|
RSS feed
GPG/PGP pubkeys
|
|
author: www-data
Having to work on multiple computers is a pain, home-directory-wise.
Being a /bin/bash kid, I'm using a console to interface
my computer. I do so even today, with all the graphical bells and whistles
around... I even managed my files from a terminal while using MacOS X!
As an upside, a more or less meaningful home directory structure crystallized
over the years.
The problem is now that everywhere I end up having an account,
I always find myself sooner or later loosely replicating the home directory
structure from my main computer (i.e. my laptop).
The more time I spend on the other computer (at work, for example), the
more precisely my new work environment looks like
the one on my laptop. At some point, I end up copying stuff from my
laptop to whereever I need to work (scripts, config files, documents etc).
Ultimately, at some point, I end up rsync-ing entire trees
resembling huge parts of my home directory
from one computer to the other. That's when it becomes a major pain in the
ass...
Hence the idea: why not put my home directory in a revision control system,
and then, whenever I have to work on a remote machine, just replicate whatever
I have?
Git is a good tool to do so.
What git can do for me
My needs with a good home directory management are the following:
- Must: easy replication of my work environment
on any new account
- Must: selective and/or incremental replication of my
work environmen (i.e.
I don't want to download all
my projects everywhere. Rather, I'd like to be
able to replicate projects as I move along with my work)
- Must: easy synchronization of my home directories between
different locations
- Nice to have: versioning management (i.e. the ability to restore
earlier versions of my files, or to restore deleted files)
- Nice to have: the ability to easily
and reliably back up my stuff
- Nice to have: my privacy being protected, i.e. don't have
private data accidentally end up on machines I don't control
or I might lose control over on the future.
Preparing the directory structure
My directory structure as visible from ~
roughly resembles the following (some comments about the purposes
of the single directories in brackets):
~
|-- Maildir (...contains my mail, and is handled by 'offlineimap'...)
|-- local (...contains "static" binary stuff. i.e. things that [almost]
| `--... never changes, ie. photos, media, large downloads ...)
|-- shared (...main shared dir: one single git repo with submodules...)
| |-- bin (...git submodule: contains my custom scripts...)
| |-- defaults (...git submodule: dot-files and dot-file samples...)
| |-- ext (...contains projects that are rooted elsewhere...)
| |-- docs (...git submodule: "static" personal stuff
| | [ie. letters to mom]...)
| `-- pro (...directory with lots of git-submodules.
| each of the subdirectories is a git-submodule
| for one of my many projects [like software development,
| ...)
|-- bin -> shared/bin/ (...lots of symlinks from ~/ to ~/shared/...)
|-- docs -> shared/docs/
`-- pro -> shared/pro/
The trick here is to know what goes well with a versioning system
like git and what doesn't. For example, git
is great at handling a large number of small, text-only.
It was explicitly designed to manage code ;)
git can also work well with small binary files (order of
magnitude ~1 MB), but the problem with frequently changing small binaries
(like JPGs, for example) is that, unlike with text-only files,
even small changes
in binaries generates history in the order of magnitude of the file itself.
After all, diff'ing a JPG is quite different
from diff'ing a text-only file :)
git sucks at large files. It gradually becomes painful for
files > 10 MB, to the point where it becomes virtually unusable for >100,
and really unusable for >500 MB. If your files, besides being
large, are also binaries, you're out of luck (and memory ;) pretty fast.
That being known, you can:
- use
git for software projects
- use
git for projects involving a large number
of text-only files (like writing a book using LaTeX)
- use
git for projects involving mainly text files,
and small amounts of non or (rarely) changing binary files (like
LaTeX-projects with images).
- not use
git, but instead use
something else (like rsync)
for everything else (movies, MP3s, large binary data...).
For media files and other things, this is not a major drawback, as one
mostly doesn't do any movie/sound editing that needs to be trackted.
Sometimes however, one needs to work on large binary files that
change frequently, like me for example when I edit scientific data
with less fortunate tools like Igor Pro. For these cases, there is
simply no good tool to track versions (or at least none that I'm
aware of -- if you know any drop me a mail, I'll gladly mention
it here).
- not use
git on directories that
have files with frequently changing names, like for example
~/Maildir. The Maildir mailbox format stores
message status flags in the file name, so while the files' content
technically does not change, the names of the files do
generally "misbehave" enough as to make Maildir management
with git a major PITA. Use something else for that,
for example offlineimap -- it rocks ;)
From a vanilla home directory to a git-backed one
Let's assume you have a home directory similar to mine on the computer
labeled laptop. I'll assume that you -- like me -- want
to have a computer somewhere on the internet that you want to use
as a central repository. Let's call it central.
What you first need to do is copy the shared part of your
home directory to
central:~/shared:
central:~$ mkdir central:~/shared/
laptop:~$ rsync -av ~/shared ssh://you@central:~/shared
The next step is to git-ize the ~/shared
folder on central:
central:~$ cd shared/docs
central:~/shared/docs$ git init --shared
central:~/shared/docs$ git add .
central:~/shared/docs$ git commit -m "Initial checkin"
Let's take the lines one by one and see whay they do:
- The first line (obviously) enters the docs-directory
- The 2nd line initializes an empty repository in the docs directory.
The key part here is to specify the
--shared argument.
This save us a headache later... But let me start at square one:
Say you have two git repositories: A and B, where A was first created,
and B was cloned from A. Now techincally, you can
push your changes from repository A
to repository B using git push B from inside A.
Or you can pull
your changes into repository B from repository A using
git pull A. But practically only git pull
is encouraged. Using git push is generally discouraged.
In our case, however, it's git push that's the more
interesting option. We really do want to be able to push our changes
to the central computer from whatever location we are just using, in order
to duplicate the changes to another location.
This is where the --shared option kicks in: it
tells git that we intend to git push
changes from more than one location to the repository
on central. For a shared repository, every time when
we will try to push from somewhere else (laptop, for exaple),
git will automatically check that the pushed revisions
can be fast-forwarded to without problems. This means (if I understood correctly)
that the pushing repository's changes are all based on the current revision
of the central repository. If this is not the
case, the central will not accept the push, and the pushing
repository will be asked to resolve the conflict
(by pulling from the central first and merging with the local
changes) and then try again.
- The third line stages all the files in the directory and...
- ...the 4th line, finally, commits all the files to the repository.
At this point, the directory central:~/shared/docs is
a git repository. What we need to do now is repeat the
procedure for each of the subdirectories of central:~/shared
that we want to become individual repositories. More clearly: for each
of the directories ~/shared/pro/project1,
~/shared/pro/project2, ~/shared/whatever... on
central:~ we
will need to execute the four lines described above.
As for the actual ~/shared directory itself, the procedure
is slightly different:
central:~/shared$ git init --shared
central:~/shared$ git submodule add ./docs
central:~/shared$ git submodule add ./pro/project1
central:~/shared$ git submodule add ./pro/project2
central:~/shared$ git submodule add ./pro/whatever...
central:~/shared$ git commit -m "submodules added"
The command git submodule add REPO POSITION would normally check out
REPO into the tree of the ~/shared
repository as path element POSITION.
In our case, since the repository URLs are already
within the ~/shared tree, we don't need to explicitly
specify a path. Note: the leading ./ in the
repository URLs is important!
Now it's time to go git. To do so, move your
shared directory laptop:~/shared out of the way.
You are well advised to back it up before you delete it! You have
been warned. The magic line:
laptop:~$ git clone ssh://you@central/home/you/shared shared
does the first part of the trick. After the command has completed,
you will have a kind of stub representation of your shared
directory downloaded. "Stub" in the sense that all the
submodule directories are there, but they are empty.
Try a git submodule status. Your result
will slightly resemble the following:
-7279bb4545d2882be08f3f5cfa259210dfb8b101 docs
-a2c71a518b0ee30ef3a080d1681a5320ed9188db pro/project1
-1f3387c37f99964afe4b6a6e197a412a8f7c86eb pro/project2
-fef8f07e85503b3c05d7953110e3bff3126e4651 pro/whatever...
The SHA1 hashes will differ (and the submodule names, of course),
but else it's going to be a list of submodules. Mind the '-'
character at the beginnig of the line. It means that the submodule
has not yet been initialized. You can initialize one or more
modules with the line:
laptop:~/shared$ git submodule init [module1 [module 2 [...]]]
If you do not specify any modules, all will be initialized.
However, exactly here is your chance to control what will be
downloaded and what not. If you want to selectively
download projects depending on your location
(for example no private projects on your work computer),
then simply call git submodule init accordingly :)
After initializing the submodules you want to download, type in:
laptop:~/shared$ git submodule update [module1 [module2 ...]]
This will actually download your projects.
If you omit the module names, git tries to update all
those that have already been initialized. For the others,
the error message "Maybe you want to use 'update --init'?" is printed.
Everyday work with a git home directory
Suppose you edit the file in ~/shared/docs/foo.txt
on your laptop. At the end of the day, you need to
first commit and then push your changes to the central
computer:
laptop:~/shared/docs$ git add foo.txt
laptop:~/shared/docs$ git commit -m "bad bad typo in foo.txt"
laptop:~/shared/docs$ git push
Then,
the superproject ~/shared needs to be notified
that a submodule has changed:
laptop:~/shared$ git add docs
laptop:~/shared$ git commit -m "the foo.txt document fixed"
Please note that the argument of the first line is
docs and not docs!
The difference matters, as the former represents the name of the
module, while the latter is the directory itself.
If you accidentally use the second,
you'll end up with all files from the
docs-subdirectory added to the
~/shared project, which is not what you want! (If you
already blew it: git rm --cached is your friend.)
At some point, you'll like to transfer the work from your laptop
to another computer, say, work. Here are youroptions:
- Either you first
git push your work to central,
then you git clone or git pull it
to work.
- Or you
git pull the files to your work computer
from your laptop,
by explicitly telling git to pull from the laptop
URL.
- Or you
git push the files from your laptop to
your work computer,
then you execute git reset --hard
on work (to reset the HEAD to the new revision).
However, please note that this is not
the intended way to do things! Besides, it only works safely
if there are
no local edits on work already present. Else,
you may lose whatever uncommited work you have
(check git's docs on that,
I'm not too firm with this issue).
What you can and cannot do with the repository on central
Provided that you don't do any local edits, the repository on
central
can be pushed to and pulled from as you like. It will keep track of your
files nicely. However, if you actually bother to enter the repository
and check the files's contents, you will notice
that the working copy of your files on
central is not up-to-date with your latest revision.
To bring the files up-to-date, you can do a git reset --hard
provided that you don't have any local changes that need to
be preserved.
If you want to make (and preserve!) any changes directly to the files
on central, you need to first git reset --hard the
project, then make your changes, and then
git add/git commit them. You can then git pull
the changes to your laptop or work
computer later on, no problem.
However, managing submodules on the central is somewhat of a
PITA in the version that I just presented you. This is because
a git submodule will show you all modules marked
as non-initialized (i.e. preceded by the '-' character).
It you try to initialize them using git submodule init,
the error
remote (origin) does not have a url defined in .git/config
appears. I'm pretty sure that's because I just haven't completely understood
git and/or git-submodule, and I'll probably
slap myself on the forehead when I find out the reason.
However, git status will tell you correctly which
of the submodules were changed and need a git add/git commit,
so submodule management is -- although not perfect -- possible on
the central repository, if you really have to do it.
If you know for a fact that you'll never want to edit the files
on your central repository, have a look at the manpage and figure out
what git init --shared --bare does -- you'll like it :)
Conclusions
I definitely need to learn more about git.
It's a pretty overwhelming versioning system. But it becomes
obvious pretty fast that it's a very powerful one, too.
Other than that, home directory management with git
seems to work for me. Future will tell if it works
reliably... Check back here from time to time, I'll let
you know how it turns out :) As soon as I start my PhD
thesis in June, I'll have to do some heavy synchronization
between my central repository, my laptop and my home and
work computers, and I expect that to be the ultimative test.
If everything goes well, I'll gradually think of setting
up a decent git based backup system for
my home dir, and maybe later for other important parts
of the filesystem (like /etc). So stay tuned,
it pays ;)
2009-03-10 03:00 | www-data rootshell.ro |
[/tech-sci/comp]
| permanent link
Older entries
«
| 2012 |
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
| »
«
| February |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| »
|