[Logo]   

Looking for the Guilty

A blog.

 

 

RSS feed

GPG/PGP pubkeys

 

[Blosxom]

[Almost XHTML]

Datensammler sind Verbrecher

Politiker-Stopp - Diese Seite ist geschützt vor Internet-Ausdruckern.

lupo's last comments:
1 2 3 4 5
florin's last comments:
1 2 3 4 5

Managing my home directory with git

author: www-data

Having to work on multiple computers is a pain, home-directory-wise. Being a /bin/bash kid, I'm using a console to interface my computer. I do so even today, with all the graphical bells and whistles around... I even managed my files from a terminal while using MacOS X! As an upside, a more or less meaningful home directory structure crystallized over the years.

The problem is now that everywhere I end up having an account, I always find myself sooner or later loosely replicating the home directory structure from my main computer (i.e. my laptop). The more time I spend on the other computer (at work, for example), the more precisely my new work environment looks like the one on my laptop. At some point, I end up copying stuff from my laptop to whereever I need to work (scripts, config files, documents etc). Ultimately, at some point, I end up rsync-ing entire trees resembling huge parts of my home directory from one computer to the other. That's when it becomes a major pain in the ass...

Hence the idea: why not put my home directory in a revision control system, and then, whenever I have to work on a remote machine, just replicate whatever I have?

Git is a good tool to do so.

What git can do for me

My needs with a good home directory management are the following:

  • Must: easy replication of my work environment on any new account
  • Must: selective and/or incremental replication of my work environmen (i.e. I don't want to download all my projects everywhere. Rather, I'd like to be able to replicate projects as I move along with my work)
  • Must: easy synchronization of my home directories between different locations
  • Nice to have: versioning management (i.e. the ability to restore earlier versions of my files, or to restore deleted files)
  • Nice to have: the ability to easily and reliably back up my stuff
  • Nice to have: my privacy being protected, i.e. don't have private data accidentally end up on machines I don't control or I might lose control over on the future.

Preparing the directory structure

My directory structure as visible from ~ roughly resembles the following (some comments about the purposes of the single directories in brackets):

~
|-- Maildir      (...contains my mail, and is handled by 'offlineimap'...)
|-- local        (...contains "static" binary stuff. i.e. things that [almost]
|   `--...           never changes, ie. photos, media, large downloads ...)
|-- shared       (...main shared dir: one single git repo with submodules...)
|   |-- bin      (...git submodule: contains my custom scripts...)
|   |-- defaults (...git submodule: dot-files and dot-file samples...)
|   |-- ext      (...contains projects that are rooted elsewhere...)
|   |-- docs     (...git submodule: "static" personal stuff
|   |                [ie. letters to mom]...)
|   `-- pro      (...directory with lots of git-submodules.
|                    each of the subdirectories is a git-submodule
|                    for one of my many projects [like software development,
|                    ...)
|-- bin -> shared/bin/     (...lots of symlinks from ~/ to ~/shared/...)
|-- docs -> shared/docs/
`-- pro -> shared/pro/

The trick here is to know what goes well with a versioning system like git and what doesn't. For example, git is great at handling a large number of small, text-only. It was explicitly designed to manage code ;) git can also work well with small binary files (order of magnitude ~1 MB), but the problem with frequently changing small binaries (like JPGs, for example) is that, unlike with text-only files, even small changes in binaries generates history in the order of magnitude of the file itself. After all, diff'ing a JPG is quite different from diff'ing a text-only file :)

git sucks at large files. It gradually becomes painful for files > 10 MB, to the point where it becomes virtually unusable for >100, and really unusable for >500 MB. If your files, besides being large, are also binaries, you're out of luck (and memory ;) pretty fast.

That being known, you can:

  • use git for software projects
  • use git for projects involving a large number of text-only files (like writing a book using LaTeX)
  • use git for projects involving mainly text files, and small amounts of non or (rarely) changing binary files (like LaTeX-projects with images).
  • not use git, but instead use something else (like rsync) for everything else (movies, MP3s, large binary data...). For media files and other things, this is not a major drawback, as one mostly doesn't do any movie/sound editing that needs to be trackted. Sometimes however, one needs to work on large binary files that change frequently, like me for example when I edit scientific data with less fortunate tools like Igor Pro. For these cases, there is simply no good tool to track versions (or at least none that I'm aware of -- if you know any drop me a mail, I'll gladly mention it here).
  • not use git on directories that have files with frequently changing names, like for example ~/Maildir. The Maildir mailbox format stores message status flags in the file name, so while the files' content technically does not change, the names of the files do generally "misbehave" enough as to make Maildir management with git a major PITA. Use something else for that, for example offlineimap -- it rocks ;)

From a vanilla home directory to a git-backed one

Let's assume you have a home directory similar to mine on the computer labeled laptop. I'll assume that you -- like me -- want to have a computer somewhere on the internet that you want to use as a central repository. Let's call it central. What you first need to do is copy the shared part of your home directory to central:~/shared:

	central:~$ mkdir central:~/shared/
	laptop:~$ rsync -av ~/shared ssh://you@central:~/shared

The next step is to git-ize the ~/shared folder on central:

	central:~$ cd shared/docs
	central:~/shared/docs$ git init --shared
	central:~/shared/docs$ git add .
	central:~/shared/docs$ git commit -m "Initial checkin"

Let's take the lines one by one and see whay they do:

  • The first line (obviously) enters the docs-directory
  • The 2nd line initializes an empty repository in the docs directory. The key part here is to specify the --shared argument. This save us a headache later... But let me start at square one:

    Say you have two git repositories: A and B, where A was first created, and B was cloned from A. Now techincally, you can push your changes from repository A to repository B using git push B from inside A. Or you can pull your changes into repository B from repository A using git pull A. But practically only git pull is encouraged. Using git push is generally discouraged.

    In our case, however, it's git push that's the more interesting option. We really do want to be able to push our changes to the central computer from whatever location we are just using, in order to duplicate the changes to another location. This is where the --shared option kicks in: it tells git that we intend to git push changes from more than one location to the repository on central. For a shared repository, every time when we will try to push from somewhere else (laptop, for exaple), git will automatically check that the pushed revisions can be fast-forwarded to without problems. This means (if I understood correctly) that the pushing repository's changes are all based on the current revision of the central repository. If this is not the case, the central will not accept the push, and the pushing repository will be asked to resolve the conflict (by pulling from the central first and merging with the local changes) and then try again.

  • The third line stages all the files in the directory and...
  • ...the 4th line, finally, commits all the files to the repository.
At this point, the directory central:~/shared/docs is a git repository. What we need to do now is repeat the procedure for each of the subdirectories of central:~/shared that we want to become individual repositories. More clearly: for each of the directories ~/shared/pro/project1, ~/shared/pro/project2, ~/shared/whatever... on central:~ we will need to execute the four lines described above.

As for the actual ~/shared directory itself, the procedure is slightly different:

	central:~/shared$ git init --shared
	central:~/shared$ git submodule add ./docs
	central:~/shared$ git submodule add ./pro/project1
	central:~/shared$ git submodule add ./pro/project2
	central:~/shared$ git submodule add ./pro/whatever...
	central:~/shared$ git commit -m "submodules added"
The command git submodule add REPO POSITION would normally check out REPO into the tree of the ~/shared repository as path element POSITION. In our case, since the repository URLs are already within the ~/shared tree, we don't need to explicitly specify a path. Note: the leading ./ in the repository URLs is important!

Now it's time to go git. To do so, move your shared directory laptop:~/shared out of the way. You are well advised to back it up before you delete it! You have been warned. The magic line:

	laptop:~$ git clone ssh://you@central/home/you/shared shared
does the first part of the trick. After the command has completed, you will have a kind of stub representation of your shared directory downloaded. "Stub" in the sense that all the submodule directories are there, but they are empty. Try a git submodule status. Your result will slightly resemble the following:
	 -7279bb4545d2882be08f3f5cfa259210dfb8b101 docs
	 -a2c71a518b0ee30ef3a080d1681a5320ed9188db pro/project1
	 -1f3387c37f99964afe4b6a6e197a412a8f7c86eb pro/project2
	 -fef8f07e85503b3c05d7953110e3bff3126e4651 pro/whatever...
The SHA1 hashes will differ (and the submodule names, of course), but else it's going to be a list of submodules. Mind the '-' character at the beginnig of the line. It means that the submodule has not yet been initialized. You can initialize one or more modules with the line:
	laptop:~/shared$ git submodule init [module1 [module 2 [...]]]
If you do not specify any modules, all will be initialized. However, exactly here is your chance to control what will be downloaded and what not. If you want to selectively download projects depending on your location (for example no private projects on your work computer), then simply call git submodule init accordingly :)

After initializing the submodules you want to download, type in:

	laptop:~/shared$ git submodule update [module1 [module2 ...]]
This will actually download your projects. If you omit the module names, git tries to update all those that have already been initialized. For the others, the error message "Maybe you want to use 'update --init'?" is printed.

Everyday work with a git home directory

Suppose you edit the file in ~/shared/docs/foo.txt on your laptop. At the end of the day, you need to first commit and then push your changes to the central computer:

	laptop:~/shared/docs$ git add foo.txt
	laptop:~/shared/docs$ git commit -m "bad bad typo in foo.txt"
	laptop:~/shared/docs$ git push
Then, the superproject ~/shared needs to be notified that a submodule has changed:
	laptop:~/shared$ git add docs
	laptop:~/shared$ git commit -m "the foo.txt document fixed"
Please note that the argument of the first line is docs and not docs! The difference matters, as the former represents the name of the module, while the latter is the directory itself. If you accidentally use the second, you'll end up with all files from the docs-subdirectory added to the ~/shared project, which is not what you want! (If you already blew it: git rm --cached is your friend.)

At some point, you'll like to transfer the work from your laptop to another computer, say, work. Here are youroptions:

  1. Either you first git push your work to central, then you git clone or git pull it to work.
  2. Or you git pull the files to your work computer from your laptop, by explicitly telling git to pull from the laptop URL.
  3. Or you git push the files from your laptop to your work computer, then you execute git reset --hard on work (to reset the HEAD to the new revision). However, please note that this is not the intended way to do things! Besides, it only works safely if there are no local edits on work already present. Else, you may lose whatever uncommited work you have (check git's docs on that, I'm not too firm with this issue).

What you can and cannot do with the repository on central

Provided that you don't do any local edits, the repository on central can be pushed to and pulled from as you like. It will keep track of your files nicely. However, if you actually bother to enter the repository and check the files's contents, you will notice that the working copy of your files on central is not up-to-date with your latest revision. To bring the files up-to-date, you can do a git reset --hard provided that you don't have any local changes that need to be preserved.

If you want to make (and preserve!) any changes directly to the files on central, you need to first git reset --hard the project, then make your changes, and then git add/git commit them. You can then git pull the changes to your laptop or work computer later on, no problem.

However, managing submodules on the central is somewhat of a PITA in the version that I just presented you. This is because a git submodule will show you all modules marked as non-initialized (i.e. preceded by the '-' character). It you try to initialize them using git submodule init, the error remote (origin) does not have a url defined in .git/config appears. I'm pretty sure that's because I just haven't completely understood git and/or git-submodule, and I'll probably slap myself on the forehead when I find out the reason. However, git status will tell you correctly which of the submodules were changed and need a git add/git commit, so submodule management is -- although not perfect -- possible on the central repository, if you really have to do it.

If you know for a fact that you'll never want to edit the files on your central repository, have a look at the manpage and figure out what git init --shared --bare does -- you'll like it :)

Conclusions

I definitely need to learn more about git. It's a pretty overwhelming versioning system. But it becomes obvious pretty fast that it's a very powerful one, too.

Other than that, home directory management with git seems to work for me. Future will tell if it works reliably... Check back here from time to time, I'll let you know how it turns out :) As soon as I start my PhD thesis in June, I'll have to do some heavy synchronization between my central repository, my laptop and my home and work computers, and I expect that to be the ultimative test. If everything goes well, I'll gradually think of setting up a decent git based backup system for my home dir, and maybe later for other important parts of the filesystem (like /etc). So stay tuned, it pays ;)

2009-03-10 03:00 | www-data.blog20090310@rootshell.ro | [/tech-sci/comp] | permanent link


Older entries

« | 2012 | Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec | »
« | February | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | »