Slimming Down a Git Repository
Posted by Ryan Coyner on June 7, 2008 under Git - 3 Comments
Within the last year, Git has become widely popular among open source enthusiasts, especially in the Rails camp. Although I do not use Rails anymore, I also switched from using Subversion to Git, and I am absolutely loving it. The distributed system is much more flexible and fun to work with than a centralized system. Like others, I migrated my old Subversion repositories into Git repositories.
Reduction in Size
The size of my repositories decreased dramatically after importing them to Git. This example uses the most recent Django repository in Subversion, which has 7568 revisions and 917 MB in size. A standard import to Git reduced it to 320 MB - about a 65% reduction in size. If you have the two spare hours, you can import the project using git-svn:
$ git-svn clone http://code.djangoproject.com/svn/
Although git-svn does an excellent job of migrating the repository cleanly, git-gc at the very least should also be run after the import. git-gc is a garbage collecting utility which automatically removes dangling deltas and repacks the repository. However, for projects as large as Django with a deep history, there is a much better way to rebuild the deltas to make the repository slightly smaller.
First, the compression level can be increased. A compression level of zero signifies no compression, and the values from one to nine represent various trade-offs between compression and speed. For instance, a compression level of nine is the slowest but the most compressed. Git defaults to a compression level of six. The level of compression can be changed in the .git/config file:
[pack] compression = 9
To rebuild the deltas, run:
$ git-repack -a -d -f --depth=250 --window=250
git-repack normally only packs unpacked objects, and that sometimes generates multiple packs. With the -a option, it packs all the objects - packed or unpacked - into a tight, single pack. The -d option removes any packs that become redundant due to any newly created packs. Since the -a option creates a single pack for the entire project, any existing packs will be deleted since they become redundant. Git also likes to use existing deltas when packing, because it allows it to complete the packing operation much faster by not having to re-calculate deltas. The -f option tells Git to drop all existing deltas and compute new ones from scratch.
The --depth option tells Git how deep it should go into the delta-chains to find usable deltas. The --window option tells Git the size of the object window it should use when each delta candidate compares itself against other deltas. Since the imported project has a deep history, setting these to a high value will increase the chances of Git finding great deltas. These options are difficult to explain without getting into pack heuristics, which I may cover in a future post.
After running git-repack, the size of the repository became 310 MB. This is a not a huge reduction from the original import, but it did reduce the number of packs to one and the process only took about three minutes. Furthermore, it ensures that future additions to the project will be built on top of a solid foundation of deltas.
Another command exists for doing what git-repack with all those options did, and that is git-gc --aggressive. It throws away all the old deltas and computes new ones from scratch. However, according to Linus, the --aggressive option does a very poor job of repacking. He also mentions that the command may be removed from the documentation.
Understanding Git
Again, I would like to explain Git's packing heuristics but to explain it properly, I would first have to explain about delta encoding and delta chains. Furthermore, about the only person who knows Git's packing heuristics in detail is Linus himself. Although I already have a general understanding of the heuristics, there is a lot of details that still throws me off.

3 Comments
Anonymous June 8, 2008 at 22:54
--aggressive doesn't do a BAD job, it just does a STANDARD job, and if you've generated a pack with high-compression options, it will throw that away and re-do it using standard compression, which can be annoying.
As for the depth and window options, http://vcscompare.blogspot.com/ has some useful benchmarks. Basically, past 100 or so, you get severely diminishing returns, and there's no real need to take depth past 50. Which is good because that affects decompression time.
Briefly, git builds a list of objects clustered by a similarity heuristic. Then it compares each file against the previous "window" files to see how large the delta is, and stores the smallest one.
The file chosen to delta against may itself be stored delta-encoded, so when unpacking, you have to chase down the entire chain until you get to a file that's stored directly. This can make unpacking slow, so there's a limit on the maximum length of the chains, set by the --depth option.
Pieter June 9, 2008 at 07:17
Why is the Django repository on github ( http://github.com/tswicegood/django/tree ) only 19 MB? That one has 7391 revisions.
Did you tell git-svn to use branches and tags? (git-svn -s)? Otherwis you will get a really big repository, and especially working tree, with all the branches in it.
I think 320MB is a bit big for such a small project with only 7000 revisions, unless they put something like images in it (like the Webkit repo has). These must have been removed from the github repository then.
Ryan Coyner June 9, 2008 at 09:34
@Anonymous - Great find on the benchmark. I knew that there were diminishing returns, but didn't realize it was that severe.
@Pieter - Yes, I did use branches and tags. The 320MB is the entire repository, including branches and tags. If it was just the trunk, it would be significantly smaller like you said.