We saw in the previous section how git creates a new blob object for each revision of a file. But what happens if the file is big? Does git really create and store a new revision of the entire file, even if we only make a small change to it? The answer is, as it usually is: it's complicated
Loose objects
Starting again with an empty repo, add the license file for this project (~34½ KB) to the repo.
Then, add a single character to the end of the file, and commit this change.
Under .git/objects/, we can see two large files, roughly the same size[1]They're only about 13½ KB because they're compressed., with the more recent one being slightly larger, so it would appear that the answer is "yes", git does store a full copy of every revision of a file, even if only a small part of it has changed.
However, as the number of these "loose objects" grows over time, git periodically gathers them up and combines them into a single "pack". We can also trigger this process by running git gc.
After doing this, we can see that the loose objects have gone, and been replaced by two pack files. The .pack file contains the content, and since it's around 12 KB in size, it looks like git is able to take advantage of the fact that the two files are almost identical.
Pack data file format
The .idx file is an index that gits uses to speed things up, and we'll look at that in a later section, but for now, we'll take a look at the .pack file, which contains the main object data, and is documented here.
This file is laid out as followed:
header (12 bytes)
|
packed data for object 0 |
packed data for object 1 |
... |
packed data for object N-1 |
checksum (20 bytes) |
Note that the packed data for each object are variable-length records.
Reading the header
Reading the header is straight-foward:
def read_pack_data_header( fp ): """Read the header of a pack data file.""" assert os.path.splitext( fp.name )[1] == ".pack" # nb: this assumes we're reading from a file # read the magic number magic = fp.read( 4 ) if magic != b"PACK": raise RuntimeError( "Invalid pack data magic number: {}".format( magic ) ) # read the version number version = read_nbo_int( fp ) if version != 2: # NOTE: Version 3 is also valid, but we don't support it. raise RuntimeError( "Invalid pack file version number: {}".format( version ) ) return version
Reading the packed object data
Reading the objects is where things start to get interesting
Each object record starts with a variable-length integer, that is stored using "size encoding". This is like a little-endian integer, except that the MSB[3]Most significant bit, or the left-most bit. is used as a continuation flag, where 1 means that there is another byte to come, and 0 means that the byte is the last one. For example:
This lets arbitrarily large numbers be stored using only as many bytes as they need. This mechanism appears in several places in the git file formats, but unfortunately, there are sometimes minor differences
In this particular case, there's an extra twist: the first 3 bits are used to store the object type, and the remaining bits contain the size of the object, after it has been compressed. For example:
The code to extract these two values looks like this:
# read the object type byt = fp.read( 1 )[0] obj_type = (byt & 0x70) >> 4 # continue reading the object's size obj_size = byt & 0x0f bshift = 4 while (byt & 0x80) != 0: byt = fp.read( 1 )[0] obj_size |= (byt & 0x7f) << bshift bshift += 7
Following directly after the object type+size is the object data itself, the format of which depends on the object type. Object types 1-4 correspond to commit, tree, blob and tag objects respectively, and they're just zlib-compressed bytes. However, while we have the number of bytes of object data, this is for the uncompressed data, not the compressed, so we can't just do something like this:
data = fp.read( obj_size ) data = zlib.decompress( data )
Instead, we have to feed the data into the decompressor one byte at a time[4]Decompressing one byte at a time is very slow in Python, but it will be OK for small repo's., and it will tell us when it's reached the end of the compressed data:
def decompress_stream( fp ): """Decompress stream data.""" dobj = zlib.decompressobj() output = b'' while True: # NOTE: Decompressing one byte at a time is very slow in Python! block = fp.read( 1 ) output += dobj.decompress( block ) if dobj.eof: break return output
A full example
This is a small repo that contains a single commit of a Hello, world file. Unpack it somewhere, and do garbage collection on it. We can see the three loose objects that represent the commit have been converted into a single .pack/.idx pair of files.
Run the pack.py script[5]Get this at the bottom of the page.
It won't be able to dump the pack for the repo we created above, but we'll address this in the next section. on the .pack file, and you can see that it now contains the three formerly-loose objects.
To the right is a binary dump of the pack file. It's easy to see the magic number, the version number, and the object count (3) (all marked in green).
The first object starts immediately after, at offset 0xc, with the object's type and size, stored in a variable-length integer (marked in blue).
The first byte is 0x96, and its MSB is set, which means that there's another byte. The next byte is 0x0a, the MSB is clear, so this is the last byte. So, we have 2 bytes for the variable-length integer, which are interpreted like this:
The first 3 bits contain the object type (1 = commit), the remaining bits are the object size (0xa6 = 166).
We then start decompressing the object data, from offset 0e. zlib consumes 120 bytes as it decompresses the object data, and the second object starts immediately after, at offset 0x86.
This process repeats until we've read the expected number of objects, and the only thing left in the file should be the checksum.
The final code looks like this:
def read_pack_data( pack_fname ): """Read a pack data file.""" with open( pack_fname, "rb" ) as fp: # read the header version = read_pack_data_header( fp ) nobjs = read_nbo_int( fp ) # read each object objs = [] for _ in range( 0, nobjs ): fpos = fp.tell() obj_type, obj_data = _read_pack_object( fp ) objs.append( ( obj_type, obj_data, fpos ) ) # NOTE: The only thing left in the file should be the checksum. assert len( fp.read() ) == 20 return version, objs def _read_pack_object( fp ): """Read an object from a pack data.""" # read the object type byt = fp.read( 1 )[0] obj_type = (byt & 0x70) >> 4 # continue reading the object's size obj_size = byt & 0x0f bshift = 4 while (byt & 0x80) != 0: byt = fp.read( 1 )[0] obj_size |= (byt & 0x7f) << bshift bshift += 7 # read the object data if obj_type in ( 1, 2, 3, 4 ): # commit, tree, blob, tag obj_data = decompress_stream( fp ) assert len(obj_data) == obj_size obj_type = OBJ_TYPES[ obj_type ] else: assert False, "ERROR: Unknown object type: {}".format( obj_type ) return obj_type, obj_data
One thing to note is that, while we have extracted the data for each object, we don't have their names. For these, we need to read the pack index, which we'll do in a later section.
Source code
A new script pack.py will dump objects (type 1-4 only) in a .pack file.
This is a copy of the repo we created with the large license file.
References
↵1 | They're only about 13½ KB because they're compressed. |
---|---|
↵2 | For the purpose of this tutorial, we only support v2. |
↵3 | Most significant bit, or the left-most bit. |
↵4 | Decompressing one byte at a time is very slow in Python, but it will be OK for small repo's. |
↵5 | Get this at the bottom of the page. It won't be able to dump the pack for the repo we created above, but we'll address this in the next section. |