In the previous section, we saw how git compresses objects into pack files, in order to save space, but at the cost of slower access. We wrote a script that dumps the contents of a .pack file, which works fine if you want to iterate through all the objects in the pack[1]For example, if you want to unpack everything into loose objects., but git usually reads individual objects from the pack as needed, and so needs a way to quickly locate where they are in the .pack file.
The pack index
The .idx file that you see alongside the .pack file is an index, that git uses to quickly locate an object in the .pack file, based on its object name, and are documented here.
This file is laid out as follows:
header (8 bytes)
|
Note that the main object records are all fixed size, so if we know how many objects are in the pack, we can quickly calculate where the object name, CRC and file offset are, for a given object index:
- start of object names = 8 + 256 × 4
- n'th object name = start of object names + 20 × n
- start of CRC's = start of object names + #objects × 20
- n'th object CRC = start of CRC's + 4 × n
- start of file offsets = start of CRC's + #objects × 4
- n'th object's file offset = start of file offsets + 4 × n
The number of objects in the index file can be found by looking at the last entry in the fanout table (explained below), so given that we know where the first object name is, we can easily calculate where the last one is, and given that the object names are sorted, we can do a binary search to quickly find the object name we're looking, and figure out its position within the table of object names.
Once we have the object's index, we can then quickly locate the corresponding entry in the table of file offsets[5]For example, if we determine that the object name we're looking for is 5th in the table of object names, then we look up the 5th entry in the table of offsets., get the offset within the .pack data file for the object we're looking for, fseek to that position, and call the _read_pack_object() function we wrote earlier.
The fanout table
But wait, there's more! The fanout table lets us do things even faster. This table contains 256 entries, one for each possible lead byte of an object name, where the N'th entry contains the index of the first object name whose lead byte is > N. This is best explained with a diagram
The diagram to the right is interpreted as follows:
- fanout[0x00] = index of the first object name that has a lead byte > 0x00
- fanout[0x01] = index of the first object name that has a lead byte > 0x01
- fanout[0x02] = index of the first object name that has a lead byte > 0x02
- etc...
So, when we're looking for an object name, we take its lead byte and look up the corresponding entry in the fanout table, which gives us the index of an object name that must be > than the one we're looking for (because its lead byte is > than ours).
We then check the previous entry in the fanout table, which gives us the index of an object name that must be ≤ to the one we're looking for (because its lead byte is ≤ to ours).
We then only need to search this much smaller range of object names.
Putting everything together, this is a function that takes an object name and returns its offset into the .pack data file:
def _find_obj_in_pack( pack_fname, obj_name ): """Find the specified object in a pack.""" index_fname = change_extn( pack_fname, ".idx" ) with open( index_fname, "rb" ) as fp: # read the header version = read_pack_index_header( fp ) assert version == 2 # read the fanout table byte0 = int( obj_name[:2], 16 ) fp.seek( 4*byte0, 1 ) end = read_nbo_int( fp ) if byte0 == 0: start = 0 else: fp.seek( -8, 1 ) start = read_nbo_int( fp ) # NOTE: We now have 2 indexes into the table of object names: # - start = index of an object name <= the name we're looking for # - end = index of an object name > the name we're looking for # do a binary search for the target object name fpos_names = 4 + 4 + 4*256 # nb: offset to the start of the object names while start <= end: mid = int( ( start + end ) / 2 ) fp.seek( fpos_names + 20 * mid ) mid_obj_name = read_obj_name( fp ) # check if we found the target object name if mid_obj_name == obj_name: # yup - get the object's offset in the pack data fp.seek( fpos_names - 4 ) nobjects = read_nbo_int( fp ) # nb: this is the last value in the fanout table fpos = fpos_names fpos += 20 * nobjects # skip over the object names fpos += 4 * nobjects # skip over the CRC's fpos += 4 * mid # point to the mid'th entry in the offset table fp.seek( fpos ) obj_offset = read_nbo_int( fp ) return obj_offset # nope - continue the binary search if mid_obj_name < obj_name: start = mid + 1 else: end = mid -1 return None # nb: we didn't find the object name we're looking for
Using this function is straight-forward e.g.
offset = _find_obj_in_pack( pack_fname, obj_name ) if not offset: print( "Not found." ) else: fname = change_extn( pack_fname, ".pack" ) with open( fname, "rb" ) as fp: fp.seek( offset ) obj_type, obj_data = _read_pack_object( fp, depth=0 )
Finally, the reason why the last entry of the fanout table contains the number of objects is because this entry contains "the index of the first object name whose lead byte is > 0xff." This doesn't actually make any sense (since a byte can't be larger than 0xff), but it has the side-effect of pointing just past the last entry in the table i.e. it contains the number of objects in the table.
Source code
The pack.py script will now dump a .pack data file, or a .idx index file, depending on what is passed in as the argument.
You can also include an object name, and the script will use the index to locate that object in the pack, then dump it.
References
↵1 | For example, if you want to unpack everything into loose objects. |
---|---|
↵2 | v1 index files don't have a field for the version number, and just start with the fanout table. However, \xfftOc is not a valid first entry for the fanout table, so the presence or absence of this value is how we distinguish between the two version. |
↵3 | Only 31 bits are used; the MSB flags a very large offset. |
↵4 | This is not used for pack files < 2 GB. |
↵5 | For example, if we determine that the object name we're looking for is 5th in the table of object names, then we look up the 5th entry in the table of offsets. |