Awasu » Git Guts: Pack indexes
Friday 7th January 2022 8:33 PM

In the previous section, we saw how git compresses objects into pack files, in order to save space, but at the cost of slower access. We wrote a script that dumps the contents of a .pack file, which works fine if you want to iterate through all the objects in the pack[1]For example, if you want to unpack everything into loose objects., but git usually reads individual objects from the pack as needed, and so needs a way to quickly locate where they are in the .pack file.

The pack index

The .idx file that you see alongside the .pack file is an index, that git uses to quickly locate an object in the .pack file, based on its object name, and are documented here.

This file is laid out as follows:

header (8 bytes)
  • magic number (b"\xfftOc")
  • version number (must be 2[2]v1 index files don't have a field for the version number, and just start with the fanout table. However, \xfftOc is not a valid first entry for the fanout table, so the presence or absence of this value is how we distinguish between the two version.)
fanout table (256 × 4-byte entries)
object names (20 raw bytes each, sorted in ascending order)
  • object 0
  • object 1

...

  • object N-1
CRC-32's for the packed object data (4 bytes each)
  • object 0
  • object 1

...

  • object N-1
offsets into the .pack data (4 bytes each[3]Only 31 bits are used; the MSB flags a very large offset.)
  • object 0
  • object 1

...

  • object N-1
.pack file very large offsets[4]This is not used for pack files < 2 GB. (8 bytes each)
trailer (2 × 20 bytes)
  • checksum of the .pack file (as it appears at the end of that file)
  • checksum of the .idx file

Note that the main object records are all fixed size, so if we know how many objects are in the pack, we can quickly calculate where the object name, CRC and file offset are, for a given object index:

  • start of object names = 8 + 256 × 4
  • n'th object name = start of object names + 20 × n
  • start of CRC's = start of object names + #objects × 20
  • n'th object CRC = start of CRC's + 4 × n
  • start of file offsets = start of CRC's + #objects × 4
  • n'th object's file offset = start of file offsets + 4 × n

The number of objects in the index file can be found by looking at the last entry in the fanout table (explained below), so given that we know where the first object name is, we can easily calculate where the last one is, and given that the object names are sorted, we can do a binary search to quickly find the object name we're looking, and figure out its position within the table of object names.

Once we have the object's index, we can then quickly locate the corresponding entry in the table of file offsets[5]For example, if we determine that the object name we're looking for is 5th in the table of object names, then we look up the 5th entry in the table of offsets., get the offset within the .pack data file for the object we're looking for, fseek to that position, and call the _read_pack_object() function we wrote earlier.

The fanout table

But wait, there's more! The fanout table lets us do things even faster. This table contains 256 entries, one for each possible lead byte of an object name, where the N'th entry contains the index of the first object name whose lead byte is > N. This is best explained with a diagram :-)

The diagram to the right is interpreted as follows:

  • fanout[0x00] = index of the first object name that has a lead byte > 0x00
  • fanout[0x01] = index of the first object name that has a lead byte > 0x01
  • fanout[0x02] = index of the first object name that has a lead byte > 0x02
  • etc...

So, when we're looking for an object name, we take its lead byte and look up the corresponding entry in the fanout table, which gives us the index of an object name that must be > than the one we're looking for (because its lead byte is > than ours).

We then check the previous entry in the fanout table, which gives us the index of an object name that must be ≤ to the one we're looking for (because its lead byte is ≤ to ours).

We then only need to search this much smaller range of object names.

Putting everything together, this is a function that takes an object name and returns its offset into the .pack data file:

    def _find_obj_in_pack( pack_fname, obj_name ):
        """Find the specified object in a pack."""

        index_fname = change_extn( pack_fname, ".idx" )
        with open( index_fname, "rb" ) as fp:

            # read the header
            version = read_pack_index_header( fp )
            assert version == 2

            # read the fanout table
            byte0 = int( obj_name[:2], 16 )
            fp.seek( 4*byte0, 1 )
            end = read_nbo_int( fp )
            if byte0 == 0:
                start = 0
            else:
                fp.seek( -8, 1 )
                start = read_nbo_int( fp )
            # NOTE: We now have 2 indexes into the table of object names:
            # - start = index of an object name <= the name we're looking for
            # - end = index of an object name > the name we're looking for

            # do a binary search for the target object name
            fpos_names = 4 + 4 + 4*256 # nb: offset to the start of the object names
            while start <= end:
                mid = int( ( start + end ) / 2 )
                fp.seek( fpos_names + 20 * mid )
                mid_obj_name = read_obj_name( fp )
                # check if we found the target object name
                if mid_obj_name == obj_name:
                    # yup - get the object's offset in the pack data
                    fp.seek( fpos_names - 4 )
                    nobjects = read_nbo_int( fp ) # nb: this is the last value in the fanout table
                    fpos = fpos_names
                    fpos += 20 * nobjects # skip over the object names
                    fpos += 4 * nobjects # skip over the CRC's
                    fpos += 4 * mid # point to the mid'th entry in the offset table
                    fp.seek( fpos )
                    obj_offset = read_nbo_int( fp )
                    return obj_offset
                # nope - continue the binary search
                if mid_obj_name < obj_name:
                    start = mid + 1
                else:
                    end = mid -1

            return None # nb: we didn't find the object name we're looking for

Using this function is straight-forward e.g.

    offset = _find_obj_in_pack( pack_fname, obj_name )
    if not offset:
        print( "Not found." )
    else:
        fname = change_extn( pack_fname, ".pack" )
        with open( fname, "rb" ) as fp:
            fp.seek( offset )
            obj_type, obj_data = _read_pack_object( fp, depth=0 )

Finally, the reason why the last entry of the fanout table contains the number of objects is because this entry contains "the index of the first object name whose lead byte is > 0xff." This doesn't actually make any sense (since a byte can't be larger than 0xff), but it has the side-effect of pointing just past the last entry in the table i.e. it contains the number of objects in the table.

Source code

The pack.py script will now dump a .pack data file, or a .idx index file, depending on what is passed in as the argument.

You can also include an object name, and the script will use the index to locate that object in the pack, then dump it.



References

References
1 For example, if you want to unpack everything into loose objects.
2 v1 index files don't have a field for the version number, and just start with the fanout table. However, \xfftOc is not a valid first entry for the fanout table, so the presence or absence of this value is how we distinguish between the two version.
3 Only 31 bits are used; the MSB flags a very large offset.
4 This is not used for pack files < 2 GB.
5 For example, if we determine that the object name we're looking for is 5th in the table of object names, then we look up the 5th entry in the table of offsets.
Have your say