Awasu » Git Guts: The staging index
Friday 7th January 2022 8:49 PM

Another important part of git is its staging index[1]Not to be confused with the index file in a pack.. This is stored in .git/index, and its format is documented here.

The index is often described as a staging area for the changes you are working on for the next commit, but there is an important subtlety that is somtimes missed: it doesn't contain only the changes you've made, but the entire working tree, with your changes applied. In other words, you are building the next commit object, that references a tree object that describes the entire working tree, that will be stored in the repo when you finally do a git commit.

This file is laid out as followed:

header (12 bytes)
  • magic number (b"DIRC")
  • version number (2, 3 or 4)
  • # entries
entry 0
entry 1
...
entry N-1
extension(s) (0 or more)
checksum

Note that the entries are variable length, and are sorted by name[2]Although for our purposes, this is not important..

Reading the header is straight-forward, so we'll dive straight into reading the index entries.

Reading index entries

Index entries usually refer to a file, and have the following format[3]The format is completely different in split index mode, but we don't support that for this tutorial.:

ctime (4 bytes for the seconds, 4 bytes for the nano-seconds)
The time the file's metadata last changed.
mtime (4 bytes for the seconds, 4 bytes for the nano-seconds)
The time the file's data last changed.
dev (4 bytes)
ID of the device containing the file[4]As reported by stat()..
ino (4 bytes)
The file's inode number[5]As reported by stat()..
mode (4 bytes[6]The doco says that this is a 32-bit field, but only describes 16 of them. The first 2 bytes appear to be unused.): 0bTTTT...P 0bPPPPPPPP
  • TTTT = entry type (1000 = regular file, 1010 = symlink, 1110 = gitlink)
  • PPPPPPPPP = file permissions (only 0755 and 0644 are used for files, symlinks and gitlinks should have 0)
uid (4 bytes)
User ID of the file's owner.
gid (4 bytes)
Group ID of the file's owner.
file size (4 bytes[7]This appears to be truncated if it overflows (!) )
object name (20 raw bytes)
flags (2 bytes):
  • b15: assume-valid
  • b14: extended
  • b13-12: stage[8]Files go through several stages during a merge.
  • b11-0: name length[9]0xfff is stored even if the length is > 0xfff.
extended flags[10]Only for v3 or later, and if the extended flag is set. (2 bytes):
  • b15: reserved
  • b14: skip-worktree[11]Used by sparse checkout.
  • b13: intent-to-add[12]Used by git add -N.
  • b12-0: unused (must be 0)
entry path name[13]In v4, this field is prefix-compressed relative to the previous entry, but we don't support this for this tutorial. (NULL-terminated, encoding unknown)
padding (enough 0x00 bytes to make the size of the entry a multiple of 8)

There's a lot there, but it's relatively straight-forward to read:

    # read the next entry
    fpos_start = fp.tell()
    entry = {
        "_offset": fpos_start,
        "ctime": ( read_nbo_int( fp ), read_nbo_int( fp ) ), # nb: seconds + nanoseconds
        "mtime": ( read_nbo_int( fp ), read_nbo_int( fp ) ), # nb: seconds + nanoseconds
        "dev": read_nbo_int( fp ),
        "ino": read_nbo_int( fp ),
    }
    # NOTE: The doco says that the mode field is a 32-bit value, but only accounts for 16 of them :-/
    mode = fp.read( 4 )
    entry["obj_type"] = (mode[2] & 0xf0) >> 4
    entry["perms"] = (mode[2] & 0x01) << 8 | mode[3]
    entry.update( {
        "uid": read_nbo_int( fp ),
        "gid": read_nbo_int( fp ),
        "file_size": read_nbo_int( fp ),
        "obj_name": read_obj_name( fp ),
        "flags": read_nbo_int( fp, nbytes=2 ),
    } )
    if version >= 3 and (entry["flags"] & 0x400) != 0:
        entry["extended_flags"] = read_nbo_int( fp, 2 )
    entry.update( {
        "path": read_string( fp ),
    } )
    if version != 4:
        # skip over the pad bytes (used to 8-align entries)
        while ( fp.tell() - fpos_start ) % 8 != 0:
            byt = fp.read( 1 )
            assert byt == b"\0"

Extensions

Immediately after the index entries come zero or more extension blocks, which are used to store additional information in the index, and have the following format:

signature (4 bytes)
Identifies which extension is present.
# data bytes (4 bytes)
the data bytes

This code will read these blocks of extension data:

    # read the rest of the index file (any extension data, and the overall file checksum)
    fpos_extn_data = fp.tell() # nb: the start of the extension data
    trailer = fp.read()
    extn_data = trailer[:-20] # nb: remove the overall checksum

    # read the extensions
    extns = []
    fp2 = io.BytesIO( extn_data )
    while fp2.tell() < len( extn_data ):
        fpos = fp2.tell() # nb: the start of the extension (within the trailer)
        extn_sig = fp2.read( 4 )
        nbytes = read_nbo_int( fp2 )
        extn_data = fp2.read( nbytes )
        extns.append( ( extn_sig, fpos_extn_data+fpos, extn_data ) )

There are a few different extensions defined, but the ones that git supports are TREE and REUC.

The TREE extension

These are used to record tree objects that already exist in the repo, but don't contain any changes, and have the following format:

path (relative to its parent, NULL-terminated)
# index entries represented by this extension (ASCII, -1 if invalid)
an ASCII space (0x20)
# subtrees this tree has (ASCII)
an ASCII newline (0x0a)
object name (only if the entry has valid index entry count)

The code to read and dump these:

    def _dump_tree_extn( extn_data ):
        """Dump a TREE extension."""

        fp = io.BytesIO( extn_data )
        while fp.tell() < len( extn_data ):

            # read the next entry
            path = read_string( fp ) # nb: encoding is unknown
            nentries = int( read_until( fp, b" " ) )
            nsubtrees = int( read_until( fp, b"\n" ) )
            obj_name = read_obj_name( fp ) if nentries != -1 else None

            # dump the entry
            print( "- path = {}".format(
                path.decode( "utf-8", errors="replace" ) # nb: the encoding is actually unknown :-/
            ) )
            print( "  - entries:  {}".format( nentries ) )
            print( "  - subtrees: {}".format( nsubtrees ) )
            if obj_name is not None:
                print( "  - name:     {}".format( obj_name ) )

The REUC extension

These are used to save the various versions of a file as conflicts are resolved during a merge (so that they can be undone), and have the following format:

path (relative to the repo root, NULL-terminated)
mode of the entry in stage 1 (ASCII, NULL-terminated) ("0" = not present)
mode of the entry in stage 2 (format as above)
mode of the entry in stage 3 (format as above)
object name in stage 1 (20 raw bytes) (only if mode1 != "0")
object name in stage 2 (20 raw bytes) (only if mode2 != "0")
object name in stage 3 (20 raw bytes) (only if mode3 != "0")

The code to read and dump these:

    def _dump_reuc_extn( extn_data ):
        """Dump a REUC extension.

        These are used to manage merge conflicts. The stages are used to store the different versions
        of the file during the merge process (so that it can be undone).
        """

        fp = io.BytesIO( extn_data )
        while fp.tell() < len( extn_data ):

            # read the next entry
            path = read_string( fp ) # nb: encoding is unknown
            modes = [
                read_string( fp, encoding="ascii" ) # nb: these are ASCII octal numbers
                for _ in range( 0, 3 )
            ]
            obj_names = [
                # NOTE: Object names are only present if their corresponding mode value is there.
                read_obj_name( fp ) if mode != "0" else None
                for mode in modes
            ]

            # dump the entry
            print( "- path = {}".format(
                path.decode( "utf-8", errors="replace" ) # nb: let's try UTF-8 :-/
            ) )
            for stage in range( 0, 3 ):
                print( "  - stage {}:".format( 1+stage ), nl=False )
                perms = modes[ stage ]
                if perms != "0":
                    print( " {} {}".format( perms, obj_names[stage] ) )
                else:
                    print()

A full example

This repo contains a Hello, world file and two files in a sub-directory, similar to the one we set up before.

If we run the index.py script[14]Get this at the bottom of the page., we can see[15]Unimportant output has been removed from the screenshot. that it contains everything in the repo, even though git status reports the working tree as clean.

We can also see that there is one TREE extension, that has 2 entries:

  • one for the root directory, that covers 3 files[16]This file count is for the entry and all its children., and has 1 subtree [17]The subdir/ sub-directory..
  • one for the subdir/ sub-directory, that covers 2 files (file1.txt and file2.txt), and has no subtrees.

If we modify the hello.txt file, and stage that change, we can see that git has created a new blob for the new version of the file, and updated the main index entry to point to it.

Note that the root entry in the TREE extension is now marked as invalid, since these things are used to record directories that don't contain any changes, but our change is this directory.

If we look at a binary dump of the index file, it's easy to see the magic number, version number and entry count at the start of the file (marked in green).

The index entries are marked in orange. These are variable-length records [18]Because of the path name., but since all our filenames are about the same length, the entries are co-incidentally the same size [19]Also because of the padding bytes at the end, that ensures they are a multiple of 8 bytes in size..

Following these are the extension(s) (marked in blue). Each one starts off with their 4-byte signature and the number of bytes of data.

The last 20 bytes are the file checksum.

After we commit the change, we can see that the main index entries remain unchanged, and the root entry in the TREE extension has gone back to recording the (now) unchanged root directory.

Source code

A new script index.py will dump the contents of a repo's staging index.



References

References
1 Not to be confused with the index file in a pack.
2 Although for our purposes, this is not important.
3 The format is completely different in split index mode, but we don't support that for this tutorial.
4 As reported by stat().
5 As reported by stat().
6 The doco says that this is a 32-bit field, but only describes 16 of them. The first 2 bytes appear to be unused.
7 This appears to be truncated if it overflows (!)
8 Files go through several stages during a merge.
9 0xfff is stored even if the length is > 0xfff.
10 Only for v3 or later, and if the extended flag is set.
11 Used by sparse checkout.
12 Used by git add -N.
13 In v4, this field is prefix-compressed relative to the previous entry, but we don't support this for this tutorial.
14 Get this at the bottom of the page.
15 Unimportant output has been removed from the screenshot.
16 This file count is for the entry and all its children.
17 The subdir/ sub-directory.
18 Because of the path name.
19 Also because of the padding bytes at the end, that ensures they are a multiple of 8 bytes in size.
Have your say