In the previous section, we saw how git stores the basic objects in a .pack file. They are compressed, but there's no real space saving since so are the loose objects. In this section, we'll take a look at how git stores revisions of a file using a single full version, and then a series of diff's[1]Much like the older source control systems e.g. cvs or svn., which gives a significant reduction in space used.
Delta'fied objects
If we run the pack.py script on the packed repo (with the large license file) we created in the last section, we hit a bit of a snag: the pack file contains an object of type 6 (OFS_DELTA), which we don't yet know how to handle.
To get the object data for one of these, git builds it up progressively, by taking another object from the pack as a base object, and applying a series of transformations to it. There are only 2 transformations supported:
- COPY BYTES (from the base object)
- APPEND BYTES (add new data bytes)
With just these two things, we can transform the base object into anything we want.
For example, let's say we start with a base object of "abcde", and we want to delete some of the letters, ending up with an object of "abe". This can be achieved as follows:
COPY BYTES (start=0, #bytes=2) → "ab" COPY BYTES (start=4, #bytes=1) → "abe"
Or, if we want to insert some new letters at the start, and change some of the existing ones, to end up with "!!!axyze":
APPEND BYTES ("!!!") → "!!!" COPY BYTES (start=0, #bytes=1) → "!!!a" APPEND BYTES ("xyz") → "!!!axyz" COPY BYTES (start=4, #bytes=1) → "!!!axyze"
In the case of the change we made to the license file (appending "!" to it), generating the updated version of the file from the original is easy:
COPY BYTES (start=0, #bytes=34523) → ...the entire contents of LICENSE.txt... APPEND BYTES ("!") → ...the entire contents of LICENSE.txt, with a ! on the end...
Reconstructing delta'fied objects
Delta'fied objects are stored in a pack like this:
base object identifier |
transformation data (compressed) |
There are two types of delta'fied objects, and they differ only in how the base object is specified:
- OFS_DELTA: a variable-length integer that specifies the offset to the base object within the same .pack file.
- REF_DELTA: the base object's name (20 raw bytes), which could be in another .pack file.
Handling REF_DELTA's requires some code that we will write later, so for now, we'll just consider OFS_DELTA's. We update the code to recognize this type of object:
# read the object data if obj_type in ( 1, 2, 3, 4 ): # commit, tree, blob, tag obj_data = decompress_stream( fp ) assert len(obj_data) == obj_size obj_type = OBJ_TYPES[ obj_type ] elif obj_type == 6: # ofs_delta obj_type, obj_data = _read_ofs_delta_obj( fp, obj_size, fpos, depth=depth ) else: assert False, "ERROR: Unknown object type: {}".format( obj_type )
The file offset to the base object follows immediately after the object type+size, also as a variable-length integer, but this time using "offset encoding". This one has the additional twist of adding 27 to each byte as it is processed:
def read_vli_be( fp, offset=False ): """Read a variable-length integer (big-endian).""" val = 0 while True: # add in the next 7 bits of data byt = fp.read( 1 )[0] val = (val << 7) | (byt & 0x7f) if byt & 0x80 == 0: break # nb: that was the last byte if offset: # NOTE: When reading offsets for delta'fied objects, there is an additional twist :-/ # The sequences [ 0xxxxxxx ] and [ 10000000, 0xxxxxxx ] would normally be read as # the same value (0xxxxxxx), so for each byte except the last one, we add 2^7, # which has the effect of ensuring that all 1-byte sequences are less than all 2-byte # sequences, which are less than all 3-byte sequences, etc. We add 1 here, but since # we are going to loop back and left-shift val by 7 bits, that is the same as adding 2^7. # Look for "offset encoding" here: # https://git-scm.com/docs/pack-format val += 1 return val
We can now get the offset[2]The number we just read is not an absolute offset into the file, but the number of bytes counting backwards from the start of the object we're currently reading. to the base object, so we can read it from the pack file. The thing is, this object might itself be delta'fied (as might its base object, and so on), so clearly, we need a recursive solution.
def _read_ofs_delta_obj( fp, obj_size, fpos, depth ): """Read an OBJ_OFS_DELTA object.""" # read the base object offset offset = read_vli_be( fp, offset=True ) base_obj_offset = fpos - offset trace( "- offset = 0x{:x} (relative=0x{:x})", base_obj_offset, offset, depth=depth ) # get the base object # IMPORTANT: The base object could itself be delta'fied. trace( "Reading base object: fpos=0x{:x}", base_obj_offset, depth=depth ) prev_fpos = fp.tell() fp.seek( base_obj_offset ) # NOTE: We wrote _read_pack_object() earlier, to read an object from a pack file, # and we are actually running right now as part of a call to this function. # Calling it again here gives us the recursive behavior we need. base_obj_type, base_obj_data = _read_pack_object( fp, depth=depth+1 ) assert base_obj_type in ( "commit", "tree", "blob", "tag" ) fp.seek( prev_fpos ) trace( "- Read base object: type={}", base_obj_type, depth=depth, data=base_obj_data ) # reconstruct the delta'fied object return _make_delta_obj( fp, base_obj_type, base_obj_data, obj_size, depth=depth )
Note that we're still in the process of reading our object, so we have to remember where we're currently at within the file, before jumping somewhere else to read the base object.
Once we have the base object, we can decompress the transformation data, which has the following format:
base object size (VLI) |
size of the object being reconstructed (VLI) |
transformation command |
transformation command |
...etc... |
Here's the code that reads the header:
def _make_delta_obj( fp, base_obj_type, base_obj_data, obj_size, depth ): """Reconstruct a delta'fied object.""" # read the transformation data data = decompress_stream( fp ) trace( "- transformation data:", depth=depth, data=data ) assert len(data) == obj_size # set up a new input stream for the transformation data fp2 = io.BytesIO( data ) base_obj_size = read_vli_le( fp2 ) obj_size2 = read_vli_le( fp2 ) trace( "- base_obj_size={}, obj_size={}", base_obj_size, obj_size2, depth=depth )
Transformation commands
There are only 2 transformation commands available, and the MSB of the lead byte is used to distinguish between them:
- 0: append new bytes
- 1: copy bytes from the base object
Appending new bytes is easy - the remaining 7 bits of the lead byte holds the number of bytes of new data to be appended, which follow immediately after.
To copy bytes from the base object, we need 2 things:
- where to start copying from
- how many bytes to copy
These two numbers follow the lead byte as little-endian values (4 bytes for the start offset, 3 bytes for the size), but the thing is, all 7 bytes are optional, and whether or not they are present is controlled by the 7 remaining bits in the lead byte, like this:
For example, the bytes 0xb0 0xdb 0x86 would be interpreted like this:
which would be read as offset 0x00000000, and size 0x0086db (or 34,523 decimal).
If you don't have a background in C or assembler, this might seem like an incredibly convoluted way of doing things, but when you're doing systems programming, it's actually quite common and easy to do. Unfortunately, in Python it's bit janky
# process the delta commands obj_data = b"" while fp2.tell() < len( data ): ch = fp2.read( 1 ) byt = ch[0] if byt == 0x00: continue if (byt & 0x80) != 0: # copy data from base object vals = [] for i in range( 0, 6+1 ): bmask = 1 << i if (byt & bmask) != 0: vals.append( fp2.read( 1 )[0] ) else: vals.append( 0 ) start = int.from_bytes( vals[0:4], byteorder="little" ) nbytes = int.from_bytes( vals[4:6], byteorder="little" ) if nbytes == 0: nbytes = 0x10000 trace( "- COPY FROM BASE OBJECT: start=0x{:x}, #bytes={}", start, nbytes, depth=depth ) obj_data += base_obj_data[ start : start + nbytes ] else: # add new data nbytes = byt & 0x7f trace( "- APPEND NEW BYTES: #bytes={}", nbytes, depth=depth ) obj_data += fp2.read( nbytes ) trace( "- Final object data: #bytes={}", len(obj_data), depth=depth, data=obj_data ) assert len(obj_data) == obj_size2 return base_obj_type, obj_data
If we run the new pack.py script[3]Get this at the bottom of the page. again, we are now able to dump all the objects, including the OFS_DELTA one (marked in orange).
We can see two blob objects, one of size 34,523 bytes (the original), and one of 34,524 bytes (the modified version, marked in blue).
If we look at the trace output for how the OFS_DELTA object was reconstructed, we can see that it is the object at offset 0x2f80, and uses the object at offset 0x120 as its base object:
Reading object: fpos=0x2f80
- type=ofs_delta, size=9
- offset = 0x120 (relative=0x2e60[4]The offset stored in the .pack file is 0x2e60, but this is relative to the start of the object being reconstructed (0x2f80), so the offset to the base object is 0x2f80 - 0x2e60 = 0x120.)
Reading the object at offset 0x120 was straight-forward, since it's just a blob object:
Reading base object: fpos=0x120 > Reading object: fpos=0x120 > - type=blob, size=34524 > Read object: type=blob, #bytes=34524
Note that at 34,524 bytes, the base object must be the modified version of the license file, which means that we are reconstructing the original version.
We then decompress the transformation data for the object we are currently dumping:
- transformation data: 00000 | dc 8d 02 db 8d 02 b0 db 86 .. .. .. .. .. .. .. | .........
This starts with 2 variable-length integers, the base object size and the size of the object being reconstructed:
0xdc 0x8d 0x02 → 0x86dc = 34,524 (base object size) 0xdb 0x8d 0x02 → 0x86db = 34,523 (reconstructed object size)
The remaining bytes are the transformation command(s). The lead byte (0xb0) has its MSB set, so it's a "copy from base object" command, and the remaining bits define which of the offset and size bytes are present:
Reading these as little-endian numbers, we get an offset of 0x00000000 and a size of 0x0086db, or 34,523 bytes.
So, we copy 34,523 bytes from the modified license file, starting at offset 0. There are no more transformation commands, so we're done, and we have indeed reconstructed the original license file[5]The modified file is 34,524 bytes in size, so by copying the first 34,523 bytes, we omit the trailing "!"..
Note that reconstructing the original file from the modified file, rather than the other way around, is deliberate. The most recent version of a file is always stored as the full copy, and revisions going back in time are constructed by progressively applying more and more transformations. This is because you are most likely to want the most recent version of a file (which will be fastest to retrieve, since it will require no transformations), and less likely to want older revisions (which will be slower to retrieve since it will require the base object to be retrieved, and then transformations applied to it, and more of them, the older the revision is).
In this way, git is able to balance speed of operation with disk space used. The stuff you've worked on most recently is stored as loose files, which are fast to access, but take up more space. Periodically, these are gathered up and moved into a pack, where they take up much less space, but are slower to retrieve (with how slow depending on how many revisions back they are).
Source code
The pack.py script has been updated to handle OFS_DELTA (type 6) objects.
Trace messages are sent to stderr, so just redirect that to /dev/null if you don't want to see them (or disable the trace() function)[6]We'll set up proper logging later..
References
↵1 | Much like the older source control systems e.g. cvs or svn. |
---|---|
↵2 | The number we just read is not an absolute offset into the file, but the number of bytes counting backwards from the start of the object we're currently reading. |
↵3 | Get this at the bottom of the page. |
↵4 | The offset stored in the .pack file is 0x2e60, but this is relative to the start of the object being reconstructed (0x2f80), so the offset to the base object is 0x2f80 - 0x2e60 = 0x120. |
↵5 | The modified file is 34,524 bytes in size, so by copying the first 34,523 bytes, we omit the trailing "!". |
↵6 | We'll set up proper logging later. |