At its lowest level, git is a content-addressable store, which simply means that it is a thing in which you can store data, indexed by its content. You give it some data, git stores it and gives you back an ID for it, which is generated from that data[1]In particular, if you try to store two different things (say, files) that happen to have the same data, you will get back the same ID.. Later on, if you want to retrieve it, you give git the ID, and it returns the original data.
Storing an object
Let's start off with a new repo, and take a look at what's in there.
If we ignore the sample hooks, there are 4 files:
.git/config | config settings that apply only to this repo |
.git/info/exclude | patterns for files to ignore[2]This is similar to the .gitignore file. |
.git/description | a description of the repo[3]This is used only by the web interface. |
.git/HEAD | a symbolic ref that records which commit you are currently on |
Content is stored in files in the .git/objects/ directory, which is, unsurprisingly, empty.
We create the canonical test file, and add it to git.
git stores[4]Because we specified -w; omit this to just calculate and show the ID. the file's content (known as an object) and returns its ID[5]This is actually the SHA-1 checksum for the content. Support for SHA-256 is also available, but currently experimental. (known as its name), which we can use to retrieve it.
If we take a look at the .git/ directory, we can see that a new file has been created.
Note that files are stored by their object name, but in separate directories based on their first byte[6]This is to avoid having a single directory with a huge number of files in it, which can be slow and/or cause problems on some operating systems..
If we look inside the file, it appears to be binary data.
However, it's actually zlib deflated, so if we decompress[7]zlib-deflate is part of the qpdf package. it first, it looks much more recognizable.
The format of these files is:
the type of object (ASCII) |
an ASCII space (0x20) |
the size of the uncompressed data (ASCII) |
a binary 0 |
the data itself |
git provides a suite of plumbing commands, which we can use to query the object we just created e.g. get its type and data.
Dumping object files
We'll be looking at these files a bit over the course of this tutorial, and dumping them manually gets old very quickly, so let's write a quick script to do it for us.
First, we walk the .git/objects/ directory, looking for object files:
# walk the git objects directory objs = [] for root, dirs, files in os.walk( ".git/objects" ): for fname in files: # check if the next file looks like an object fname = os.path.join( root, fname ) mo = re.search( r"([0-9a-f]{2})[/\\]([0-9a-f]{38}$)", fname ) if not mo: # NOTE: We can get here if there are extra files in the objects directory (e.g. pack files), # but this shouldn't happen on a newly-created repo which hasn't had too much activity. print( "UNKNOWN FILE:", fname, file=sys.stderr ) continue # yup - save it objs.append( ( mo.group(1) + mo.group(2), # nb: this is the full object name os.stat( fname ).st_mtime # nb: file timestamp ) )
It's not essential, but useful to sort them by timestamp:
# sort the objects by timestamp objs.sort( key = lambda obj: obj[1] )
Then we run git cat-file on each file, to get its type and content:
# dump the objects for obj_no, (obj_id, tstamp) in enumerate( objs ): obj_type = run_git( "cat-file", "-t", obj_id, utf8=True, strip=True ) tstamp = datetime.fromtimestamp( tstamp ).strftime( "%H:%M:%S.%f" ) if obj_no > 0: print() print( "=== {} [{}] ({}) ===".format( obj_id, obj_type, tstamp ) ) print() obj_data = run_git( "cat-file", "-p", obj_id ) try: print( obj_data.decode( "utf-8", errors="replace" ), end="" ) except UnicodeError: print( obj_data )
This is the function that actually runs git:
def run_git( cmd, *args, repo_dir=".", utf8=False, strip=False ): """Run git and return the output.""" args = list( itertools.chain( [ _git_path, "--git-dir", os.path.join( repo_dir, ".git" ), cmd ], args ) ) proc = subprocess.run( args, capture_output=True, check=True ) output = proc.stdout if utf8: output = output.decode( "utf-8" ) if strip: output = output.rstrip() return output
Note that because object data can be anything at all, we have to mindful of whether we're dealing with text or binary data.
Running this script[8]Get this at the bottom of the page. over our repo shows the one object we just created.
Storing binary data
Let's try storing another file, but this time with binary data.
Download this 1x1 image file and save it somewhere outside the repo, then add it to the repo using the --stdin switch:
This time, git has read the data from stdin, which demonstrates that the filename is not used as part of the process. The only thing that matters is the content being stored.
If we take a look at the .git/objects/ directory, we can see that a new object file has been created.
And our dump script shows that the new object looks like a PNG file.
Source code
A new script dump_loose_objects.py will walk your .git/objects/ directory, looking for loose object files, and dump them.
References
↵1 | In particular, if you try to store two different things (say, files) that happen to have the same data, you will get back the same ID. |
---|---|
↵2 | This is similar to the .gitignore file. |
↵3 | This is used only by the web interface. |
↵4 | Because we specified -w; omit this to just calculate and show the ID. |
↵5 | This is actually the SHA-1 checksum for the content. Support for SHA-256 is also available, but currently experimental. |
↵6 | This is to avoid having a single directory with a huge number of files in it, which can be slow and/or cause problems on some operating systems. |
↵7 | zlib-deflate is part of the qpdf package. |
↵8 | Get this at the bottom of the page. |