Awasu » Git Guts: How git implements source control
Friday 7th January 2022 7:51 PM

In the previous section, we saw how git allows arbitrary blobs of data to be stored. In this section, we'll take a look at how git builds on top of this to provide a source control system.

What's in a commit

Let's start again with a new repo, then create and commit a test file.

Running the dump script, we see that three objects have been created in the object store.

The commit object at the end (marked in red) represents the commit itself. We can see the details of the author, and the commit message, but there's an extra line at the beginning, which is a reference to a tree object that was also created (marked in green).

This tree object describes the entire repo at the time of the commit. You may have heard the idea that while commits on other source control systems track just what's changed, git stores a snapshot of the entire repo for each commit. Well, that's what this is referring to. Right now, there's only one file in the repo, and so only one file in the tree object, but as you add more files to the repo, so will the tree objects get larger, as they track them all.

Finally, we can see that the tree object references a third object that was created, a blob object that contains the new file we added (marked in blue).

Format of a tree object

commit objects are plain-text, blob objects contain whatever the file they represent contains, but tree objects are stored as a mixture of text and binary content, and record the files in a directory.

They start off with the standard object type and byte count (marked in green), followed by the tree data (one entry per file):

the file's permissions (ASCII)
an ASCII space (0x20)
the filename[1]The encoding is unspecified. (NULL-terminated)
the hash of the blob object that holds the file's contents (20 raw bytes)

In this example, there is only one file, but if there are more, they follow one after the other.

Adding more files

Let's add some more files to the repo, this time in a sub-directory.

We run the dump script again[2]You'll need to cd up to the root directory of the repo., to see what git has in its object store.

At the bottom is a new commit object we just created (marked in red). Note that it has a new parent field that points to its parent commit[3]The commit we created above didn't have this field, since it was the first commit in the repo..

The commit object points to a new tree object (marked in green) that represents the root of the working tree. Note that this contains the old hello.txt file which, even though it wasn't changed as part of this commit, is still part of repo, and so is listed here[4]Because commits always describe the entire working tree..

Drilling down into the tree object for subdir/, we can see two blob objects for the new files we created (marked in blue).

The picture below shows a traditional diagram of the git history across the top, and the internal representation underneath:

Modifying and deleting files

Let's take a look at what happens when we change an existing file, or delete a file.

If we look at the tree object that got created for this commit, we can see that hello.txt has a new blob object for the new version of the file (marked in blue).

However, the old object is still there, as is the file we deleted (marked in orange).

Why? Well, git is a source control system, and has to be able to retrieve old versions of files e.g.[5]Because we're using normal git commands, we specify the hash of the commit, not the underlying blob object.

    $ git log --pretty=oneline hello.txt
    4faaa8c1a998c76b16b59c08f401a4cfcb20362d (HEAD -> master) Modified hello, deleted file2.
    6577c1ef53a95d6bc73e7479b93a574e32787188 Added a greeting.

    $ git show 6577c1ef53a95d6bc73e7479b93a574e32787188:hello.txt
    Hello, world.

    $ git show 4faaa8c1a998c76b16b59c08f401a4cfcb20362d:hello.txt
    yo!

So, this is why it needs to keep the old objects in the store.

Going back to the new tree object that was created, if we drill down into the subdir tree, we can see that file1.txt is no longer there.

But similarly, the underlying blob object is still there, in case it ever needs to be retrieved:

    $ cat subdir/file1.txt
    cat: subdir/file1.txt: No such file or directory

    $ git cat-file -p 5c1170f2eaac6f78662a8cf899326a4b95c80dd2
    This is file 1.

Updating our comparison diagram, we can see how git re-uses[6]It hasn't happened in this tiny example, but tree objects will also be re-used where possible, which will happen for all but smallest of repo's. objects that haven't changed i.e. the blob objects for the files hello.txt and file2.txt:

Manually creating a commit

We've seen how git creates commits by creating three objects in the store, so let's try manually creating these three objects ourself, and getting them to show up in the repo history as a commit.

For our manual commit, we will add a new file called manual.txt, so first, we need to add a blob object that contains its content.

We also need a tree object that describes the files and directories in the repo and in particular, references the new blob object for the manual.txt file we just created. We can do this by adding manual.txt to the index, and then getting git to create a tree object based on what's in the index.

Finally, we need a commit object, which specifies the parent commit, a commit message, and the associated tree object.

If we run our dump script, we can see the new objects we created, but unfortunately, our commit is not showing up in the git history.

However, the first line of the git log output gives us a clue as to what the problem is: HEAD.

HEAD is a symbolic ref that points to where you're currently at in the git tree i.e. which commit you're currently on. It's stored in .git/HEAD, and if we look in the this file, we can see that it's pointing to a reference called master.

References are just aliases for commits, with branches being a special kind of ref that keeps updating itself as you add commits, so that it always points to the last commit on that branch.

If we take a look at the master ref, we can see that it's still pointing to the last "normal" commit we made, not the manual one we created here.

We can change this file directly to point to the new commit object we created, but the recommended way to change refs is to use git update-ref.

Now that the master ref has been updated to point to our new commit, it shows up in the git history. We can query it as if it were a normal commit, because it is a normal commit; we've just done manually what git commit does for you, behind the scenes.



References

References
1 The encoding is unspecified.
2 You'll need to cd up to the root directory of the repo.
3 The commit we created above didn't have this field, since it was the first commit in the repo.
4 Because commits always describe the entire working tree.
5 Because we're using normal git commands, we specify the hash of the commit, not the underlying blob object.
6 It hasn't happened in this tiny example, but tree objects will also be re-used where possible, which will happen for all but smallest of repo's.
Have your say