When it comes to using git, I’ve mostly just learned as I go, figuring out whatever it is that I might need as the situation arise. In doing so though, I sometimes gloss over details about how things really work under the hood, so it’s nice to be able to learn a bit more by going through the tutorials/videos I mentioned before.

In fact, I learned that one incorrect assumption I’ve made up until now is that I thought that git stored commits as deltas, i.e., storing only bits that have changed from the previous version. However, as I learned in GitHub’s video on Advanced Git Tricks, that’s not what’s going on. Each commit can be deconstructed into 3 parts: blobs, trees, and the commit object. Blobs are just content (representing anything from binary files like images to text files, like source code); they are represented internally as a hash of their content. Trees are like directory listings that links filenames to to blobs. Commits contain information about the person doing the check-in, a link to the hash of the top-level tree that includes the changes to be checked in, and the commit message.

When you make even the smallest of changes to a file, git creates a completely new blob, with a unique hash signature. This effects rolls upward, creating new trees and, thus a new commit object. So no get deltas actually get stored! Also, what’s nice about using hashes is that it’s much easier, programmatically speaking, for git to recognize that two files are actually the same (since identical files will result in having the same hash signature). This means that git can just store have pointers to a single blob, even if you end up checking in the same content twice (under different filenames).

To see this in action, consider this simple example. We’ll create a sample git project, add a text file, commit the file, modify the file, and then commit it again. We’ll also add another file with the same content and notice that git doesn’t create a new blob for this. The typical work flow for this would just be to use a series of git add and git commit. We’ll do that here but we’ll also look underneath the hood a bit after each step.

$ git init test
$ cd test
$ echo -n "hello world" > helloworld.txt
$ git add helloworld.txt

After running git add, git creates a blob in the object database, which you can verify by looking in the .git/objects directory:

$ ls -lsa .git/objects
drwxr-xr-x  95
drwxr-xr-x  info
drwxr-xr-x  pack
$ ls -lsa .git/objects/95
-r--r--r-- d09f2b10159347eece71399a7e2e907ea3df4f

The 95 refers to the first two digits of helloworld.txt‘s hash signature. The remaining characters in the hash signature refer to the blob’s filename. By running git cat-file, we can verify that it’s a blob object and that this contains the "hello world" string.

$ git cat-file -t 95d09f2b10159347eece71399a7e2e907ea3df4f
blob
$ git cat-file -p 95d09f2b10159347eece71399a7e2e907ea3df4f
hello world

After we check-in helloworld.txt, we see that more objects are created in git’s object database:

$ git commit -m "adding helloworld.txt"
[master (root-commit) 95e5ad0] add helloworld.txt
 1 file changed, 1 insertion(+)
 create mode 100644 helloworld.txt
$ ls -lsa .git/objects
drwxr-xr-x  33
drwxr-xr-x  95
drwxr-xr-x  info
drwxr-xr-x  pack

One of those objects is a tree, which we see has helloworld.txt’s hash signature, along with the user-provided filename. The 100644 part refers to the file permissions for the object.

$ ls -lsa .git/objects/33
-r--r--r-- d09f2b10159347eece71399a7e2e907ea3df4f
-r--r--r-- e5ad0870d4ab32784fe634b9820ac525db1591
$ git cat-file -t 33afd6485aadae927bc4bc2986ea9a0d86d5d699
tree
$ git cat-file -p 33afd6485aadae927bc4bc2986ea9a0d86d5d699
100644 blob 95d09f2b10159347eece71399a7e2e907ea3df4f	helloworld.txt

We also see there’s a commit object, which contains a link to the tree object’s hash signature:

$ ls -lsa .git/objects/95
-r--r--r-- d09f2b10159347eece71399a7e2e907ea3df4f
-r--r--r-- e5ad0870d4ab32784fe634b9820ac525db1591
$ git cat-file -t 95e5ad0870d4ab32784fe634b9820ac525db1591
commit
$ git cat-file -p 95e5ad0870d4ab32784fe634b9820ac525db1591
tree 33afd6485aadae927bc4bc2986ea9a0d86d5d699
author ktnode <ktnode@email.com> 1392936563 -0500
committer ktnode <ktnode@email.com> 1392936563 -0500
add helloworld.txt

The git commit commands returns the beginning of the hash signature (95e5ad0)for the commit object, so it’s not surprising that we find the commit object located in the .git/objects/95 folder.

Now let’s modify the file and check it in again and see how git treats it under the hood. We’ll see there’s yet again a new object in git’s database. A new blob gets created that, even though we’re touching the same file we just checked in. This shows us that, in this case, a delta/diff is not what gets stored in git!

$ echo ', goodbye' >> helloworld.txt
$ git add helloworld.txt
$ ls -lsa .git/objects
drwxr-xr-x  33
drwxr-xr-x  66
drwxr-xr-x  95
drwxr-xr-x  info
drwxr-xr-x  pack
$ ls -lsa .git/objects/66
-r--r--r--  b049f59fc01dfb6d97dfe93dd1e149e6c2fa48
$ git cat-file -t 66b049f59fc01dfb6d97dfe93dd1e149e6c2fa48
blob

When we commit our changes, we’ll see that two new files are again created (a tree and a commit object). In the commit object, we’ll see that it contains a reference to the hash of the new tree object.

$ git commit -m "modified helloworld.txt"
[master 3a70f16] modified helloworld.txt
 1 file changed, 1 insertion(+), 1 deletion(-)
$ ls -lsa .git/objects
drwxr-xr-x  33
drwxr-xr-x  3a
drwxr-xr-x  60
drwxr-xr-x  66
drwxr-xr-x  95
drwxr-xr-x  info
drwxr-xr-x  pack
$ ls -lsa .git/objects/60
-r--r--r--  74b9e407c82a07393cc1b7fe4ca31370e3f5a3
$ git cat-file -t 6074b9e407c82a07393cc1b7fe4ca31370e3f5a3
tree
$ git cat-file -p 6074b9e407c82a07393cc1b7fe4ca31370e3f5a3
100644 blob 66b049f59fc01dfb6d97dfe93dd1e149e6c2fa48	helloworld.txt
$ ls -lsa .git/objects/3a
-r--r--r--  70f16aea5ca9153500b73f779109399ba11f51
$ git cat-file -t 3a70f16aea5ca9153500b73f779109399ba11f51
commit
$ git cat-file -p 3a70f16aea5ca9153500b73f779109399ba11f51
tree 6074b9e407c82a07393cc1b7fe4ca31370e3f5a3
parent 95e5ad0870d4ab32784fe634b9820ac525db1591
author ktnode <ktnode@email.com> 1392936813 -0500
committer ktnode <ktnode@email.com> 1392936813 -0500
modified helloworld.txt

Now let’s consider what happens when we create a file with the same contents as our helloworld.txt, but that has another filename.

$ echo -n 'hello world' > another-helloworld.txt
$ git add another-helloworld.txt 
$ ls -lsa .git/objects
drwxr-xr-x  33
drwxr-xr-x  3a
drwxr-xr-x  60
drwxr-xr-x  66
drwxr-xr-x  95
drwxr-xr-x  info
drwxr-xr-x  pack

We see that there are no new objects created in git’s database! (Feel free to use the timestamps of each directory to see that none of them have been updated.) Once we commit this new file, we’ll see some git magic.

$ git commit -m "added a copy of helloworld.txt"
[master 370cbee] added a copy of helloworld.txt
 1 file changed, 1 insertion(+)
 create mode 100644 another-helloworld.txt
$ ls -lsa .git/objects
drwxr-xr-x  33
drwxr-xr-x  37
drwxr-xr-x  3a
drwxr-xr-x  60
drwxr-xr-x  66
drwxr-xr-x  95
drwxr-xr-x  f1
drwxr-xr-x  info
drwxr-xr-x  pack
$ ls -lsa .git/objects/f1
-r--r--r--  a0d7a708eb2ceed5af17de959c7eafe67fb3c4
$ git cat-file -t f1a0d7a708eb2ceed5af17de959c7eafe67fb3c4
tree
$ git cat-file -p f1a0d7a708eb2ceed5af17de959c7eafe67fb3c4
100644 blob 95d09f2b10159347eece71399a7e2e907ea3df4f	another-helloworld.txt
100644 blob 66b049f59fc01dfb6d97dfe93dd1e149e6c2fa48	helloworld.txt

Inside the tree object for this commit, we see that another-helloworld.txt is linked to 95d09f2b10159347eece71399a7e2e907ea3df4f. This is the same hash signature as our original helloworld.txt (before we modified it). The second entry in the tree directory points to the other file in our current file structure, which is simply the latest version of the helloworld.txt (after we modified it). In the commit object, we see that this tree object (that contains entries for both another-helloworld.txt and helloworld.txt) is the one that’s referenced in the commit object.

$ ls -lsa .git/objects/37
total 8
-r--r--r--  0cbee313074a83cf517dd8d521a3afdc10c451
$ git cat-file -t 370cbee313074a83cf517dd8d521a3afdc10c451
commit
$ git cat-file -p 370cbee313074a83cf517dd8d521a3afdc10c451
tree f1a0d7a708eb2ceed5af17de959c7eafe67fb3c4
parent 3a70f16aea5ca9153500b73f779109399ba11f51
author ktnode <ktnode@email.com> 1392937033 -0500
committer ktnode <ktnode@email.com> 1392937033 -0500
added a copy of helloworld.txt

Thus, stepping through tis shows that, with each commit, git actually has a complete copy of the current state of whatever you’re checking in. Specifically, each commit points to tree(s) which capture the latest file/directory structure that you’re checking in.

Leave a Comment

Your email address will not be published. Required fields are marked *