git commits: digging a bit deeper
When it comes to using git, I’ve mostly just learned as I go, figuring out whatever it is that I might need as the situation arise. In doing so though, I sometimes gloss over details about how things really work under the hood, so it’s nice to be able to learn a bit more by going through the tutorials/videos I mentioned before.
In fact, I learned that one incorrect assumption I’ve made up until now is that I thought that git stored commits as deltas, i.e., storing only bits that have changed from the previous version. However, as I learned in GitHub’s video on Advanced Git Tricks, that’s not what’s going on. Each commit can be deconstructed into 3 parts: blobs, trees, and the commit object. Blobs are just content (representing anything from binary files like images to text files, like source code); they are represented internally as a hash of their content. Trees are like directory listings that links filenames to to blobs. Commits contain information about the person doing the check-in, a link to the hash of the top-level tree that includes the changes to be checked in, and the commit message.
When you make even the smallest of changes to a file, git creates a completely new blob, with a unique hash signature. This effects rolls upward, creating new trees and, thus a new commit object. So no get deltas actually get stored! Also, what’s nice about using hashes is that it’s much easier, programmatically speaking, for git to recognize that two files are actually the same (since identical files will result in having the same hash signature). This means that git can just store have pointers to a single blob, even if you end up checking in the same content twice (under different filenames).
To see this in action, consider this simple example. We’ll create a sample git project, add a text file, commit the file, modify the file, and then commit it again. We’ll also add another file with the same content and notice that git doesn’t create a new blob for this. The typical work flow for this would just be to use a series of git add and git commit. We’ll do that here but we’ll also look underneath the hood a bit after each step.
$ git init test $ cd test $ echo -n "hello world" > helloworld.txt $ git add helloworld.txt |
After running git add, git creates a blob in the object database, which you can verify by looking in the .git/objects
directory:
$ ls -lsa .git/objects drwxr-xr-x 95 drwxr-xr-x info drwxr-xr-x pack $ ls -lsa .git/objects/95 -r--r--r-- d09f2b10159347eece71399a7e2e907ea3df4f |
The 95
refers to the first two digits of helloworld.txt
‘s hash signature. The remaining characters in the hash signature refer to the blob’s filename. By running git cat-file
, we can verify that it’s a blob object and that this contains the "hello world"
string.
$ git cat-file -t 95d09f2b10159347eece71399a7e2e907ea3df4f blob $ git cat-file -p 95d09f2b10159347eece71399a7e2e907ea3df4f hello world |
After we check-in helloworld.txt
, we see that more objects are created in git’s object database:
$ git commit -m "adding helloworld.txt" [master (root-commit) 95e5ad0] add helloworld.txt 1 file changed, 1 insertion(+) create mode 100644 helloworld.txt $ ls -lsa .git/objects drwxr-xr-x 33 drwxr-xr-x 95 drwxr-xr-x info drwxr-xr-x pack |
One of those objects is a tree, which we see has helloworld.txt’s hash signature, along with the user-provided filename. The 100644 part refers to the file permissions for the object.
$ ls -lsa .git/objects/33 -r--r--r-- d09f2b10159347eece71399a7e2e907ea3df4f -r--r--r-- e5ad0870d4ab32784fe634b9820ac525db1591 $ git cat-file -t 33afd6485aadae927bc4bc2986ea9a0d86d5d699 tree $ git cat-file -p 33afd6485aadae927bc4bc2986ea9a0d86d5d699 100644 blob 95d09f2b10159347eece71399a7e2e907ea3df4f helloworld.txt |
We also see there’s a commit object, which contains a link to the tree object’s hash signature:
$ ls -lsa .git/objects/95 -r--r--r-- d09f2b10159347eece71399a7e2e907ea3df4f -r--r--r-- e5ad0870d4ab32784fe634b9820ac525db1591 $ git cat-file -t 95e5ad0870d4ab32784fe634b9820ac525db1591 commit $ git cat-file -p 95e5ad0870d4ab32784fe634b9820ac525db1591 tree 33afd6485aadae927bc4bc2986ea9a0d86d5d699 author ktnode <ktnode@email.com> 1392936563 -0500 committer ktnode <ktnode@email.com> 1392936563 -0500 add helloworld.txt |
The git commit
commands returns the beginning of the hash signature (95e5ad0
)for the commit object, so it’s not surprising that we find the commit object located in the .git/objects/95
folder.
Now let’s modify the file and check it in again and see how git treats it under the hood. We’ll see there’s yet again a new object in git’s database. A new blob gets created that, even though we’re touching the same file we just checked in. This shows us that, in this case, a delta/diff is not what gets stored in git!
$ echo ', goodbye' >> helloworld.txt $ git add helloworld.txt $ ls -lsa .git/objects drwxr-xr-x 33 drwxr-xr-x 66 drwxr-xr-x 95 drwxr-xr-x info drwxr-xr-x pack $ ls -lsa .git/objects/66 -r--r--r-- b049f59fc01dfb6d97dfe93dd1e149e6c2fa48 $ git cat-file -t 66b049f59fc01dfb6d97dfe93dd1e149e6c2fa48 blob |
When we commit our changes, we’ll see that two new files are again created (a tree and a commit object). In the commit object, we’ll see that it contains a reference to the hash of the new tree object.
$ git commit -m "modified helloworld.txt" [master 3a70f16] modified helloworld.txt 1 file changed, 1 insertion(+), 1 deletion(-) $ ls -lsa .git/objects drwxr-xr-x 33 drwxr-xr-x 3a drwxr-xr-x 60 drwxr-xr-x 66 drwxr-xr-x 95 drwxr-xr-x info drwxr-xr-x pack $ ls -lsa .git/objects/60 -r--r--r-- 74b9e407c82a07393cc1b7fe4ca31370e3f5a3 $ git cat-file -t 6074b9e407c82a07393cc1b7fe4ca31370e3f5a3 tree $ git cat-file -p 6074b9e407c82a07393cc1b7fe4ca31370e3f5a3 100644 blob 66b049f59fc01dfb6d97dfe93dd1e149e6c2fa48 helloworld.txt $ ls -lsa .git/objects/3a -r--r--r-- 70f16aea5ca9153500b73f779109399ba11f51 $ git cat-file -t 3a70f16aea5ca9153500b73f779109399ba11f51 commit $ git cat-file -p 3a70f16aea5ca9153500b73f779109399ba11f51 tree 6074b9e407c82a07393cc1b7fe4ca31370e3f5a3 parent 95e5ad0870d4ab32784fe634b9820ac525db1591 author ktnode <ktnode@email.com> 1392936813 -0500 committer ktnode <ktnode@email.com> 1392936813 -0500 modified helloworld.txt |
Now let’s consider what happens when we create a file with the same contents as our helloworld.txt
, but that has another filename.
$ echo -n 'hello world' > another-helloworld.txt $ git add another-helloworld.txt $ ls -lsa .git/objects drwxr-xr-x 33 drwxr-xr-x 3a drwxr-xr-x 60 drwxr-xr-x 66 drwxr-xr-x 95 drwxr-xr-x info drwxr-xr-x pack |
We see that there are no new objects created in git’s database! (Feel free to use the timestamps of each directory to see that none of them have been updated.) Once we commit this new file, we’ll see some git magic.
$ git commit -m "added a copy of helloworld.txt" [master 370cbee] added a copy of helloworld.txt 1 file changed, 1 insertion(+) create mode 100644 another-helloworld.txt $ ls -lsa .git/objects drwxr-xr-x 33 drwxr-xr-x 37 drwxr-xr-x 3a drwxr-xr-x 60 drwxr-xr-x 66 drwxr-xr-x 95 drwxr-xr-x f1 drwxr-xr-x info drwxr-xr-x pack $ ls -lsa .git/objects/f1 -r--r--r-- a0d7a708eb2ceed5af17de959c7eafe67fb3c4 $ git cat-file -t f1a0d7a708eb2ceed5af17de959c7eafe67fb3c4 tree $ git cat-file -p f1a0d7a708eb2ceed5af17de959c7eafe67fb3c4 100644 blob 95d09f2b10159347eece71399a7e2e907ea3df4f another-helloworld.txt 100644 blob 66b049f59fc01dfb6d97dfe93dd1e149e6c2fa48 helloworld.txt |
Inside the tree object for this commit, we see that another-helloworld.txt
is linked to 95d09f2b10159347eece71399a7e2e907ea3df4f
. This is the same hash signature as our original helloworld.txt
(before we modified it). The second entry in the tree directory points to the other file in our current file structure, which is simply the latest version of the helloworld.txt
(after we modified it). In the commit object, we see that this tree object (that contains entries for both another-helloworld.txt
and helloworld.txt
) is the one that’s referenced in the commit object.
$ ls -lsa .git/objects/37 total 8 -r--r--r-- 0cbee313074a83cf517dd8d521a3afdc10c451 $ git cat-file -t 370cbee313074a83cf517dd8d521a3afdc10c451 commit $ git cat-file -p 370cbee313074a83cf517dd8d521a3afdc10c451 tree f1a0d7a708eb2ceed5af17de959c7eafe67fb3c4 parent 3a70f16aea5ca9153500b73f779109399ba11f51 author ktnode <ktnode@email.com> 1392937033 -0500 committer ktnode <ktnode@email.com> 1392937033 -0500 added a copy of helloworld.txt |
Thus, stepping through tis shows that, with each commit, git actually has a complete copy of the current state of whatever you’re checking in. Specifically, each commit points to tree(s) which capture the latest file/directory structure that you’re checking in.