Git – the Bin, the Trees, and the Commits of the Past

The Wind, The Trees And The Shadows Of The Past @2007 The Morningside - a Russian band
The Wind, The Trees and the Shadows of The Past @2007 The Morningside – a Russian band

Git is hard, really. But it is only hard when looking from an outside view, I mean, the user interface, i.e. the Git commands. There’s a ton of complains that Git commands are inconsistent, needlessly complicated, difficult to remember … For that reason, in order to understand Git, let’s try to defer its commands, and maybe try to temporarily erase all Subversion knowledge in your brain.

Three types of Git objects: Commit, Tree, and Blob

Git has three core concepts: commit, tree, and blob objects. Every object has a self-ID which is the hash value of its content. Git uses SHA-1, a cryptographic hash function that hashes any arbitrary content to a 160-bit number often represented in the hexadecimal form: a string of 40 characters. You can observe this by using a Git command called hash-object. Suppose that you have a text file named hello.txt containing one simple line Hello World, now run:

D:\Projects\sample>git hash-object hello.txt
Output:
5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689

The hash value printed is always 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689, no matter the file name, the file path, or the kind of your computer/OS. Even you run this command in another world, the result is the same as long as the file content is Hello World.

Hash IDs are so powerful! They allow Git to avoid object duplication. When a new object – coming from either local or network – is being added to Git database; the object will be ignored if its hash ID is already existed in the database. Clearly, hash IDs is a efficient way to implement this avoiding of object duplication.

I have just mentioned about Git database. Actually it is the .git directory at the root of your working directory. In Git terms, this database is called repository. Both the repository and the working directory are stored together in your hard disk. I will talk about the working directory later, now let’s focus on the repository.

Of course, the repository stores objects. There are some extra things in the repository but objects are the main one. You can locate and see how objects are organized in the directory .git/objects. Basically, each object is stored in a file; the file name is the hash ID; the file content is the content of the object.

So far you got an overview of what are objects and how they are stored. Now let’s discover commit objects first, because commits depend on trees, and trees depend on blobs.

Commit – a snapshot of your project

Given a particular commit object, Git knows how to re-construct exactly the project files and directories of that commit. So the size of a commit object must be huge, right? But the answer is NO, the size of a commit is very small because it contains just references – hash IDs – and some supporting lightweight data. Here is the content of a commit:

<tree hash ID>
[<hash ID of parent commit #1>]
[<hash ID of parent commit #2>]
<author> <datetime>
<committer> <datetime>
<message>

These information are hashed to produce the commit hash ID, of course. Unlike hash ID of the hello.txt file content, hash IDs of the commits are very different because they contain author, committer name, data time, and message.

Given a hash ID of a commit, to view its content, you can use the Git command cat-file like this:

D:\Projects\sample>git cat-file -p 54e1cc
Ouput:
tree 783326c7a1e7ffd6067ca594790e2ecba056b707
parent 287947ed8beef14bf3fa03694e94c89ab0bc409b
author katatunix <katatunix@gmail.com> 1424852748 +0700
committer katatunix <katatunix@gmail.com> 1424852748 +0700
Fix bugs

Look! 54e1cc is too short, it is not a 40-character string. Yes, Git is very comfortable, you do not need to type full 40 characters to refer to a hash ID, just some beginning characters as long as Git can identify it in the repository. If there are more than one object starting with 54e1cc, Git will print an error. Don’t worry, just press the UP arrow key to back to the last command and append some more characters.

Now please look into the content of the commit above, at the 2nd line, you’ll see one (or more, in general) parent of this commit. That means, there are parent-child relationships between commits. Those relationships will be discussed later. At the moment, please concentrate to the 1st line: tree hash ID. Aha, it is the hash ID of a tree object. By using this tree, Git can re-construct exactly the project files and directories corresponding to that commit.

Tree – a list of blobs and other trees

To define a tree, there’s no better way than recursion. This is very similar to a directory, a directory is a list of files and other directories, huh?

Let’s use the command cat-file to see content of the tree in the previous example – the tree with hash ID 783326c7a1e7ffd6067ca594790e2ecba056b707:

D:\Projects\sample>git cat-file -p 783326c
Output:
100644 blob 5e1c309dae7f45e0f39b1bf3ac3cd9db12e7d689 hello.txt
040000 tree 36918457d8a89b2bb4096faaeaf7ce9c0d1a3d47 yeah

And yeah, the tree 783326c contains:

  • File hello.txt with content stored in the blob 5e1c309dae...
  • Directory yeah which is another tree with hash ID 36918457d...

You can continue cat-file -p 36918457d to see content of the directory yeah, and so on.

Did you get my point? Git uses this mechanism to re-construct exactly files and directories of your project at a commit. Given a commit, there’s always a tree corresponding to it. Given a tree, there’s always a full directory corresponding to it.

Blob – binary large object

The content — only the content — of every file F in your project is stored as a blob object. The name of F is stored in a certain tree where the tree needs that blob, with an arbitrary file name. As you can see in previous section, the tree 783326c contains the blob 5e1c309dae with the file name hello.txt, but another tree, or even 783326c itself, can still contain that blob with a different file name e.g. hello2.txt.

Of course, two blobs with the same content are always stored in the same blob file, the blob file name is the hash ID of its content (I repeat this statement again just in case I have not stressed enough). Therefore, even different commits can partially share the same trees and/or blobs. This mechanism helps Git to save disk space for the repository.

To view content of a blob, use cat-file again:

D:\Projects\sample>git cat-file -p 5e1c309
Output:
Hello World

Don’t try to open blob files on your hard disk because it is compressed.

You are still reading, thanks for that :)

Until now, I have discussed a lot of Git data structures: a commit contains a tree, a tree contains trees and blobs, a blob contains binary data; and every object is identified by its hash value. This is not enough, still another important kind of data structure that we have to discover: Directed Acyclic Graph. Do you remember the parent-child relationships between commit objects?

History is just a Directed Acyclic Graph of commit objects

Now you have a collection of commit objects. By selecting a commit object, you can get the corresponding snapshot of your project, do something on your project, and then create a new commit object to save your work. That’s cool! But is that enough?

A collection of commit objects
A collection of commit objects

That is NOT enough because you need to know about the commit history, in other words, the order of commits. The order means a linear order, isn’t it? So why don’t we use the date time information in commits? Yes, the date time is useful, but it is for linear order only:

Commits with a linear order
Commits with a linear order

Sometimes you need a non-linear order of commits, like this:

Commits with a non-linear order
Commits with a non-linear order

Clearly, a commit B need to store a reference to its parent A, that means, commit B was created right after commit A. A commit can have zero, one, or many parents. A commit without parent is called initial commit, this is the first ever commit of your project which adds files and directories.

So, a new created commit always refers to zero, one, or many existed commits; and, the new commit has no children. This mechanism makes the graph of commits has a property: there’s no any cycle, thus the graph is called Directed Acyclic Graph (DAG).

Most of the Git commands are to manipulate this DAG:

  • Create a new commit (and add to the DAG, this is done automatically because of the parent hash IDs).
  • Delete a leaf commit.
  • Cut a branch and add to another location of the DAG. (*)
  • Fetch other commits from network (and add to the DAG, again, automatically).
  • Push commits to network.

Why do we need branches?

Using commit hash IDs seems enough because with these IDs you can do everything to the DAG. However, you must have a brain of a machine, not human. It’s terrible to remember a meaningless string like 36918457d. Meaningful and stable names such as master, hot-fix, feature-x … would be much better.

The term “branch” often makes you think that a branch is a list of commits forming a path from the initial commit to a particular commit. That’s true! And that’s the biggest advantage of branches: using a meaningful and stable name to refer to a historical path of commits. Clearly, we cannot do this with commit hash IDs, because if we do that, each time a new commit – with a different hash ID – is appended to the path, the name of path will change and thus is not stable.

From my experience, in many cases, it would be better to think that a branch is just a head (or pointer/reference) pointing to a specific commit. There are some more kinds of head: tag, stash; you can find them all in the directory .git/refs

heads (branches) and HEAD
heads (branches) and HEAD

In a repository, there is one — and only one — special head named HEAD (uppercase) pointing to a normal head (lowercase). For example, as in the figure above, master, feature-x, and hot-fix are normal heads, HEAD is pointing to master. In fact, HEAD can point to a any head/commit and lets you know what is the current active head/commit.

With heads and HEAD, now we have some ways to refer to a particular commit:

  • Commit hash ID
  • Name of a head, e.g. master, feature-x, and hot-fix
  • HEAD
  • One of above plus ~number, e.g. 1234abcd~1 means parent of commit 1234abcd, master~2 means grandparent of master, HEAD~3 means .. you know what I mean :) What if the commit has more than one parent? I have not checked, maybe a random parent will be selected? Or an error will be thrown? It is unimportant, please avoid that situation by using another ways.

Very flexible, right? Please remember them because knowing to refer to a commit is very essential when working with Git. As you will see later.

Moving heads and HEAD

Okay, so how can we move heads and HEAD? Git provides following commands to do that.

git commit

This will create a new commit which refers to HEAD as parent, and then move the current head and HEAD to the new commit.

Create new commit F, master and HEAD move to F
Create new commit F, master and HEAD move to F

git merge <commit>

You are about creating a new commit that has two parents: HEAD and <commit>; after that, current head and HEAD are moved to the new commit, so you can interpret this command as: merge from <commit> to HEAD.

Command "git merge A" merges A to B, creates a new commit C, moves master and HEAD from B to C
Command “git merge A” merges A to B, creates a new commit C, moves master and HEAD from B to C

In case HEAD is parent of the <commit>, nothing to merge, no new commit is created. <commit> is the desired commit, thus Git just moves current head and HEAD to <commit>, this is called fast-forward.

Command "git merge A" does not create new commit, just moves master and HEAD from B to A. This is fast-forward.
Command “git merge A” does not create new commit, just moves master and HEAD from B to A. This is fast-forward.

<commit> can be in any form of referring to a commit, as I stressed above

git checkout <commit>

This command is straightforward, it moves HEAD to <commit> directly, current head is not moved.

git reset <commit>

This command does more things than git checkout, it moves both current head and HEAD to <commit>. git reset is interesting but dangerous.

git rebase <commit>

This is a kind of the (*) action, assume HEAD and <commit> has the same ancestor named A, the command will cut the path {from HEAD to A but excluding A}, then append the path to <commit>. The content of each commit in the path will be changed, of course, since at least their parent hash ID are changed; and sometimes, their trees (do you remember the tree object inside a commit object?) are also changed.

After all, both current head and HEAD still point to the same commit before. But, in fact, the hash ID of that commit is changed, thus current head and HEAD need to be updated too.

Before rebasing: "git rebase C"
Before rebasing: “git rebase C”
After rebasing: D and E are modified to D' and E', D' becomes child of C, D and E are abandoned
After rebasing: D and E are modified to D’ and E’, D’ becomes child of C, D and E are abandoned

There are some more complicated usages of git rebase. Whatever they are, they still belong to the (*) action.

I will come back to these commands later, maybe in next articles, when we gain a broad knowledge of Git in general.

Detached HEAD

As you can see from the commands above, HEAD can still point directly to a commit e.g. 1234abcd, without a head; but this case is not recommend. HEAD should point to a head, otherwise it will be called detached HEAD. Why a detached HEAD is not recommend?

How to delete commit objects?

Heads can be used to delete commits. All the commits that cannot be reached from any head is considered to be deleted by Git garbage collector, as described in the image below:

C and E cannot be reached from hot-fix and master, thus C and E are abandoned and will be deleted
C and E cannot be reached from hot-fix and master, thus C and E are abandoned and will be deleted

You should understand why detached HEAD is not recommend. When you are on a detached HEAD and make some new commits, these new commits cannot be reached from any head and therefore will be abandoned (thus may be deleted) after you move HEAD to a real head.

What will be affected when HEAD points to a commit?

They’re the index and the working directory!

Moving HEAD to another head/commit is also called switching branch (if you prefer the term). When this happens, the content of your index and working directory will be changed or not. It depends on the running command and provided arguments.

Working directory is the directory storing your project but excluding the repository (.git directory). This is the place where you develop your project (modifying/adding/deleting files and directories).

Index (or staging area) is something unique of Git. In fact, it is a binary file (.git/index) storing a flat list of blobs hash ID. Let’s examine the content of index by git ls-files -s

D:\Projects\sample>git ls-files -s
Output:
100644 15c6adca0b50210e550733e17bafc077f252b0fc 0 hello.txt
100644 6cdee075d73a4b1ea0d34903915444e2490fa383 0 yeah/foo.txt

Semantically, an index is really a directory just like the working directory and the tree object (snapshot) pointed by HEAD. When HEAD points to a commit by running a command (typically git checkout) that forces to change index and working directory, Git will extract the snapshot in order to:

  • populate the file .git/index with corresponding content
  • populate the working directory with corresponding content

At this time, both of three things: snapshot, index, and working directory are identical. Running git status will return nothing.

Now try to change content of the file hello.txt in your working directory. git status will show there’s a difference between index and working directory: file hello.txt, and suggests you to “stage” the file. To stage a file means creating a blob of that file, saving the blob into repository, and then update the corresponding hash ID in the file .git/index

To stage hello.txt, run git add hello.txt. Now index and working directory are identical. But snapshot and index are not. Thus git status will show a “ready to commit” message. To commit means creating a tree object from the index, then creating a new commit object including that tree’s hash ID.

After committing, current head and HEAD point to the new commit, and both of three things: snapshot, index, and working directory are identical, again.

Back to the git reset <commit> command, recall that the command always moves current head and HEAD to <commit>, and then, what is the next action? There are 3 options you can provide to it:

  • --soft: stop here!
  • --mixed: force to change index only, working directory is unchanged, this is the default option.
  • --hard: force to change both index and working directory, very dangerous!

Because git reset moves the current head, please be careful, there are commits that might be abandoned after that.

How to synchronize commit objects through network?

Working alone is bored! You want to send your commits to other repositories? You also want to receive new commits from those repositories? Git provides us a concept called remote repositories.

Setting up a remote repository

In order to setup a remote repository to your local repository, use the command git remote add, for example:

git remote add kata https://github.com/katatunix/sample

This will setup a remote repository named kata, from now, you will use this name instead of the long URL https://github.com/katatunix/sample

Receiving commits from the remote repository

The setting up just creates a reference kata to the remote repository. This does not fetch any commits. To fetch commits, you have to explicitly run the command git fetch for example:

git fetch kata

This will fetch all commits from kata and save to your local repository.

Sending commits to the remote repository

The sending process is more complicated. You can use the command git push to send your specific local commits to a remote repository. You have to provide some information to this command: a remote name, local commits to be sent, and a head name in the remote repository.

Remote name

Here is kata. Thus the command should be: git push kata

Local commits to be sent

If you specify a commit A to be sent, then the parents of A must be sent together, then the parents of the parents must to be sent … and so on. Hence, in the git push command, you have to provide only a <commit>, git will automatically find all ancestors of <commit> and send all of them to the remote repository.

Therefore the command should be either:

git push kata master or
git push kata 1234abcd or
git push kata HEAD

Don’t forget the ways to refer to a commit.

Wait a second! The number of commits to be sent is too big! Is there any way to determine which commits are already existed in the remote repository? So we don’t need to send those commits. Yes, it is remote heads (or remote branches).

The git fetch command do not only fetch commits but also heads. These heads are called remote heads and are named in the form <remote name>/<head name> for example: kata/master. In the local repository, remote heads are read-only, you cannot move them. They are only moved/updated by git fetch.

Even you move HEAD to a remote head, you will be in a detached HEAD.

Remote heads: kata/hot-fix, kata/master
Remote heads: kata/hot-fix, kata/master

Now you can see how remote heads are used to determine which commits are already existed in the remote repository. If there’s a remote head pointing to a commit X, then we can make sure that X and its ancestors are already existed in the remote repository.

So our git push command looks like:

git push kata <commit>

This command will send <commit> and a portion of its ancestors to kata. In the kata repository, these commits are really new, thus they cannot be reached from any head, and they will be abandoned, right? That’s why the third information comes.

A head name in the remote repository

You need to also provide a head name H, so that when kata receives your commits, it will point H to the commit that was specified in <commit>. For example:

git push kata 1234abcd:hello

This will send commit 1234abcd and some of its ancestors to kata. After that, in kata, the head hello is moved to 1234abcd. If there’s no head named hello in kata, it will be created.

kata receives commit 1234abcd, then moves hello to 1234abcd
kata receives commit 1234abcd, then moves hello to 1234abcd

There are cases in which the moving of hello in kata has a problem: at current time, hello is pointing to a very different head, if it is moved to 1234abcd, then some commits will be abandoned:

In kata, at current time, hello is pointing to E. After receiving 1234abcd, if hello moves to 1234abcd, then D and E will be abandoned
In kata, at current time, hello is pointing to E. After receiving 1234abcd, if hello moves to 1234abcd, then D and E will be abandoned

For that reason, the moving of hello to 1234abcd must be a fast-forward. That means, before moving, hello must be an ancestor of 1234abcd. Otherwise Git will throw an error.

Conclusion

In a scope of an article, I cannot cover all aspects of Git. But the knowledge in this article are very essential and important. You have known that the core of Git is commit, tree, and blob objects; and a repository is just a DAG of commits. Because most of Git commands are for manipulating on this DAG, thus, if you understand it, you should not have any trouble when discovering other Git commands. I’m going to talk about some important Git commands such as git merge, git rebase ... in next articles.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s