Exploring Git
I think Git may reasonably be my favorite piece of software ever written. I love the C# Programming Language, and I’m a huge fan of TypeScript. Visual Studio is pretty cool, and so is Azure, but at the center of nearly all the development work I do (including this blog), I use Git. It’s simply amazing. I’ve spent a ton of time deconstructing my own commit history, ranging from the most simple, linear commit graphs, to graphs including several merges (particularly, multiple merges on experimental development branches). This post explores some of the features that I feel are most pertinent to understanding and using Git.
I’m preparing to give a presentation on Git at work in the near future, and felt that blogging about Git would help me get into the proper headspace for the eventual presentation (which is in about a week). Part of the reason why I write this blog is to explore ideas and research gaps in my own knowledge, and I hope that both you and I are mutual benefactors of this effort.
I’ll just get this out of the way up front. My favorite features in Git are:
- Fantastic command-line interface
- Porcelain and plumbing commands
- The fact that Git is distributed. I’ll never see centralized systems the same after using Git
- The brilliant, light-weight branching model
- Tags. They’re amazing
Some people may not appreciate it, but I tend to really enjoy the command-line interface in Git. Perhaps it’s because I accepted that a GUI would just never offer the same power as the CLI, or perhaps it’s because I had to dive into the CLI early on with Git, when I made my first Open-Source contribution. Whatever the reason, the Git command-line just makes me happy. I think it’s very well thought-out, and personally find it to be a very productive experience.
The next item on the list is porcelain and plumbing commands. These are actually very interesting, and expose the true power of Git to any user. Porcelain commands are those “top-level” commands, like add and commit, that you normally use on the Git CLI, but they’re really just sugar on top of a bunch of plumbing commands (low-level Git sub-commands that do all the work). The hallmark of this whole approach is that it’s very light-weight and Unix-like (which is not surprising, given Git’s origins). It’s trivial to do things like obtain a revision (commit) list for a branch.
If you’re not familiar with what it means for Git to be distributed, allow me to explain. Many version control systems employ a single, central database, that all work gets checked into. This database is the “master,” and any operation involving the history of a project must be queried through the central system. This means that the database API must be either made available over the public internet, or that you need a VPN to access the system. Since I initially adopted Git, I’ve found centralized version control systems to be terribly unproductive and highly frustrating. Git supports a fully-offline workflow: the only time you need to connect to any other instance is when performing synchronization (such as getting the latest patches). Every Git repository is simply a node on a network, much the same way as your computer or your phone is a node in a larger network.
Git being distributed doesn’t mean that there aren’t shared Git servers: several platforms, such as Github, Azure DevOps, and Gitlab, provide centralized hosting systems. These are fanstatic for enabling collaboration, but the power of Git being distributed is that these centralized servers are not required for basic operations, such as viewing the commit history. A significant detail about this is that you’re only ever guaranteed to view a snapshot of the history: unless your repository instance is somehow designated as being authoritative (which usually occurs through community consensus), then your history is just a snapshot. This is a great feature, though; because no single instance is considered to be authoritative, it means any instance can be authoritative. If the agreed upon Git server somehow gets corrupted, you just need to re-push a clone of the repository. You can manage Git repository backups by simply synchronizing a Git mirror on a nightly basis, and your backup is done: no third-party software, no additional licensing required. How cool is that?!
Before I get much further into this post, I would like to establish what degree of experience I expect a lot of readers will have. I’ll assume that you have a basic understanding, and possibly some experience, using some or all of the following Git features:
- Cloning repositories. This is a common first step for many Git new-comers
- Pushing to and pulling from Git remotes (servers)
- Making and committing changes to a Git repository
- Potentially, creating Git branches, and possibly merging them with other branches
If you haven’t used any of these, but have used other version control systems, then I believe you’ll still be able to follow along with most of what I have to say here. With that disclaimer out of the way, let’s dig in!
An Anatomical View of Commits in Git
There are three primary objects that Git uses for tracking your content:
- Blobs (Binary Large Objects)
- Trees
- Commits
A blob is very simply the raw, binary content of a file. This can be plain
text, compiled code, or whatever: it’s just the content of a file. An important
feature of Git that it stores blob
objects one and only one time. It’s rather
simple to demonstrate this, using the following Git commands:
# Create a scratch dir, and change directories to it
mkdir ./example
cd ./example
# Convert this directory into a Git repository
git init
# Create dummy files
echo Hello, World! > file1.txt
echo Hello, World! > file2.txt
echo Hello, World! > file3.txt
echo Hello, World! > file4.txt
echo Hello, World! > file5.txt
# Stage and commit the files
git add file*.txt
git commit -m "Add example files"
# List the files with their content
git show --raw
# Move back up to the original directory
cd ..
# For Windows users
rmdir /S /Q ./example
# For Bash users
rm -rf ./example
The result of running git show --raw
should be similar to this:
commit 58eb8d5000f0767549e200098407c2c939f545d1
Author: John Doe <john@example.com>
Date: Wed May 8 22:44:05 2019 -0400
Add example files
:000000 100644 0000000... 8ab686e... A file1.txt
:000000 100644 0000000... 8ab686e... A file2.txt
:000000 100644 0000000... 8ab686e... A file3.txt
:000000 100644 0000000... 8ab686e... A file4.txt
:000000 100644 0000000... 8ab686e... A file5.txt
The part of this command’s output that says 0000000... 8ab686e... A
tells us
a few things:
- The SHA1 hash of the original object. In this case, there is no original
object (hence the
0000000...
part) - The SHA1 hash of the new object. This is the
8ab686e...
part, which is an abbreviated hash - The
A
part tells us that an object was added (or created, or some other synonym)
This demonstrates that Git stores discrete content one and only one time. Each
of these files contains the exact same content, given from the “Hello, World!”
listings in the example above. It stores this content in an internal objects
directory, where the first two characters of the commit (8a
in this case)
are the directory name, and the remainder of the object hash is the file name.
Git doesn’t attempt to store blobs multiple times, because it wouldn’t be
terribly economical.
One note about this example: if you’re using Windows, and specifically Windows PowerShell, you may get a different commit hash. This has to do with how things like new line characters and work encodings work across different environments.
Blobs aren’t the only objects that get stored one-and-only-one time. Every single object that Git creates gets hashed, and the hash is used as a unique identifier. If Git finds that an object already exists, it simply re-uses it.
Git doesn’t associate blobs directly with files. Instead, it creates trees.
A tree is a Git representation of a directory structure. Trees are how Git
links blob instances with relative paths (for a repository), as well as mapping
sub-trees. Here’s what a sample tree
object might look like:
100644 blob 5edff5abac3f31cd8ff26045151c7278e9503e5e .editorconfig
100644 blob 45c150536e5f3888554c294f27539c5d41072467 .gitignore
040000 tree 9ead4dcd336627a532665b0f91504ff505607982 src
The first part is a set of flags. These are used when Git re-constructs the
tree on the file system. The next part tells you whether the item is a tree
or a blob. As noted, blobs are stored one-and-only-one time in Git, and trees
are how they get associated to being files on the filesystem. Next, we see
another SHA1 hash of the object. Finally, we have the name of the object. As
you can see above, elements of a tree are listed in file system-sorted order
(.editorconfig
, .gitignore
, src
). When a tree
is the child of another
tree
, we’d refer to it as a sub-tree
. This intrinsically means that tree
s
in Git are recursive, just like they are on your filesystem.
This is all integrated through the commit
object, which has a few
attributes:
- A single tree hash, which refers to the root of the repository
- One or more
parent
commits - Author info (the person that created the commit)
- Committer info (the person that committed the commit)
- A commit message
Commit messages are like e-mails: they have a subject line, followed by a message body. Usually, I’ll type the message body as either a bulleted list, or in paragraph form. I also use markdown syntax, because several tools will at some point display commit messages the same way. Here’s a sample commit:
tree a34f789016317ef654ba2839f33bd3b8cbb8352c
parent 541fed384c76d0f94db213b230300cddba8b1e89
author John Doe <john@example.com> 1557289111 -0400
committer John Doe <john@example.com> 1557289111 -0400
Example commit: this is the "subject" line
Message body. I'm using a paragraph style here, but for many changes, will
employ a bulleted-list, similar to:
- One space before the "bullet" token
- A token, such as `-` or `*`
- One space after the token
- The text, using a psuedo-sentence style (without ending punctuation)
Here, you see the tree <hash>
object, as well as the hash for the tree. Next,
there’s that parent <hash>
thing. That’s the parent commit. For merge
commits, you’ll see at least two parents, but Git can easily merge more than
just two commits. The author info tells us who created the commit, their
contact details, and a Unix timestamp for when the commit was made. The commit
info tells us who actually applied the commit to the commit graph (committed,
merged - which is still committing - rebased, etc.). I formatted this message
body to explain the remainder, and how I generally structure my commits.
And that’s pretty much all you “need” to know about commits. Git commits are beautifully straight forward! Next, we’ll review branches and tags.
Git Branches
I mentioned that one of my favorite things about Git is the branching model. In many version control systems, a branch is literally a cloned folder from some point in the history. This means you end up with additional folder paths on your hard disk, and unfortunately, the common systems I’ve worked with (SVN, TFVC) store the branch as a second directory on the server. If you only have two branches, then this isn’t terrible (though it’s still not great), but when you scale up to having many branches, you end up with lots of duplicated folders. Even worse, a lot of the branches are “sticky” in your history, and never really go away.
Which leads me to why I love Git branches. A Git branch consists of a very
simple object: it’s just a file in Git’s private repository directory, under
.git/refs/heads
. Most repositories will have a branch called master
. If you
navigate to .git/refs/heads/
from your repository root, you’ll see a file
called master
. When I examine this file in the repository I’m currently
working in, it looks like this:
3afad76c02773ddf753a13e05821ad0537560c3a
There’s a little plumbing command called cat-file
we can use to query what
this hash refers to:
> git cat-file -t 3afad76c02773ddf753a13e05821ad0537560c3a
commit
That commit is the same commit I referred to earlier in this post. The content is:
> git cat-file -p 3afad76c02773ddf753a13e05821ad0537560c3a
tree a34f789016317ef654ba2839f33bd3b8cbb8352c
parent 541fed384c76d0f94db213b230300cddba8b1e89
author John Doe <john@example.com> 1557289111 -0400
committer John Doe <john@example.com> 1557289111 -0400
Example commit: this is the "subject" line
Message body. I'm using a paragraph style here, but for many changes, will employ
a bulleted-list, similar to:
- One space before the "bullet" token
- A token, such as `-` or `*`
- One space after the token
- The text, using a psuedo-sentence style (usually, without ending punctuation)
In Git, you navigate between branches using the git checkout
command. When
you perform a checkout, Git performs essentially these steps:
- Open the branch file and obtain the commit hash
- Locate the commit in the history, and open it’s
tree
object - Recursively delete any files or directories not listed in the
tree
- Recursively re-instantiate all other files and directories in the
tree
So what am I getting at here about branches? Well, let’s recap:
- A branch is just a file stored in Git’s private directory
- A branch file just records the SHA1 hash of a commit
- When you navigate branches (using the
checkout
command), Git reconstructs that commit recursively down the tree
The consequence of these facts is that a Git branch is nothing more than a
commit pointer. There are no additional directories, such as with other version
control systems. Branches are “non-persistent” - they don’t stick around in
history after they’ve been created and destroyed. And this is why I love
branches in Git. “Destroying” a branch just means that you’ve deleted it (using
the git branch
command with either the -d
or -D
option). They’re not
purged from history completely, though: there will always be a record of a
branch in the Git history (assuming you’re not re-writing commits), but only if
you choose to publish it by merging the commit (for this post, we’ll forget
about things like publishing via the git push
command).
Git Tags
We’ve covered a whole lot of ground here. We’ve discussed the most basic
elements in Git, including blob
s, tree
s, and commit
s. We’ve reviewed how
commits are linked together to form a commit graph. And finally, we’ve talked
about branches, and why I personally feel Git’s branching model is superior to
other systems I’ve used. The last concept I want to review is tags, because I
feel they complete the story in terms of Git repository concepts. If you’re
coming from SVN, these should be familiar. If you’re coming from TFVC, the
analogue in TFVC is a “label.”
So what are tags, and why do I feel they’re important? Well, a tag is just an alias for a commit … or, more succinctly, a commit pointer. That’s all. Nothing crazy, and nothing special.
But you just said that a branch is just a commit pointer. And now you’re telling me that a tag is nothing more than a commit pointer too? What gives?! If branches and tags are both “just commit pointers,” then why the distinction, and why should I choose one over the other?
Perhaps you’re not having this insane monologue with yourself. If not, I apologize. In any case, I’m going to attempt to answer all these questions as succinctly as I know how.
Tags are just named commits, or commit-pointers, and branches are just commit pointers. In each case, they can point at any commit in the repository’s history. So how are they different from each other? It’s actually quite simple: you can write new commits to Git branches, but you can’t write new commits to a Git tag. Here’s a short list of things you might do with tags:
- Create tags using the
git tag <name> [commit-id]
operation - Overwrite an existing tag (using the
-f
option) - Delete a tag (using the
-d
option)
You can also conditionally add a description to a tag using the -a
option.
There is one interesting point about tags: if you run the cat-file
sub-command with the -t
option, and a tag name, Git will report that an
object is a tag, but only if you’ve annotated the tag. Otherwise, cat-file
will just resolve the commit ID. An annotated tag includes some additional
attributes that are not found on “normal” tags:
object 541fed384c76d0f94db213b230300cddba8b1e89
type commit
tag example-tag
tagger John Doe <john@example.com> 1557293448 -0400
This is an example tag, which is annotated.
The first line tells us that we’re pointing at an object with (abbreviated) ID
541fed
. The next line tells us that object is a commit
. The rest is pretty
self-explanatory.
So when should you use a tag, and when should you use a commit? My general
guidance is that you use branches when you’re continuously incrementing the
history, such as the case with having a shared develop
branch and a
release-quality master
branch, or when you have two parallel branches that
are equivalent to dual master
branches. An example of why you might choose to
do this is you’re maintaining a library or a framework of some kind, you’re
planning to make a breaking change, and you expect to patch both the version
following the change, and the previous major version with subsequent minor
patches (I’ve had to do this before).
Unlike branches, tags are more retrospective in nature. We use tags to identify events that have occurred in a repository’s history. One common use of tags is the identification of releases to a project in it’s history. You can use them for other things too, but I’m having a hard time recalling other use-cases for tagging. In any case, if you just want to name an event in your repository’s history, that’s what you use tags for. They’re more useful when accompanied by annotations, but naming any special event in history is better than leaving it to guesswork.
The final piece I’ll say about tags is this: just because you don’t have a branch, doesn’t mean that you can’t continue incrementing the history for a tag. In fact, it’s fairly trivial, but beyond the scope of things I’d like to cover in this post.
Wrapping Up
So I’ve talked about a whole bunch of stuff on here. I hope this broadens your view of Git. I didn’t bother trying to talk about everything there is to know to get started with Git, and mostly assume that you have some experience using Git. If I’m mistaken in that assumption, then I’d recommend you go create an account with a service that will host your Git repositories for you. I personally like and recommend recommend Github and Azure DevOps, but have heard great things about Gitlab. Once you’ve got an account, I’d recommend learning how to use all the following:
git clone
git pull
git add
git commit
git push
Once you feel you’ve got the hang of using Git, this post may be worth re-reading. There are also great videos on YouTube and other places that go into much greater depth than I have here. Finally, I can’t recommend the Pro Git book enough, which you can read for free on the Git website.
I hope this post has helped improve your understanding of how Git works. Now, get out there and start committing!
- Brian