🧑‍💻

Banner Image: https://unsplash.com/photos/black-red-and-yellow-coated-wires-6ySmw7CwYDk

Software Engineering Essentials: Version Control with Git

Published On: November 11, 2024

Licenses: CC-BY-4.0 | MIT
Filter By License

Software Engineering Essentials is a series of blog posts designed to help you get started with a wide variety of software engineering topics. This post was originally part of my Go-Twitter project, a project-based curriculum designed to take someone from “zero” to “competent” in the world of software engineering.

All writing for this post is licensed under CC-BY-4.0, while all code is licensed under the MIT license.

Git

What is Git?

“Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.”

Git is an application designed to manage your source code and make it easy to keep version history, but why would you want this? Programming gets messy, especially in large projects. Let’s say you’re testing an experimental bugfix, you may be tempted to copy/paste your code into a different folder and name it, my-project_bugfix, but what happens when you have my-project_bugfix2 or my-project_bugfix_real. Copying around folders of source code might work initially, but will quickly get messy, especially if you need to work with other people. Git provides a system to help keep your code organized, easy to work with, and easy to collaborate on.

A Git repository is essentially a place where you can store files to track and share changes to those files. Git repositories (or repos, for short) are usually used to store and track changes to source code files.

Installing Git

Git is available on a wide variety of platforms and you can find an installer here. There are some options to pick during setup:

Select Components: Default
Choosing the default editor used by Git: Use Visual Studio Code as Git’s default editor
- If you don’t have Visual Studio Code, exit the Git installer, install vscode, then re-launch the Git installer.
Adjusting the name of the initial branch in new repositories: Override the default branch name for new repositories: main
Adjusting your PATH environment: Git from the command line and also from 3rd-party software
Choosing the SSH executable: Use external OpenSSH
- SSH is now included in Windows 10 and Windows 11 by default
Choosing HTTPS transport backend: UUse the OpenSSL library
Configuring the line ending conversions: Checkout as-is, commit as-is
Configuring the terminal emulator to use with Git Bash: Use MinTTY (the default terminal of MSYS2)
Choose the default behavior of git pull: Only ever fast-forward
Choose a credential helper: Git Credential Manager
Configuring extra options: Check both Enable file system caching and Enable symbolic links
Configuring experimental options: Leave all options unchecked

Configuring SSH and Setting Up Your GitLab Account

First register for GitLab. Next, we need to create an SSH key so we can authenticate to GitLab.

Press Windows+R to bring up the “Run” dialog box.
Type powershell and press Enter to open up the PowerShell prompt.
Type ssh-keygen and press Enter to start the key creation process
- Press Enter to accept the default file path
- Type a passphrase to protect your key and press Enter.
  - NOTE: Your passphrase will not appear in the terminal window as you type it. This is a security feature.
- Re-type your passphrase and press Enter.
- The key is generated and a list of file paths is printed out. Make note of where the public key has been saved.
- Use VSCode to open that public key file. Keep powershell and vscode open for the next section.

Click here to go to your GitLab User Settings to manage your SSH keys. In the Add an SSH key section, copy and paste the entire public key file from before into the Key text box, then click Add key.

Now bring up the powershell window you had open before and use the ssh-add command. Type your key passphrase and press Enter. Remember: Your passphrase will not be shown here as you type it.

This command will add your key to an ssh-agent. This program just makes it easier to use your key. Instead of needing to type your passphrase every time you use the key, you just add it to the agent and use the passphrase only once. You’ll need to do this each time the computer reboots.

You can see what keys the agent has with the command: ssh-add -l

You can remove keys from the agent, thereby “re-locking” them, with the command: ssh-add -D

How Git views the World

A Brief Introduction to Hashes

Git’s view of the world is through hashes. A hash function is a way to “boil down” data to a fixed-length value. This sounds complex, but its simple in practice. Let’s look at the current hash of this file I’m writing now:

~ sha1sum Handouts/01-source-code-version-control-and-hosting/01-git.md
0b4597c1d9294a633ea584d715ce6681e57cb82a  Handouts/01-source-code-version-control-and-hosting/01-git.md

But as I write more, change the content, and save, the hash will change:

~ sha1sum Handouts/01-source-code-version-control-and-hosting/01-git.md
8393d5018367212dcfe4fa6e477fdf5deff28576  Handouts/01-source-code-version-control-and-hosting/01-git.md

So hashing is a way to take data of any amount and get a fixed-length value that is unique to that data’s content. Some practical uses involve things like error checking, data deduplication, and protection against corruption or tampering. Git uses hashes to identify individual objects like files, directories, references, commits, and more.

Diffs on Diffs on Diffs

Git heavily utilizes the concept of diffs as a way to understand the differences between files (and multiple historical versions of the same file). This makes merges easier to manage and makes for more efficient file storage.

Let’s check out diff in action:

I have two files, a.txt and b.txt:

a.txt:

Test
This is file a

b.txt:

Test
This is file b

Here’s what I get when I run diff a.txt b.txt:

2c2
< This is file a
---
> This is file b

diff only shows us the differences between the files. Git can use this to figure out what changed between different objects. A commit in Git is like taking a snapshot of the diff of the entire repository. As you make more commits, you are creating historical records of what changed in that repository at that moment in time. By playing the diffs in reverse, you can effectively wind back the clock and travel to past, undoing changes.

So how does git “play back” those diffs? This is an operation called patching. A patch is a series of changes intended to be applied to an existing system.

So if I had a program and someone found a typo: hello.go

package main

import "fmt"

func main() {
	fmt.Println("henlo world")
}

They could send me a small patch to fix it:

typo-fix.patch

--- hello.go	2023-01-01 21:25:05.694853862 +0000
+++ hello.go	2023-01-01 21:26:15.863385402 +0000
@@ -3,5 +3,5 @@
 import "fmt"

 func main() {
-	fmt.Println("henlo world")
+	fmt.Println("hello world")
 }

And I could use patch hello.go typo-fix.patch or git apply typo-fix.patch to apply the diff. Now my application will print hello world!

A Git repository is just hashed objects and diffs stacked on top of diffs.

Extra Reading

If you enjoyed this section, feel free to check out Git Internals chapter in the official book for an ever deeper dive.

Using Git on the Command Line

Git is an open source application and there are a ton of products and graphical interfaces designed to make working with it easier. While these can work well in some straight-forward circumstances, they all tend to fall apart pretty badly when things go off the “happy path”, and it can be difficult to get back to a good state. For this reason, we’ll focus on using the command-line version of Git. Once you’re comfortable with this interface, you can absolutely use any Git-compatible application you want, including VS Code’s excellent Git integration, just don’t skip learning the CLI basics first.

Project Creation

A Git repository is just a folder with another .git folder inside of it. In the .git folder are a bunch of directories and files that Git uses to make sense of your codebase. These include hashed objects, diffs, and metadata used to keep things organized and human-compatible. You shouldn’t need to dive into this directory or change anything.

The command to instantiate a new Git repository on your local system is: git init my-project-name

Once you run this command, you’ll need a folder named my-project-name. All of your code, as well as anything you want to keep under version control (like documentation files), should go here.

Cloning an Existing Project

Git is a distributed system, which means projects don’t have to live purely on your local system. Many programmers utilize a Git-hosting service to store and share their code with other contributors. These include services like GitHub, GitLab, and Bitbucket. Don’t worry about signing up for any of these services just yet, the next section will walk you through the basics of GitLab.

Git hosting services give you an easy way to interact with other contributors, report and accept issues, manage merge requests, and most importantly: Share your code with the world. Its important to understand that Git is a distributed system by default, you don’t need these hosting sites to share or collaborate on code, but they do make it easier.

You can think about Git hosting services like YouTube or Vimeo. You can absolutely share video files with people or even make your own video hosting service, but using someone else’s service is so much easier. It is important to keep in mind that Git is not GitHub. GitHub can be used to host a Git repository, just like YouTube can be used to host a video. The hosting service and the core technology are two different things.

Most Git hosting sites will give you one or two Clone addresses you can use to copy down the project locally. Most often these addresses are SSH and/or HTTPS links. You can use these addresses in this command:

git clone <https://github.com/torvalds/linux.git>

or like this:

git clone git@github.com:torvalds/linux.git

Typically, an https repository link is going to be read-only. You can pull down changes, but won’t be able to push your own changes (unless you set up HTTPS authentication). An ssh repository link (the one that looks like git@gitlab) is typically going to be read+write. You will be able to push your changes to the remote repository.

This will clone the repository hosted on GitHub, by the user torvalds, with the project name linux, into a directory on your local system called linux. After you run that command, you can browse the Linux source code on your own system. Because Git is decentralized, you obtain a full copy of the source code repository, including history, commits, branches, tags, and other Git metadata. If you want, you can see what the code looked like last year, or even before then. You can see when certain features were merged in, or when a certain version-number was tagged.

One thing to keep in mind is some data found on the Git host site isn’t included with the repository. These would be things like merge request discussions, issues and bug reports, and project milestones and planning features. By default, Git does not track these items in the repository, so they are instead hosted directly by the Git hosting service.

Viewing and Committing Changes

Git’s version control isn’t automated, users need to manually make “checkpoints” of their changes to let Git know “these changes are important to track”. In Git, these “checkpoints” are called commits.

A commit is a way to tell Git “here are the changes I want to track”. Not everything in a repository folder is automatically tracked. We need to tell Git what files we want to track. For now though, let’s see how Git see’s the world:

In this example, I have a new repository and I’ve created my first file with some “Hello, World!” text in it. I can use git status to see what Git thinks of the files in the repository:

~/git/my-project-name git status
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        my-first-file.txt

nothing added to commit but untracked files present (use "git add" to track)>

Git sees a new file that it has never tracked before: my-first-file.txt

Now, let’s follow Git’s helpful advice, add the file, and re-run the status command to see what changes:

~/git/my-project-name git add my-first-file.txt
~/git/my-project-name git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   my-first-file.txt

Now it looks like we have added the file to our staging area, but we haven’t committed. We’re only preparing the files for commit, we haven’t made the “checkpoint” yet. To do that, let’s use git commit. By default, a text editor will pop up or display in your terminal. This is used for writing a commit message. A commit message is what will be logged with your code changes into Git history. So if you’re working on a shared project, what you write here will be viewable by anyone with access to the repository.

Writing good commit messages is an art form and practice makes perfect. Try to include enough context with your messages, so a “code detective” down the line can figure out what your change did without needing to read the code itself. Keep in mind that you will be that “code detective” one day, and the code you’ll be digging into may very well be your own. Trying to track down a bug and running into a commit that just says “fix” is infuriating. Make it easy on yourself, write good commit messages.

A good Git commit message explains what changed and why. The first line is reserved for a title, the second line skipped, and the third line and beyond is for your longer write-up. This is so git log --oneline and other Git tools can display text nicely. This isn’t a hard rule though, people can (and will) disregard good commit message etiquette, the software will work with it all the same. Zach Holman has a great post about commit messages and why they just don’t matter, so there are multiple schools of thought on the subject.

Back to the example. So I’ve written this as my commit message:

Added my-first-file.txt to use in a Git example

I've added this file with default text in order to show Git examples
using commit messages.

I saved the file and exited the text editor. Git then uses this to construct a commit. It returned this text:

[main (root-commit) 3fa8cef] Added my-first-file.txt to use in a Git example
 1 file changed, 1 insertion(+)
 create mode 100644 my-first-file.txt

Now I can use git log to see my history:

commit 3fa8cefb1c854a550158a58d6443ec2b27e957fe (HEAD -> main)
Author: Tom Webster <tom@samurailink3.com>
Date:   Wed Jan 4 00:17:48 2023 +0000

    Added my-first-file.txt to use in a Git example

    I've added this file with default text in order to show Git examples
    using commit messages.

To get a more compact view, you can use git log --oneline and that will display your commit titles in a list:

3fa8cef (HEAD -> main) Added my-first-file.txt to use in a Git example

You may have noticed that long string of jumbled text above: 3fa8cefb1c854a550158a58d6443ec2b27e957fe.

This is a commit ID. A commit ID is a hash used to identify the change you’ve added to history. Since hashes are designed to be unique, we only need to display enough characters to identify a single commit, we don’t need to display the full ID from above, we can just display 3fa8cef.

A simple example of this is like an ID system with too many digits. Let’s say you have a list of the best fruits, but each ID has 10 digits, so you end up with a table like this:

Fruit       - ID Number

Apple       - 0000000001
Orange      - 0000000002
Banana      - 0000000003
Strawberry  - 0000000004

So much wasted space in that ID number. If we think we’ll have less than 100 fruits in this list, we can just use 2 digits to display the ID instead of the full 10:

Fruit       - ID Number

Apple       - 01
Orange      - 02
Banana      - 03
Strawberry  - 04

Hashes in Git operate in mostly the same way: It will give you the shortest unambiguous hash, but the minimum number of digits is 4.

Git Remotes

Git Remotes are remote repositories. Because Git is a distributed system, you can have multiple remote repositories you can read from or write to. If you cloned this course repository from GitLab, you’ll probably see this when you run git remote -v:

origin  <https://gitlab.com/samurailink3/go-twitter> (fetch)
origin  <https://gitlab.com/samurailink3/go-twitter> (push)

By convention, the first remote added to a Git repository is named origin, but this is only a general convention, not a hard rule.

You can have multiple remotes and custom rules for pushing and pulling changes for each of them if you want. Most often you’ll just have one remote, hosted by one of the popular Git hosting services. This setup will allow you to push changes to an online repository and/or pull changes that have been pushed, but you haven’t pulled to your local repository.

If your repository was created on your local system only, without cloning from a Git hosting service, you won’t have any remotes configured initially. You can always add a remote with:

git remote add remote-name remote-address

Like this:

git remote add origin git@gitlab.com:samurailink3/go-twitter.git

Pulling Changes

Git has two different methods to pull changes into your local repository: fetch and pull. Fetching changes pulls down new objects, diffs, and references from a Git remote down to your local repository. Its important to keep keep in mind that fetching won’t change files in your local workspace. The changes are instead cached as part of that Git remote.

To fetch changes from the origin remote, you can run:

git fetch origin

If you’d like to fetch changes from all remotes, you can use:

git fetch --all

To actually update files in your local workspace, you would use git pull. This will import changes from the configured remote branch into your local branch. This means things like viewing history or checking diffs is near-instantaneous because all of the changes are cached locally, Git doesn’t need to go back to the internet to get more data. Once you fetch, those changes are stored locally.

Pushing Changes

Once you have commits you would like to push to your configured remote, you can use:

git push remote-name branch-name

Like this:

git push origin main

This tells Git, “Take my local commits and push them to the origin remote on the main branch. Git will automatically figure out which commits are missing on the remote and push only what’s needed to make them match.

Branches and Checkouts

Branches in Git are a way to create other “timelines” of history in your repository. So, if your main development branch is main, but you want to work on a complex feature without introducing bugs or breakages on the main branch, you can create a new branch named my-new-feature and start committing changes to that branch. By default, the first branch of a new Git repository is called main, or in older versions of Git, master.

To see a list of branches in your local repository, you can use:

git branch

To see all branches, even those present on remotes, you can use:

git branch --all

To create a new branch, you can just use:

git branch my-new-branch

Now, if you look at your list of branches with git branch, you’ll see it in your list of local branches:

* main
  my-new-branch

But if you look, you can see the * next to main, and if you run git status, you’ll see that you’re still on the main branch:

On branch main

So how do we switch branches?

git checkout my-new-branch

Now when you run git status, you’ll see:

On branch my-new-branch
nothing to commit, working tree clean

Now, when you commit changes, you’ll be committing to my-new-branch instead of main. To get back to the main branch, run git checkout main.

To save time, you can create and check out a new branch with one command:

git checkout -b my-new-branch

If you’d like to base your branch on a specific branch, you can use:

git checkout -b my-new-branch main

git checkout -b my-new-branch origin/some-other-branch

Additional Reading and Git Workflows

There are many strategies and theories on when you should make new branches and how you should organize them. When starting out: simpler is better. I’d recommend reading about GitHub Flow. The quick reference for this workflow is:

Create a new branch for your specific change
Make your commits
Push your branch to your remote
Create a merge request (called a Pull Request in GitHub’s terminology)
If you’re working in a group: Get sign off from your team and address any comments or requests
Merge your changes
Delete your branch

If you are working on software that needs multiple versions and maintenance releases (like operating systems or desktop software), something a bit more complex, like git-flow, could help.