Software Engineering Essentials is a series of blog posts designed to help you get started with a wide variety of software engineering topics. This post was originally part of my Go-Twitter project, a project-based curriculum designed to take someone from “zero” to “competent” in the world of software engineering.
All writing for this post is licensed under CC-BY-4.0, while all code is licensed under the MIT license.
Git
What is Git?
“Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.”
Git is an application designed to manage your source code and make it easy to
keep version history, but why would you want this? Programming gets messy,
especially in large projects. Let’s say you’re testing an experimental bugfix,
you may be tempted to copy/paste your code into a different folder and name it,
my-project_bugfix, but what happens when you have my-project_bugfix2 or
my-project_bugfix_real. Copying around folders of source code might work
initially, but will quickly get messy, especially if you need to work with other
people. Git provides a system to help keep your code organized, easy to work
with, and easy to collaborate on.
A Git repository is essentially a place where you can store files to track and share changes to those files. Git repositories (or repos, for short) are usually used to store and track changes to source code files.
Installing Git
Git is available on a wide variety of platforms and you can find an installer here. There are some options to pick during setup:
- Select Components: Default
- Choosing the default editor used by Git: Use Visual Studio Code as Git’s
default editor
- If you don’t have Visual Studio Code, exit the Git installer, install vscode, then re-launch the Git installer.
- Adjusting the name of the initial branch in new repositories: Override the
default branch name for new repositories:
main - Adjusting your PATH environment: Git from the command line and also from 3rd-party software
- Choosing the SSH executable: Use external OpenSSH
- Choosing HTTPS transport backend: UUse the OpenSSL library
- Configuring the line ending conversions: Checkout as-is, commit as-is
- Configuring the terminal emulator to use with Git Bash: Use MinTTY (the default terminal of MSYS2)
- Choose the default behavior of
git pull: Only ever fast-forward - Choose a credential helper: Git Credential Manager
- Configuring extra options: Check both Enable file system caching and Enable symbolic links
- Configuring experimental options: Leave all options unchecked
Configuring SSH and Setting Up Your GitLab Account
First register for GitLab. Next, we need to create an SSH key so we can authenticate to GitLab.
- Press
Windows+Rto bring up the “Run” dialog box. - Type
powershelland pressEnterto open up the PowerShell prompt. - Type
ssh-keygenand pressEnterto start the key creation process- Press
Enterto accept the default file path - Type a passphrase to protect your key and press
Enter.- NOTE: Your passphrase will not appear in the terminal window as you type it. This is a security feature.
- Re-type your passphrase and press
Enter. - The key is generated and a list of file paths is printed out. Make note of where the public key has been saved.
- Use VSCode to open that public key file. Keep powershell and vscode open for the next section.
- Press
Click here to go to your GitLab User Settings to manage your SSH keys. In the Add an SSH key section, copy and paste the entire public key file from before into the Key text box, then click Add key.
Now bring up the powershell window you had open before and use the ssh-add
command. Type your key passphrase and press Enter. Remember: Your passphrase
will not be shown here as you type it.
This command will add your key to an ssh-agent. This program just makes it
easier to use your key. Instead of needing to type your passphrase every time
you use the key, you just add it to the agent and use the passphrase only once.
You’ll need to do this each time the computer reboots.
You can see what keys the agent has with the command: ssh-add -l
You can remove keys from the agent, thereby “re-locking” them, with the command:
ssh-add -D
How Git views the World
A Brief Introduction to Hashes
Git’s view of the world is through hashes. A hash function is a way to “boil down” data to a fixed-length value. This sounds complex, but its simple in practice. Let’s look at the current hash of this file I’m writing now:
~ sha1sum Handouts/01-source-code-version-control-and-hosting/01-git.md
0b4597c1d9294a633ea584d715ce6681e57cb82a Handouts/01-source-code-version-control-and-hosting/01-git.md
But as I write more, change the content, and save, the hash will change:
~ sha1sum Handouts/01-source-code-version-control-and-hosting/01-git.md
8393d5018367212dcfe4fa6e477fdf5deff28576 Handouts/01-source-code-version-control-and-hosting/01-git.md
So hashing is a way to take data of any amount and get a fixed-length value that is unique to that data’s content. Some practical uses involve things like error checking, data deduplication, and protection against corruption or tampering. Git uses hashes to identify individual objects like files, directories, references, commits, and more.
Diffs on Diffs on Diffs
Git heavily utilizes the concept of diffs as a way to understand the
differences between files (and multiple historical versions of the same file).
This makes merges easier to manage and makes for more efficient file storage.
Let’s check out diff in action:
I have two files, a.txt and b.txt:
a.txt:
Test
This is file a
b.txt:
Test
This is file b
Here’s what I get when I run diff a.txt b.txt:
2c2
< This is file a
---
> This is file b
diff only shows us the differences between the files. Git can use this to
figure out what changed between different objects. A commit in Git is like
taking a snapshot of the diff of the entire repository. As you make more
commits, you are creating historical records of what changed in that repository
at that moment in time. By playing the diffs in reverse, you can effectively
wind back the clock and travel to past, undoing changes.
So how does git “play back” those diffs? This is an operation called patching. A patch is a series of changes intended to be applied to an existing system.
So if I had a program and someone found a typo: hello.go
package main
import "fmt"
func main() {
fmt.Println("henlo world")
}
They could send me a small patch to fix it:
typo-fix.patch
--- hello.go 2023-01-01 21:25:05.694853862 +0000
+++ hello.go 2023-01-01 21:26:15.863385402 +0000
@@ -3,5 +3,5 @@
import "fmt"
func main() {
- fmt.Println("henlo world")
+ fmt.Println("hello world")
}
And I could use patch hello.go typo-fix.patch or git apply typo-fix.patch to
apply the diff. Now my application will print hello world!
A Git repository is just hashed objects and diffs stacked on top of diffs.
Extra Reading
If you enjoyed this section, feel free to check out Git Internals chapter in the official book for an ever deeper dive.
Using Git on the Command Line
Git is an open source application and there are a ton of products and graphical interfaces designed to make working with it easier. While these can work well in some straight-forward circumstances, they all tend to fall apart pretty badly when things go off the “happy path”, and it can be difficult to get back to a good state. For this reason, we’ll focus on using the command-line version of Git. Once you’re comfortable with this interface, you can absolutely use any Git-compatible application you want, including VS Code’s excellent Git integration, just don’t skip learning the CLI basics first.
Project Creation
A Git repository is just a folder with another .git folder inside of it. In
the .git folder are a bunch of directories and files that Git uses to make
sense of your codebase. These include hashed objects, diffs, and metadata used
to keep things organized and human-compatible. You shouldn’t need to dive into
this directory or change anything.
The command to instantiate a new Git repository on your local system is: git init my-project-name
Once you run this command, you’ll need a folder named my-project-name. All of
your code, as well as anything you want to keep under version control (like
documentation files), should go here.
Cloning an Existing Project
Git is a distributed system, which means projects don’t have to live purely on your local system. Many programmers utilize a Git-hosting service to store and share their code with other contributors. These include services like GitHub, GitLab, and Bitbucket. Don’t worry about signing up for any of these services just yet, the next section will walk you through the basics of GitLab.
Git hosting services give you an easy way to interact with other contributors, report and accept issues, manage merge requests, and most importantly: Share your code with the world. Its important to understand that Git is a distributed system by default, you don’t need these hosting sites to share or collaborate on code, but they do make it easier.
You can think about Git hosting services like YouTube or Vimeo. You can absolutely share video files with people or even make your own video hosting service, but using someone else’s service is so much easier. It is important to keep in mind that Git is not GitHub. GitHub can be used to host a Git repository, just like YouTube can be used to host a video. The hosting service and the core technology are two different things.
Most Git hosting sites will give you one or two Clone addresses you can use to copy down the project locally. Most often these addresses are SSH and/or HTTPS links. You can use these addresses in this command:
git clone <https://github.com/torvalds/linux.git>
or like this:
git clone git@github.com:torvalds/linux.git
Typically, an https repository link is going to be read-only. You can pull
down changes, but won’t be able to push your own changes (unless you set up
HTTPS authentication). An ssh repository link (the one that looks like
git@gitlab) is typically going to be read+write. You will be able to push
your changes to the remote repository.
This will clone the repository hosted on GitHub, by the user torvalds, with the project name linux, into a directory on your local system called linux. After you run that command, you can browse the Linux source code on your own system. Because Git is decentralized, you obtain a full copy of the source code repository, including history, commits, branches, tags, and other Git metadata. If you want, you can see what the code looked like last year, or even before then. You can see when certain features were merged in, or when a certain version-number was tagged.
One thing to keep in mind is some data found on the Git host site isn’t included with the repository. These would be things like merge request discussions, issues and bug reports, and project milestones and planning features. By default, Git does not track these items in the repository, so they are instead hosted directly by the Git hosting service.
Viewing and Committing Changes
Git’s version control isn’t automated, users need to manually make “checkpoints” of their changes to let Git know “these changes are important to track”. In Git, these “checkpoints” are called commits.
A commit is a way to tell Git “here are the changes I want to track”. Not everything in a repository folder is automatically tracked. We need to tell Git what files we want to track. For now though, let’s see how Git see’s the world:
In this example, I have a new repository and I’ve created my first file with
some “Hello, World!” text in it. I can use git status to see what Git thinks
of the files in the repository:
~/git/my-project-name git status
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
my-first-file.txt
nothing added to commit but untracked files present (use "git add" to track)>
Git sees a new file that it has never tracked before: my-first-file.txt
Now, let’s follow Git’s helpful advice, add the file, and re-run the status command to see what changes:
~/git/my-project-name git add my-first-file.txt
~/git/my-project-name git status
On branch main
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: my-first-file.txt
Now it looks like we have added the file to our staging area, but we haven’t
committed. We’re only preparing the files for commit, we haven’t made the
“checkpoint” yet. To do that, let’s use git commit. By default, a text editor
will pop up or display in your terminal. This is used for writing a commit
message. A commit message is what will be logged with your code changes into
Git history. So if you’re working on a shared project, what you write here will
be viewable by anyone with access to the repository.
Writing good commit messages is an art form and practice makes perfect. Try to include enough context with your messages, so a “code detective” down the line can figure out what your change did without needing to read the code itself. Keep in mind that you will be that “code detective” one day, and the code you’ll be digging into may very well be your own. Trying to track down a bug and running into a commit that just says “fix” is infuriating. Make it easy on yourself, write good commit messages.
A good Git commit message explains what changed and why. The first line is
reserved for a title, the second line skipped, and the third line and beyond is
for your longer write-up. This is so git log --oneline and other Git tools can
display text nicely. This isn’t a hard rule though, people can (and will)
disregard good commit message etiquette, the software will work with it all the
same. Zach Holman has a great post about commit messages and why they just
don’t matter, so there are
multiple schools of thought on the subject.
Back to the example. So I’ve written this as my commit message:
Added my-first-file.txt to use in a Git example
I've added this file with default text in order to show Git examples
using commit messages.
I saved the file and exited the text editor. Git then uses this to construct a commit. It returned this text:
[main (root-commit) 3fa8cef] Added my-first-file.txt to use in a Git example
1 file changed, 1 insertion(+)
create mode 100644 my-first-file.txt
Now I can use git log to see my history:
commit 3fa8cefb1c854a550158a58d6443ec2b27e957fe (HEAD -> main)
Author: Tom Webster <tom@samurailink3.com>
Date: Wed Jan 4 00:17:48 2023 +0000
Added my-first-file.txt to use in a Git example
I've added this file with default text in order to show Git examples
using commit messages.
To get a more compact view, you can use git log --oneline and that will
display your commit titles in a list:
3fa8cef (HEAD -> main) Added my-first-file.txt to use in a Git example
You may have noticed that long string of jumbled text above:
3fa8cefb1c854a550158a58d6443ec2b27e957fe.
This is a commit ID. A commit ID is a hash used to identify the change
you’ve added to history. Since hashes are designed to be unique, we only need to
display enough characters to identify a single commit, we don’t need to display
the full ID from above, we can just display 3fa8cef.
A simple example of this is like an ID system with too many digits. Let’s say you have a list of the best fruits, but each ID has 10 digits, so you end up with a table like this:
Fruit - ID Number
Apple - 0000000001
Orange - 0000000002
Banana - 0000000003
Strawberry - 0000000004
So much wasted space in that ID number. If we think we’ll have less than 100 fruits in this list, we can just use 2 digits to display the ID instead of the full 10:
Fruit - ID Number
Apple - 01
Orange - 02
Banana - 03
Strawberry - 04
Hashes in Git operate in mostly the same way: It will give you the shortest unambiguous hash, but the minimum number of digits is 4.
Git Remotes
Git Remotes are remote repositories. Because Git is a distributed system, you
can have multiple remote repositories you can read from or write to. If you
cloned this course repository from GitLab, you’ll probably see this when you run
git remote -v:
origin <https://gitlab.com/samurailink3/go-twitter> (fetch)
origin <https://gitlab.com/samurailink3/go-twitter> (push)
By convention, the first remote added to a Git repository is named origin, but this is only a general convention, not a hard rule.
You can have multiple remotes and custom rules for pushing and pulling changes for each of them if you want. Most often you’ll just have one remote, hosted by one of the popular Git hosting services. This setup will allow you to push changes to an online repository and/or pull changes that have been pushed, but you haven’t pulled to your local repository.
If your repository was created on your local system only, without cloning from a Git hosting service, you won’t have any remotes configured initially. You can always add a remote with:
git remote add remote-name remote-address
Like this:
git remote add origin git@gitlab.com:samurailink3/go-twitter.git
Pulling Changes
Git has two different methods to pull changes into your local repository:
fetch and pull. Fetching changes pulls down new objects, diffs, and
references from a Git remote down to your local repository. Its important to
keep keep in mind that fetching won’t change files in your local workspace. The
changes are instead cached as part of that Git remote.
To fetch changes from the origin remote, you can run:
git fetch origin
If you’d like to fetch changes from all remotes, you can use:
git fetch --all
To actually update files in your local workspace, you would use git pull. This
will import changes from the configured remote branch into your local branch.
This means things like viewing history or checking diffs is near-instantaneous
because all of the changes are cached locally, Git doesn’t need to go back to
the internet to get more data. Once you fetch, those changes are stored locally.
Pushing Changes
Once you have commits you would like to push to your configured remote, you can use:
git push remote-name branch-name
Like this:
git push origin main
This tells Git, “Take my local commits and push them to the origin remote on
the main branch. Git will automatically figure out which commits are missing
on the remote and push only what’s needed to make them match.
Branches and Checkouts
Branches in Git are a way to create other “timelines” of history in your
repository. So, if your main development branch is main, but you want to work
on a complex feature without introducing bugs or breakages on the main branch,
you can create a new branch named my-new-feature and start committing changes
to that branch. By default, the first branch of a new Git repository is called
main, or in older versions of Git, master.
To see a list of branches in your local repository, you can use:
git branch
To see all branches, even those present on remotes, you can use:
git branch --all
To create a new branch, you can just use:
git branch my-new-branch
Now, if you look at your list of branches with git branch, you’ll see it in
your list of local branches:
* main
my-new-branch
But if you look, you can see the * next to main, and if you run git status, you’ll see that you’re still on the main branch:
On branch main
So how do we switch branches?
git checkout my-new-branch
Now when you run git status, you’ll see:
On branch my-new-branch
nothing to commit, working tree clean
Now, when you commit changes, you’ll be committing to my-new-branch instead of
main. To get back to the main branch, run git checkout main.
To save time, you can create and check out a new branch with one command:
git checkout -b my-new-branch
If you’d like to base your branch on a specific branch, you can use:
git checkout -b my-new-branch main
or
git checkout -b my-new-branch origin/some-other-branch
Additional Reading and Git Workflows
There are many strategies and theories on when you should make new branches and how you should organize them. When starting out: simpler is better. I’d recommend reading about GitHub Flow. The quick reference for this workflow is:
- Create a new branch for your specific change
- Make your commits
- Push your branch to your remote
- Create a merge request (called a Pull Request in GitHub’s terminology)
- If you’re working in a group: Get sign off from your team and address any comments or requests
- Merge your changes
- Delete your branch
If you are working on software that needs multiple versions and maintenance releases (like operating systems or desktop software), something a bit more complex, like git-flow, could help.
