Version Control Systems: Git at a Glance

The absolute minimum required to use Git

Jun 01, 2023

Last week, we talked about Version Control Systems, or VCSes, and their general benefits for handling your workspace and understanding how who did what when and why. Today, let's look at one such VCS and how we can go about using it within our environment: Git.

All the information in this post is on the Git docs. I would highly suggest reading through it.

What is Git?

Git is a type of VCS known as a Distributed Versions Control System, or DVCS for short. DVCSes basically distribute a copy of a repository to whatever machine that wants to get, or checkout, it. As such, every operation made through the Git command line is localized on the copy of your machine. It will not be updated by any remote location unless you specifically specify to push your local copy or pull from the remote.

Git also only adds data whenever storing changes to files. This means you can recover deleted files in a repository or essentially checkout any point in time within your history.

There are other useful safety checks and implementations, but those are less important to understand for general purpose use.

A Repository in a Nutshell

A Git repository can be boiled down to a folder on your machine with fancy features. All files that you want to save in the repository will go inside this folder.

To create a repository, you should create an empty folder and navigate the root directory of an existing project. Then, open your shell (e.g., PowerShell, Terminal) and navigate to the folder. If you are unfamiliar with your shell, it's basically using `cd` to navigate or right-clicking inside the folder and selecting the `Open in <shell>` option.

Afterwards, run the following command:

git init

`init` creates a new folder within your project called `.git` which contains the necessary files needed to use Git and its VCS. Next, let's create a few files within the project folder. You can then add files to the repository using `add`. The `add` command is followed by a list of paths relative to the project directory. These paths could either be the name of the file, or be a match string indicating what files should be added.

# Add all files within the repository
git add *

# Add all files which end with .html
git add *.html

# Add the license file in the root directory
git add LICENSE

`*`, by default will add all files in the repository. However, if you create a `.gitignore` file within your project folder, all files which match a pattern within the file will not be tracked. If the file is already tracked, this will do nothing.
# Do not add any files which end with .c
*.c

Once a file is added, it will be tracked by the git repository. This means that you do not need to add the file again.

Once you added your files to the git repository, you can make a commit using the `commit` command:

git commit -a -m "Initial commit"

`-a` means to add all tracked changes, also known as modified, to the commit. `-m` specifies that the following string will be the commit message. Of course, if you want to add only specific files, you can use the same path spec as previously shown above:

git commit -m "Only commit files ending with .c" *.c

Now you have a commit containing your files! You can repeat the above two commands over and over to add more commits to whatever you are working on.

Branches

Currently, all commits you have made above are on the main or master branch of your repository. You can verify this by running the `status` command:

# The first line should say 'On branch <branch_name>'
git status

The branch name shown will be the default specified when you first set up Git.

However, let's say that you wanted to work on a new feature which you don't want in your main branch until it is complete. Maybe you want to include an environment to generate the documentation for your files. For this, we can make use of branches. A branch is basically a pointer to a commit. Each pointer represents the head of the branch to which more commits can be made on top of. You can think of commits as being part of a tree in this case.

You can create a new branch from the current commit using the `checkout` command with the `-b` modifier:

git checkout -b example

This creates a new branch called 'example' and now tells Git to update the 'example' pointer when making commits, instead of the 'main' pointer.

The above is synonymous with running the following two commands:
git branch example
git checkout example

Then, you can repeat the process of making commits using `add` and `commit`. If you ever want to go back to another branch, all you have to do is run the `checkout` command again with the name. The files will be automatically updated to that branch.

You can also merge branches into one another; however, that requires a more in-depth explanation as to commit history, how merges work, and resolving conflicts. I would suggest reading the book chapter on this instead, though most researchers working alone will likely never encounter it.

Remote Repositories

Currently, all of these commits and files are stored locally on your machine. However, you can host your repositories anywhere on the Internet. This is known as a remote repository. Usually, you would store your files on this remote repository to which others may use or collaborate on. There are plenty of services which store remote Git repositories, such as GitHub, GitLab, BitBucket, etc.

You can add a remote location using the `remote add` command:

git remote add origin https://github.com/exampleuser/examplerepo

This command says add a new remote named 'origin' which points to a Git repository at the specified URL. This URL can also be an SSH location or a remote git path.

Now, if we wanted to update the remote repository, we can run the `push` command:

# Assume we have the 'main' branch currently checked out
git push -u origin main

So, this command is doing two things. First, it is pushing the branch to a remote branch on the 'origin' remote called main. Second, it sets up the local 'main' branch to track the 'main' branch on the origin remote. You must set the upstream location for each branch on first push. This means that after the first push on a branch, we can simply call:

git push

However, we can also pull changes from the remote branch using the `pull` command:

git pull

Now you have the basics of using Git at a glance. Of course, there are plenty of other useful features and commands that make life easy. This is simply the minimum amount of knowledge necessary to use Git at a glance.

Some Additional Thoughts

Read the Existing Documentation

This is a very brief overview on how to use Git and its commands. The documentation Git has is much more complete. They even have a free book that contains a more step-by-step process. I highly recommend reading it.

Let's Discover Open Science and Reproducibility