Chapter 5 Collaboration with version control

5.1 Overview

This chapter will introduce the concept of using version control systems to track changes to a project over its lifespan, to share and edit code in a collaborative team, and to distribute the finished project to its intended audience. This chapter also demonstrates how to implement these ideas effectively in practice using Git, GitHub, and JupyterHub.

5.2 Chapter learning objectives

By the end of the chapter, students will be able to:

  • Describe what version control is and why data analysis projects can benefit from it
  • Create a remote version control repository on GitHub
  • Move changes to files from GitHub to JupyterHub, and from JupyterHub to GitHub
  • Give collaborators access to the repository
  • Resolve conflicting edits made by multiple collaborators
  • Communicate with collaborators using issues
  • Use best practices when collaborating on a project with others

5.3 What is version control, and why should I use it?

Data analysis projects often require iteration and revision to move from an initial idea to a finished product that is ready for the intended audience. Without deliberate and conscious effort towards tracking changes made to the analysis, projects tend to become messy, with mystery results files that cannot be reproduced, temporary files with snippets of ideas, mind-boggling filenames like document_final_draft_v5_final.txt, large blocks of commented code “saved for later,” and more. Additionally, the iterative nature of data analysis projects makes it important to be able to examine earlier versions of code and writing. Finally, data analyses are typically completed by a team of people rather than a single person. This means that files need to be shared across multiple computers, and multiple people often end up editing the project simultaneously. In such a situation, determining who has the latest version of the project—and how to resolve conflicting edits—can be a real challenge.

Version control helps solve these challenges by tracking changes to the files in the analysis (code, writing, data, etc) over the lifespan of the project, including when the changes were made and who made them. This provides the means both to view earlier versions of the project and to revert changes. Version control also facilitates collaboration via tools to share edits with others and resolve conflicting edits. But even if you’re working on a project alone, you should still use version control. It helps you keep track of what you’ve done, when you did it, and what you’re planning to do next!

You mostly collaborate with yourself, and me-from-two-months-ago never responds to email.

–Mark T. Holder

In order to version control a project, you generally need two things: a version control system and a repository hosting service. The version control system is the software that is responsible for tracking changes, sharing changes you make with others, obtaining changes others have made, and resolving conflicting edits. The repository hosting service is responsible for storing a copy of the version controlled project online (a repository), where you and your collaborators can access it remotely, discuss issues and bugs, and distribute your final product. For both of these items, there is a wide variety of choices; some of the more popular ones are:

In this textbook we’ll use Git for version control, and GitHub for repository hosting, because both are currently the most widely-used platforms.

Note: technically you don’t have to use a repository hosting service. You can, for example, use Git to version control a project that is stored only in a folder on your computer. But using a repository hosting service provides a few big benefits, including managing collaborator access permissions, tools to discuss and track bugs, and the ability to have external collaborators contribute work, not to mention the safety of having your work backed up in the cloud. Since most repository hosting services now offer free accounts, there are not many situations in which you wouldn’t want to use one for your project.

5.4 Creating a space for your project online

Before you can create repositories, you will need a GitHub account; you can sign up for a free account at https://github.com/. Once you have logged into your account, you can create a new repository to host your project by clicking on the “+” icon in the upper right hand corner, and then on “New Repository” as shown below:

On the next page, do the following:

  1. Enter the name for your project repository. In the example below, we use canadian_languages. Most repositories follow this naming convention, which involves lowercase letter words separated by either underscores or hyphens.
  2. Choose an option for the privacy of your repository
    1. If you select “Public”, your repository may be viewed by anyone, but only you and collaborators you designate will be able to modify it.
    2. If you select “Private”, only you and your collaborators can view or modify it.
  3. Select “Add a README file.” This creates a template README.md file in your repository’s root folder.
  4. When you are happy with your repository name and configuration, click on the green “Create Repository” button.

Now you should have a repository that looks something like this:

5.5 Creating and editing files on GitHub

There are several ways to use the GitHub interface to add files to your repository and to edit them. Below we cover how to use the pen tool to edit existing files, and how to use the “Add file” drop down to create a new file or upload files from your computer. These techniques are useful for handling simple plaintext files, for example, the README.md file that is already present in the repository.

5.5.1 The pen tool

The pen tool can be used to edit existing plaintext files. Click on the pen tool:

Use the text box to make your changes:

Finally, commit your changes. When you commit a file in a repository, the version control system takes a snapshot of what the file looks like. As you continue working on the project, over time you will possibly make many commits to a single file; this generates a useful version history for that file. On GitHub, if you click the green “Commit changes” button, it will save the file and then make a commit. Do this now:

5.5.2 The “Add file” menu

The “Add file” menu can be used to create new plaintext files and upload files from your computer. To create a new plaintext file, click the “Add file” drop down menu and select the “Create new file” option:

A page will open with a small text box for the file name to be entered, and a larger text box where the desired file content text can be entered. Note the two tabs, “Edit new file” and “Preview”. Toggling between them lets you enter and edit text and view what the text will look like when rendered, respectively. Note that GitHub understands and renders .md files using a markdown syntax very similar to Jupyter notebooks, so the “Preview” tab is especially helpful for checking markdown code correctness.

Save and commit your changes by click the green “Commit changes” button at the bottom of the page.

You can also upload files that you have created on your local machine by using the “Add file” drop down menu and selecting “Upload files”:

To select the files from your local computer to upload, you can either drag and drop them into the grey box area shown below, or click the “choose your files” link to access a file browser dialog. Once the files you want to upload have been selected, click the green “Commit changes” button at the bottom of the page.

Note that Git and GitHub are designed to track changes in individual files. Do not upload your whole project in an archive file (e.g. .zip), because then Git can only keep track of changes to the entire .zip file—that wouldn’t be very useful if you’re trying to see the history of changes to a single code file in your project!

5.6 Cloning your repository on JupyterHub

Although there are several ways to create and edit files on GitHub, they are not quite powerful enough for efficiently creating and editing complex files, or files that need to be executed to assess whether they work (e.g., files containing code). Thus, it is useful to be able to connect the project repository that was created on GitHub to a coding environment. This can be done on your local computer, or using a JupyterHub; below we show how to do this using a JupyterHub.

We need to clone our project’s Git repository to our JupyterHub—i.e., make a copy that knows where it was obtained from so that it knows where send/receive new committed edits. In order to do this, first copy the url from the HTTPS tab of the Code drop down menu on GitHub:

Then open JupyterHub, and click the Git+ icon on the file browser tab:

Paste the url of the GitHub project repository you created and click the blue “CLONE” button:

On the file browser tab, you will now see a folder for your project’s repository (and inside it will be all the files that existed on GitHub):

5.7 Working in a cloned repository on JupyterHub

Now that you have cloned your repository on JupyterHub, you can get to work editing, creating, and deleting files. Once you reach a point that you want Git to keep a record of the current version, you need to commit (i.e., snapshot) your changes. Then once you have made commits that you want to share with your collaborators, you need to push (i.e., send) those commits back to GitHub. Again, we can use the JupyterLab Git extension tool to do all of this. In particular, your workflow on JupyterHub should look like this:

  1. You edit, create, and delete files in your cloned repository on JupyterHub.
  2. Once you want a record of the current version, you specify which files to “add” to Git’s staging area. You can think of files in the staging area as those modified files for which you want a snapshot.
  3. You commit those flagged files to your repository, and include a helpful commit message to tell your collaborators about the changes you made. Note: here you are only committing to your cloned repository stored on JupyterHub. The repository on GitHub has not changed, and your collaborators cannot see your work yet.
  4. Go back to step 1. and keep working!
  5. When you want to store your commits (that only exist in your cloned repository right now) on the cloud where they can be shared with your collaborators, you push them back to the hosted repository on GitHub.

Below we walk through how to use the Jupyter Git extension tool to do each of the steps outlined above.

5.7.1 Specifying files to commit

Below we created and saved a new file (named eda.ipynb) that we would like to send back to the project repository on GitHub. To “add” this modified file to the staging area (i.e., flag that this is a file whose changes we would like to commit), we click the Jupyter Git extension icon on the far left-hand side of JupyterLab:

This opens the Jupyter Git graphical user interface pane, and then we click the plus sign beside the file that we want to “add”.

Note: because this is the first change for this file that we want to add, it falls under the “Untracked” heading. However, next time we edit this file and want to add the changes we made, we will find it under the “Changed” heading.

Note: do not add the eda-checkpoint.ipynb file (sometimes called .ipynb_checkpoints). This file is automatically created by Jupyter when you work on eda.ipynb. You generally do not add auto-generated files to Git repositories, only add the files you directly create and edit.

This moves the file from the “Untracked” heading to the “Staged” heading, flagging this file so that Git knows we want a snapshot of its current state as a commit. Now we are ready to “commit” the changes. Make sure to include a (clear and helpful!) message about what was changed so that your collaborators (and future you) know what happened in this commit.

5.7.2 Making the commit

To snapshot the changes with an associated commit message, we put the message in the text box at the bottom of the Git pane and click on the blue “Commit” button. It is highly recommended to write useful and meaningful messages about what was changed. These commit messages, and the datetime stamp for a given commit, are the primary means to navigate through the project’s histry in the event that we need to view or retrieve a past version of a file, or revert our project to an earlier state.

When you click the “Commit” button for the first time, you will be prompted to enter your name and email. This only needs to be done once for each machine you use Git on.

After “commiting” the file(s), you will see there there are 0 “Staged” files and we are now ready to push our changes (and the attached commit message) to our project repository on GitHub:

5.7.3 Pushing the commits to GitHub

To send the committed changes back to the project repository on GitHub, we need to push them. To do this we click on the cloud icon with the up arrow on the Jupyter Git tab:

We will then be prompted to enter our GitHub username and password, and click the blue “OK” button:

If the files were successfully pushed to our project repository on GitHub we will be given the success message shown below. Click “Dismiss” to continue working in Jupyter.

You will see that the changes now exist there!

5.8 Collaboration

5.8.1 Giving collaborators access to your project

As mentioned earlier, GitHub allows you to control who has access to your project. The default of both public and private projects are that only the person who created the GitHub repository has permissions to create, edit and delete files (write access). To give your collaborators write access to the projects, navigate to the “Settings” tab:

Then click “Manage access”:

Click the green “Invite a collaborator” button:

Type in the collaborator’s GitHub username and select their name when it appears:

Finally, click the green “Add to this repository” button:

After this you should see your newly added collaborator listed under the “Manage access” tab. They should receive an email invitation to join the GitHub repository as a collaborator. They need to accept this invitation to enable write access.

5.8.2 Pulling changes from GitHub

If your collaborators send their own commits to the GitHub repository, you will need to pull those changes to your own cloned copy on JupyterHub before you’re allowed to push any more changes yourself. By pulling their changes, you sync your local repository to what is present on GitHub.

Note: you can still work on your own cloned repository and commit changes even if collaborators have pushed changes to the GitHub repository. It is only when you try to push your changes back to GitHub that Git will make sure nobody else has pushed any work in the meantime.

You can do this using the Jupyter Git tab by clicking on the cloud icon with the down arrow:

Once the files are successfully pulled from GitHub, you need to click “Dismiss” to keep working:

And then when you open (or refresh) the files whose changes you just pulled, you should be able to see them:

It can be very useful to review the history of the changes to your project. You can do this directly on the JupyterHub by clicking “History” in the Git tab.

It is good practice to pull any changes at the start of every work session before you start working on your local copy. If you do not do this, and your collaborators have pushed some changes to the project to GitHub, then you will be unable to push your changes to GitHub until you pull. This situation can be recognized by this error message:

Usually, getting out of this situation is not too troublesome. First you need to pull the changes that exist on GitHub that you do not yet have on your machine. Usually when this happens, Git can automatically merge the changes for you, even if you and your collaborators were working on different parts of the same file!

If however, you and your collaborators made changes to the same line of the same file, Git will not be able to automatically merge the changes—it will not know whether to keep your version of the line(s), your collaborators version of the line(s), or some blend of the two. When this happens, Git will tell you that you have a merge conflict and that it needs human intervention (you!), and which file(s) this occurs in.

5.8.3 Handling merge conflicts

To fix the merge conflict we need to open the file that had the merge conflict in a plain text editor and look for special marks that Git puts in the file to tell you where the merge conflict occurred.

The beginning of the merge conflict is preceded by <<<<<<< HEAD and the end of the merge conflict is marked by >>>>>>>. Between these markings, Git also inserts a separator (=======). The version of the change before the separator is your change, and the version that follows the separator was the change that existed on GitHub.

Once you have decided which version of the change (or what combination!) to keep, you need to use the plain text editor to remove the special marks that Git added.

The file must be saved, added to the staging area, and then committed before you will be able to push your changes to GitHub.

5.8.4 Communicating using GitHub issues

When working on a project in a team, you don’t just want a historical record of who changed what file and when in the project—you also want a record of decisions that were made, ideas that were floated, problems that were identified and addressed, and all other communication surrounding the project. Email and messaging apps are both very popular for general communication, but are not designed for project-specific communication: they both generally do not have facilities for organizing conversations by project subtopics, searching for conversations related to particular bugs or software versions, etc.

GitHub issues are an alternative written communication medium to email and messaging apps, and were designed specifically to facilitate project-specific communication. Issues are opened from the “Issues” tab on the project’s GitHub page, and they persist there even after the conversation is over and the issue is closed (in contrast to email, issues are not usually deleted). One issue thread is usually created per topic, and they are easily searchable using GitHub’s search tools. All issues are accessible to all project collaborators, so no one is left out of the conversation. Finally, issues can be setup so that team members get email notifications when a new issue is created or a new post is made in an issue thread. Replying to issues from email is also possible. Given all of these advantages, we highly recommend the use of issues for project-related communication.

To open a GitHub issue, first click on the “Issues” tab:

Next click the “New issue” button:

Add an issue title (which acts like an email subject line), and then put the body of the message in the larger text box. Finally click “Submit new issue” to post the issue to share with others:

You can reply to an issue that someone opened by adding your written response to the large text box and clicking comment:

When a conversation is resolved, you can click “Close issue”. The closed issue can be later viewed be clicking the “Closed” header link in the “Issue” tab:

5.9 Additional resources

Now that you’ve picked up the basics of version control with Git and GitHub, you can expand your knowledge using one of the many tutorials available online:

5.9.1 Best practices and workflows

5.9.2 Technical references

References

Matthews, David, Greg Wilson, and Steve Easterbrook. 2008. “Configuration Management for Large-Scale Scientific Computing at the Uk Met Office.” Computing in Science & Engineering 10 (6): 56–64.

Wilson, Greg, Dhavide A Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven HD Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLoS Biol 12 (1): e1001745.