Chapter 5 Collaboration with version control
This chapter will introduce the concept of using version control systems to track changes to a project over its lifespan, to share and edit code in a collaborative team, and to distribute the finished project to its intended audience. This chapter also demonstrates how to implement these ideas effectively in practice using Git, GitHub, and JupyterHub.
5.2 Chapter learning objectives
By the end of the chapter, students will be able to:
- Describe what version control is and why data analysis projects can benefit from it
- Create a remote version control repository on GitHub
- Move changes to files from GitHub to JupyterHub, and from JupyterHub to GitHub
- Give collaborators access to the repository
- Resolve conflicting edits made by multiple collaborators
- Communicate with collaborators using issues
- Use best practices when collaborating on a project with others
5.3 What is version control, and why should I use it?
Data analysis projects often require iteration and revision to move from an initial idea
to a finished product that is ready for the intended audience. Without
deliberate and conscious effort towards tracking changes made to the analysis,
projects tend to become messy, with mystery results files that cannot be reproduced,
temporary files with snippets of ideas, mind-boggling filenames like
large blocks of commented code “saved for later,” and more. Additionally,
the iterative nature of data analysis projects makes it important to be able to examine
earlier versions of code and writing. Finally, data analyses are typically completed by a team of
people rather than a single person. This means that files need to be
shared across multiple computers, and multiple people often end up editing the project
simultaneously. In such a situation, determining who has the latest version of the project—and
how to resolve conflicting edits—can be a real challenge.
Version control helps solve these challenges by tracking changes to the files in the analysis (code, writing, data, etc) over the lifespan of the project, including when the changes were made and who made them. This provides the means both to view earlier versions of the project and to revert changes. Version control also facilitates collaboration via tools to share edits with others and resolve conflicting edits. But even if you’re working on a project alone, you should still use version control. It helps you keep track of what you’ve done, when you did it, and what you’re planning to do next!
You mostly collaborate with yourself, and me-from-two-months-ago never responds to email.
–Mark T. Holder
In order to version control a project, you generally need two things: a version control system and a repository hosting service. The version control system is the software that is responsible for tracking changes, sharing changes you make with others, obtaining changes others have made, and resolving conflicting edits. The repository hosting service is responsible for storing a copy of the version controlled project online (a repository), where you and your collaborators can access it remotely, discuss issues and bugs, and distribute your final product. For both of these items, there is a wide variety of choices; some of the more popular ones are:
- Version control systems:
- Repository hosting services:
In this textbook we’ll use Git for version control, and GitHub for repository hosting, because both are currently the most widely-used platforms.
Note: technically you don’t have to use a repository hosting service. You can, for example, use Git to version control a project that is stored only in a folder on your computer. But using a repository hosting service provides a few big benefits, including managing collaborator access permissions, tools to discuss and track bugs, and the ability to have external collaborators contribute work, not to mention the safety of having your work backed up in the cloud. Since most repository hosting services now offer free accounts, there are not many situations in which you wouldn’t want to use one for your project.
5.4 Creating a space for your project online
Before you can create repositories, you will need a GitHub account; you can sign up for a free account at https://github.com/. Once you have logged into your account, you can create a new repository to host your project by clicking on the “+” icon in the upper right hand corner, and then on “New Repository” as shown below:
On the next page, do the following:
- Enter the name for your project repository. In the example below, we use
canadian_languages. Most repositories follow this naming convention, which involves lowercase letter words separated by either underscores or hyphens.
- Choose an option for the privacy of your repository
- If you select “Public”, your repository may be viewed by anyone, but only you and collaborators you designate will be able to modify it.
- If you select “Private”, only you and your collaborators can view or modify it.
- Select “Add a README file.” This creates a template
README.mdfile in your repository’s root folder.
- When you are happy with your repository name and configuration, click on the green “Create Repository” button.
Now you should have a repository that looks something like this:
5.5 Creating and editing files on GitHub
There are several ways to use the GitHub interface to add files to your repository and to edit them.
Below we cover how to use the pen tool to edit existing files, and how to use the “Add file” drop down
to create a new file or upload files from your computer. These techniques are useful for handling simple
plaintext files, for example, the
README.md file that is already present in the repository.
5.5.1 The pen tool
The pen tool can be used to edit existing plaintext files. Click on the pen tool:
Use the text box to make your changes:
Finally, commit your changes. When you commit a file in a repository, the version control system takes a snapshot of what the file looks like. As you continue working on the project, over time you will possibly make many commits to a single file; this generates a useful version history for that file. On GitHub, if you click the green “Commit changes” button, it will save the file and then make a commit. Do this now:
5.6 Cloning your repository on JupyterHub
Although there are several ways to create and edit files on GitHub, they are not quite powerful enough for efficiently creating and editing complex files, or files that need to be executed to assess whether they work (e.g., files containing code). Thus, it is useful to be able to connect the project repository that was created on GitHub to a coding environment. This can be done on your local computer, or using a JupyterHub; below we show how to do this using a JupyterHub.
We need to clone our project’s Git repository to our JupyterHub—i.e., make a copy that knows where it was obtained from so that it knows where send/receive new committed edits. In order to do this, first copy the url from the HTTPS tab of the Code drop down menu on GitHub:
Then open JupyterHub, and click the Git+ icon on the file browser tab:
Paste the url of the GitHub project repository you created and click the blue “CLONE” button:
On the file browser tab, you will now see a folder for your project’s repository (and inside it will be all the files that existed on GitHub):
5.7 Working in a cloned repository on JupyterHub
Now that you have cloned your repository on JupyterHub, you can get to work editing, creating, and deleting files. Once you reach a point that you want Git to keep a record of the current version, you need to commit (i.e., snapshot) your changes. Then once you have made commits that you want to share with your collaborators, you need to push (i.e., send) those commits back to GitHub. Again, we can use the JupyterLab Git extension tool to do all of this. In particular, your workflow on JupyterHub should look like this:
- You edit, create, and delete files in your cloned repository on JupyterHub.
- Once you want a record of the current version, you specify which files to “add” to Git’s staging area. You can think of files in the staging area as those modified files for which you want a snapshot.
- You commit those flagged files to your repository, and include a helpful commit message to tell your collaborators about the changes you made. Note: here you are only committing to your cloned repository stored on JupyterHub. The repository on GitHub has not changed, and your collaborators cannot see your work yet.
- Go back to step 1. and keep working!
- When you want to store your commits (that only exist in your cloned repository right now) on the cloud where they can be shared with your collaborators, you push them back to the hosted repository on GitHub.
Below we walk through how to use the Jupyter Git extension tool to do each of the steps outlined above.
5.7.1 Specifying files to commit
Below we created and saved a new file (named
eda.ipynb) that we would
like to send back to the project repository on GitHub.
To “add” this modified file to the staging area (i.e., flag that this is a
file whose changes we would like to commit), we click the Jupyter Git extension
icon on the far left-hand side of JupyterLab:
This opens the Jupyter Git graphical user interface pane, and then we click the plus sign beside the file that we want to “add”.
Note: because this is the first change for this file that we want to add, it falls under the “Untracked” heading. However, next time we edit this file and want to add the changes we made, we will find it under the “Changed” heading.
Note: do not add the
eda-checkpoint.ipynbfile (sometimes called
.ipynb_checkpoints). This file is automatically created by Jupyter when you work on
eda.ipynb. You generally do not add auto-generated files to Git repositories, only add the files you directly create and edit.
This moves the file from the “Untracked” heading to the “Staged” heading, flagging this file so that Git knows we want a snapshot of its current state as a commit. Now we are ready to “commit” the changes. Make sure to include a (clear and helpful!) message about what was changed so that your collaborators (and future you) know what happened in this commit.
5.7.2 Making the commit
To snapshot the changes with an associated commit message, we put the message in the text box at the bottom of the Git pane and click on the blue “Commit” button. It is highly recommended to write useful and meaningful messages about what was changed. These commit messages, and the datetime stamp for a given commit, are the primary means to navigate through the project’s histry in the event that we need to view or retrieve a past version of a file, or revert our project to an earlier state.
When you click the “Commit” button for the first time, you will be prompted to enter your name and email. This only needs to be done once for each machine you use Git on.
After “commiting” the file(s), you will see there there are 0 “Staged” files and we are now ready to push our changes (and the attached commit message) to our project repository on GitHub:
5.7.3 Pushing the commits to GitHub
To send the committed changes back to the project repository on GitHub, we need to push them. To do this we click on the cloud icon with the up arrow on the Jupyter Git tab:
We will then be prompted to enter our GitHub username and password, and click the blue “OK” button:
If the files were successfully pushed to our project repository on GitHub we will be given the success message shown below. Click “Dismiss” to continue working in Jupyter.
You will see that the changes now exist there!
5.8.1 Giving collaborators access to your project
As mentioned earlier, GitHub allows you to control who has access to your project. The default of both public and private projects are that only the person who created the GitHub repository has permissions to create, edit and delete files (write access). To give your collaborators write access to the projects, navigate to the “Settings” tab:
Then click “Manage access”:
Click the green “Invite a collaborator” button:
Type in the collaborator’s GitHub username and select their name when it appears:
Finally, click the green “Add
After this you should see your newly added collaborator listed under the “Manage access” tab. They should receive an email invitation to join the GitHub repository as a collaborator. They need to accept this invitation to enable write access.
5.8.2 Pulling changes from GitHub
If your collaborators send their own commits to the GitHub repository, you will need to pull those changes to your own cloned copy on JupyterHub before you’re allowed to push any more changes yourself. By pulling their changes, you sync your local repository to what is present on GitHub.
Note: you can still work on your own cloned repository and commit changes even if collaborators have pushed changes to the GitHub repository. It is only when you try to push your changes back to GitHub that Git will make sure nobody else has pushed any work in the meantime.
You can do this using the Jupyter Git tab by clicking on the cloud icon with the down arrow:
Once the files are successfully pulled from GitHub, you need to click “Dismiss” to keep working:
And then when you open (or refresh) the files whose changes you just pulled, you should be able to see them:
It can be very useful to review the history of the changes to your project. You can do this directly on the JupyterHub by clicking “History” in the Git tab.
It is good practice to pull any changes at the start of every work session before you start working on your local copy. If you do not do this, and your collaborators have pushed some changes to the project to GitHub, then you will be unable to push your changes to GitHub until you pull. This situation can be recognized by this error message:
Usually, getting out of this situation is not too troublesome. First you need to pull the changes that exist on GitHub that you do not yet have on your machine. Usually when this happens, Git can automatically merge the changes for you, even if you and your collaborators were working on different parts of the same file!
If however, you and your collaborators made changes to the same line of the same file, Git will not be able to automatically merge the changes—it will not know whether to keep your version of the line(s), your collaborators version of the line(s), or some blend of the two. When this happens, Git will tell you that you have a merge conflict and that it needs human intervention (you!), and which file(s) this occurs in.
5.8.3 Handling merge conflicts
To fix the merge conflict we need to open the file that had the merge conflict in a plain text editor and look for special marks that Git puts in the file to tell you where the merge conflict occurred.
The beginning of the merge
conflict is preceded by
<<<<<<< HEAD and the end of the merge conflict is
>>>>>>>. Between these markings, Git also inserts a separator
=======). The version of the change before the separator is your change, and
the version that follows the separator was the change that existed on GitHub.
Once you have decided which version of the change (or what combination!) to keep, you need to use the plain text editor to remove the special marks that Git added.
The file must be saved, added to the staging area, and then committed before you will be able to push your changes to GitHub.
5.8.4 Communicating using GitHub issues
When working on a project in a team, you don’t just want a historical record of who changed what file and when in the project—you also want a record of decisions that were made, ideas that were floated, problems that were identified and addressed, and all other communication surrounding the project. Email and messaging apps are both very popular for general communication, but are not designed for project-specific communication: they both generally do not have facilities for organizing conversations by project subtopics, searching for conversations related to particular bugs or software versions, etc.
GitHub issues are an alternative written communication medium to email and messaging apps, and were designed specifically to facilitate project-specific communication. Issues are opened from the “Issues” tab on the project’s GitHub page, and they persist there even after the conversation is over and the issue is closed (in contrast to email, issues are not usually deleted). One issue thread is usually created per topic, and they are easily searchable using GitHub’s search tools. All issues are accessible to all project collaborators, so no one is left out of the conversation. Finally, issues can be setup so that team members get email notifications when a new issue is created or a new post is made in an issue thread. Replying to issues from email is also possible. Given all of these advantages, we highly recommend the use of issues for project-related communication.
To open a GitHub issue, first click on the “Issues” tab:
Next click the “New issue” button:
Add an issue title (which acts like an email subject line), and then put the body of the message in the larger text box. Finally click “Submit new issue” to post the issue to share with others:
You can reply to an issue that someone opened by adding your written response to the large text box and clicking comment:
When a conversation is resolved, you can click “Close issue”. The closed issue can be later viewed be clicking the “Closed” header link in the “Issue” tab:
5.9 Additional resources
Now that you’ve picked up the basics of version control with Git and GitHub, you can expand your knowledge using one of the many tutorials available online:
5.9.1 Best practices and workflows
- Good enough practices in scientific computing (2014) by Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt & Tracy K. Teal
- Configuration Management for Large-Scale Scientific Computing at the UK Met Office (2008) by David Mathews, Greg Wilson & Steve Easterbrook
Matthews, David, Greg Wilson, and Steve Easterbrook. 2008. “Configuration Management for Large-Scale Scientific Computing at the Uk Met Office.” Computing in Science & Engineering 10 (6): 56–64.
Wilson, Greg, Dhavide A Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven HD Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLoS Biol 12 (1): e1001745.