Milestone 2

Overall project summary

In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.

Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:

An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor

Milestone 2 summary

In this milestone, you will:

  1. Upgrade your project’s computational environment to a container.

  2. Abstract more code from your literate code document (*.ipynb or *.Rmd) to scripts (e.g., .R or .py). You should aim to split the analysis code into 4, or more, R or Python scripts. Where the code in each script is contributing to a related step in your analysis.

  3. Convert your *.ipynb or *.Rmd files into a Quarto document (*.qmd). Edit your Quarto document so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the report. The goal is that non-data scientists would not be able to tell that code was used to perform your analysis or format your report (i.e., no code should be visible in the rendered report).

  4. Write another script, a Makefile (literally called Makefile), to act as a driver script to rule them all. This script should run the others in sequence, hard coding in the appropriate arguments.

  5. Continue to manage issues professionally.

Milestone 2 specifics

1. Upgrade your project’s computational environment to a container.

Write a Dockerfile to create a custom container for the computational environment for your project. Build your container using GitHub Actions, and publish your container image on DockerHub. Once this is done, shift the development of your project from working in a virtual environment to working in a container!

The Dockerfile is the file used to specify and create the Docker image from which containers can be run to create an reproducible computational environment for your analysis. For this project, we recommend using a base Docker image that already has most of the software dependencies needed for your analysis. Examples of these include the Jupyter core team Docker images (documentation) and the Rocker team Docker images (documentation). When you add other software dependencies to this Dockerfile, ensure that you pin the version of the software that you add.

Note - this document should live in the root of your public GitHub.com repository.

In this milestone, we expect you to add a GitHub Actions workflow to automatically build the image, push it to DockerHub, and version the image (using the corresponding GitHub SHA) and GitHub repo when changes are pushed to the Dockerfile.

You will need to add your DockerHub username and password (naming them DOCKER_USERNAME and DOCKER_PASSWORD, respectively) as GitHub secrets to this repository for this to work. This part is similar to Individual Assignment 2.

Additionally, document how to use your container image in your README. This is important to make it easy for you and your team to shift to a container solution for your computational environment. We highly recommend using Docker Compose so that launching your containers is as frictionless as possible (which makes you more likely to use this tool in your workflow)!

2. Abstract more code from your literate code document (*.ipynb, *.Rmd, or .qmd) to scripts (e.g., .R or .py).

This code need not be converted to a function, but can simply be files that call the functions needed to run your analysis. You should aim to split the analysis code into 4, or more, R or Python scripts. Where the code in each script is contributing to a related step in your analysis.

The output of the first script must be the input of the second, and so on. All scripts should have command line arguments and we expect you to use either the docopt R package or the click Python package for parsing command line arguments.

They scripts could be organized something like this:

3. Convert your *.ipynb or *.Rmd files into a Quarto document (*.qmd). Edit your Quarto document so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the report

The goal is that non-data scientists would not be able to tell that code was used to perform your analysis or format your report (i.e., no code should be visible in the rendered report). You should do all the things you did for the report in individual assignment 4, including:

4. Write another script, a Makefile (literally called Makefile), to act as a driver script to rule them all

This script should run the others in sequence, hard coding in the appropriate arguments. This script should:

Tip:

5. Continue to manage issues professionally

Continue managing issues effectively through project boards and milestones, make it clear who is responsible for what and what project milestone each task is associated with. In particular, create an issue for each task and/or sub-task needed for this milestone. Each of these issues must be assigned to a single person on the team. We want all of you to get coding experience in the project and each team member should be responsible for an approximately equal portion of the code.

Submission Instructions

You will submit three URLs to Canvas in the provided text box for milestone 2:

  1. the URL of your project’s GitHub.com repository
  2. the URL of a GitHub release of your project’s GitHub.com repository for this milestone.
  3. The URL of your Dockerhub image that can be pulled and used to run your analysis following your README.md instructions.

Creating a release on GitHub.com

Just before you submit the milestone 2, create a release on your project repository on GitHub and name it something like 1.0.0 (how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.

Expectations