DSCI 310: Milestone 2

Overall project summary

In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.

Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:

a well documented and modularized software package and scripts written in R and/or Python,
a data analysis pipeline automated with GNU Make,
a reproducible report powered by Quarto,
a containerized computational environment created and made shareable by Docker, and
remote version control repositories on GitHub for project collaboration and sharing, as well as automation of test suite execution and documentation and software deployment.

An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor

Milestone 2 summary

In this milestone, you will:

Upgrade your project’s computational environment to a container.
Abstract more code from your literate code document (*.ipynb or *.Rmd) to scripts (e.g., .R or .py). You should aim to split the analysis code into 4, or more, R or Python scripts. Where the code in each script is contributing to a related step in your analysis.
Convert your *.ipynb or *.Rmd files into a Quarto document (*.qmd). Edit your Quarto document so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the report. The goal is that non-data scientists would not be able to tell that code was used to perform your analysis or format your report (i.e., no code should be visible in the rendered report).
Write another script, a Makefile (literally called Makefile), to act as a driver script to rule them all. This script should run the others in sequence, hard coding in the appropriate arguments.
Continue to manage issues professionally.

Milestone 2 specifics

1. Upgrade your project’s computational environment to a container.

Write a Dockerfile to create a custom container for the computational environment for your project. Build your container using GitHub Actions, and publish your container image on DockerHub. Once this is done, shift the development of your project from working in a virtual environment to working in a container!

The Dockerfile is the file used to specify and create the Docker image from which containers can be run to create an reproducible computational environment for your analysis. For this project, we recommend using a base Docker image that already has most of the software dependencies needed for your analysis. Examples of these include the Jupyter core team Docker images (documentation) and the Rocker team Docker images (documentation). When you add other software dependencies to this Dockerfile, ensure that you pin the version of the software that you add.

Note - this document should live in the root of your public GitHub.com repository.

In this milestone, we expect you to add a GitHub Actions workflow to automatically build the image, push it to DockerHub, and version the image (using the corresponding GitHub SHA) and GitHub repo when changes are pushed to the Dockerfile.

You will need to add your DockerHub username and password (naming them DOCKER_USERNAME and DOCKER_PASSWORD, respectively) as GitHub secrets to this repository for this to work. This part is similar to Individual Assignment 2.

Additionally, document how to use your container image in your README. This is important to make it easy for you and your team to shift to a container solution for your computational environment. We highly recommend using Docker Compose so that launching your containers is as frictionless as possible (which makes you more likely to use this tool in your workflow)!

2. Abstract more code from your literate code document (`.ipynb`, `.Rmd`, or `.qmd`) to scripts (e.g., `.R` or `.py`).

This code need not be converted to a function, but can simply be files that call the functions needed to run your analysis. You should aim to split the analysis code into 4, or more, R or Python scripts. Where the code in each script is contributing to a related step in your analysis.

The output of the first script must be the input of the second, and so on. All scripts should have command line arguments and we expect you to use either the docopt R package or the click Python package for parsing command line arguments.

They scripts could be organized something like this:

A first script that downloads the data from the internet and saves it locally. This should take at least two arguments:
- the path to the input file (a URL or a relative local path, such as data/file.csv), as well as
- a path/filename where to write the file to and what to call it (e.g., data/cleaned_data.csv).
Note: choose more descriptive filenames than the ones used above. These are general for illustrative purposes.
A second script that reads the data from the first script and performs and data cleaning/pre-processing, transforming, and/or partitioning that needs to happen before exploratory data analysis or modeling takes place. This should take at least two arguments:
- a path/filename pointing to the data to be read in
- a path/filename pointing to where the cleaned/processed/transformed/partitioned data should live.
A third script which creates exploratory data visualization(s) and table(s) that are useful to help the reader/consumer understand that data set. These analysis artifacts should be written to files. This should take at least two arguments:
- a path/filename pointing to the data,
- a path/filename prefix where to write the figure to and what to call it (e.g., results/this_eda.png).
A fourth script that reads the data from the second script, performs the modeling and summarizes the results as a figure(s) and a table(s). These analysis artifacts should be written to files. This should take at least two arguments:
- a path/filename pointing to the data
- a path/filename prefix where to write the figure(s)/table(s) to and what to call it (e.g., results/this_analysis)

3. Convert your `.ipynb` or `.Rmd` files into a Quarto document (`*.qmd`). Edit your Quarto document so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the report

The goal is that non-data scientists would not be able to tell that code was used to perform your analysis or format your report (i.e., no code should be visible in the rendered report). You should do all the things you did for the report in individual assignment 4, including:

Use markdown headers and Quarto configuration so that a table of contents will be created when the document is rendered.
Add correctly formatted inline bibtex references so that a nicely formatted reference list will be present at the end of the document.
Edit the figure formatting so that Quarto:
- automatically numbers them,
- has a label for easy cross referencing, and
- controls the figure sizes so the figures are not too big.
Add a table description argument so that they are automatically numbered by Quarto and use cross referencing to link to the table if/when it is discussed in the text.
Edit the narrative so it uses cross-referencing to refer to the figures and tables instead of having their reference hardcode the figure and table numbers.
Replace any hard-coded values in the Quarto report narrative text with inline Quarto code, so that the text will be automatically updated with the correct value.
Change the code chunk options so that no code is viewable in the rendered report, just the code outputs where needed (e.g., figures and tables).

4. Write another script, a `Makefile` (literally called `Makefile`), to act as a driver script to rule them all

This script should run the others in sequence, hard coding in the appropriate arguments. This script should:

live in the project’s root directory and be named Makefile
be well documented (using the project README and comments inside the Makefile to explain what it does and how to use it)
have a all target so that you can easily run the entire analysis from top to bottom by running make all at the command line
have a clean target so that you can easily “undo” your analysis by running make clean at the command line (e.g., deletes all generate data and files).

Tip:

The all target can be a .PHONY target
You can create other targets that link up all the dependencies
- For example, data, analysis, figures, pdf, etc, so your build process only runs what’s necessary during development

5. Continue to manage issues professionally

Continue managing issues effectively through project boards and milestones, make it clear who is responsible for what and what project milestone each task is associated with. In particular, create an issue for each task and/or sub-task needed for this milestone. Each of these issues must be assigned to a single person on the team. We want all of you to get coding experience in the project and each team member should be responsible for an approximately equal portion of the code.

Submission Instructions

You will submit three URLs to Canvas in the provided text box for milestone 2:

the URL of your project’s GitHub.com repository
the URL of a GitHub release of your project’s GitHub.com repository for this milestone.
The URL of your Dockerhub image that can be pulled and used to run your analysis following your README.md instructions.

Creating a release on GitHub.com

Just before you submit the milestone 2, create a release on your project repository on GitHub and name it something like 1.0.0 (how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.

Expectations

Everyone should contribute equally to all aspects of the project (e.g., code, writing, project management). This should be evidenced by a roughly equal number of commits, pull request reviews and participation in communication via GitHub issues.
Each group member should work in a GitHub flow workflow; where they create branches for each feature or fix, which are reviewed and critiqued by at least one other teammate before the the pull request is accepted.
You should be committing to git and pushing to GitHub.com every time you work on this project.
Git commit messages should be meaningful. These will be marked. It’s OK if one or two are less meaningful, but most should be.
Use GitHub issues to communicate to their team mate (as opposed to email or Slack).
Your question, analysis and visualization should make sense. It doesn’t have to be complicated.
Your analysis should be correct, and run reproducibly given the instructions provided in the README.
You should use proper grammar and full sentences in your README. Point form may occur, but should be less than 10% of your written documents.
R code should follow the tidyverse style guide, and Python code should follow the black style guide for Python)
You should not have extraneous files in your repository that should be ignored.

Milestone 2