In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.
Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:
An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor
In this milestone, you will:
Upgrade your project’s computational environment to a container.
Abstract more code from your literate code document (*.ipynb
or *.Rmd
)
to scripts (e.g., .R
or .py
).
You should aim to split the analysis code into 4, or more, R or Python scripts.
Where the code in each script is contributing to a related step in your analysis.
Convert your *.ipynb
or *.Rmd
files into a Quarto document (*.qmd
).
Edit your Quarto document so that it’s sole
job is to narrate your analysis, display your analysis artifacts
(i.e., figures and tables), and nicely format the report.
The goal is that non-data scientists would not be able to tell that code
was used to perform your analysis or format your report
(i.e., no code should be visible in the rendered report).
Write another script, a Makefile
(literally called Makefile
), to act as a driver script to rule them all. This script should run the others in sequence, hard coding in the appropriate arguments.
Continue to manage issues professionally.
Write a Dockerfile
to create a custom container
for the computational environment for your project.
Build your container using GitHub Actions,
and publish your container image on DockerHub.
Once this is done, shift the development of your project
from working in a virtual environment to working in a container!
The Dockerfile
is the file used to specify and create the Docker image from which containers
can be run to create an reproducible computational environment for your analysis.
For this project,
we recommend using a base Docker image
that already has most of the software dependencies needed for your analysis.
Examples of these include the Jupyter core team Docker images
(documentation)
and the Rocker team Docker images
(documentation).
When you add other software dependencies to this Dockerfile
,
ensure that you pin the version of the software that you add.
Note - this document should live in the root of your public GitHub.com repository.
In this milestone, we expect you to add a GitHub Actions workflow to automatically build the image, push it to DockerHub, and version the image (using the corresponding GitHub SHA) and GitHub repo when changes are pushed to the Dockerfile.
You will need to add your DockerHub username and password
(naming them DOCKER_USERNAME
and DOCKER_PASSWORD
, respectively)
as GitHub secrets to this repository for this to work.
This part is similar to Individual Assignment 2.
Additionally, document how to use your container image in your README
.
This is important to make it easy for you
and your team to shift to a container solution
for your computational environment.
We highly recommend using Docker Compose
so that launching your containers is as frictionless as possible
(which makes you more likely to use this tool in your workflow)!
*.ipynb
, *.Rmd
, or .qmd
) to scripts (e.g., .R
or .py
).This code need not be converted to a function, but can simply be files that call the functions needed to run your analysis. You should aim to split the analysis code into 4, or more, R or Python scripts. Where the code in each script is contributing to a related step in your analysis.
The output of the first script must be the input of the second, and so on.
All scripts should have command line arguments
and we expect you to use either the docopt
R package
or the click
Python package for parsing command line arguments.
They scripts could be organized something like this:
A first script that downloads the data from the internet and saves it locally. This should take at least two arguments:
Note: choose more descriptive filenames than the ones used above. These are general for illustrative purposes.
A second script that reads the data from the first script and performs and data cleaning/pre-processing, transforming, and/or partitioning that needs to happen before exploratory data analysis or modeling takes place. This should take at least two arguments:
A third script which creates exploratory data visualization(s) and table(s) that are useful to help the reader/consumer understand that data set. These analysis artifacts should be written to files. This should take at least two arguments:
A fourth script that reads the data from the second script, performs the modeling and summarizes the results as a figure(s) and a table(s). These analysis artifacts should be written to files. This should take at least two arguments:
*.ipynb
or *.Rmd
files into a Quarto document (*.qmd
). Edit your Quarto document so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the reportThe goal is that non-data scientists would not be able to tell that code was used to perform your analysis or format your report (i.e., no code should be visible in the rendered report). You should do all the things you did for the report in individual assignment 4, including:
Makefile
(literally called Makefile
), to act as a driver script to rule them allThis script should run the others in sequence, hard coding in the appropriate arguments. This script should:
README
and comments inside the Makefile
to explain what it does and how to use it)all
target so that you can easily run the entire analysis from top to bottom by running make all
at the command lineclean
target so that you can easily “undo” your analysis by running make clean
at the command line (e.g., deletes all generate data and files).Tip:
all
target can be a .PHONY
targetdata
, analysis
, figures
, pdf
, etc, so your build process only runs what’s necessary during developmentContinue managing issues effectively through project boards and milestones, make it clear who is responsible for what and what project milestone each task is associated with. In particular, create an issue for each task and/or sub-task needed for this milestone. Each of these issues must be assigned to a single person on the team. We want all of you to get coding experience in the project and each team member should be responsible for an approximately equal portion of the code.
You will submit three URLs to Canvas in the provided text box for milestone 2:
README.md
instructions.Just before you submit the milestone 2, create a release on your project repository on GitHub and name it something like 1.0.0
(how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.
Everyone should contribute equally to all aspects of the project (e.g., code, writing, project management). This should be evidenced by a roughly equal number of commits, pull request reviews and participation in communication via GitHub issues.
Each group member should work in a GitHub flow workflow; where they create branches for each feature or fix, which are reviewed and critiqued by at least one other teammate before the the pull request is accepted.
You should be committing to git and pushing to GitHub.com every time you work on this project.
Git commit messages should be meaningful. These will be marked. It’s OK if one or two are less meaningful, but most should be.
Use GitHub issues to communicate to their team mate (as opposed to email or Slack).
Your question, analysis and visualization should make sense. It doesn’t have to be complicated.
Your analysis should be correct, and run reproducibly given the instructions provided in the README.
You should use proper grammar and full sentences in your README. Point form may occur, but should be less than 10% of your written documents.
R code should follow the tidyverse style guide, and Python code should follow the black style guide for Python)
You should not have extraneous files in your repository that should be ignored.