Milestone 4

Abstract project functions to their own software package, and address any feedback received in earlier milestones.

Overall project summary

In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.

Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:

a well documented and modularized software package and scripts written in R and/or Python,
a data analysis pipeline automated with GNU Make,
a reproducible report powered by Quarto,
a containerized computational environment created and made shareable by Docker, and
remote version control repositories on GitHub for project collaboration and sharing, as well as automation of test suite execution and documentation and software deployment.

An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor

Milestone 4 summary

In this final project milestone, you will:

Abstract project functions to their own software package (this means the code for your project will be split across two repositories)
Add data validation checks to the analysis pipeline
Address any feedback received in earlier milestones or during the peer review

Final project specifics

Abstract project functions to their own software package

To further refine your project, you are going to split the code for your analysis into two version control repositories. One will be for your analysis, the other will be another separate repository, that will serve as a place to package up the functions you wrote in milestone 3. Decoupling these two code bases will allow for those functions to more easily be reused in future projects by you and others who might find them useful!

Your new version control repository for your packaged functions should:

be given a name relevant to the functions it contains (just like the data science packages you know and love are).
be in the DSCI-310-2025 organization: https://github.com/DSCI-310-2025
contain at least 3 or 4 useful functions that handle errors gracefully (you already wrote these - you need to now copy them from your analysis repository to the software package repository)
be well documented (use the three level of package documentation we learned about in class: function/reference docs, vignette/tutorials docs, and high-level package website docs)
have unit and integration tests (you already wrote these - you need to now copy them from your analysis repository to the software package repository)
use GitHub actions for continuous integration to run your tests, and continuous deployment to render your package documentation (if you are creating a Python package, you can even use continuous deployment to publish your package to PyPI!)
have a badge for a code coverage
have a README that describes your package from a high-level, including where it sits in the package ecosystem for the language you chose to write it in chose (i.e., what other similar packages exist and how do they differ)
have a Software Licence that describes how you will permit others to use your work
have a code of conduct document that describes behavioural expectations of team interactions
have a contributions document that outlines how the team will work together (expected workflow practices)

You should update your the analysis code in your data analysis repository so that:

you load your R (or Python) package to access your functions, instead of sourcing or importing files in your analysis repository
remove the files that contain your functions (you no longer need these since you are loading them from your package), and the tests directory files that holds the tests and the helper data for those tests (again, these should now exist in your package repository)
update your projects computational environment so that it contains your R (or Python) package. Don’t forget to include the version when doing this!

Add data validation checks to the analysis pipeline

To ensure that your data visualization and modeling results are correct, robust and of high quality, we want you to add data validation checks to your data analysis. These should be written as code in your data analysis. Ensure you cover the checks we recommend in our Data validation checklist. For this milestone you must do at least 8 of the checks listed.

Also be thoughtful about what you want to happen if/when your checks fail, and be sure that your code reflects that. Finally, when implementing your checks, be sure they do not create data leakage between the training and test set.

Address any feedback received in earlier milestones and during peer review

Revise your data analysis project to address feedback received from the DSCI 310 teaching team from past milestones, as well as feedback received from the peer review. 50% of your final grade for this milestone will be assessing whether you have addressed this feedback to improve your project.

To help us easily, and correctly, assess this, please create a GitHub issue in your analysis project’s GitHub repository with the title “Feedback addressed”. In this issue describe any improvements you made to the project based on feedback, and point to evidence of these improvements. You can point us to evidence of addressing it by providing URLs to reference specific lines of code, commit messages, pull requests, etc. Be sure to add some narration when sharing these URLs so that it is easy for us to identify which changes to your work addressed which pieces of feedback.

Submission Instructions

Just before you submit the milestone 4, create a release on your analysis project repository on GitHub and name it something like 3.0.0 (how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.

Also create a release on your software package repository (do this manually if you are not using an automated tool for this in your continuous deployment).

You will submit a PDF to Gradescope for milestone 4 that includes:

the URL of your analysis project’s GitHub.com repository
the URL of a GitHub release of your analysis project’s GitHub.com repository
the URL to the “Feedback addressed” issue that outlines the feedback you have addressed
the URL of your software package’s GitHub.com repository
the URL of a GitHub release of your software package’s GitHub.com repository

Expectations

Everyone should contribute equally to all aspects of the project (e.g., code, writing, project management). This should be evidenced by a roughly equal number of commits, pull request reviews and participation in communication via GitHub issues.
Each group member should work in a GitHub flow workflow; where they create branches for each feature or fix, which are reviewed and critiqued by at least one other teammate before the the pull request is accepted.
You should be committing to git and pushing to GitHub.com every time you work on this project.
Git commit messages should be meaningful. These will be marked. It’s OK if one or two are less meaningful, but most should be.
Use GitHub issues to communicate to their team mate (as opposed to email or Slack).
Your question, analysis and visualization should make sense. It doesn’t have to be complicated.
Your analysis should be correct, and run reproducibly given the instructions provided in the README.
You should use proper grammar and full sentences in your README. Point form may occur, but should be less than 10% of your written documents.
R code should follow the tidyverse style guide, and Python code should follow the black style guide for Python)
You should not have extraneous files in your repository that should be ignored.