In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.
Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:
An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor
In this milestone, you will:
An example project milestone 1 is available here: Breast Cancer Predictor in Python v0.0.4
This document will govern the working relationships on your team, and if done correctly, it should help you manage and resolve any issues that might arise when working in a group. A sample team work contract from a past data science project can be found here.
A team work contract communicates specifically how the core group of people who are working together and gives detail about the logisitics of working together and the expectations you have for each other.
Some aspects of the team work contract could be:
Use this opportunity to use your prior knowledge/experience to improve your teamwork, communication, leadership, and organizational skills. You will need all of these for your capastone projects (and beyond)!
Note - this document is fairly personal and does NOT need to reside in your public GitHub.com repo. Instead you can prove that you created this by submitting a copy of it to Gradescope under the Milestone 1 assignment.
(5 min) Each person quietly writes down three things that are important to them for the team work contract.
(5 min) Each person takes a turn shares their three ideas with the group.
(10 min) Group discussion to decide which of the ideas presented should be incorporated into the contract (try to fairly distribute the inclusion of ideas from all teammates). Also discuss if there is anything missing from the contract.
Fill out this survey to provide us your GitHub.com username: https://ubc.ca1.qualtrics.com/jfe/form/SV_8cCRmyxpbch5uxU
After we add you to the DSCI 310 2024 organization on GitHub.com, set-up a public GitHub repository for your group under that organization being sure to add all team members as collaborators. Once you have decided on a project question and data set, make sure your repository has a name that is relevant to that question. Note: if you need to change the name of your project repository later, you can!
We are using github.com and public repositories because we will eventually be using some fancy tooling (e.g., GitHub Actions) in this project. For this to work, all your work for this project (including scripts) should be developed and live in this repository on GitHub.com. Additionally, this will help you build a portfolio of your work you can share with others.
For this project, you are expected to follow the best practices we have learned about in regards to project directory and file structures. This should be reflected in your GitHub repository. Specifically for this milestone, please ensure that you include at least the following 7 files and directories in your project’s GitHub repository:
README.md
CODE_OF_CONDUCT.md
CONTRIBUTING.md
data/
<analysis>.ipynb
(give it a meaningful name!)environment.yml
(or the corresponding renv
files)LICENSE.md
The description of each of them follows.
README.md
In the main README.md
file for this project you should include: - the project title - the list of contributors/authors - a short summary of the project (view from 10,000 feet) - how to run your data analysis - a list of the dependencies needed to run your analysis - the names of the licenses contained in LICENSE.md
Note - this document should live in the root of your public GitHub.com repo.
CODE_OF_CONDUCT.md
In an attempt to create a safe, positive, productive, and happy community, many organizations and developers create a code of conduct for their projects. A code of conduct in a Data Science project informs others of social norms, acceptable behaviour and general etiquette. It is more outward facing than the team work contract discussed above. We recommend Project Include for a comprehensive guide to writing an effective Code of Conduct.
At minimum, we believe that a good/effective code of conduct should include:
Note 1 - this document should live in the root of your public GitHub.com repository.
Note 2 - If you all, or even part, of someone else’s code of conduct, be sure to make this clear by attributing them.
CONTRIBUTING.md
It is a good practice to include information about how others, outside the core team, can contribute to your project somewhere in your repository. This is typically done as a separate file in the repository called CONTRIBUTING.md
. Here are some examples of this file:
Note 1 - this document should live in the root of your public GitHub.com repository.
Note 2 - If you all, or even part, of someone else’s code of conduct, be sure to make this clear by attributing them.
data/
The data
directory should contain a copy of the downloaded data (it should get written there by code). An exception to this can be made if the data set is extremely large (> 100 MB per file). In such cases, please reach out to the instructional team to make an alternative plan.
analysis.ipynb
/ analysis.qmd
For this milestone, the analysis code and narration should be contained within a single literate code document (e.g., Jupyter notebook, RMarkdown file, or Quarto document). This document should not actually be named analysis.***
, but have a more descriptive title, related to the project title.
environment.yml
(or the corresponding renv
files)Include the necessary files (e.g., environment.yml
) so others can reproduce your work on their machine.
Note - this document should live in the root of your public GitHub.com repository.
LICENSE.md
License’s tell others how they may (or may not) use your work. We will learn more about licenses later in the course, and for now we recommend using an MIT license for the project code and a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license for the project report. As long as your group agrees, these can be changed later in the course when you learn more about licenses. If you want to spend some time choosing different licenses now, we can recommend these two websites to help get you started:
Note - this document should live in the root of your public GitHub.com repository.
The content of this analysis project should be a narrated analysis that asks
and answers a predictive question using a classification
or regression method taught in the prerequisite course, DSCI 100.
For milestone 1,
the code and analysis narrative should be contained within a single Jupyter notebook
(e.g., .ipynb
file), RMarkdown file (e.g., .Rmd
file), or Quarto document (e.g., .qmd
file).
The analysis narrative should be rich,
and at the level of the final project from DSCI 100.
Either R or Python can be used to do this.
The data for the project should be publicly available,
and clearly licensed to be shared and used openly on the internet.
We strongly suggest you avoid using data sets where authentication is needed for access
(e.g., Kaggle) as this adds another layer of complexity when making these projects reproducible.
Note - You are expected to create an original data analysis for this project. You are not allowed to reuse an analysis from another course.
Note: if you choose a data set from the fivethirtyeight R package, you cannot copy their scientific question, visualizations or methods from the original FiveThirtyEight articles from where the data sets were first reported on. Finally, with the fivethirtyeight R package data sets, we want you to get practice reading them using the read_*
functions in R or Python, so please use the versions of the data sets listed here: https://github.com/rudeboybert/fivethirtyeight/tree/master/data-raw
Your analysis should have the following sections:
For this project, we will eventually be making the computation environment reproducible through containerization with Docker. For milestone 1 however, we will start with using a virtual environment as our first step towards this. You can use either conda
or renv
for this. Be sure to include the necessary files (e.g., environment.yml
) so others can reproduce your work. Be sure to include documentation of how to use the environment with your project as well in the README.md
file.
Finally, make sure the versions of all software and software packages used in the analysis
are recorded in the file that specifies the environment.
Manage issues effectively through project boards and milestones, make it clear who is responsible for what and what project milestone each task is associated with. In particular, create an issue for each task in the project. Each of these issues must be assigned to a single person on the team. We want all of you to get coding experience in the project and each team member should be responsible for an approximately equal portion of the code.
To ensure the TA’s can grade your project board, change the visibility from the default private setting to the public setting. To change the visibility of the project board do the following:
Go to the project board and click on the three vertical dots on the right-hand side of the board and select settings
Scroll to the bottom of the Settings page to the “Danger zone” and under visibility select “Public”.
You will submit two URLS’s to Canvas in the provided text box for milestone 1:
Just before you submit the milestone 1,
create a release on your project repository on GitHub and name it
0.0.1
(how to create a release).
This release allows us and you to easily jump to the state of your repository at the time of submission for grading puroposes,
while you continue to work on your project for the next milestone.