Milestone 1

Overall project summary

In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.

Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:

An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor

Milestone 1 summary

In this milestone, you will:

  1. Draft a team work contract
  2. Set-up a public GitHub repository
  3. Create an appropriate file and directory structure for a data analysis project
  4. Perform the data analysis in a literate code document
  5. Ensure the computation environment is reproducible through a virtual environment (e.g., conda or renv) - hint do this as you create the analysis, don’t wait until the end!
  6. Manage issues professionally

An example project milestone 1 is available here: Breast Cancer Predictor in Python v0.0.4

1. Draft a team work contract

This document will govern the working relationships on your team, and if done correctly, it should help you manage and resolve any issues that might arise when working in a group. A sample team work contract from a past data science project can be found here.

A team work contract communicates specifically how the core group of people who are working together and gives detail about the logisitics of working together and the expectations you have for each other.

Some aspects of the team work contract could be:

Use this opportunity to use your prior knowledge/experience to improve your teamwork, communication, leadership, and organizational skills. You will need all of these for your capastone projects (and beyond)!

Note - this document is fairly personal and does NOT need to reside in your public GitHub.com repo. Instead you can prove that you created this by submitting a copy of it to Gradescope under the Milestone 1 assignment.

Activity to start drafting the team work document

  1. (5 min) Each person quietly writes down three things that are important to them for the team work contract.

  2. (5 min) Each person takes a turn shares their three ideas with the group.

  3. (10 min) Group discussion to decide which of the ideas presented should be incorporated into the contract (try to fairly distribute the inclusion of ideas from all teammates). Also discuss if there is anything missing from the contract.

2. Set-up a GitHub repository

Fill out this survey to provide us your GitHub.com username: https://ubc.ca1.qualtrics.com/jfe/form/SV_8cCRmyxpbch5uxU

After we add you to the DSCI 310 2024 organization on GitHub.com, set-up a public GitHub repository for your group under that organization being sure to add all team members as collaborators. Once you have decided on a project question and data set, make sure your repository has a name that is relevant to that question. Note: if you need to change the name of your project repository later, you can!

We are using github.com and public repositories because we will eventually be using some fancy tooling (e.g., GitHub Actions) in this project. For this to work, all your work for this project (including scripts) should be developed and live in this repository on GitHub.com. Additionally, this will help you build a portfolio of your work you can share with others.

3. Create an appropriate file & directory structure for a data analysis project

For this project, you are expected to follow the best practices we have learned about in regards to project directory and file structures. This should be reflected in your GitHub repository. Specifically for this milestone, please ensure that you include at least the following 7 files and directories in your project’s GitHub repository:

The description of each of them follows.

i. README.md

In the main README.md file for this project you should include: - the project title - the list of contributors/authors - a short summary of the project (view from 10,000 feet) - how to run your data analysis - a list of the dependencies needed to run your analysis - the names of the licenses contained in LICENSE.md

Note - this document should live in the root of your public GitHub.com repo.

ii. CODE_OF_CONDUCT.md

In an attempt to create a safe, positive, productive, and happy community, many organizations and developers create a code of conduct for their projects. A code of conduct in a Data Science project informs others of social norms, acceptable behaviour and general etiquette. It is more outward facing than the team work contract discussed above. We recommend Project Include for a comprehensive guide to writing an effective Code of Conduct.

At minimum, we believe that a good/effective code of conduct should include:

Sample Codes of Conduct:

Note 1 - this document should live in the root of your public GitHub.com repository.

Note 2 - If you all, or even part, of someone else’s code of conduct, be sure to make this clear by attributing them.

iii. CONTRIBUTING.md

It is a good practice to include information about how others, outside the core team, can contribute to your project somewhere in your repository. This is typically done as a separate file in the repository called CONTRIBUTING.md. Here are some examples of this file:

Note 1 - this document should live in the root of your public GitHub.com repository.

Note 2 - If you all, or even part, of someone else’s code of conduct, be sure to make this clear by attributing them.

iv. data/

The data directory should contain a copy of the downloaded data (it should get written there by code). An exception to this can be made if the data set is extremely large (> 100 MB per file). In such cases, please reach out to the instructional team to make an alternative plan.

v. analysis.ipynb / analysis.qmd

For this milestone, the analysis code and narration should be contained within a single literate code document (e.g., Jupyter notebook, RMarkdown file, or Quarto document). This document should not actually be named analysis.***, but have a more descriptive title, related to the project title.

vi. environment.yml (or the corresponding renv files)

Include the necessary files (e.g., environment.yml) so others can reproduce your work on their machine.

Note - this document should live in the root of your public GitHub.com repository.

viii. LICENSE.md

License’s tell others how they may (or may not) use your work. We will learn more about licenses later in the course, and for now we recommend using an MIT license for the project code and a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license for the project report. As long as your group agrees, these can be changed later in the course when you learn more about licenses. If you want to spend some time choosing different licenses now, we can recommend these two websites to help get you started:

Note - this document should live in the root of your public GitHub.com repository.

4. Perform the data analysis in a literate code document

The content of this analysis project should be a narrated analysis that asks and answers a predictive question using a classification or regression method taught in the prerequisite course, DSCI 100. For milestone 1, the code and analysis narrative should be contained within a single Jupyter notebook (e.g., .ipynb file), RMarkdown file (e.g., .Rmd file), or Quarto document (e.g., .qmd file). The analysis narrative should be rich, and at the level of the final project from DSCI 100. Either R or Python can be used to do this. The data for the project should be publicly available, and clearly licensed to be shared and used openly on the internet. We strongly suggest you avoid using data sets where authentication is needed for access (e.g., Kaggle) as this adds another layer of complexity when making these projects reproducible.

Note - You are expected to create an original data analysis for this project. You are not allowed to reuse an analysis from another course.

Possible data set sources

Note: if you choose a data set from the fivethirtyeight R package, you cannot copy their scientific question, visualizations or methods from the original FiveThirtyEight articles from where the data sets were first reported on. Finally, with the fivethirtyeight R package data sets, we want you to get practice reading them using the read_* functions in R or Python, so please use the versions of the data sets listed here: https://github.com/rudeboybert/fivethirtyeight/tree/master/data-raw

Your analysis should have the following sections:

6. Ensure the computation environment is reproducible through a virtual environment (e.g., conda or renv)

For this project, we will eventually be making the computation environment reproducible through containerization with Docker. For milestone 1 however, we will start with using a virtual environment as our first step towards this. You can use either conda or renv for this. Be sure to include the necessary files (e.g., environment.yml) so others can reproduce your work. Be sure to include documentation of how to use the environment with your project as well in the README.md file. Finally, make sure the versions of all software and software packages used in the analysis are recorded in the file that specifies the environment.

6. Manage issues professionally

Manage issues effectively through project boards and milestones, make it clear who is responsible for what and what project milestone each task is associated with. In particular, create an issue for each task in the project. Each of these issues must be assigned to a single person on the team. We want all of you to get coding experience in the project and each team member should be responsible for an approximately equal portion of the code.

To ensure the TA’s can grade your project board, change the visibility from the default private setting to the public setting. To change the visibility of the project board do the following:

  1. Go to the project board and click on the three vertical dots on the right-hand side of the board and select settings

  2. Scroll to the bottom of the Settings page to the “Danger zone” and under visibility select “Public”.

Submission Instructions

You will submit two URLS’s to Canvas in the provided text box for milestone 1:

  1. the URL of your project’s GitHub.com repository
  2. the URL of a GitHub release of your project’s project’s GitHub.com repository

Creating a release on GitHub.com

Just before you submit the milestone 1, create a release on your project repository on GitHub and name it 0.0.1 (how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading puroposes, while you continue to work on your project for the next milestone.

Expectations