DSCI 310: Milestone 1

Overall project summary

In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.

Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:

a well documented and modularized software package and scripts written in R and/or Python,
a data analysis pipeline automated with GNU Make,
a reproducible report powered by Quarto,
a containerized computational environment created and made shareable by Docker, and
remote version control repositories on GitHub for project collaboration and sharing, as well as automation of test suite execution and documentation and software deployment.

An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor

Milestone 1 summary

In this milestone, you will:

Draft a team work contract
Set-up a public GitHub repository
Create an appropriate file and directory structure for a data analysis project
Perform the data analysis in a literate code document
Ensure the computation environment is reproducible through a virtual environment (e.g., conda or renv) - hint do this as you create the analysis, don’t wait until the end!
Manage issues professionally

An example project milestone 1 is available here: Breast Cancer Predictor in Python v0.0.4

1. Draft a team work contract

This document will govern the working relationships on your team, and if done correctly, it should help you manage and resolve any issues that might arise when working in a group. A sample team work contract from a past data science project can be found here.

A team work contract communicates specifically how the core group of people who are working together and gives detail about the logisitics of working together and the expectations you have for each other.

Some aspects of the team work contract could be:

How will work be distributed in a fair and equitable way?
What are the expected work hours for the project?
How often will group meetings occur (here is a nice article on meetings)?
Will you have meeting agendas and minutes? If so, what will be the system for rotating through these responsibilities?
What will be the style of working?
- Will you start each day with stand-ups, or submit a summary of your contributions 4 hours before each meeting? or something else?
What is the quality of work each team member expects from themselves and each other?
When are team members not available (e.g., evenings and Sundays because of family obligations).
And any other similar things that govern your working relationships.

Use this opportunity to use your prior knowledge/experience to improve your teamwork, communication, leadership, and organizational skills. You will need all of these for your capastone projects (and beyond)!

Note - this document is fairly personal and does NOT need to reside in your public GitHub.com repo. Instead you can prove that you created this by submitting a copy of it to Gradescope under the Milestone 1 assignment.

Activity to start drafting the team work document

(5 min) Each person quietly writes down three things that are important to them for the team work contract.
(5 min) Each person takes a turn shares their three ideas with the group.
(10 min) Group discussion to decide which of the ideas presented should be incorporated into the contract (try to fairly distribute the inclusion of ideas from all teammates). Also discuss if there is anything missing from the contract.

2. Set-up a GitHub repository

Fill out this survey to provide us your GitHub.com username: https://ubc.ca1.qualtrics.com/jfe/form/SV_8cCRmyxpbch5uxU

After we add you to the DSCI 310 2024 organization on GitHub.com, set-up a public GitHub repository for your group under that organization being sure to add all team members as collaborators. Once you have decided on a project question and data set, make sure your repository has a name that is relevant to that question. Note: if you need to change the name of your project repository later, you can!

We are using github.com and public repositories because we will eventually be using some fancy tooling (e.g., GitHub Actions) in this project. For this to work, all your work for this project (including scripts) should be developed and live in this repository on GitHub.com. Additionally, this will help you build a portfolio of your work you can share with others.

3. Create an appropriate file & directory structure for a data analysis project

For this project, you are expected to follow the best practices we have learned about in regards to project directory and file structures. This should be reflected in your GitHub repository. Specifically for this milestone, please ensure that you include at least the following 7 files and directories in your project’s GitHub repository:

README.md
CODE_OF_CONDUCT.md
CONTRIBUTING.md
data/
<analysis>.ipynb (give it a meaningful name!)
environment.yml (or the corresponding renv files)
LICENSE.md

The description of each of them follows.

i. `README.md`

In the main README.md file for this project you should include: - the project title - the list of contributors/authors - a short summary of the project (view from 10,000 feet) - how to run your data analysis - a list of the dependencies needed to run your analysis - the names of the licenses contained in LICENSE.md

Note - this document should live in the root of your public GitHub.com repo.

ii. `CODE_OF_CONDUCT.md`

In an attempt to create a safe, positive, productive, and happy community, many organizations and developers create a code of conduct for their projects. A code of conduct in a Data Science project informs others of social norms, acceptable behaviour and general etiquette. It is more outward facing than the team work contract discussed above. We recommend Project Include for a comprehensive guide to writing an effective Code of Conduct.

At minimum, we believe that a good/effective code of conduct should include:

A Statement on diversity and inclusivity
Details on expected, and unacceptable behaviour
A list of consequences for unacceptable behaviour
A procedure for reporting and dealing with unacceptable behaviour

Sample Codes of Conduct:

Note 1 - this document should live in the root of your public GitHub.com repository.

Note 2 - If you all, or even part, of someone else’s code of conduct, be sure to make this clear by attributing them.

iii. `CONTRIBUTING.md`

It is a good practice to include information about how others, outside the core team, can contribute to your project somewhere in your repository. This is typically done as a separate file in the repository called CONTRIBUTING.md. Here are some examples of this file:

Altair
dplyr
Atom editor (very comprehensive)

Note 1 - this document should live in the root of your public GitHub.com repository.

Note 2 - If you all, or even part, of someone else’s code of conduct, be sure to make this clear by attributing them.

iv. `data/`

The data directory should contain a copy of the downloaded data (it should get written there by code). An exception to this can be made if the data set is extremely large (> 100 MB per file). In such cases, please reach out to the instructional team to make an alternative plan.

v. `analysis.ipynb` / `analysis.qmd`

For this milestone, the analysis code and narration should be contained within a single literate code document (e.g., Jupyter notebook, RMarkdown file, or Quarto document). This document should not actually be named analysis.***, but have a more descriptive title, related to the project title.

vi. `environment.yml` (or the corresponding `renv` files)

Include the necessary files (e.g., environment.yml) so others can reproduce your work on their machine.

Note - this document should live in the root of your public GitHub.com repository.

viii. `LICENSE.md`

License’s tell others how they may (or may not) use your work. We will learn more about licenses later in the course, and for now we recommend using an MIT license for the project code and a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license for the project report. As long as your group agrees, these can be changed later in the course when you learn more about licenses. If you want to spend some time choosing different licenses now, we can recommend these two websites to help get you started:

Note - this document should live in the root of your public GitHub.com repository.

4. Perform the data analysis in a literate code document

The content of this analysis project should be a narrated analysis that asks and answers a predictive question using a classification or regression method taught in the prerequisite course, DSCI 100. For milestone 1, the code and analysis narrative should be contained within a single Jupyter notebook (e.g., .ipynb file), RMarkdown file (e.g., .Rmd file), or Quarto document (e.g., .qmd file). The analysis narrative should be rich, and at the level of the final project from DSCI 100. Either R or Python can be used to do this. The data for the project should be publicly available, and clearly licensed to be shared and used openly on the internet. We strongly suggest you avoid using data sets where authentication is needed for access (e.g., Kaggle) as this adds another layer of complexity when making these projects reproducible.

Note - You are expected to create an original data analysis for this project. You are not allowed to reuse an analysis from another course.

Possible data set sources

City of Vancouver Open Data Portal
BC Data Catalogue
Government of Canada open data
GitHub repo of awesome public datasets
UCI Machine Learning Repository: Data Sets
tidytuesday data set
Google Data set search engine
fivethirtyeight R package (if you use this, see the note below!)

Note: if you choose a data set from the fivethirtyeight R package, you cannot copy their scientific question, visualizations or methods from the original FiveThirtyEight articles from where the data sets were first reported on. Finally, with the fivethirtyeight R package data sets, we want you to get practice reading them using the read_* functions in R or Python, so please use the versions of the data sets listed here: https://github.com/rudeboybert/fivethirtyeight/tree/master/data-raw

Your analysis should have the following sections:

Title
Summary
Introduction:
- provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
- clearly state the question you tried to answer with your project
- identify and describe the dataset that was used to answer the question
Methods & Results:
- describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
  - your report should include code which:
  - loads data from the original source on the web
  - wrangles and cleans the data from it’s original (downloaded) format to the format necessary for the planned classification or clustering analysis
  - performs a summary of the data set that is relevant for exploratory data analysis related to the planned classification analysis
  - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis
  - performs classification or regression analysis
  - creates a visualization of the result of the analysis
  - note: all tables and figure should have a figure/table number and a legend
Discussion:
- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?
References:
- at least 4 citations relevant to the project (format is your choose, just be consistent across the references).

6. Ensure the computation environment is reproducible through a virtual environment (e.g., conda or renv)

For this project, we will eventually be making the computation environment reproducible through containerization with Docker. For milestone 1 however, we will start with using a virtual environment as our first step towards this. You can use either conda or renv for this. Be sure to include the necessary files (e.g., environment.yml) so others can reproduce your work. Be sure to include documentation of how to use the environment with your project as well in the README.md file. Finally, make sure the versions of all software and software packages used in the analysis are recorded in the file that specifies the environment.

6. Manage issues professionally

Manage issues effectively through project boards and milestones, make it clear who is responsible for what and what project milestone each task is associated with. In particular, create an issue for each task in the project. Each of these issues must be assigned to a single person on the team. We want all of you to get coding experience in the project and each team member should be responsible for an approximately equal portion of the code.

To ensure the TA’s can grade your project board, change the visibility from the default private setting to the public setting. To change the visibility of the project board do the following:

Go to the project board and click on the three vertical dots on the right-hand side of the board and select settings
Scroll to the bottom of the Settings page to the “Danger zone” and under visibility select “Public”.

Submission Instructions

You will submit two URLS’s to Canvas in the provided text box for milestone 1:

the URL of your project’s GitHub.com repository
the URL of a GitHub release of your project’s project’s GitHub.com repository

Creating a release on GitHub.com

Just before you submit the milestone 1, create a release on your project repository on GitHub and name it 0.0.1 (how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading puroposes, while you continue to work on your project for the next milestone.

Expectations

Everyone should contribute equally to all aspects of the project (e.g., code, writing, project management). This should be evidenced by a roughly equal number of commits, pull request reviews and participation in communication via GitHub issues.
After the repository is set-up, each group member should work in a GitHub flow workflow; where they create branches for each feature or fix, which are reviewed and critiqued by at least one other teammate before the the pull request is accepted.
You should be committing to git and pushing to GitHub.com every time you work on this project.
Git commit messages should be meaningful. These will be marked. It’s OK if one or two are less meaningful, but most should be.
Use GitHub issues to communicate to their team mate (as opposed to email, WhatsApp, WeChat, etc).
Your question, analysis and visualization should make sense. It doesn’t have to be complicated.
You should use proper grammar and full sentences in your written documents. Point form may occur, but should be less than 10% of your written documents.

Milestone 1