Milestone 1

GitHub repository, statistical question, data & rough draft of analysis in one monolithic literate code document, reproducible environment (full.ipynb, Dockerfile, docker-compose.yml).

Overall project summary

In this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.

Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:

  • A well documented and modularized software package and scripts written in R and/or Python,
  • A data analysis pipeline automated with GNU Make,
  • A reproducible report powered by either Jupyter book or R Markdown,
  • A containerized computational environment created and made shareable by Docker, and
  • A remote version control repository on GitHub for project collaboration and sharing, As well as automation of test suite execution and documentation and software deployment.

An example final project from another course (where the project is similar) can be seen here: Breast Cancer Predictor

Milestone 1 summary

In this milestone, you will:

  1. Draft a team work contract
  2. Set-up a public GitHub repository
  3. Select a data analysis project to work with
  4. Create an appropriate file and directory structure for a data analysis project
  5. Add the data analysis as a literate code document
  6. Make the computation environment reproducible through containerization with Docker

An example project milestone 1 is available here: https://github.com/UBC-DSCI/predict-airbnb-nightly-price/tree/v1.0.2

1. Draft a team work contract

This document will govern the working relationships on your team, and if done correctly, it should help you manage and resolve any issues that might arise when working in a group. A sample team work contract from a past data science project can be found here.

A team work contract communicates specifically how the core group of people who are working together and gives detail about the logistics of working together and the expectations you have for each other.

Some aspects of the team work contract could be:

  • How will work be distributed in a fair and equitable way?
  • What are the expected work hours for the project?
  • How often will group meetings occur (here is a nice article on meetings)?
  • Will you have meeting agendas and minutes? If so, what will be the system for rotating through these responsibilities?
  • What will be the style of working?
    • Will you start each day with stand-ups, or submit a summary of your contributions 4 hours before each meeting? or something else?
  • What is the quality of work each team member expects from themselves and each other?
  • When are team members not available (e.g., evenings and Sundays because of family obligations).
  • And any other similar things that govern your working relationships.

Use this opportunity to use your prior knowledge/experience to improve your teamwork, communication, leadership, and organizational skills.

Note

This document is fairly personal and does NOT need to reside in your public GitHub.com repo.

Please submit this contract under a separate assignment.

2. Set-up a GitHub repository

Now set-up a public GitHub repository for your project in the format: dsci-310-group-XX-<optional_team_name>. You will create this project in the course github organization (not under a personal account).

We are using github.com and public repositories because we will eventually be using some fancy tooling (e.g., GitHub Actions) in this project. For this to work, all your work for this project (including scripts) should be developed and live in this repository on GitHub.com. Additionally, this will help you build a portfolio of your work you can share with others.

Note

If you need to change the name of your project repository later, you can.

The tracking of the project will be done using GitHub milestones. When you create a milestone, you can associate it with issues and pull requests. After creating issues you can link them to a milestone associating each them to a project due date.

In order to make more efficient team workflow, organize the issues you have previously associated with milestones into a GitHub project board.

The project board should show the current status of the project to the team. Each collaborator is responsible of converting the issues to cards and then moving them through the board columns as the project evolves.

3. Select a data analysis project to work with

The content of this analysis project should be a narrated analysis that asks and answers a predictive question using a classification or regression method taught in the prerequisite course, DSCI 100. For milestone 1, the code and analysis narrative should be contained within a single Jupyter notebook (e.g., .ipynb file), RMarkdown file (e.g., .Rmd file), or Quarto document (e.g., .qmd file). The analysis narrative should be rich, and at the level of the final project from DSCI 100. Either R or Python can be used to do this. However, all group members must have past experience using the language chosen and permission of the instructor must be given. The restrictions on the analysis methods and programming language are put in place to ensure that all group members can contribute equally to the project.

The data for the project should be publicly available, and clearly licensed to be shared and used openly on the internet. We strongly suggest you avoid using data sets where authentication is needed for access (e.g., Kaggle) as this adds another layer of complexity when making these projects reproducible.

Possible data set sources

Note

If you choose a data set from the fivethirtyeight R package, you cannot copy their scientific question, visualizations or methods from the original FiveThirtyEight articles from where the data sets were first reported on. Finally, with the fivethirtyeight R package data sets, we want you to get practice reading them using the read_* functions in R or Python, so please use the versions of the data sets listed here: https://github.com/rudeboybert/fivethirtyeight/tree/master/data-raw

4. Create an appropriate file and directory structure for a data analysis project

For this project, you are expected to follow the best practices we have learned about in regards to project directory and file structures. This should be reflected in your GitHub repository. Specifically for this milestone, please ensure that you include at least the following 7 files and directories in your project’s GitHub repository:

  • README.md
  • CODE_OF_CONDUCT.md
  • CONTRIBUTING.md
  • data/
  • analysis.Rmd or analysis.qmd
  • Dockerfile
  • .github/workflows/publish_docker_image.yml
  • LICENSE.md

The description of each of them follows.

i. README.md

In the main README.md file for this project you should include: - the project title - the list of contributors/authors - a short summary of the project (view from 10,000 feet) - how to run your data analysis - a list of the dependencies needed to run your analysis - the names of the licenses contained in LICENSE.md

Note - this document should live in the root of your public GitHub.com repo.

ii. CODE_OF_CONDUCT.md

In an attempt to create a safe, positive, productive, and happy community, many organizations and developers create a code of conduct for their projects. A code of conduct in a Data Science project informs others of social norms, acceptable behaviour and general etiquette. It is more outward facing than the team work contract discussed above. We recommend Project Include for a comprehensive guide to writing an effective Code of Conduct.

At minimum, we believe that a good/effective code of conduct should include:

  • A Statement on diversity and inclusivity
  • Details on expected, and unacceptable behaviour
  • A list of consequences for unacceptable behaviour
  • A procedure for reporting and dealing with unacceptable behaviour
Sample Codes of Conduct:

Note - this document should live in the root of your public GitHub.com repository.

iii. CONTRIBUTING.md

It is a good practice to include information about how others outside the core team can contribute to your project somewhere in your repository. This is typically done as a separate file in a repository called CONTRIBUTING.md. Here are some examples of this file:

If you’re interested in this, here’s a quick guide to creating this file from GitHub, but in the meantime, you can use the following snippet as a template:

We welcome all contributions to this project!
If you notice a bug, or have a feature request,
please open up an issue [here](https://github.com/UBC-DSCI/REPOSITORY_NAME/issues).
If you'd like to contribute a feature or bug fix,
you can fork our repo and submit a pull request.
We will review pull requests within 7 days.
All contributors must abide by our [code of conduct](CODE_OF_CONDUCT.md).

Note - this document should live in the root of your public GitHub.com repository.

iv. data/

The data directory should contain a copy of the downloaded data (written there by code). An exception to this can be made if the data set is extremely large (> 100 MB per file). In such cases, please reach out to the instructional team to make an alternative plan.

v. analysis.ipynb / analysis.Rmd / analysis.qmd

For this milestone, the analysis code and narration should be contained within a single literate code document (e.g., Jupyter notebook, RMarkdown file, or Quarto document). This document should not actually be named analysis.***, but have a more descriptive title, related to the project title.

vi. Dockerfile

The Dockerfile is the file used to specify and create the Docker image from which containers can be run to create an reproducible computational environment for your analysis. For this project, we recommend using a base Docker image that already has most of the software dependencies needed for your analysis. Examples of these include the Jupyter core team Docker images (documentation) and the Rocker team Docker images (documentation). When you add other software dependencies to this Dockerfile, ensure that you pin the version of the software that you add.

Note - this document should live in the root of your public GitHub.com repository.

vii. .github/workflows/publish_docker_image.yml

In this milestone, we expect you to add a GitHub Actions workflow to automatically build the image, push it to DockerHub, and version the image (using latest is okay for now) and GitHub repo when changes are pushed to the Dockerfile

You will need to add your DockerHub username and password (naming them DOCKER_USERNAME and DOCKER_PASSWORD, respectively) as GitHub secrets to this repository for this to work.

This part is similar to Individual Assignment 3.

viii. LICENSE.md

License’s tell others how they may (or may not) use your work. We will learn more about licenses later in the course, and for now we recommend using an MIT license for the project code and a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license for the project report. As long as your group agrees, these can be changed later in the course when you learn more about licenses. If you want to spend some time choosing different licenses now, we can recommend these two websites to help get you started:

  • https://choosealicense.com/
  • https://chooser-beta.creativecommons.org/

5. Add the data analysis as a literate code document

For this milestone, the analysis code and narration should be contained within a single literate code document (i.e., Jupyter notebook). The following sections should be included:

  • Title
  • Summary
  • Introduction:
    • provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
    • clearly state the question you tried to answer with your project
    • identify and describe the dataset that was used to answer the question
  • Methods & Results:
    • describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
    • your report should include code which:
    • loads data from the original source on the web
    • wrangles and cleans the data from it’s original (downloaded) format to the format necessary for the planned classification or clustering analysis
    • performs a summary of the data set that is relevant for exploratory data analysis related to the planned classification analysis
    • creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis
    • performs classification or regression analysis
    • creates a visualization of the result of the analysis
    • note: all tables and figure should have a figure/table number and a legend
  • Discussion:
    • summarize what you found
    • discuss whether this is what you expected to find?
    • discuss what impact could such findings have?
    • discuss what future questions could this lead to?
  • References:
    • at least 4 citations relevant to the project (format is your choose, just be consistent across the references).

6. Make the computation environment reproducible through containerization with Docker

For this project, we will be making the computation environment reproducible through containerization with Docker. We will use Docker containers for collaborative project development, as well as for sharing a reproducible analysis at the end of the project. It is expected that: - the Docker image, from which containers are generated, will be specified by a Dockerfile that lives in the root of the project’s GitHub repository - the building of the Docker image and distribution of it via DockerHub will be automated via GitHub Actions - an explanation of how to run and use Docker to develop and execute the analysis will be documented in the README.md file.

Submission Instructions

You will submit two URLS’s to Canvas in the provided text box for milestone 1:

  1. the URL of your project’s GitHub.com repository

  2. the URL of a GitHub release of your project’s project’s GitHub.com repository

  3. In a separate file / Submission, your teamwork contract

Creating a release on GitHub.com

Just before you submit the milestone 1, create a release on your project repository on GitHub and name it 0.0.1 (how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading puroposes, while you continue to work on your project for the next milestone.

Expectations

  • Everyone should contribute equally to all aspects of the project (e.g., code, writing, project management). This should be evidenced by a roughly equal number of commits, pull request reviews and participation in communication via GitHub issues.
  • After the repository is set-up, each group member should work in a GitHub flow workflow; where they create branches for each feature or fix, which are reviewed and critiqued by at least one other teammate before the the pull request is accepted.
  • You should be committing to git and pushing to GitHub.com every time you work on this project.
  • Git commit messages should be meaningful. These will be marked. It’s OK if one or two are less meaningful, but most should be.
  • Use GitHub issues to communicate to their team mate (as opposed to email or Slack).
  • Your question, analysis and visualization should make sense. It doesn’t have to be complicated.
  • You should use proper grammar and full sentences in your README. Point form may occur, but should be less than 10% of your written documents.