Workflows for reproducibile and trustworthy data science wrap-up

This topic serves as a wrap-up of the course, summarizing the course learning objectives, redefining what is meant by reproducible and trustworthy data science, as well as contains data analysis project critique exercises to reinforce what has been learned in the course.

Course learning objectives

By the end of this course, students should be able to:

  • Defend and justify the importance of creating data science workflows that are reproducible and trustworthy and the elements that go into such a workflow (e.g., writing clear, robust, accurate and reproducible code, managing and sharing compute environments, defined collaboration strategies, etc).

  • Constructively criticize the workflows and data analysis of others in regards to its reproducibility and trustworthiness.

  • Develop a data science project (including code and non-code documents such as reports) that uses reproducible and trustworthy workflows

  • Demonstrate how to effectively share and collaborate on data science projects and software by creating robust code packages, using reproducible compute environments, and leveraging collaborative development tools.

  • Defend and justify the benefit of, and employ automated testing regimes, continuous integration and continuous deployment for managing and maintaining data science projects and packages.

  • Demonstrate strong communication, teamwork, and collaborative skills by working on a significant data science project with peers throughout the course.

Definitions review:

Data science

the study, development and practice of reproducible and auditable processes to obtain insight from data.

From this definition, we must also define reproducible and auditable analysis:

Reproducible analysis:

reaching the same result given the same input, computational methods and conditions \(^1\).

  • input = data

  • computational methods = computer code

  • conditions = computational environment (e.g., programming language & it’s dependencies)

Auditable/transparent analysis,

a readable record of the steps used to carry out the analysis as well as a record of how the analysis methods evolved \(^2\).

  1. National Academies of Sciences, 2019

  2. Parker, 2017 and Ram, 2013

What makes trustworthy data science?

Some possible criteria:

  1. It should be reproducible and auditable

  2. It should be correct

  3. It should be fair, equitable and honest

There are many ways a data science can be untrustworthy… In this course we will focus on workflows that can help build trust. I highly recommend taking a course in data science ethics* to help round out your education in how to do this. Further training in statistics and machine learning will also help with making sure your analysis is correct.

* UBC’s CPSC 430 (Computers and Society) will have a section reserved for DSCI minor students next year in 2022 T1, which will focus on ethics in data science.

Exercise

Answer the questions below to more concretely connect with the criteria suggested above.

  1. Give an example of a data science workflow that affords reproducibility, and one that affords auditable analysis.

  2. Can a data analysis be reproducible but not auditable? How about auditable, but not reproducible?

  3. Name at least two ways that a data analysis project could be correct (or incorrect).

Critiquing data analysis projects

Critiquing is defined as evaluating something in a detailed and analytical way (source: Oxford Languages Dictionary). It is used in many domains as a means to improve something (related to peer review), but also serves as an excellent pedagogical tool to actively practice evaluation.

We will work together in class to critique the following projects from the lens of reproducible and trustworthy workflows:

Exercise

When prompted for each project listed above:

  • In groups of ~5, take 10 minutes to review the project from the lens of reproducible and trustworthy workflows. You want to evaluate the project with the questions in mind:

    • Would I know where to get started reproducing the project?

    • If I could get started, do I think I could reproduce the project?

  • As a group, come up with at least 1-2 things that have been done well, as well as 1-2 things that could be improved from this lens. Justify these with respect to reproducibility and trustworthiness.

  • Choose one person to be the reporter to bring back these critiques to the larger group.