40 Workflows for reproducibile and trustworthy data science wrap-up
This topic serves as a wrap-up of the course, summarizing the course learning objectives, redefining what is meant by reproducible and trustworthy data science, as well as contains data analysis project critique exercises to reinforce what has been learned in the course.
40.1 Course learning objectives
By the end of this course, students should be able to:
- Defend and justify the importance of creating data science workflows that are reproducible and trustworthy and the elements that go into such a workflow (e.g., writing clear, robust, accurate and reproducible code, managing and sharing compute environments, defined collaboration strategies, etc).
- Constructively criticize the workflows and data analysis of others in regards to its reproducibility and trustworthiness.
- Develop a data science project (including code and non-code documents such as reports) that uses reproducible and trustworthy workflows
- Demonstrate how to effectively share and collaborate on data science projects and software by creating robust code packages, using reproducible compute environments, and leveraging collaborative development tools.
- Defend and justify the benefit of, and employ automated testing regimes, continuous integration and continuous deployment for managing and maintaining data science projects and packages.
- Demonstrate strong communication, teamwork, and collaborative skills by working on a significant data science project with peers throughout the course.
40.2 Definitions review:
40.2.1 Data science
the study, development and practice of reproducible and auditable processes to obtain insight from data.
From this definition, we must also define reproducible and auditable analysis:
40.2.2 Reproducible analysis:
reaching the same result given the same input, computational methods and conditions \(^1\).
- input = data
- computational methods = computer code
- conditions = computational environment (e.g., programming language & it’s dependencies)
40.2.3 Auditable/transparent analysis,
a readable record of the steps used to carry out the analysis as well as a record of how the analysis methods evolved \(^2\).
40.3 What makes trustworthy data science?
Some possible criteria:
It should be reproducible and auditable
It should be correct
It should be fair, equitable and honest
There are many ways a data science can be untrustworthy… In this course we will focus on workflows that can help build trust. I highly recommend taking a course in data science ethics* to help round out your education in how to do this. Further training in statistics and machine learning will also help with making sure your analysis is correct.
* UBC’s CPSC 430 (Computers and Society) will have a section reserved for DSCI minor students next year in 2022 T1, which will focus on ethics in data science.
Exercise
Answer the questions below to more concretely connect with the criteria suggested above.
Give an example of a data science workflow that affords reproducibility, and one that affords auditable analysis.
Can a data analysis be reproducible but not auditable? How about auditable, but not reproducible?
Name at least two ways that a data analysis project could be correct (or incorrect).
40.4 Critiquing data analysis projects
Critiquing is defined as evaluating something in a detailed and analytical way (source: Oxford Languages Dictionary). It is used in many domains as a means to improve something (related to peer review), but also serves as an excellent pedagogical tool to actively practice evaluation.
We will work together in class to critique the following projects from the lens of reproducible and trustworthy workflows:
Genomic data and code to accompany the “Accelerating Gene Discovery by Phenotyping Whole-Genome Sequenced Multi-mutation Strains and Using the Sequence Kernel Association Test (SKAT)” manuscript by Timbers et al., PLoS Genetics, 2015
Data and code to accompany the “Altmetrics in the wild: Using social media to explore scholarly impact” manuscript by Priem et al., arXiv, 2021
Code to accompany the “Deep convolutional models improve predictions of macaque V1 responses to natural images” manuscript by Cadena et al., PLoS Computational Biology, 2019
40.4.1 Exercise
When prompted for each project listed above:
In groups of ~5, take 10 minutes to review the project from the lens of reproducible and trustworthy workflows. You want to evaluate the project with the questions in mind:
- Would I know where to get started reproducing the project?
- If I could get started, do I think I could reproduce the project?
As a group, come up with at least 1-2 things that have been done well, as well as 1-2 things that could be improved from this lens. Justify these with respect to reproducibility and trustworthiness.
Choose one person to be the reporter to bring back these critiques to the larger group.