Data analysis pipelines#

Topic learning objectives#

By the end of this topic, students should be able to:

  1. Justify why analysis pipelines should be used over a single long script in data analysis projects, specifically in regards to how this affects reproducibility, maintainability and future derivatives of the work.

  2. Write a simple automated analysis pipeline using workflow tool (e.g., GNU Make)

  3. Discuss the advantage of using software that has a dependency graph for analysis pipelines compared to software that does not.

Data analysis pipelines#

  • As analysis grows in length and complexity, one literate code document generally is not enough

  • To improve code report readability (and code reproducibility and modularity) it is better to abstract at least parts of the code away (e.g, to scripts)

  • These scripts save figures and tables that will be imported into the final report

../../_images/scripts.png

Demo: building a Data Analysis pipeline using a Shell script tutorial#

adapted from Software Carpentry

To illustrate how to make a data analysis pipeline using a shell script to drive other scripts, we will work through an example together. Here are some set-up instructions so that we can do this together:

Set-up instructions#

  1. Click the green “Use this template” button from this GitHub repository to obtain a copy of it for yourself (do not fork it).

  2. Clone this repository to your computer.

Let’s do some analysis!#

Suppose we have a script, wordcount.py, that reads in a text file, counts the words in this text file, and outputs a data file:

python scripts/wordcount.py \
    --input_file=data/isles.txt \
    --output_file=results/isles.dat

If we view the first 5 rows of the data file using head,

head -5 results/isles.dat

we can see that the file consists of one row per word. Each row shows the word itself, the number of occurrences of that word, and the number of occurrences as a percentage of the total number of words in the text file.

the 3822 6.7371760973
of 2460 4.33632998414
and 1723 3.03719372466
to 1479 2.60708619778
a 1308 2.30565838181

Suppose we also have a script, plotcount.py, that reads in a data file and save a plot of the 10 most frequently occurring words:

python scripts/plotcount.py \
    --input_file=results/isles.dat \
    --output_file=results/figure/isles.png

Together these scripts implement a data analysis pipeline:

  1. Read a data file.

  2. Perform an analysis on this data file.

  3. Write the analysis results to a new file.

  4. Plot a graph of the analysis results.

To document how we’d like the analysis to be run, so we (and others) have a record and can replicate it, we will build a shell script called run_all.sh. Let’s work to try to build this pipeline so it does all that!

# run_all.sh
# Tiffany Timbers, Nov 2017
#
# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.
#
# Usage: bash run_all.sh

# perform wordcout on novels
python scripts/wordcount.py \
    --input_file=data/isles.txt \
    --output_file=results/isles.dat

# create plots
python scripts/plotcount.py \
    --input_file=results/isles.dat \
    --output_file=results/figure/isles.png

We actually have 4 books that we want to analyze, and then put the figures in a report.

  1. Read a data file.

  2. Perform an analysis on this data file.

  3. Write the analysis results to a new file.

  4. Plot a graph of the analysis results.

  5. Save the graph as an image, so we can put it in a paper.

Let’s extend our shell script to do that!

# run_all.sh
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# bash run_all.sh

# count the words
python scripts/wordcount.py \
    --input_file=data/isles.txt \
    --output_file=results/isles.dat
python scripts/wordcount.py \
    --input_file=data/abyss.txt \
    --output_file=results/abyss.dat
python scripts/wordcount.py \
    --input_file=data/last.txt \
    --output_file=results/last.dat
python scripts/wordcount.py \
    --input_file=data/sierra.txt \
    --output_file=results/sierra.dat

# create the plots
python scripts/plotcount.py \
    --input_file=results/isles.dat \
    --output_file=results/figure/isles.png
python scripts/plotcount.py \
    --input_file=results/abyss.dat \
    --output_file=results/figure/abyss.png
python scripts/plotcount.py \
    --input_file=results/last.dat \
    --output_file=results/figure/last.png
python scripts/plotcount.py \
    --input_file=results/sierra.dat \
    --output_file=results/figure/sierra.png

# write the report
quarto render report/count_report.qmd

Another example:#

From the breast cancer prediction example analysis repo, here is a data analysis pipeline using a shell script:

# run_all.sh
# Tiffany Timbers, Jan 2020

# download data
python src/download_data.py --out_type=feather --url=https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data --out_file=data/raw/wdbc.feather

# run eda report
Rscript -e "rmarkdown::render('src/breast_cancer_eda.Rmd')"

# pre-process data 
Rscript src/pre_process_wisc.r --input=data/raw/wdbc.feather --out_dir=data/processed 

# create exploratory data analysis figure and write to file 
Rscript src/eda_wisc.r --train=data/processed/training.feather --out_dir=results

# tune model
Rscript src/fit_breast_cancer_predict_model.r --train=data/processed/training.feather --out_dir=results

# test model
Rscript src/breast_cancer_test_results.r --test=data/processed/test.feather --out_dir=results

# render final report
Rscript -e "rmarkdown::render('doc/breast_cancer_predict_report.Rmd', output_format = 'github_document')"

Discussion#

  • What are the advantages to using a data analysis pipeline?

  • How “good” is a shell script as a data analysis pipeline? What might not be ideal about this?

GNU Make as a data analysis pipeline tool#

We previously built a data analysis pipeline by using a shell script (we called it run_all.sh) to piece together and create a record of all the scripts and arguments we used in our analysis. That is a step in the right direction, but there were a few unsatisfactory things about this strategy:

  1. It takes time to manually erase all intermediate and final files generated by analysis to do a complete test to see that everything is working from top to bottom

  2. It runs every step every time. This can be problematic if some steps take a long time and you have only changed other, smaller parts of the analysis

Thus, to improve on this we are going to use the build and automation tool, Make, to make a smarter data analysis pipeline.

Makefile Structure#

Each block of code in a Makefile is called a rule, it looks something like this:

file_to_create.png : data_it_depends_on.dat script_it_depends_on.py
	python script_it_depends_on.py data_it_depends_on.dat file_to_create.png
  • file_to_create.png is a target, a file to be created, or built.

  • data_it_depends_on.dat and script_it_depends_on.py are dependencies, files which are needed to build or update the target. Targets can have zero or more dependencies.

  • : separates targets from dependencies.

  • python script_it_depends_on.py data_it_depends_on.dat file_to_create.png is an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions. Actions are indented using the TAB character, not 8 spaces.

  • Together, the target, dependencies, and actions form a rule.

Structure if you have multiple targets from a scripts#

file_to_create_1.png file_to_create_2.png : data_it_depends_on.dat script_it_depends_on.py
	python script_it_depends_on.py data_it_depends_on.dat file_to_create

Demo: building a Data Analysis pipeline using Make, a tutorial#

adapted from Software Carpentry

Set-up instructions#

  1. Click the green “Use this template” button from this GitHub repository to obtain a copy of it for yourself (do not fork it).

  2. Clone this repository to your computer.

Good reference: http://swcarpentry.github.io/make-novice/reference

Create a file, called Makefile, with the following content:

# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
	python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

This is a simple build file, which for GNU Make is called a Makefile - a file executed by GNU Make. Let us go through each line in turn:

  • # denotes a comment. Any text from # to the end of the line is ignored by Make.

  • results/isles.dat is a target, a file to be created, or built.

  • data/isles.txt and scripts/wordcount.py are dependencies, a file that is needed to build or update the target. Targets can have zero or more dependencies.

  • : separates targets from dependencies.

  • python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.dat is an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions.

  • Actions are indented using the TAB character, not 8 spaces. This is a legacy of Make’s 1970’s origins.

  • Together, the target, dependencies, and actions form a rule.

Our rule above describes how to build the target results/isles.dat using the action python scripts/wordcount.py and the dependency data/isles.txt.

By default, Make looks for a Makefile, called Makefile, and we can run Make as follows:

$ make results/isles.dat

Make prints out the actions it executes:

python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.dat

If we see,

Makefile:3: *** missing separator.  Stop.

then we have used a space instead of a TAB characters to indent one of our actions.

We don’t have to call our Makefile Makefile. However, if we call it something else we need to tell Make where to find it. This we can do using -f flag. For example:

$ make -f Makefile results/isles.dat

As we have re-run our Makefile, Make now informs us that:

make: `results/isles.dat' is up to date.

This is because our target, results/isles.dat, has now been created, and Make will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell touch command to update its timestamp (which would happen if we did edit the file):

$ touch data/isles.txt

If we compare the timestamps of data/isles.txt and results/isles.dat,

$ ls -l data/isles.txt results/isles.dat

then we see that results/isles.dat, the target, is now older thandata/isles.txt, its dependency:

-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat

If we run Make again,

$ make results/isles.dat

then it recreates results/isles.dat:

python src/wordcount.py data/isles.txt results/isles.dat

When it is asked to build a target, Make checks the ‘last modification time’ of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target.

We may want to remove all our data files so we can explicitly recreate them all. We can introduce a new target, and associated rule, clean:

# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
	python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

clean :
	rm -f results/isles.dat

This is an example of a rule that has no dependencies. clean has no dependencies on any .dat file as it makes no sense to create these just to remove them. We just want to remove the data files whether or not they exist. If we run Make and specify this target,

$ make clean

then we get:

rm -f *.dat

There is no actual thing built called clean. Rather, it is a short-hand that we can use to execute a useful sequence of actions.

Let’s add another rule to the end of Makefile:

results/isles.dat : data/isles.txt scripts/wordcount.py
	python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

results/figure/isles.png : results/isles.dat scripts/plotcount.py
	python scripts/plotcount.py \
        --input_file=results/isles.dat \
        --output_file=results/figure/isles.png

clean :
	rm -f results/isles.dat
	rm -f results/figure/isles.png

the new target isles.png depends on the target isles.dat. So to make both, we can simply type:

$ make results/figure/isles.png
$ ls

Let’s add another book:

results/isles.dat : data/isles.txt scripts/wordcount.py
	python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

results/abyss.dat : data/abyss.txt scripts/wordcount.py
python scripts/wordcount.py \
    --input_file=data/abyss.txt \
    --output_file=results/abyss.dat

results/figure/isles.png : results/isles.dat scripts/plotcount.py
	python scripts/plotcount.py \
        --input_file=results/isles.dat \
        --output_file=results/figure/isles.png

results/figure/abyss.png : results/abyss.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/abyss.dat \
        --output_file=results/figure/abyss.png

clean :
	rm -f results/isles.dat \
        results/abyss.dat
	rm -f results/figure/isles.png \
        results/figure/abyss.png

To run all of the commands, we need to type make for each one:

$ make results/figure/isles.png
$ make results/figure/abyss.png

OR we can add a target all to the very top of the Makefile which will build the last of the dependencies.

all: results/figure/isles.png results/figure/abyss.png

Finish off the Makefile!#

Since we will also combine the figures into a report in the end, we will change our all target to being the rendered report file, and add a target for the rendered report file at the end:

# Makefile
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# make all

all : report/count_report.html

# count the words
results/isles.dat : data/isles.txt scripts/wordcount.py
	python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

results/abyss.dat : data/abyss.txt scripts/wordcount.py
python scripts/wordcount.py \
    --input_file=data/abyss.txt \
    --output_file=results/abyss.dat

results/last.dat : data/last.txt scripts/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/last.txt \
        --output_file=results/last.dat

results/sierra.dat : data/sierra.txt scripts/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/sierra.txt \
        --output_file=results/sierra.dat

# create the plots
results/figure/isles.png : results/isles.dat scripts/plotcount.py
	python scripts/plotcount.py \
        --input_file=results/isles.dat \
        --output_file=results/figure/isles.png

results/figure/abyss.png : results/abyss.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/abyss.dat \
        --output_file=results/figure/abyss.png

results/figure/last.png : results/last.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/last.dat \
        --output_file=results/figure/last.png

results/figure/sierra.png : results/sierra.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/sierra.dat \
        --output_file=results/figure/sierra.png

# write the report
report/count_report.html : report/count_report.qmd \
results/figure/isles.png \
results/figure/abyss.png \
results/figure/last.png \
results/figure/sierra.png
    quarto render report/count_report.qmd

clean :
	rm -f results/isles.dat \
        results/abyss.dat \
        results/last.dat \
        results/sierra.dat
	rm -f results/figure/isles.png \
        results/figure/abyss.png \
        results/figure/last.png \
        results/figure/sierra.png
	rm -rf report/count_report.html

Improving the Makefile#

Adding PHONY targets to run parts of the analysis (e.g., just counting the words, or just making the figures) can be very useful when iterating and refining a particular part of your analysis. Similarly, adding other PHONY targets to clean up parts of the analysis can help with this as well.

In the version of the Makefile below we do just that, creating the following PHONY targets:

  • dats (counts words and saves to .dat files)

  • figs (creates figures)

  • clean-dats (cleans up .dat files)

  • clean-figs(cleans up figure files)

# Makefile
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# make all

# run entire analysis
all: report/count_report.html

# count words
dats: results/isles.dat \
results/abyss.dat \
results/last.dat \
results/sierra.dat

results/isles.dat : scripts/wordcount.py data/isles.txt
	python scripts/wordcount.py \
		--input_file=data/isles.txt \
		--output_file=results/isles.dat
results/abyss.dat : scripts/wordcount.py data/abyss.txt
	python scripts/wordcount.py \
		--input_file=data/abyss.txt \
		--output_file=results/abyss.dat
results/last.dat : scripts/wordcount.py data/last.txt
	python scripts/wordcount.py \
		--input_file=data/last.txt \
		--output_file=results/last.dat
results/sierra.dat : scripts/wordcount.py data/sierra.txt
	python scripts/wordcount.py \
		--input_file=data/sierra.txt \
		--output_file=results/sierra.dat

# plot
figs : results/figure/isles.png \
	results/figure/abyss.png \
	results/figure/last.png \
	results/figure/sierra.png

results/figure/isles.png : scripts/plotcount.py results/isles.dat
	python scripts/plotcount.py \
		--input_file=results/isles.dat \
		--output_file=results/figure/isles.png
results/figure/abyss.png : scripts/plotcount.py results/abyss.dat
	python scripts/plotcount.py \
		--input_file=results/abyss.dat \
		--output_file=results/figure/abyss.png
results/figure/last.png : scripts/plotcount.py results/last.dat
	python scripts/plotcount.py \
		--input_file=results/last.dat \
		--output_file=results/figure/last.png
results/figure/sierra.png : scripts/plotcount.py results/sierra.dat
	python scripts/plotcount.py \
		--input_file=results/sierra.dat \
		--output_file=results/figure/sierra.png

# write the report
report/count_report.html : report/count_report.qmd figs
	quarto render report/count_report.qmd

clean-dats :
	rm -f results/isles.dat \
		results/abyss.dat \
		results/last.dat \
		results/sierra.dat

clean-figs :
	rm -f results/figure/isles.png \
	results/figure/abyss.png \
	results/figure/last.png \
	results/figure/sierra.png

clean-all : clean-dats \
	clean-figs
	rm -f report/count_report.html
	rm -rf report/count_report_files

Pattern matching and variables in a Makefile#

It is possible to DRY out a Makefile and use variables.

Using wild cards and pattern matching in a makefile is possible, but the syntax is not very readable. So if you choose to do this proceed with caution. Example of how to do this are here: http://swcarpentry.github.io/make-novice/05-patterns/index.html

As for variables in a Makefile, in most cases we actually do not want to do this. The reason is that we want this file to be a record of what we did to run our analysis (e.g., what files were used, what settings were used, etc). If you start using variables with your Makefile, then you are shifting the problem of recording how your analysis was done to another file. There needs to be some file in your repo that captures what variables were called so that you can replicate your analysis. Examples of using variables in a Makefile are here: http://swcarpentry.github.io/make-novice/06-variables/index.html