24 Data analysis pipelines with GNU Make

Learning Objectives

Write a simple automated analysis pipeline using workflow tool (e.g., GNU Make)
Discuss the advantage of using software that has a dependency graph for analysis pipelines compared to software that does not.

24.1 GNU Make as a data analysis pipeline tool

We previously built a data analysis pipeline by using a shell script (we called it run_all.sh) to piece together and create a record of all the scripts and arguments we used in our analysis. That is a step in the right direction, but there were a few unsatisfactory things about this strategy:

It takes time to manually erase all intermediate and final files generated by analysis to do a complete test to see that everything is working from top to bottom
It runs every step every time. This can be problematic if some steps take a long time and you have only changed other, smaller parts of the analysis

Thus, to improve on this we are going to use the build and automation tool, Make, to make a smarter data analysis pipeline.

24.2 Makefile Structure

Each block of code in a Makefile is called a rule, it looks something like this:

file_to_create.png : data_it_depends_on.dat script_it_depends_on.py
    python script_it_depends_on.py data_it_depends_on.dat file_to_create.png

file_to_create.png is a target, a file to be created, or built.
data_it_depends_on.dat and script_it_depends_on.py are dependencies, files which are needed to build or update the target. Targets can have zero or more dependencies.
: separates targets from dependencies.
python script_it_depends_on.py data_it_depends_on.dat file_to_create.png is an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions. Actions are indented using the TAB character, not 8 spaces.
Together, the target, dependencies, and actions form a rule.

24.3 Structure if you have multiple targets from a scripts

file_to_create_1.png file_to_create_2.png : data_it_depends_on.dat script_it_depends_on.py
    python script_it_depends_on.py data_it_depends_on.dat file_to_create

Building a Data Analysis pipeline using Make, a tutorial

adapted from Software Carpentry

Set-up instructions

Click the green “Use this template” button from this GitHub repository to obtain a copy of it for yourself (do not fork it).
Clone this repository to your computer.

Good reference: http://swcarpentry.github.io/make-novice/reference

Create a file, called Makefile, with the following content:

# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

This is a simple build file, which for GNU Make is called a Makefile - a file executed by GNU Make. Let us go through each line in turn:

# denotes a comment. Any text from # to the end of the line is ignored by Make.
results/isles.dat is a target, a file to be created, or built.
data/isles.txt and scripts/wordcount.py are dependencies, a file that is needed to build or update the target. Targets can have zero or more dependencies.
: separates targets from dependencies.
python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.dat is an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions.
Actions are indented using the TAB character, not 8 spaces. This is a legacy of Make’s 1970’s origins.
Together, the target, dependencies, and actions form a rule.

Our rule above describes how to build the target results/isles.dat using the action python scripts/wordcount.py and the dependency data/isles.txt.

By default, Make looks for a Makefile, called Makefile, and we can run Make as follows:

$ make results/isles.dat

Make prints out the actions it executes:

python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.dat

If we see,

Makefile:3: *** missing separator.  Stop.

then we have used a space instead of a TAB characters to indent one of our actions.

We don’t have to call our Makefile Makefile. However, if we call it something else we need to tell Make where to find it. This we can do using -f flag. For example:

$ make -f Makefile results/isles.dat

As we have re-run our Makefile, Make now informs us that:

make: `results/isles.dat' is up to date.

This is because our target, results/isles.dat, has now been created, and Make will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell touch command to update its timestamp (which would happen if we did edit the file):

$ touch data/isles.txt

If we compare the timestamps of data/isles.txt and results/isles.dat,

$ ls -l data/isles.txt results/isles.dat

then we see that results/isles.dat, the target, is now older thandata/isles.txt, its dependency:

-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat

If we run Make again,

$ make results/isles.dat

then it recreates results/isles.dat:

python src/wordcount.py data/isles.txt results/isles.dat

When it is asked to build a target, Make checks the ‘last modification time’ of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target.

We may want to remove all our data files so we can explicitly recreate them all. We can introduce a new target, and associated rule, clean:

# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

clean :
    rm -f results/isles.dat

This is an example of a rule that has no dependencies. clean has no dependencies on any .dat file as it makes no sense to create these just to remove them. We just want to remove the data files whether or not they exist. If we run Make and specify this target,

$ make clean

then we get:

rm -f *.dat

There is no actual thing built called clean. Rather, it is a short-hand that we can use to execute a useful sequence of actions.

Let’s add another rule to the end of Makefile:

results/isles.dat : data/isles.txt scripts/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

results/figure/isles.png : results/isles.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/isles.dat \
        --output_file=results/figure/isles.png

clean :
    rm -f results/isles.dat
    rm -f results/figure/isles.png

the new target isles.png depends on the target isles.dat. So to make both, we can simply type:

$ make results/figure/isles.png
$ ls

Let’s add another book:

results/isles.dat : data/isles.txt scripts/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

results/abyss.dat : data/abyss.txt scripts/wordcount.py
python scripts/wordcount.py \
    --input_file=data/abyss.txt \
    --output_file=results/abyss.dat

results/figure/isles.png : results/isles.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/isles.dat \
        --output_file=results/figure/isles.png

results/figure/abyss.png : results/abyss.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/abyss.dat \
        --output_file=results/figure/abyss.png

clean :
    rm -f results/isles.dat \
        results/abyss.dat
    rm -f results/figure/isles.png \
        results/figure/abyss.png

To run all of the commands, we need to type make for each one:

$ make results/figure/isles.png
$ make results/figure/abyss.png

OR we can add a target all to the very top of the Makefile which will build the last of the dependencies.

all: results/figure/isles.png results/figure/abyss.png

24.4 Finish off the Makefile!

Since we will also combine the figures into a report in the end, we will change our all target to being the rendered report file, and add a target for the rendered report file at the end:

# Makefile
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# make all

all : report/count_report.html

# count the words
results/isles.dat : data/isles.txt scripts/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat

results/abyss.dat : data/abyss.txt scripts/wordcount.py
python scripts/wordcount.py \
    --input_file=data/abyss.txt \
    --output_file=results/abyss.dat

results/last.dat : data/last.txt scripts/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/last.txt \
        --output_file=results/last.dat

results/sierra.dat : data/sierra.txt scripts/wordcount.py
    python scripts/wordcount.py \
        --input_file=data/sierra.txt \
        --output_file=results/sierra.dat

# create the plots
results/figure/isles.png : results/isles.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/isles.dat \
        --output_file=results/figure/isles.png

results/figure/abyss.png : results/abyss.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/abyss.dat \
        --output_file=results/figure/abyss.png

results/figure/last.png : results/last.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/last.dat \
        --output_file=results/figure/last.png

results/figure/sierra.png : results/sierra.dat scripts/plotcount.py
    python scripts/plotcount.py \
        --input_file=results/sierra.dat \
        --output_file=results/figure/sierra.png

# write the report
report/count_report.html : report/count_report.qmd \
results/figure/isles.png \
results/figure/abyss.png \
results/figure/last.png \
results/figure/sierra.png
    quarto render report/count_report.qmd

clean :
    rm -f results/isles.dat \
        results/abyss.dat \
        results/last.dat \
        results/sierra.dat
    rm -f results/figure/isles.png \
        results/figure/abyss.png \
        results/figure/last.png \
        results/figure/sierra.png
    rm -rf report/count_report.html

24.5 PHONY Targets

So far our make files contains multiple targets. Some of these targets point to the name of a file, and others, e.g., clean and all, are not names of a particular file. When we have a target that is really a name for a recipe, these are known as phony targets.

The main reason for phony targets is to avoid a conflict with a file of the same name. In order to create a phony garget we use the .PHONY at the top of our Makefile and space separate the target names. By convention we list all the phony targets at the top of the file in 1 place. Sometimes you may see multiple .PHONY: calls right above the recipe for a phony target. Pay attention to how it is spelled, .PHONY.

In order to declare our clean and all targets as a phony target we would put this line at the top of our Makefile

.PHONY: all clean

24.6 Improving the Makefile

Adding PHONY targets to run parts of the analysis (e.g., just counting the words, or just making the figures) can be very useful when iterating and refining a particular part of your analysis. Similarly, adding other PHONY targets to clean up parts of the analysis can help with this as well.

In the version of the Makefile below we do just that, creating the following PHONY targets: - dats (counts words and saves to .dat files) - figs (creates figures) - clean-dats (cleans up .dat files) - clean-figs(cleans up figure files)

# Makefile
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# make all

.PHONY: all dats figs clean-dats clean-figs clean-all

# run entire analysis
all: report/count_report.html

# count words
dats: results/isles.dat \
results/abyss.dat \
results/last.dat \
results/sierra.dat

results/isles.dat : scripts/wordcount.py data/isles.txt
    python scripts/wordcount.py \
        --input_file=data/isles.txt \
        --output_file=results/isles.dat
results/abyss.dat : scripts/wordcount.py data/abyss.txt
    python scripts/wordcount.py \
        --input_file=data/abyss.txt \
        --output_file=results/abyss.dat
results/last.dat : scripts/wordcount.py data/last.txt
    python scripts/wordcount.py \
        --input_file=data/last.txt \
        --output_file=results/last.dat
results/sierra.dat : scripts/wordcount.py data/sierra.txt
    python scripts/wordcount.py \
        --input_file=data/sierra.txt \
        --output_file=results/sierra.dat

# plot
figs : results/figure/isles.png \
    results/figure/abyss.png \
    results/figure/last.png \
    results/figure/sierra.png

results/figure/isles.png : scripts/plotcount.py results/isles.dat
    python scripts/plotcount.py \
        --input_file=results/isles.dat \
        --output_file=results/figure/isles.png
results/figure/abyss.png : scripts/plotcount.py results/abyss.dat
    python scripts/plotcount.py \
        --input_file=results/abyss.dat \
        --output_file=results/figure/abyss.png
results/figure/last.png : scripts/plotcount.py results/last.dat
    python scripts/plotcount.py \
        --input_file=results/last.dat \
        --output_file=results/figure/last.png
results/figure/sierra.png : scripts/plotcount.py results/sierra.dat
    python scripts/plotcount.py \
        --input_file=results/sierra.dat \
        --output_file=results/figure/sierra.png

# write the report
report/count_report.html : report/count_report.qmd figs
    quarto render report/count_report.qmd

clean-dats :
    rm -f results/isles.dat \
        results/abyss.dat \
        results/last.dat \
        results/sierra.dat

clean-figs :
    rm -f results/figure/isles.png \
    results/figure/abyss.png \
    results/figure/last.png \
    results/figure/sierra.png

clean-all : clean-dats \
    clean-figs
    rm -f report/count_report.html
    rm -rf report/count_report_files

24.7 Pattern matching and variables in a Makefile

It is possible to DRY out a Makefile and use variables.

Using wild cards and pattern matching in a makefile is possible, but the syntax is not very readable. So if you choose to do this proceed with caution. Example of how to do this are here: http://swcarpentry.github.io/make-novice/05-patterns/index.html

As for variables in a Makefile, in most cases we actually do not want to do this. The reason is that we want this file to be a record of what we did to run our analysis (e.g., what files were used, what settings were used, etc). If you start using variables with your Makefile, then you are shifting the problem of recording how your analysis was done to another file. There needs to be some file in your repo that captures what variables were called so that you can replicate your analysis. Examples of using variables in a Makefile are here: http://swcarpentry.github.io/make-novice/06-variables/index.html