24 Data analysis pipelines with GNU Make
Learning Objectives
- Write a simple automated analysis pipeline using workflow tool (e.g., GNU Make)
- Discuss the advantage of using software that has a dependency graph for analysis pipelines compared to software that does not.
24.1 GNU Make as a data analysis pipeline tool
We previously built a data analysis pipeline by using a shell script (we called it run_all.sh) to piece together and create a record of all the scripts and arguments we used in our analysis. That is a step in the right direction, but there were a few unsatisfactory things about this strategy:
- It takes time to manually erase all intermediate and final files generated by analysis to do a complete test to see that everything is working from top to bottom
- It runs every step every time. This can be problematic if some steps take a long time and you have only changed other, smaller parts of the analysis
Thus, to improve on this we are going to use the build and automation tool, Make, to make a smarter data analysis pipeline.
24.2 Makefile Structure
Each block of code in a Makefile is called a rule, it looks something like this:
file_to_create.png : data_it_depends_on.dat script_it_depends_on.py
python script_it_depends_on.py data_it_depends_on.dat file_to_create.pngfile_to_create.pngis a target, a file to be created, or built.data_it_depends_on.datandscript_it_depends_on.pyare dependencies, files which are needed to build or update the target. Targets can have zero or more dependencies.:separates targets from dependencies.python script_it_depends_on.py data_it_depends_on.dat file_to_create.pngis an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions. Actions are indented using the TAB character, not 8 spaces.- Together, the target, dependencies, and actions form a rule.
24.3 Structure if you have multiple targets from a scripts
file_to_create_1.png file_to_create_2.png : data_it_depends_on.dat script_it_depends_on.py
python script_it_depends_on.py data_it_depends_on.dat file_to_createadapted from Software Carpentry
Set-up instructions
Click the green “Use this template” button from this GitHub repository to obtain a copy of it for yourself (do not fork it).
Clone this repository to your computer.
Good reference: http://swcarpentry.github.io/make-novice/reference
Create a file, called Makefile, with the following content:
# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
python scripts/wordcount.py \
--input_file=data/isles.txt \
--output_file=results/isles.datThis is a simple build file, which for GNU Make is called a Makefile - a file executed by GNU Make. Let us go through each line in turn:
#denotes a comment. Any text from#to the end of the line is ignored by Make.results/isles.datis a target, a file to be created, or built.data/isles.txtandscripts/wordcount.pyare dependencies, a file that is needed to build or update the target. Targets can have zero or more dependencies.:separates targets from dependencies.python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.datis an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions.- Actions are indented using the TAB character, not 8 spaces. This is a legacy of Make’s 1970’s origins.
- Together, the target, dependencies, and actions form a rule.
Our rule above describes how to build the target results/isles.dat using the action python scripts/wordcount.py and the dependency data/isles.txt.
By default, Make looks for a Makefile, called Makefile, and we can run Make as follows:
$ make results/isles.datMake prints out the actions it executes:
python scripts/wordcount.py --input_file=data/isles.txt --output_file=results/isles.datIf we see,
Makefile:3: *** missing separator. Stop.then we have used a space instead of a TAB characters to indent one of our actions.
We don’t have to call our Makefile Makefile. However, if we call it something else we need to tell Make where to find it. This we can do using -f flag. For example:
$ make -f Makefile results/isles.datAs we have re-run our Makefile, Make now informs us that:
make: `results/isles.dat' is up to date.This is because our target, results/isles.dat, has now been created, and Make will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell touch command to update its timestamp (which would happen if we did edit the file):
$ touch data/isles.txtIf we compare the timestamps of data/isles.txt and results/isles.dat,
$ ls -l data/isles.txt results/isles.datthen we see that results/isles.dat, the target, is now older thandata/isles.txt, its dependency:
-rw-r--r-- 1 mjj Administ 323972 Jun 12 10:35 books/isles.txt
-rw-r--r-- 1 mjj Administ 182273 Jun 12 09:58 isles.datIf we run Make again,
$ make results/isles.datthen it recreates results/isles.dat:
python src/wordcount.py data/isles.txt results/isles.datWhen it is asked to build a target, Make checks the ‘last modification time’ of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target.
We may want to remove all our data files so we can explicitly recreate them all. We can introduce a new target, and associated rule, clean:
# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
python scripts/wordcount.py \
--input_file=data/isles.txt \
--output_file=results/isles.dat
clean :
rm -f results/isles.datThis is an example of a rule that has no dependencies. clean has no dependencies on any .dat file as it makes no sense to create these just to remove them. We just want to remove the data files whether or not they exist. If we run Make and specify this target,
$ make cleanthen we get:
rm -f *.datThere is no actual thing built called clean. Rather, it is a short-hand that we can use to execute a useful sequence of actions.
Let’s add another rule to the end of Makefile:
results/isles.dat : data/isles.txt scripts/wordcount.py
python scripts/wordcount.py \
--input_file=data/isles.txt \
--output_file=results/isles.dat
results/figure/isles.png : results/isles.dat scripts/plotcount.py
python scripts/plotcount.py \
--input_file=results/isles.dat \
--output_file=results/figure/isles.png
clean :
rm -f results/isles.dat
rm -f results/figure/isles.pngthe new target isles.png depends on the target isles.dat. So to make both, we can simply type:
$ make results/figure/isles.png
$ lsLet’s add another book:
results/isles.dat : data/isles.txt scripts/wordcount.py
python scripts/wordcount.py \
--input_file=data/isles.txt \
--output_file=results/isles.dat
results/abyss.dat : data/abyss.txt scripts/wordcount.py
python scripts/wordcount.py \
--input_file=data/abyss.txt \
--output_file=results/abyss.dat
results/figure/isles.png : results/isles.dat scripts/plotcount.py
python scripts/plotcount.py \
--input_file=results/isles.dat \
--output_file=results/figure/isles.png
results/figure/abyss.png : results/abyss.dat scripts/plotcount.py
python scripts/plotcount.py \
--input_file=results/abyss.dat \
--output_file=results/figure/abyss.png
clean :
rm -f results/isles.dat \
results/abyss.dat
rm -f results/figure/isles.png \
results/figure/abyss.pngTo run all of the commands, we need to type make
$ make results/figure/isles.png
$ make results/figure/abyss.pngOR we can add a target all to the very top of the Makefile which will build the last of the dependencies.
all: results/figure/isles.png results/figure/abyss.png24.4 Finish off the Makefile!
Since we will also combine the figures into a report in the end, we will change our all target to being the rendered report file, and add a target for the rendered report file at the end:
# Makefile
# Tiffany Timbers, Nov 2018
# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.
# example usage:
# make all
all : report/count_report.html
# count the words
results/isles.dat : data/isles.txt scripts/wordcount.py
python scripts/wordcount.py \
--input_file=data/isles.txt \
--output_file=results/isles.dat
results/abyss.dat : data/abyss.txt scripts/wordcount.py
python scripts/wordcount.py \
--input_file=data/abyss.txt \
--output_file=results/abyss.dat
results/last.dat : data/last.txt scripts/wordcount.py
python scripts/wordcount.py \
--input_file=data/last.txt \
--output_file=results/last.dat
results/sierra.dat : data/sierra.txt scripts/wordcount.py
python scripts/wordcount.py \
--input_file=data/sierra.txt \
--output_file=results/sierra.dat
# create the plots
results/figure/isles.png : results/isles.dat scripts/plotcount.py
python scripts/plotcount.py \
--input_file=results/isles.dat \
--output_file=results/figure/isles.png
results/figure/abyss.png : results/abyss.dat scripts/plotcount.py
python scripts/plotcount.py \
--input_file=results/abyss.dat \
--output_file=results/figure/abyss.png
results/figure/last.png : results/last.dat scripts/plotcount.py
python scripts/plotcount.py \
--input_file=results/last.dat \
--output_file=results/figure/last.png
results/figure/sierra.png : results/sierra.dat scripts/plotcount.py
python scripts/plotcount.py \
--input_file=results/sierra.dat \
--output_file=results/figure/sierra.png
# write the report
report/count_report.html : report/count_report.qmd \
results/figure/isles.png \
results/figure/abyss.png \
results/figure/last.png \
results/figure/sierra.png
quarto render report/count_report.qmd
clean :
rm -f results/isles.dat \
results/abyss.dat \
results/last.dat \
results/sierra.dat
rm -f results/figure/isles.png \
results/figure/abyss.png \
results/figure/last.png \
results/figure/sierra.png
rm -rf report/count_report.html24.5 PHONY Targets
So far our make files contains multiple targets. Some of these targets point to the name of a file, and others, e.g., clean and all, are not names of a particular file. When we have a target that is really a name for a recipe, these are known as phony targets.
The main reason for phony targets is to avoid a conflict with a file of the same name. In order to create a phony garget we use the .PHONY at the top of our Makefile and space separate the target names. By convention we list all the phony targets at the top of the file in 1 place. Sometimes you may see multiple .PHONY: calls right above the recipe for a phony target. Pay attention to how it is spelled, .PHONY.
In order to declare our clean and all targets as a phony target we would put this line at the top of our Makefile
.PHONY: all clean
24.6 Improving the Makefile
Adding PHONY targets to run parts of the analysis (e.g., just counting the words, or just making the figures) can be very useful when iterating and refining a particular part of your analysis. Similarly, adding other PHONY targets to clean up parts of the analysis can help with this as well.
In the version of the Makefile below we do just that, creating the following PHONY targets: - dats (counts words and saves to .dat files) - figs (creates figures) - clean-dats (cleans up .dat files) - clean-figs(cleans up figure files)
# Makefile
# Tiffany Timbers, Nov 2018
# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.
# example usage:
# make all
.PHONY: all dats figs clean-dats clean-figs clean-all
# run entire analysis
all: report/count_report.html
# count words
dats: results/isles.dat \
results/abyss.dat \
results/last.dat \
results/sierra.dat
results/isles.dat : scripts/wordcount.py data/isles.txt
python scripts/wordcount.py \
--input_file=data/isles.txt \
--output_file=results/isles.dat
results/abyss.dat : scripts/wordcount.py data/abyss.txt
python scripts/wordcount.py \
--input_file=data/abyss.txt \
--output_file=results/abyss.dat
results/last.dat : scripts/wordcount.py data/last.txt
python scripts/wordcount.py \
--input_file=data/last.txt \
--output_file=results/last.dat
results/sierra.dat : scripts/wordcount.py data/sierra.txt
python scripts/wordcount.py \
--input_file=data/sierra.txt \
--output_file=results/sierra.dat
# plot
figs : results/figure/isles.png \
results/figure/abyss.png \
results/figure/last.png \
results/figure/sierra.png
results/figure/isles.png : scripts/plotcount.py results/isles.dat
python scripts/plotcount.py \
--input_file=results/isles.dat \
--output_file=results/figure/isles.png
results/figure/abyss.png : scripts/plotcount.py results/abyss.dat
python scripts/plotcount.py \
--input_file=results/abyss.dat \
--output_file=results/figure/abyss.png
results/figure/last.png : scripts/plotcount.py results/last.dat
python scripts/plotcount.py \
--input_file=results/last.dat \
--output_file=results/figure/last.png
results/figure/sierra.png : scripts/plotcount.py results/sierra.dat
python scripts/plotcount.py \
--input_file=results/sierra.dat \
--output_file=results/figure/sierra.png
# write the report
report/count_report.html : report/count_report.qmd figs
quarto render report/count_report.qmd
clean-dats :
rm -f results/isles.dat \
results/abyss.dat \
results/last.dat \
results/sierra.dat
clean-figs :
rm -f results/figure/isles.png \
results/figure/abyss.png \
results/figure/last.png \
results/figure/sierra.png
clean-all : clean-dats \
clean-figs
rm -f report/count_report.html
rm -rf report/count_report_files24.7 Pattern matching and variables in a Makefile
It is possible to DRY out a Makefile and use variables.
Using wild cards and pattern matching in a makefile is possible, but the syntax is not very readable. So if you choose to do this proceed with caution. Example of how to do this are here: http://swcarpentry.github.io/make-novice/05-patterns/index.html
As for variables in a Makefile, in most cases we actually do not want to do this. The reason is that we want this file to be a record of what we did to run our analysis (e.g., what files were used, what settings were used, etc). If you start using variables with your Makefile, then you are shifting the problem of recording how your analysis was done to another file. There needs to be some file in your repo that captures what variables were called so that you can replicate your analysis. Examples of using variables in a Makefile are here: http://swcarpentry.github.io/make-novice/06-variables/index.html