Non-interactive scripts and data analysis pipelines

Topic learning objectives

By the end of this topic, students should be able to:

  1. Explain when it is optimal to work in a read-eval-print-loop (REPL) framework and when it is optimal to shift to using non-interactive scripts.

  2. Be able to create simple scripts in R that can take input and be executed from the command line.

  3. Decide when to move from using command line arguments to pass variables into a script to passing variables in via a configuration file, and create scripts that can read configuration files.

  4. Justify why analysis pipelines should be used over a single long script in data analysis projects, specifically in regards to how this affects reproducibility, maintainability and future derivatives of the work.

  5. Write a simple automated analysis pipeline using workflow tool (e.g., GNU Make)

  6. Discuss the advantage of using software that has a dependency graph for analysis pipelines compared to software that does not.

Read-eval-print-loop (REPL) framework (i.e., interactive mode) versus Scripts

  • Up until now, we have primarily been using R and Python in an Read-eval-print-loop (REPL) framework (i.e., interactive mode)

  • Read-eval-print-loop (REPL) framework (i.e., interactive mode) is when we run our code in the console in R/Python, or in cells/chunks in the RStudio/Juptyer notebooks

  • A Read-eval-print-loop (REPL) framework (i.e., interactive mode) is very useful for:

    • solving small problems

    • developing code that will be knit to an analytic report

    • developing code that will be run as a script (i.e., in “batch” mode)

What is a script?

An R/Python script is simply a plain text file containing (almost) the same commands that you would enter into R/Python’s console or in cells/chunks in the RStudio/Juptyer notebooks. We often run these from top to bottom from the command line/unix shell.

Why write scripts?

  • Efficiency!

  • Automation!

  • Reusable!

  • Record of what you have done!

Scripts in R

Let’s start with a small, simple example to demonstrate how we write and run scripts in R (it is very similar in Python and we will get to this later in the lesson).

Our script will be called print_mean_hp.R, and it will calculate the mean horsepower of the cars from the built-in R data frame mtcars.

We will develop this script inside RStudio, make sure it works, and then run it from the command line/terminal/Git bash.

Our first R script:

# author: Tiffany Timbers
# date: 2020-01-15
#
# This script calculates the mean horsepower of the cars from the built-in 
# R data frame `mtcars`. This script takes no arguments.
#
# Usage: Rscript print_mean_hp.R

mean_hp <- mean(mtcars$hp)
print(mean_hp)

Running our first R script

To run our R script, we need to open the command line/terminal/Git bash, and either navigate to the directory that houses the script OR point to it when we call it. We will do the former.

Then to run the R script, we use the Rscript command, followed by the name/path to the file:

Rscript print_mean_hp.R

The output should be:

[1] 146.6875

A couple notes about scripts

  • If you want something to be output to the command line/terminal/Git bash, you should explicitly ask for it to be print. This is not an absolute requirement in R, but it is in Python!

  • Similar with figures, they need to be saved! You will never see a figure created in a script unless you write it to a file.

  • From a reproducibility perspective, if we want input from the user, usually we will design the scripts to take command line arguments, and not use keyboard/user prompts.

Script structure and organization

Although not necessary in R or Python, it is still good practice and advised to organize the code in your script into related sections. This practice keeps your code readable and organized. Below we outline how we typically organize R scripts:

Example R script organization:

# documentation comments

# load libraries/packages or source functions from other scripts

# parse/define command line arguments here

# code for "guts" of script goes here

Scripts in R:

Here we write a script called quick_titanic_fare_mean.R which reads in the titanic dataset (original source: https://biostat.app.vumc.org/wiki/Main/DataSets) and calculates the mean for the fare (ticket price) variable.

# author: Tiffany Timbers
# date: 2020-01-15
#
# This script calculates the mean for the fare (ticket price)
# from titanic.csv. This script takes no arguments.
#
# Usage: quick_titanic_fare_mean.R

library(tidyverse)

data <- read_csv('data/titanic.csv')

out <- data %>%
         pull(fare) %>%
         mean(na.rm = TRUE)

print(out)

Wait! Other languages use a main function (so you can, for example, import and access parts of the code into other code docs) - why are we not doing this here? I truly do not know the answer (yet) to why not, however, it is not commonly done:

../../_images/twitter-poll.png

Perhaps this is due to historical origins and culture of the R programming language and it’s user base? A follow-up question is, can you do this if you want? That question I can answer with a yes, see below for an example:

main <- function() {
  # script code
}


if (!interactive()) {
  main()
}

Saving things from scripts

Above we just printed the mean to the terminal. That is was done because the purpose of that script was to have a very simple illustration of how to create and run scripts in R. However, in practice, we typically want to save our analysis artifacts (figures, tables, values, etc) to disc so that we can load them into other files (e.g., our final reports to communicate our analysis findings).

Below we show an example of how we would use readr::write_csv to save the mean value we calculated to a .csv file:

# author: Tiffany Timbers
# date: 2020-01-15
#
# This script calculates the mean horsepower of the cars from the built-in 
# R data frame `mtcars` and saves it to `results/mean_hp_col.csv`. 
# This script takes no arguments.
#
# Usage: Rscript print_mean_hp.R

library(readr)

mean_hp <- mean(mtcars$hp)
mean_hp <- data.frame(value = mean_hp)
write_csv(mean_hp, "results/mean_hp_col.csv")

Note: in this script we are saving the file to the results directory. There needs to be a results directory created before this script would work.

Using command line arguments in R

Let’s make our script more flexible, and let us specify what column variable we want to calculate the mean for when we call the script.

To do this, we use the docopt R package. This will allow us to collect the text we enter at the command line when we call the script, and make it available to us when we run the script.

When we run docopt it takes the text we entered at the command line and gives it to us as a named list of the text provided after the script name. The names of the items in the list come from the documentation. Whitespace at the command line is what is used to parse the text into separate items in the vector.

# author: Tiffany Timbers
# date: 2020-01-15

"This script calculates the mean for a specified column
from titanic.csv.

Usage: quick_titanic_col_mean.R <col>
" -> doc


library(tidyverse)
library(docopt)

opt <- docopt(doc)

data <- read_csv('data/titanic.csv')

out <- data %>%
         pull(!!opt$col) %>%
         mean(na.rm = TRUE)

print(out)

Note: we use !! in front of opt$col because all command line arguments are passed into R as strings, and are thus quoted. However, pull is a function from the tidyverse that expects an unquoted column name of a data frame. !! does this unquoting. This is similar to {{ that we saw before with functions (which quotes and unquotes values when they are passed into functions). However here we use !! as we have no indirection and just need to perform unquoting.

And we would run a script like this from the command line as follows:

Rscript src/quick_titanic_col_mean.R fare

Let’s make our script even more flexible, and let us specify that dataset as well (we could then use it more generally on other files, such as the Gapminder .csv’s we saw in Block 1).

# author: Tiffany Timbers
# date: 2020-01-15

"This script calculates the mean for any numerical vector
from a csv file. This script takes an unquoted column name and a data file path.

Usage: quick_csv_col_mean.R <file_path> <col>
" -> doc


library(tidyverse)
library(docopt)

opt <- docopt(doc)

data <- read_csv(opt$file_path)

out <- data %>%
         pull(!!opt$col) %>%
         mean(na.rm = TRUE)

print(out)

Now we would run a script like this from the command line as follows:

Rscript src/quick_csv_col_mean.R data/titanic.csv fare

Positional arguments vs options

In the examples above, we used docopt to specify positional arguments. This means that the order matters! If we change the order of the values of the arguments at the command line, our script will likely throw an error, because it will try to perform the wrong operations on the wrong values.

Another downside to positional arguments, is that without good documentation, they can be less readable. And certainly the call to the script to is less readable. We can instead give the arguments names using --ARGUMENT_NAME syntax. We call these “options”. Below is the same script but specified using options as opposed to positional arguments:

# author: Tiffany Timbers
# date: 2020-01-15

"This script calculates the mean for any numerical vector
from a csv file. This script takes an unquoted column name and a data file path.

Usage: quick_csv_col_mean.R --file_path=<file_path> --col=<col>

Options:
--file_path=<file_path>   Path to the data file
--col=<col>               Unquoted column name of the numerical vector for which to calculate the mean
" -> doc


library(tidyverse)
library(docopt)

opt <- docopt(doc)

data <- read_csv(opt$file_path)

out <- data %>%
         pull(!!opt$col) %>%
         mean(na.rm = TRUE)

print(out)

And we would run a script like this that uses options like this:

Rscript src/quick_csv_col_mean.R --file_path=data/titanic.csv --col=fare

or like this:

Rscript src/quick_csv_col_mean.R --col=fare --file_path=data/titanic.csv

because we gave the arguments names, and thus their position no longer matters!

Optional elements (arguments or options)

If you would like an argument or option to be “optional” (i.e., not required!) enclose them with the square brackets “[ ]” in the documentation.

Improving our docs

We can improve our docs further by adding an “Options” section with {docopt}:

# author: Tiffany Timbers
# date: 2020-01-15

"This script calculates the mean for any numerical vector
from a csv file. This script takes an unquoted column name and a data file path.

Usage: quick_csv_col_mean.R --file_path=<file_path> --col=<col>

Options:
--file_path=<file_path>   Path to the data file
--col=<col>               Unquoted column name of the numerical vector for which to calculate the mean
" -> doc


library(tidyverse)
library(docopt)

opt <- docopt(doc)

data <- read_csv(opt$file_path)

out <- data %>%
         pull(!!opt$col) %>%
         mean(na.rm = TRUE)

print(out)

And then the user can see these docs by calling the script from the command line and using the --help to get this information:

This script calculates the mean for any numerical vector
from a csv file. This script takes an unquoted column name and a data file path.

Usage: quick_csv_col_mean.R --file_path=<file_path> --col=<col>

Options:
--file_path=<file_path>   Path to the data file
--col=<col>               Unquoted column name of the numerical vector for which to calculate the mean

Some tips for RStudio IDE:

  • To indent a block of text, highlight and use tab

  • To fix indenting in general to R code standards, use Command/Cntrl + I

  • To get multiple cursors, hold alt/option and highlight lines using cursor

  • To get multiple cursors to the beginning of the line, use control A

  • To get multiple cursors to the end of the line, use control E

Data analysis pipelines

  • As analysis grows in length and complexity, one literate code document generally is not enough

  • To improve code report readability (and code reproducibility and modularity) it is better to abstract at least parts of the code away (e.g, to scripts)

  • These scripts save figures and tables that will be imported into the final report

../../_images/scripts.png

Demo: building a Data Analysis pipeline using a Shell script tutorial

adapted from Software Carpentry

To illustrate how to make a data analysis pipeline using a shell script to drive other scripts, we will work through an example together. Here are some set-up instructions so that we can do this together:

Set-up instructions

Let’s do some analysis!

Suppose we have a script, wordcount.py, that reads in a text file, counts the words in this text file, and outputs a data file:

$ python src/wordcount.py data/isles.txt results/isles.dat

If we view the first 5 rows of the data file using head,

$ head -5 results/isles.dat

we can see that the file consists of one row per word. Each row shows the word itself, the number of occurrences of that word, and the number of occurrences as a percentage of the total number of words in the text file.

the 3822 6.7371760973
of 2460 4.33632998414
and 1723 3.03719372466
to 1479 2.60708619778
a 1308 2.30565838181

Suppose we also have a script, plotcount.py, that reads in a data file and save a plot of the 10 most frequently occurring words:

$ python src/plotcount.py results/isles.dat results/figure/isles.png

Together these scripts implement a data analysis pipeline:

  1. Read a data file.

  2. Perform an analysis on this data file.

  3. Write the analysis results to a new file.

  4. Plot a graph of the analysis results.

To document how we’d like the analysis to be run, so we (and others) have a record and can replicate it, we will build a shell script called run_all.sh. Let’s work to try to build this pipeline so it does all that!

# run_all.sh
# Tiffany Timbers, Nov 2017
#
# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.
#
# Usage: bash run_all.sh

# perform wordcout on novels
python src/wordcount.py data/isles.txt results/isles.dat

# create plots
python src/plotcount.py results/isles.dat results/figure/isles.png

We actually have 4 books that we want to analyze, and then put the figures in a report.

  1. Read a data file.

  2. Perform an analysis on this data file.

  3. Write the analysis results to a new file.

  4. Plot a graph of the analysis results.

  5. Save the graph as an image, so we can put it in a paper.

Let’s extend our shell script to do that!

# run_all.sh
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# bash run_all.sh

# count the words
python src/wordcount.py data/isles.txt results/isles.dat
python src/wordcount.py data/abyss.txt results/abyss.dat
python src/wordcount.py data/last.txt results/last.dat
python src/wordcount.py data/sierra.txt results/sierra.dat

# create the plots
python src/plotcount.py results/isles.dat results/figure/isles.png
python src/plotcount.py results/abyss.dat results/figure/abyss.png
python src/plotcount.py results/last.dat results/figure/last.png
python src/plotcount.py results/sierra.dat results/figure/sierra.png

# write the report
Rscript -e "rmarkdown::render('doc/count_report.Rmd')"

Another example:

From the breast cancer prediction example analysis repo, here is a data analysis pipeline using a shell script:

# run_all.sh
# Tiffany Timbers, Jan 2020

# download data
python src/download_data.py --out_type=feather --url=https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data --out_file=data/raw/wdbc.feather

# run eda report
Rscript -e "rmarkdown::render('src/breast_cancer_eda.Rmd')"

# pre-process data 
Rscript src/pre_process_wisc.r --input=data/raw/wdbc.feather --out_dir=data/processed 

# create exploratory data analysis figure and write to file 
Rscript src/eda_wisc.r --train=data/processed/training.feather --out_dir=results

# tune model
Rscript src/fit_breast_cancer_predict_model.r --train=data/processed/training.feather --out_dir=results

# test model
Rscript src/breast_cancer_test_results.r --test=data/processed/test.feather --out_dir=results

# render final report
Rscript -e "rmarkdown::render('doc/breast_cancer_predict_report.Rmd', output_format = 'github_document')"

Discussion

  • What are the advantages to using a data analysis pipeline?

  • How “good” is a shell script as a data analysis pipeline? What might not be ideal about this?

GNU Make as a data analysis pipeline tool

We previously built a data analysis pipeline by using a shell script (we called it run_all.sh) to piece together and create a record of all the scripts and arguments we used in our analysis. That is a step in the right direction, but there were a few unsatisfactory things about this strategy:

  1. It takes time to manually erase all intermediate and final files generated by analysis to do a complete test to see that everything is working from top to bottom

  2. It runs every step every time. This can be problematic if some steps take a long time and you have only changed other, smaller parts of the analysis

Thus, to improve on this we are going to use the build and automation tool, Make, to make a smarter data analysis pipeline.

Makefile Structure

Each block of code in a Makefile is called a rule, it looks something like this:

file_to_create.png : data_it_depends_on.dat script_it_depends_on.py
	python script_it_depends_on.py data_it_depends_on.dat file_to_create.png
  • file_to_create.png is a target, a file to be created, or built.

  • data_it_depends_on.dat and script_it_depends_on.py are dependencies, files which are needed to build or update the target. Targets can have zero or more dependencies.

  • : separates targets from dependencies.

  • python script_it_depends_on.py data_it_depends_on.dat file_to_create.png is an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions. Actions are indented using the TAB character, not 8 spaces.

  • Together, the target, dependencies, and actions form a rule.

Structure if you have multiple targets from a scripts

file_to_create_1.png file_to_create_2.png : data_it_depends_on.dat script_it_depends_on.py
	python script_it_depends_on.py data_it_depends_on.dat file_to_create

Demo: building a Data Analysis pipeline using Make, a tutorial

adapted from Software Carpentry

Set-up instructions

  • Download data_analysis_pipeline_eg-2.0.zip

  • Unzip it and change into the data_analysis_pipeline_eg-2.0 directory.

  • note - the tutorial in lecture 4 is a prerequisite

Good reference: http://swcarpentry.github.io/make-novice/reference

Create a file, called Makefile, with the following content:

# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat

This is a simple build file, which for Make is called a Makefile - a file executed by Make. Let us go through each line in turn:

  • # denotes a comment. Any text from # to the end of the line is ignored by Make.

  • results/isles.dat is a target, a file to be created, or built.

  • data/isles.txt and src/wordcount.py are dependencies, a file that is needed to build or update the target. Targets can have zero or more dependencies.

  • : separates targets from dependencies.

  • python src/wordcount.py data/isles.txt isles.dat is an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions.

  • Actions are indented using the TAB character, not 8 spaces. This is a legacy of Make’s 1970’s origins.

  • Together, the target, dependencies, and actions form a rule.

Our rule above describes how to build the target results/isles.dat using the action python src/wordcount.py and the dependency data/isles.txt.

By default, Make looks for a Makefile, called Makefile, and we can run Make as follows:

$ make

Make prints out the actions it executes:

python src/wordcount.py data/isles.txt results/isles.dat

If we see,

Makefile:3: *** missing separator.  Stop.

then we have used a space instead of a TAB characters to indent one of our actions.

We don’t have to call our Makefile Makefile. However, if we call it something else we need to tell Make where to find it. This we can do using -f flag. For example:

$ make -f Makefile

As we have re-run our Makefile, Make now informs us that:

make: `results/isles.dat' is up to date.

This is because our target, results/isles.dat, has now been created, and Make will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell touch command to update its timestamp (which would happen if we did edit the file):

$ touch data/isles.txt

If we compare the timestamps of data/isles.txt and results/isles.dat,

$ ls -l data/isles.txt results/isles.dat

then we see that results/isles.dat, the target, is now older thandata/isles.txt, its dependency:

-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat

If we run Make again,

$ make

then it recreates results/isles.dat:

python src/wordcount.py data/isles.txt results/isles.dat

When it is asked to build a target, Make checks the ‘last modification time’ of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target.

We may want to remove all our data files so we can explicitly recreate them all. We can introduce a new target, and associated rule, clean:

results/isles.dat : data/isles.txt
	python src/wordcount.py data/isles.txt results/isles.dat

clean :
	rm -f results/*.dat

This is an example of a rule that has no dependencies. clean has no dependencies on any .dat file as it makes no sense to create these just to remove them. We just want to remove the data files whether or not they exist. If we run Make and specify this target,

$ make clean

then we get:

rm -f *.dat

There is no actual thing built called clean. Rather, it is a short-hand that we can use to execute a useful sequence of actions. Such targets, though very useful, can lead to problems. For example, let us recreate our data files, create a directory called clean, then run Make:

$ make results/isles.dat
$ mkdir clean
$ make clean

We get:

make: `clean' is up to date.

Let’s add another rule to the end of Makefile:

results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat

results/figure/isles.png : results/isles.dat src/plotcount.py
	python src/plotcount.py results/isles.dat results/figure/isles.png

clean :
	rm -f results/*.dat
	rm -f results/figure/*.png

the new target isles.png depends on the target isles.dat. So to make both, we can simply type:

$ make isles.dat
$ ls

Let’s add another book:

results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat

results/abyss.dat : data/abyss.txt src/wordcount.py
  python src/wordcount.py data/abyss.txt results/abyss.dat

results/figure/isles.png : results/isles.dat src/plotcount.py
	python src/plotcount.py results/isles.dat results/figure/isles.png

results/figure/abyss.png : results/abyss.dat src/plotcount.py
  python src/plotcount.py results/abyss.dat results/figure/abyss.png

clean :
	rm -f results/*.dat
	rm -f results/figure/*.png

To run all of the commands, we need to type make for each one:

$ make isles.png
$ make abyss.png

OR we can add a target all which will build the last of the dependencies.

all: results/figure/isles.png results/figure/abyss.png

# count words
results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat
	
results/abyss.dat : data/abyss.txt src/wordcount.py
	python src/wordcount.py data/abyss.txt results/abyss.dat

# plot word count
results/figure/isles.png : results/isles.dat src/plotcount.py
	python src/plotcount.py results/isles.dat isles.png

results/figure/abyss.png : results/abyss.dat src/plotcount.py
	python src/plotcount.py results/abyss.dat abyss.png

clean :
	rm -f results/*.dat
	rm -f results/figure/*.png

Finish off the Makefile!

# Makefile
# Tiffany Timbers, Nov 2018

# This driver script completes the textual analysis of
# 3 novels and creates figures on the 10 most frequently
# occuring words from each of the 3 novels. This script
# takes no arguments.

# example usage:
# make all

all : doc/count_report.md

# count the words
results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat
results/abyss.dat : data/abyss.txt src/wordcount.py
	python src/wordcount.py data/abyss.txt results/abyss.dat
results/last.dat : data/last.txt src/wordcount.py
	python src/wordcount.py data/last.txt results/last.dat
results/sierra.dat : data/sierra.txt src/wordcount.py
	python src/wordcount.py data/sierra.txt results/sierra.dat
	
# create the plots
results/figure/isles.png : results/isles.dat src/plotcount.py
	python src/plotcount.py results/isles.dat results/figure/isles.png
results/figure/abyss.png : results/abyss.dat src/plotcount.py	
	python src/plotcount.py results/abyss.dat results/figure/abyss.png
results/figure/last.png : results/last.dat src/plotcount.py	
	python src/plotcount.py results/last.dat results/figure/last.png
results/figure/sierra.png : results/sierra.dat src/plotcount.py	
	python src/plotcount.py results/sierra.dat results/figure/sierra.png
	
# write the report
doc/count_report.md : doc/count_report.Rmd results/figure/isles.png results/figure/abyss.png results/figure/last.png results/figure/sierra.png
	Rscript -e "rmarkdown::render('doc/count_report.Rmd')"
	
clean :
	rm -rf results/isles.dat results/abyss.dat results/last.dat results/sierra.dat
	rm -rf results/figure/isles.png results/figure/abyss.png results/figure/last.png results/figure/sierra.png
	rm -rf doc/count_report.md doc/count_report.html

Pattern matching and variables in a Makefile

It is possible to DRY out a Makefile and use variables.

Using wild cards and pattern matching in a makefile is possible, but the syntax is not very readable. So if you choose to do this proceed with caution. Example of how to do this are here: http://swcarpentry.github.io/make-novice/05-patterns/index.html

As for variables in a Makefile, in most cases we actually do not want to do this. The reason is that we want this file to be a record of what we did to run our analysis (e.g., what files were used, what settings were used, etc). If you start using variables with your Makefile, then you are shifting the problem of recording how your analysis was done to another file. There needs to be some file in your repo that captures what variables were called so that you can replicate your analysis. Examples of using variables in a Makefile are here: http://swcarpentry.github.io/make-novice/06-variables/index.html