14 Non-interactive scripts
Learning Objectives
- Explain when it is optimal to work in a read-eval-print-loop (REPL) framework and when it is optimal to shift to using non-interactive scripts.
- Be able to create simple scripts in R that can take input and be executed from the command line.
- Decide when to move from using command line arguments to pass variables into a script to passing variables in via a configuration file, and create scripts that can read configuration files.
14.1 Read-eval-print-loop (REPL) framework (i.e., interactive mode) versus Scripts
- Up until now, we have primarily been using R and Python in an Read-eval-print-loop (REPL) framework (i.e., interactive mode)
- Read-eval-print-loop (REPL) framework (i.e., interactive mode) is when we run our code in the console in R/Python, or in cells/chunks in the RStudio/Juptyer notebooks
- A Read-eval-print-loop (REPL) framework (i.e., interactive mode) is very useful for:
- solving small problems
- developing code that will be knit to an analytic report
- developing code that will be run as a script (i.e., in “batch” mode)
14.2 What is a script?
An R/Python script is simply a plain text file containing (almost) the same commands that you would enter into R/Python’s console or in cells/chunks in the RStudio/Juptyer notebooks. We often run these from top to bottom from the command line/unix shell.
14.2.1 Why write scripts?
- Efficiency!
- Automation!
- Reusable!
- Record of what you have done!
It also makes your report files a lot cleaner!!!
You can find the code repository for this lesson here: https://github.com/DSCI-310/2024-02-13-scripts
14.3 Scripts in R
Let’s start with a small, simple example to demonstrate how we write and run scripts in R (it is very similar in Python and we will get to this later in the lesson).
Our script will be called print_mean_hp.R
, and it will calculate the mean horsepower of the cars from the built-in R data frame mtcars
.
We will develop this script inside RStudio, make sure it works, and then run it from the command line/terminal/Git bash.
14.3.1 Our first R script
# author: Tiffany Timbers
# date: 2020-01-15
#
# This script calculates the mean horsepower of the cars from the built-in
# R data frame `mtcars`. This script takes no arguments.
#
# Usage: Rscript print_mean_hp.R
<- mean(mtcars$hp)
mean_hp print(mean_hp)
14.3.2 Running our first R script
To run our R script, we need to open the command line/terminal/Git bash, and either navigate to the directory that houses the script OR point to it when we call it. We will do the former.
Then to run the R script, we use the Rscript
command, followed by the name/path to the file:
Rscript print_mean_hp.R
The output should be:
[1] 146.6875
14.3.3 A couple notes about scripts
- If you want something to be output to the command line/terminal/Git bash, you should explicitly ask for it to be print. This is not an absolute requirement in R, but it is in Python!
- Similar with figures, they need to be saved! You will never see a figure created in a script unless you write it to a file.
- From a reproducibility perspective, if we want input from the user, usually we will design the scripts to take command line arguments, and not use keyboard/user prompts.
14.3.4 Script structure and organization
Although not necessary in R or Python, it is still good practice and advised to organize the code in your script into related sections. This practice keeps your code readable and organized. Below we outline how we typically organize R scripts:
14.3.5 Example R script organization:
# documentation comments
# import libraries/packages
# parse/define command line arguments here
# code for other functions
# define main function
<- function(){
main # code for "guts" of script goes here
}
# call main function
main() # pass any command line args to main here
14.3.6 R script example
Here we write a script called quick_titanic_fare_mean.R
which reads in the titanic dataset (original source: https://biostat.app.vumc.org/wiki/Main/DataSets) and calculates the mean for the fare (ticket price) variable.
Our script has two functions, a function we defined to calculate the standard error of the mean (such a function does not exist in R) and a main
function which runs the “body” of our code.
# author: Tiffany Timbers
# date: 2020-01-15
#
# This script calculates the mean for the fare (ticket price)
# from titanic.csv. This script takes no arguments.
#
# Usage: quick_titanic_fare_mean.R
library(tidyverse)
<- function() {
main <- read_csv('data/titanic.csv')
data <- data %>%
out pull(fare) %>%
mean(na.rm = TRUE)
print(out)
}
main()
14.3.7 Saving things from scripts
Above we just printed the mean to the terminal. That is was done because the purpose of that script was to have a very simple illustration of how to create and run scripts in R. However, in practice, we typically want to save our analysis artifacts (figures, tables, values, etc) to disc so that we can load them into other files (e.g., our final reports to communicate our analysis findings).
Below we show an example of how we would use readr::write_csv
to save the mean value we calculated to a .csv
file:
# author: Tiffany Timbers
# date: 2020-01-15
#
# This script calculates the mean horsepower of the cars from the built-in
# R data frame `mtcars` and saves it to `results/mean_hp_col.csv`.
# This script takes no arguments.
#
# Usage: Rscript print_mean_hp.R
library(readr)
<- function() {
main <- mean(mtcars$hp)
mean_hp <- data.frame(value = mean_hp)
mean_hp write_csv(mean_hp, "results/mean_hp_col.csv")
}
main()
In this script we are saving the file to the results
directory. There needs to be a results
directory created before this script would work.
14.4 Using command line arguments in R
Let’s make our script more flexible, and let us specify what column variable we want to calculate the mean for when we call the script.
To do this, we use the docopt
R package. This will allow us to collect the text we enter at the command line when we call the script, and make it available to us when we run the script.
When we run docopt
it takes the text we entered at the command line and gives it to us as a named list of the text provided after the script name. The names of the items in the list come from the documentation. Whitespace at the command line is what is used to parse the text into separate items in the list. The list is a bit funny, in that the arguments are duplicated in the list. In the first occurrence of the arguments in the list, the names include the syntax of how you specify the argument in the docopt
docs (so it can include characters like <
and -
and [
). These make the names for the arguments non-syntactic, and thus difficult to program with in R. Thus, they are also given without those special docopt
characters.
It might be easier to understand if we see an example. Consider this script named demo-docopt.R
:
# author: Tiffany Timbers
# date: 2024-11-29
"This script prints out docopt args and the type of object they are stored as.
Usage: demo-docopt.R <arg1> <arg2> <arg3>
Options:
<arg1> Takes any value
<arg2> Takes any value
<arg3> Takes any value
" -> doc
library(docopt)
<- docopt(doc)
opt
print(opt)
print(typeof(opt))
If we run this via Rscript demo-docopt.R I LOVE MDS
(so our three arguments were I
, LOVE
and MDS
), we get this output:
List of 6
$ <arg1>: chr "I"
$ <arg2>: chr "LOVE"
$ <arg3>: chr "MDS"
$ arg1 : chr "I"
$ arg2 : chr "LOVE"
$ arg3 : chr "MDS"
NULL
[1] "list"
You can see that opt
(the list of command line arguments) is of length 6, even though there are only 3 command line arguments expected by the script and there were only 3 command line arguments passed to the script at the command line. The argument values are repeated, with only the names being different. The second occurence of the argument name is easier to work with because it doesn’t have the special docopt command line characters. So to reference a command line argument value, let’s say the second command line argument value in the example above, we would type opt$arg2
.
We could also use numerical indexing to access the second command line argument value in the example above. To do that we could type either opt[[2]]
or opt[[5]]
. But this is far less readable, and so we tend to reference the command line arguments using the $name
approach.
Let’s look at a more realistic example:
# author: Tiffany Timbers
# date: 2020-01-15
"This script calculates the mean for a specified column
from titanic.csv.
Usage: quick_titanic_col_mean.R <var>
" -> doc
library(tidyverse)
library(docopt)
<- docopt(doc)
opt
<- function(var) {
main
# read in data
<- read_csv('data/titanic.csv')
data
# print out statistic of variable of interest
<- data |>
out pull(!!var) |>
mean(na.rm = TRUE)
print(out)
}
main(opt$var)
Note: we use !!
in front of opt$col
because all command line arguments are passed into R as strings, and are thus quoted. However, pull
is a function from the tidyverse
that expects an unquoted column name of a data frame. !!
does this unquoting. This is similar to {{
that we saw before with functions (which quotes and unquotes values when they are passed into functions). However here we use !!
as we have no indirection and just need to perform unquoting.
And we would run a script like this from the command line as follows:
Rscript src/quick_titanic_col_mean.R fare
Let’s make our script even more flexible, and let us specify that dataset as well (we could then use it more generally on other files, such as the Gapminder .csv
’s).
# author: Tiffany Timbers
# date: 2020-01-15
"This script calculates the mean for a specified column
from titanic.csv.
Usage: quick_titanic_col_mean.R <file_path> <var>
" -> doc
library(tidyverse)
library(docopt)
<- docopt(doc)
opt
<- function(file_path, var) {
main
# read in data
<- read_csv(file_path)
data
# print out statistic of variable of interest
<- data |>
out pull(!!var) |>
mean(na.rm = TRUE)
print(out)
}
main(opt$file_path, opt$var)
Now we would run a script like this from the command line as follows:
Rscript src/quick_csv_col_mean.R data/titanic.csv fare
14.4.1 Positional arguments vs options
In the examples above, we used docopt
to specify positional arguments. This means that the order matters! If we change the order of the values of the arguments at the command line, our script will likely throw an error, because it will try to perform the wrong operations on the wrong values.
Another downside to positional arguments, is that without good documentation, they can be less readable. And certainly the call to the script to is less readable. We can instead give the arguments names using --ARGUMENT_NAME
syntax. We call these “options”. Below is the same script but specified using options as opposed to positional arguments:
# author: Tiffany Timbers
# date: 2020-01-15
"This script calculates the mean for a specified column
from titanic.csv.
Usage: quick_csv_col_mean.R --file_path=<file_path> --var=<var>
Options:
--file_path=<file_path> Path to the data file
--var=<var> Unquoted column name of the numerical vector for which to calculate the mean
" -> doc
library(tidyverse)
library(docopt)
<- docopt(doc)
opt
<- function(file_path, var) {
main
# read in data
<- read_csv(file_path)
data
# print out statistic of variable of interest
<- data |>
out pull(!!var) |>
mean(na.rm = TRUE)
print(out)
}
main(opt$file_path, opt$var)
And we would run a script like this that uses options like this:
Rscript src/quick_csv_col_mean.R --file_path=data/titanic.csv --col=fare
or like this:
Rscript src/quick_csv_col_mean.R --col=fare --file_path=data/titanic.csv
because we gave the arguments names, and thus their position no longer matters!
14.4.2 Some tips for RStudio IDE
- To indent a block of text, highlight and use tab
- To fix indenting in general to R code standards, use ⌘/ctrl> + shift + i
- To get multiple cursors, hold alt/option and highlight lines using cursor
- To get multiple cursors to the beginning of the line, use control A
- To get multiple cursors to the end of the line, use control E
14.5 Scripts in Python
14.5.1 Example Python script organization:
# documentation comments
# import libraries/packages
# parse/define command line arguments here
# code for other functions
# define main function
def main():
# code for "guts" of script goes here
# call main function
if __name__ == "__main__":
# pass any command line args to main here main()
You can see that R and Python scripts should have roughly the same style. There is the difference of if __name__ == "__main__":
in Python scripts, and R does not really have an equivalent. The benefit of some control flow around main
, as is done in Python, is so that you could import or source the other functions in the script without running the main
function.
Another important note is that if you are using the click
package for parsing command line arguments in Python, this packages uses decorators on the main
function to do this. As a consequence, these decorators need to be positioned directly above the main
function definition. So, in this case the organization rules for “# parse/define command line arguments here” above is a little different.
It is still worthwhile however to have a main
function in your R scripts, as it helps with organization and readability.
14.5.2 Using command line arguments in Python
Although docopt
for Python exists, it is not currently being supported by an active development community. Thus, we will use the click
Python package instead. It is widely used, has a healthy and active development community, and excellent functionality.
Below is an example of using click
for a simple Python script:
import click
@click.command()
@click.argument('num1', type=int)
@click.argument('num2', type=int)
def main(num1, num2):
"""Simple program that adds two numbers."""
= num1 + num2
result f"The sum of {num1} and {num2} is {result}")
click.echo(
if __name__ == '__main__':
main()
Running this script via:
python sum.py 5 7
Would result in:
The sum of 5 and 7 is 12
We do not need to pass the variables into main()
. The click
decorators take care of that for us! How nice!!!
14.5.3 Positional arguments vs options in Python
If we instead wanted to use options in the script above, we swap the argument
method for the option
method and add --
to the prefix of our options:
import click
@click.command()
@click.option('--num1', type=int)
@click.option('--num2', type=int)
def main(num1, num2):
"""Simple program that adds two numbers."""
= num1 + num2
result f"The sum of {num1} and {num2} is {result}")
click.echo(
if __name__ == '__main__':
main()
Running this script, we now add the names of the options as shown below via:
python sum.py --num1=5 --num2=7
Would result in:
The sum of 5 and 7 is 12
14.6 Saving objects from scripts
14.6.1 Tables
As mentioned already, it is important you save results from your scripts so that you can import them into your reports (or other other data products). For data frame objects that will be presented as tables, writing the objects to a .csv
file through readr
(in R) or pandas
(in Python) is great.
14.6.2 Figures
For figures, saving images as .png
is also a good choice. Although the downstream usage of the figure can sometimes change this recommendation. For a brief but more thorough discussion of this topic, see the “Saving the visualization” chapter from Data Science: An Introduction by Timbers, Campbell & Lee (2020).
14.6.3 Model objects
Model objects that are trained/fit in one script, and then need to be used again later in another script can and should be saved as binary files. In R, the format is .RDS
and we use the functions saveRDS()
and readRDS()
to do this. In python, the format is .pickle
and we use the functions pickle.dump()
and pickle.load()
from the pickle
package.
example of saving a model using saveRDS()
saveRDS(final_knn_model, "final_knn_model.rds")
example of loading a saved model using readRDS()
<- readRDS("final_knn_model.rds") final_knn_model
example of saving a model using pickle.dump()
for very simple objects (like preprocessor)
import pickle
open("knn_preprocessor.pickle", "wb")) pickle.dump(knn_preprocessor,
for more complex objects (like a fit model)
import pickle
with open("knn_fit.pickle", 'wb') as f:
pickle.dump(knn_fit, f)
example of loading a saved model using pickle.load()
for very simple objects (like preprocessor)
import pickle
= pickle.load(open("knn_preprocessor.pickle", "rb")) knn_preprocessor
for more complex objects (like a fit model)
import pickle
with open("knn_fit.pickle", 'rb') as f:
= pickle.load(f) knn_fit