Customizing and building containers#

Topic learning objectives#

By the end of this topic, students should be able to:

  1. Write a container file (e.g., Dockerfile) that can be used to reproducibly build a container image that would contain the needed software and environment dependencies of your Data Science project

  2. Use manual and automated tools (e.g., Docker, GitHub Actions) to build and share container images

  3. List good container base images for Data Science projects

Building container images from Dockerfile’s#

  • A Dockerfile is a plain text file that contains commands primarily about what software to install in the Docker image. This is the more trusted and transparent way to build Docker images.

  • Once we have created a Dockerfile we can build it into a Docker image.

  • Docker images are built in layers, and as such, Dockerfiles always start by specifiying a base Docker image that the new image is to be built on top off.

  • Docker containers are all Linux containers and thus use Linux commands to install software, however there are different flavours of Linux (e.g., Ubuntu, Debian, CentOs, RedHat, etc) and thus you need to use the right Linux install commands to match your flavour of container. For this course we will focus on Ubuntu- or Debian-based images and thus use apt-get as our installation program.

Workflow for building a Dockerfile#

  1. Choose a base image to build off (from https://hub.docker.com/).

  2. Create a Dockerfile named Dockerfile and save it in an appropriate project repository. Open that file and type FROM <BASE_IMAGE> on the first line.

  3. In a terminal, type docker run --rm -it <IMAGE_NAME> and interactively try the install commands you think will work. Edit and try again until the install command works.

  4. Write working install commands in the Dockerfile, preceeding them with RUN and save the Dockerfile.

  5. After adding every 2-3 commands to your Dockerfile, try building the Docker image via docker build --tag <TEMP_IMAGE_NAME> <PATH_TO_DOCKERFILE_DIRECTORY>.

  6. Once the entire Dockerfile works from beginning to end on your laptop, then you can finally move to building remotely (e.g., creating a trusted build on GitHub Actions).

Demo workflow for creating a Dockfile locally#

We will demo this workflow together to build a Docker image locally on our machines that has R and the cowsay R package installed.

Let’s start with the debian:stable image, so the first line of our Dockerfile should be as such:

FROM debian:stable

Now let’s run the debian:stable image so we can work on our install commands to find some that work!

$ docker run --rm -it debian:stable

Now that we are in a container instance of the debian:stable Docker image, we can start playing around with installing things. To install things in the Debian flavour of Linux we use the command apt-get. We will do some demo’s in class today, but a more comprehensive tutorial can be found here.

To install R on Debian, we can figure out how to do this by following the CRAN documentation available here.

First they recommend updating the list of available software package we can install with apt-get to us via the apt-get update command:

root@5d0f4d21a1f9:/# apt-get update

Next, they suggest the following commands to install R:

root@5d0f4d21a1f9:/# apt-get install r-base r-base-dev

OK, great! That seemed to have worked! Let’s test it by trying out R!

root@5d0f4d21a1f9:/# R

R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 

Awesome! This seemed to have worked! Let’s exit R (via q()) and the Docker container (via exit). Then we can add these commands to the Dockerfile, proceeding them with RUN and try to build our image to ensure this works.

Our Dockerfile so far:

FROM debian:stable

RUN apt-get update

RUN apt-get install r-base r-base-dev -y 
$ docker build --tag testr1 src

Wait! That didn’t seem to work! Let’s focus on the last two lines of the error message:

Do you want to continue? [Y/n] Abort.
The command '/bin/sh -c apt-get install r-base r-base-dev' returned a non-zero code: 1

Ohhhh, right! As we were interactively installing this, we were prompted to press “Y” on our keyboard to continue the installation. We need to include this in our Dockerfile so that we don’t get this error. To do this we append the -y flag to the end of the line contianing RUN apt-get install r-base r-base-dev. Let’s try building again!

Great! Success! Now we can play with installing R packages!

Let’s start now with the test image we have built from our Dockerfile:

$ docker run -it --rm testr1

Now while we are in the container interactively, we can try to install the R package via:

root@51f56d653892:/# Rscript -e "install.packages('cowsay')"

And it looks like it worked! Let’s confirm by trying to call a function from the cowsay package in R:

root@51f56d653892:/# R

> cowsay::say("Smart for using Docker are you", "yoda")

Great, let’s exit the container, and add this command to our Dockerfile and try to build it again!

root@51f56d653892:/# exit

Our Dockerfile now:

FROM debian:stable

RUN apt-get update

RUN apt-get install r-base r-base-dev -y 

RUN Rscript -e "install.packages('cowsay')"

Build the Dockerfile into an image:

$ docker build --tag testr1 src

$ docker run -it --rm testr1

Looks like a success, let’s be sure we can use the cowsay package:

root@861487da5d00:/# R

> cowsay::say("why did the chicken cross the road", "chicken")

Hurray! We did it! Now we can automate this build on GitHub, push it to Docker Hub and share this Docker image with the world!

Source: https://giphy.com/gifs/memecandy-ZcKASxMYMKA9SQnhIl

Tips for installing things programmatically on Debian-flavoured Linux#

Installing things with apt-get#

Before you install things with apt-get you will want to update the list of packages that apt-get can see. We do this via apt-get update.

Next, to install something with apt-get you will use the apt-get install command along with the name of the software. For example, to install the Git version control software we would type apt-get install git. Note however that we will be building our containers non-interactively, and so we want to preempt any questions/prompts the installation software we will get by including the answers in our commands. So for example, to apt-get install we append --yes to tell apt-get that yes we are happy to install the software we asked it to install, using the amount of disk space required to install it. If we didn’t append this, the installation would stall out at this point waiting for our answer to this question. Thus, the full command to Git via apt-get looks like:

apt-get install --yes git

Breaking shell commands across lines#

If we want to break a single command across lines in the shell, we use the \ character. For example, to reduce the long line below which uses apt-get to install the programs Git, Tiny Nano, Less, and wget:

apt-get install --yes git nano-tiny less wget

We can use \ after each program, to break the long command across lines and make the command more readable (especially if there were even more programs to install). Similarly, we indent the lines after \ to increase readability:

apt-get install --yes \
    git \
    nano-tidy \
    less \
    wget

Running commands only if the previous one worked#

Sometimes we don’t want to run a command if the command that was run immediately before it failed. We can specify this in the shell using &&. For example, if we want to not run apt-get installation commands if apt-get update failed, we can write:

apt-get update && \
    apt-get install --yes git

Dockerfile command summary#

Most common Dockerfile commands I use:

Command

Description

FROM

States which base image the new Docker image should be built on top of

RUN

Specifies that a command should be run in a shell

ENV

Sets environment variables

EXPOSE

Specifies the port the container should listen to at runtime

COPY or ADD

adds files (or URL’s in the case of ADD) to a container’s filesystem

ENTRYPOINT

Configure a container that will run as an executable

WORKDIR

sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile

And more here in the Dockerfile reference.

Choosing a base image for your Dockerfile#

https://themuslimtimesdotinfodotcom.files.wordpress.com/2018/10/newton-quotes-2.jpg?w=1334

Source: https://themuslimtimes.info/2018/10/25/if-i-have-seen-further-it-is-by-standing-on-the-shoulders-of-giants/

Good base images to work from for R or Python projects!#

Image

Software installed

rocker/tidyverse

R, R packages (including the tidyverse), RStudio, make

continuumio/anaconda3

Python 3.7.4, Ananconda base package distribution, Jupyter notebook

jupyter/scipy-notebook

Includes popular packages from the scientific Python ecosystem.

For mixed language projects, I would recommend using the rocker/tidyverse image as the base and then installing Anaconda or miniconda as I have done here: https://github.com/UBC-DSCI/introduction-to-datascience/blob/b0f86fc4d6172cd043a0eb831b5d5a8743f29c81/Dockerfile#L19

This is also a nice tour de Docker images from the Jupyter core team: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#selecting-an-image

Dockerfile FAQ:#

1. Where does the Dockerfile live?#

The Dockerfile should live in the root directory of your project.

2. How do I make an image from a Dockerfile?#

There are 2 ways to do this! I use the first when developing my Dockerfile (to test quickly that it works), and then the second I use when I think I am “done” and want to have it archived on Docker Hub.

  1. Build a Docker image locally on your laptop

  2. Build a Docker image and push it to DockerHub using GitHub Actions,

3. How do I build an image locally on my laptop#

From the directory that contains your Dockerfile (usually your project root):

docker build --tag IMAGE_NAME:VERSION .

note: --tag let’s you name and version the Docker image. You can call this anything you want. The version number/name comes after the colon

After I build, I think try to docker run ... to test the image locally. If I don’t like it, or it doesn’t work, I delete the image with docker rmi {IMAGE_NAME}, edit my Dockerfile and try to build and run it again.

Build a Docker image from a Dockerfile on GitHub Actions#

Building a Docker image from a Dockerfile using an automated tool (e.g., DockerHub or GitHub Actions) lets others trust your image as they can clearly see which Dockerfile was used to build which image.

We will do this in this course by using GitHub Actions (a continuous integration tool) because is provides a great deal of nuanced control over when to trigger the automated builds of the Docker image, and how to tag them.

An example GitHub repository that uses GitHub Actions to build a Docker image from a Dockerfile and publish it on DockerHub is available here: ttimbers/gha_docker_build

We will work through a demonstration of this now starting here: ttimbers/dockerfile-practice

Version Docker images and report software and package versions#

It is easier to create a Docker image from a Dockerfile and tag it (or use it’s digest) than to control the version of each thing that goes into your Docker image.

  • tags are human readable, however they can be associated with different builds of the image (potentially using different Dockerfiles…)

  • digests are not human readable, but specify a specific build of an image

Example of how to pull using a tag:

docker pull ttimbers/dockerfile-practice:v1.0

Example of how to pull using a digest:

docker pull ttimbers/dockerfile-practice@sha256:cc512c9599054f24f4020e2c7e3337b9e71fd6251dfde5bcd716dc9b1f8c3a73

Tags are specified when you build on Docker Hub on the Builds tab under the Configure automated builds options. Digests are assigned to a build. You can see the digests on the Tags tab, by clicking on the “Digest” link for a specific tag of the image.

How to get the versions of your software in your container#

Easiest is to enter the container interactively and poke around using the following commands:

  • python --version and R --version to find out the versions of Python and R, respectively

  • pip freeze or conda list in the bash shell to find out Python package versions

  • Enter R and load the libraries used in your scripts, then use sessionInfo() to print the package versions

But I want to control the versions!#

How to in R:#

The Rocker team’s strategy#

This is not an easy thing, but the Rocker team has made a concerted effort to do this. Below is their strategy:

Using the R version tag will naturally lock the R version, and also lock the install date of any R packages on the image. For example, rocker/tidyverse:3.3.1 Docker image will always rebuild with R 3.3.1 and R packages installed from the 2016-10-31 MRAN snapshot, corresponding to the last day that version of R was the most recent release. Meanwhile rocker/tidyverse:latest will always have both the latest R version and latest versions of the R packages, built nightly.

See VERSIONS.md for details, but in short they use the line below to lock the R version (or view in r-ver Dockerfile here for more context):

    && curl -O https://cran.r-project.org/src/base/R-3/R-${R_VERSION}.tar.gz \

And this line to specify the CRAN snapshot from which to grab the R packages (or view in r-ver Dockerfile here for more context):

    && Rscript -e "install.packages(c('littler', 'docopt'), repo = '$MRAN')" \

How to in Python:#

Python version:

  • conda to specify an install of specific Python version, either when downloading (see example here, or after downloading with conda install python=3.6).

  • Or you can install a specific version of Python yourself, as they do in the Python official images (see here for example), but this is more complicated.

For Python packages, there are a few tools:

  • conda (via conda install scipy=0.15.0 for example)

  • pip (via pip install scipy=0.15.0 for example)

Take home messages:#

  • At a minimum, tag your Docker images or reference image digests

  • If you want to version installs inside the container, use base images that version R & Python, and add what you need on top in a versioned manner!