Customizing and building containers#
Topic learning objectives#
By the end of this topic, students should be able to:
Write a container file (e.g., Dockerfile) that can be used to reproducibly build a container image that would contain the needed software and environment dependencies of your Data Science project
Use manual and automated tools (e.g., Docker, GitHub Actions) to build and share container images
List good container base images for Data Science projects
Building container images from Dockerfile
’s#
A
Dockerfile
is a plain text file that contains commands primarily about what software to install in the Docker image. This is the more trusted and transparent way to build Docker images.Once we have created a
Dockerfile
we can build it into a Docker image.Docker images are built in layers, and as such,
Dockerfiles
always start by specifiying a base Docker image that the new image is to be built on top off.Docker containers are all Linux containers and thus use Linux commands to install software, however there are different flavours of Linux (e.g., Ubuntu, Debian, CentOs, RedHat, etc) and thus you need to use the right Linux install commands to match your flavour of container. For this course we will focus on Ubuntu- or Debian-based images and thus use
apt-get
as our installation program.
Workflow for building a Dockerfile#
Choose a base image to build off (from https://hub.docker.com/).
Create a
Dockerfile
namedDockerfile
and save it in an appropriate project repository. Open that file and typeFROM <BASE_IMAGE> on the first line
.In a terminal, type
docker run --rm -it <IMAGE_NAME>
and interactively try the install commands you think will work. Edit and try again until the install command works.Write working install commands in the
Dockerfile
, preceeding them withRUN
and save theDockerfile
.After adding every 2-3 commands to your
Dockerfile
, try building the Docker image viadocker build --tag <TEMP_IMAGE_NAME> <PATH_TO_DOCKERFILE_DIRECTORY>
.Once the entire Dockerfile works from beginning to end on your laptop, then you can finally move to building remotely (e.g., creating a trusted build on GitHub Actions).
Demo workflow for creating a Dockfile
locally#
We will demo this workflow together to build a Docker image locally on our machines that has R and the cowsay
R package installed.
Let’s start with the debian:stable
image, so the first line of our Dockerfile
should be as such:
FROM debian:stable
Now let’s run the debian:stable
image so we can work on our install commands to find some that work!
$ docker run --rm -it debian:stable
Now that we are in a container instance of the debian:stable
Docker image, we can start playing around with installing things. To install things in the Debian flavour of Linux we use the command apt-get
. We will do some demo’s in class today, but a more comprehensive tutorial can be found here.
To install R on Debian, we can figure out how to do this by following the CRAN documentation available here.
First they recommend updating the list of available software package we can install with apt-get
to us via the apt-get update
command:
root@5d0f4d21a1f9:/# apt-get update
Next, they suggest the following commands to install R:
root@5d0f4d21a1f9:/# apt-get install r-base r-base-dev
OK, great! That seemed to have worked! Let’s test it by trying out R!
root@5d0f4d21a1f9:/# R
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
Awesome! This seemed to have worked! Let’s exit R (via q()
) and the Docker container (via exit
). Then we can add these commands to the Dockerfile, proceeding them with RUN
and try to build our image to ensure this works.
Our Dockerfile
so far:
FROM debian:stable
RUN apt-get update
RUN apt-get install r-base r-base-dev -y
$ docker build --tag testr1 src
Wait! That didn’t seem to work! Let’s focus on the last two lines of the error message:
Do you want to continue? [Y/n] Abort.
The command '/bin/sh -c apt-get install r-base r-base-dev' returned a non-zero code: 1
Ohhhh, right! As we were interactively installing this, we were prompted to press “Y” on our keyboard to continue the installation. We need to include this in our Dockerfile so that we don’t get this error. To do this we append the -y
flag to the end of the line contianing RUN apt-get install r-base r-base-dev
. Let’s try building again!
Great! Success! Now we can play with installing R packages!
Let’s start now with the test image we have built from our Dockerfile
:
$ docker run -it --rm testr1
Now while we are in the container interactively, we can try to install the R package via:
root@51f56d653892:/# Rscript -e "install.packages('cowsay')"
And it looks like it worked! Let’s confirm by trying to call a function from the cowsay
package in R:
root@51f56d653892:/# R
> cowsay::say("Smart for using Docker are you", "yoda")
Great, let’s exit the container, and add this command to our Dockerfile
and try to build it again!
root@51f56d653892:/# exit
Our Dockerfile
now:
FROM debian:stable
RUN apt-get update
RUN apt-get install r-base r-base-dev -y
RUN Rscript -e "install.packages('cowsay')"
Build the Dockerfile
into an image:
$ docker build --tag testr1 src
$ docker run -it --rm testr1
Looks like a success, let’s be sure we can use the cowsay
package:
root@861487da5d00:/# R
> cowsay::say("why did the chicken cross the road", "chicken")
Hurray! We did it! Now we can automate this build on GitHub, push it to Docker Hub and share this Docker image with the world!
Source: https://giphy.com/gifs/memecandy-ZcKASxMYMKA9SQnhIlTips for installing things programmatically on Debian-flavoured Linux#
Installing things with apt-get
#
Before you install things with apt-get
you will want to update the list of packages that apt-get
can see. We do this via apt-get update
.
Next, to install something with apt-get
you will use the apt-get install
command along with the name of the software. For example, to install the Git version control software we would type apt-get install git
. Note however that we will be building our containers non-interactively, and so we want to preempt any questions/prompts the installation software we will get by including the answers in our commands. So for example, to apt-get install
we append --yes
to tell apt-get
that yes we are happy to install the software we asked it to install, using the amount of disk space required to install it. If we didn’t append this, the installation would stall out at this point waiting for our answer to this question. Thus, the full command to Git via apt-get
looks like:
apt-get install --yes git
Breaking shell commands across lines#
If we want to break a single command across lines in the shell, we use the \
character.
For example, to reduce the long line below which uses apt-get
to install the programs Git, Tiny Nano, Less, and wget:
apt-get install --yes git nano-tiny less wget
We can use \
after each program, to break the long command across lines and make the command more readable (especially if there were even more programs to install). Similarly, we indent the lines after \
to increase readability:
apt-get install --yes \
git \
nano-tidy \
less \
wget
Running commands only if the previous one worked#
Sometimes we don’t want to run a command if the command that was run immediately before it failed. We can specify this in the shell using &&
. For example, if we want to not run apt-get
installation commands if apt-get update
failed, we can write:
apt-get update && \
apt-get install --yes git
Dockerfile
command summary#
Most common Dockerfile
commands I use:
Command |
Description |
---|---|
FROM |
States which base image the new Docker image should be built on top of |
RUN |
Specifies that a command should be run in a shell |
ENV |
Sets environment variables |
EXPOSE |
Specifies the port the container should listen to at runtime |
COPY or ADD |
adds files (or URL’s in the case of ADD) to a container’s filesystem |
ENTRYPOINT |
Configure a container that will run as an executable |
WORKDIR |
sets the working directory for any |
And more here in the Dockerfile reference.
Choosing a base image for your Dockerfile#
Source: https://themuslimtimes.info/2018/10/25/if-i-have-seen-further-it-is-by-standing-on-the-shoulders-of-giants/
Good base images to work from for R or Python projects!#
Image |
Software installed |
---|---|
R, R packages (including the tidyverse), RStudio, make |
|
Python 3.7.4, Ananconda base package distribution, Jupyter notebook |
|
Includes popular packages from the scientific Python ecosystem. |
For mixed language projects, I would recommend using the rocker/tidyverse
image as the base and then installing Anaconda or miniconda as I have done here: https://github.com/UBC-DSCI/introduction-to-datascience/blob/b0f86fc4d6172cd043a0eb831b5d5a8743f29c81/Dockerfile#L19
This is also a nice tour de Docker images from the Jupyter core team: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#selecting-an-image
Dockerfile FAQ:#
1. Where does the Dockerfile
live?#
The Dockerfile should live in the root directory of your project.
2. How do I make an image from a Dockerfile
?#
There are 2 ways to do this! I use the first when developing my Dockerfile
(to test quickly that it works), and then the second I use when I think I am “done” and want to have it archived on Docker Hub.
Build a Docker image locally on your laptop
Build a Docker image and push it to DockerHub using GitHub Actions,
3. How do I build an image locally on my laptop#
From the directory that contains your Dockerfile
(usually your project root):
docker build --tag IMAGE_NAME:VERSION .
note: --tag
let’s you name and version the Docker image. You can call this anything you want. The version number/name comes after the colon
After I build, I think try to docker run ...
to test the image locally. If I don’t like it, or it doesn’t work, I delete the image with docker rmi {IMAGE_NAME}
, edit my Dockerfile and try to build and run it again.
Build a Docker image from a Dockerfile on GitHub Actions#
Building a Docker image from a Dockerfile using an automated tool (e.g., DockerHub or GitHub Actions) lets others trust your image as they can clearly see which Dockerfile was used to build which image.
We will do this in this course by using GitHub Actions (a continuous integration tool) because is provides a great deal of nuanced control over when to trigger the automated builds of the Docker image, and how to tag them.
An example GitHub repository that uses GitHub Actions to build a Docker image from a Dockerfile and publish it on DockerHub is available here: ttimbers/gha_docker_build
We will work through a demonstration of this now starting here: ttimbers/dockerfile-practice
Version Docker images and report software and package versions#
It is easier to create a Docker image from a Dockerfile and tag it (or use it’s digest) than to control the version of each thing that goes into your Docker image.
tags are human readable, however they can be associated with different builds of the image (potentially using different Dockerfiles…)
digests are not human readable, but specify a specific build of an image
Example of how to pull using a tag:
docker pull ttimbers/dockerfile-practice:v1.0
Example of how to pull using a digest:
docker pull ttimbers/dockerfile-practice@sha256:cc512c9599054f24f4020e2c7e3337b9e71fd6251dfde5bcd716dc9b1f8c3a73
Tags are specified when you build on Docker Hub on the Builds tab under the Configure automated builds options. Digests are assigned to a build. You can see the digests on the Tags tab, by clicking on the “Digest” link for a specific tag of the image.
How to get the versions of your software in your container#
Easiest is to enter the container interactively and poke around using the following commands:
python --version
andR --version
to find out the versions of Python and R, respectivelypip freeze
orconda list
in the bash shell to find out Python package versionsEnter R and load the libraries used in your scripts, then use
sessionInfo()
to print the package versions
But I want to control the versions!#
How to in R:#
The Rocker team’s strategy#
This is not an easy thing, but the Rocker team has made a concerted effort to do this. Below is their strategy:
Using the R version tag will naturally lock the R version, and also lock the install date of any R packages on the image. For example, rocker/tidyverse:3.3.1 Docker image will always rebuild with R 3.3.1 and R packages installed from the 2016-10-31 MRAN snapshot, corresponding to the last day that version of R was the most recent release. Meanwhile rocker/tidyverse:latest will always have both the latest R version and latest versions of the R packages, built nightly.
See VERSIONS.md for details, but in short they use the line below to lock the R version (or view in r-ver Dockerfile here for more context):
&& curl -O https://cran.r-project.org/src/base/R-3/R-${R_VERSION}.tar.gz \
And this line to specify the CRAN snapshot from which to grab the R packages (or view in r-ver Dockerfile here for more context):
&& Rscript -e "install.packages(c('littler', 'docopt'), repo = '$MRAN')" \
How to in Python:#
Python version:
conda
to specify an install of specific Python version, either when downloading (see example here, or after downloading withconda install python=3.6
).Or you can install a specific version of Python yourself, as they do in the Python official images (see here for example), but this is more complicated.
For Python packages, there are a few tools:
conda (via
conda install scipy=0.15.0
for example)pip (via
pip install scipy=0.15.0
for example)
Take home messages:#
At a minimum, tag your Docker images or reference image digests
If you want to version installs inside the container, use base images that version R & Python, and add what you need on top in a versioned manner!