15 Customizing and building containers
Learning Objectives
- Write a container file (e.g., Dockerfile) that can be used to reproducibly build a container image that would contain the needed software and environment dependencies of your Data Science project
- Use manual and automated tools (e.g., Docker, GitHub Actions) to build and share container images
- List good container base images for Data Science projects
15.1 Building container images from Dockerfile’s
A
Dockerfileis a plain text file that contains commands primarily about what software to install in the Docker image. This is the more trusted and transparent way to build Docker images.Once we have created a
Dockerfilewe can build it into a Docker image.Docker images are built in layers, and as such,
Dockerfiles always start by specifying a base Docker image that the new image is to be built on top off.Docker containers are all Linux containers and thus use Linux commands to install software, however there are different flavours of Linux (e.g., Ubuntu, Debian, CentOs, RedHat, etc) and thus you need to use the right Linux install commands to match your flavour of container. For this course we will focus on Ubuntu- or Debian-based images (and that means if we need to install software outside of R and Python packages, we will use
apt-getas our installation program). However, most of what we’ll be doing for installation is R, Python and their packages. For that we can really use tools we’re already familiar with (condaandconda-lock, for example).
15.1.1 Workflow for building a Dockerfile
It can take a LOOOOOOOONNNG time to troubleshoot the building of Docker images. Thus, to speed things up and be more efficient, we suggest the workflow below:
Choose a base image to build off (from https://hub.docker.com/).
Create a
DockerfilenamedDockerfileand save it in an appropriate project repository. Open that file and typeFROM <BASE_IMAGE> on the first line. Add any files that need to be accessed at build time via theCOPYcommand.In a terminal, type
docker run --rm -it <IMAGE_NAME>and interactively try the install commands you think will work. Edit and try again until the install command works.Write working install commands in the
Dockerfile, preceding them withRUNand save theDockerfile.After adding every 2-3 commands to your
Dockerfile, try building the Docker image viadocker build --tag <TEMP_IMAGE_NAME> <PATH_TO_DOCKERFILE_DIRECTORY>.Once the entire
Dockerfileworks from beginning to end on your laptop, then you can finally move to building remotely (e.g., creating a trusted build on GitHub Actions).
15.1.2 Demo workflow for creating a Dockerfile locally
We will demo this workflow together to build a Docker image locally on our machines that has Jupyter, Python and the python packages pandas, pandera and deepcheck installed. When we do this, we will leverage a conda environment that we have for these packages already:
# environment.yml
name: my-env
channels:
- conda-forge
- defaults
dependencies:
- pandas=2.2.2
- pandera=0.20.4
- python=3.11
- pip
- pip:
- deepchecks==0.18.1To use this environment efficiently to build a Docker image, we need a conda-lock file, more specifically an explicit conda-lock file for the linux operating system (as our container will be a linux container). We can generate that from the environment.yml file via:
conda-lock -k explicit --file environment.yml -p linux-64From that, we get a file named conda-linux-64.lock, which we can copy into a Jupyter container and update the conda environment already installed there using mamba update.
OK, now we are ready to start writing our Dockerfile! Let’s start with the quay.io/jupyter/minimal-notebook:afe30f0c9ad8 image, which already has Jupyter lab installed. so the first line of our Dockerfile should be as such:
FROM quay.io/jupyter/minimal-notebook:afe30f0c9ad8Next, since we will be wanting the container to have access to a file at build time, we need to COPY it in so we can have access to it in the container. Our Dockerfile should now look like this:
FROM quay.io/jupyter/minimal-notebook:afe30f0c9ad8
COPY conda-linux-64.lock /tmp/conda-linux-64.lockNow let’s build an image from our Dockerfile so we can test out and find install commands that work for what we need to do! To build an image, we use docker build. We’ll want to tag/name the image so we can reference it after its built, so we can run it. Here we named it testing_cmds. Finally we say where to look for the Dockerfile. Here we say ., meaning the current working directory:
docker build --tag testing_cmds .Now we will run our image (named testing_cmds). Note that we run this not using the Jupyter web app because all we are doing right now is testing out installation commands.
$ docker run --rm -it testing_cmds bin/bashThe first command we will test out is mamba update to add the packages specified in conda-linux-64.lock to the base conda environment in the container.
Note the use of --quiet in the command below. These commands will be run non-interactively when building the container, and so we will not be able to see the output. If this command fails here though, we may want to remove --quiet while troubleshooting, and then add it back in once we get things working.
jovyan@91320098e7cb:~$ mamba update --quiet --file /tmp/conda-linux-64.lockGreat! That seemed to work. Next, we’ll try cleaning up (a good, but not necessary practice after updating conda environments).
Note the use of -y and -f in the command below. These commands will be run non-interactively when building the container, and so we cannot use the keyboard to say “yes” to conda/mamba questions. So instead we just give full permission for all the things when we run the command.
jovyan@91320098e7cb:~$ mamba clean --all -y -fOK, great! That seemed to have worked too! Last thing we’ll test out is fixing permissions of the directories where we installed things (sometimes these got modified during installation and can cause user issues):
jovyan@91320098e7cb:~$ fix-permissions "${CONDA_DIR}"jovyan@91320098e7cb:~$ fix-permissions "/home/${NB_USER}"Awesome! This seemed to have worked! Let’s exit the Docker container (via exit). Then we can add these commands to the Dockerfile, proceeding them with RUN and try to build our image to ensure this works.
Our Dockerfile so far:
FROM quay.io/jupyter/minimal-notebook:afe30f0c9ad8
COPY conda-linux-64.lock /tmp/conda-linux-64.lock
RUN mamba update --quiet --file /tmp/conda-linux-64.lock
RUN mamba clean --all -y -f
RUN fix-permissions "${CONDA_DIR}"
RUN fix-permissions "/home/${NB_USER}"Let’s try building an image named testimage locally:
$ docker build --tag testimage .Looks like a success, let’s be sure we can use the pandera package as a test:
jovyan@cc85f7afef69:~$ python
Python 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 11:57:02) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandera as pa
>>>Hurray! We did it! Now we can automate this build on GitHub, push it to Docker Hub and share this Docker image with the world!

15.1.3 Guidelines for RUN Commands:
Each RUN command creates a new layer in the Docker image. Each layer in a Docker image takes more disc space. As a consequence, we want to minimize the number of layers where reasonably possible. How can we do this?
Install everything that can be installed by a tool (e.g., conda, mamba, pip, apt-get, etc) at once (i.e., when installing 5 programs via apt-get, do not call apt-get five times, instead do: apt-get tool1 tool2 tool3 tool4 tool5). However, doing this can lead to long lines. In response, we can break a single command across lines in the shell by using the \ character. For example, to reduce the long line below which uses apt-get to install the programs Git, Tiny Nano, Less, and wget:
apt-get install --yes git nano-tiny less wgetWe can use \ after each program, to break the long command across lines and make the command more readable (especially if there were even more programs to install). Similarly, we indent the lines after \ to increase readability:
apt-get install --yes \
git \
nano-tidy \
less \
wgetWe can also group together related commands that depend upon each other. Whe doing this we need to be careful though, as sometimes we don’t want to run a command if the command that was run immediately before it failed. We can specify this in the shell using &&. For example, if we want to not run apt-get installation commands if apt-get update failed, we can write:
apt-get update && \
apt-get install --yes git15.2 Dockerfile command summary
Most common Dockerfile commands I use:
| Command | Description |
|---|---|
| FROM | States which base image the new Docker image should be built on top of |
| RUN | Specifies that a command should be run in a shell |
| ENV | Sets environment variables |
| EXPOSE | Specifies the port the container should listen to at runtime |
| COPY or ADD | adds files (or URL’s in the case of ADD) to a container’s filesystem |
| ENTRYPOINT | Configure a container that will run as an executable |
| WORKDIR | sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile |
And more here in the Dockerfile reference.
15.3 Choosing a base image for your Dockerfile

15.3.1 Good base images to work from for R or Python projects!
| Image | Software installed |
|---|---|
| jupyter/minimal-notebook | Includes popular packages from the scientific Python ecosystem. |
| rocker/rstudio | R, R, RStudio |
| rocker/tidyverse | R, R packages (including the tidyverse), RStudio, make, do not use for ARM based computers |
| continuumio/anaconda3 | Python 3.7.4, Ananconda base package distribution, Jupyter notebook |
15.4 Dockerfile FAQ:
1. Where does the Dockerfile live?
The Dockerfile should live in the root directory of your project.
2. How do I make an image from a Dockerfile?
There are 2 ways to do this! I use the first when developing my Dockerfile (to test quickly that it works), and then the second I use when I think I am “done” and want to have it archived on Docker Hub.
Build a Docker image locally on your laptop
Build a Docker image and push it to DockerHub using GitHub Actions,
3. How do I build an image locally on my laptop
From the directory that contains your Dockerfile (usually your project root):
docker build --tag IMAGE_NAME:VERSION .--tag let’s you name and version the Docker image. You can call this anything you want. The version number/name comes after the colon
After I build, I think try to docker run ... to test the image locally. If I don’t like it, or it doesn’t work, I delete the image with docker rmi {IMAGE_NAME}, edit my Dockerfile and try to build and run it again.
15.5 Build a Docker image from a Dockerfile on GitHub Actions
Building a Docker image from a Dockerfile using an automated tool (e.g., DockerHub or GitHub Actions) lets others trust your image as they can clearly see which Dockerfile was used to build which image.
We will do this in this course by using GitHub Actions (a continuous integration tool) because is provides a great deal of nuanced control over when to trigger the automated builds of the Docker image, and how to tag them.
An example GitHub repository that uses GitHub Actions to build a Docker image from a Dockerfile and publish it on DockerHub is available here: https://github.com/ttimbers/dsci522-dockerfile-practice
What is there that we haven’t already seen here? It’s the GitHub Actions workflow file .github/workflows/docker-publish.yml shown below. This workflow can be triggered manually, or automatically when a push to GitHub is made that changes Dockerfile or conda-linux-64.lock. When that happens, a computer on GitHub will copy the contents of the GitHub repository and build and version/tag a Docker image using the Dockerfile contained therein. The image will get two tags: latest and the short GitHub SHA corresponding to the Git commit SHA at the HEAD of main. It will also push the Docker image to DockerHub. For that last step to happen, the code owner’s DockerHub credentials must be stored in the GitHub repository as GitHub repository secrets.
Example .github/workflows/docker-publish.yml file:
# Publishes docker image, pinning actions to a commit SHA,
# and updating most recently built image with the latest tag.
# Can be triggered by either pushing a commit that changes the `Dockerfile`,
# or manually dispatching the workflow.
name: Publish Docker image
on:
workflow_dispatch:
push:
paths:
- 'Dockerfile'
- .github/workflows/docker-publish.yml # or whatever you named the file
- 'conda-linux-64.lock'
jobs:
push_to_registry:
name: Push Docker image to Docker Hub
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: actions/checkout@v4
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }} # do not change, add as secret
password: ${{ secrets.DOCKER_PASSWORD }} # do not change, add as secret
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v5
with:
images: ttimbers/dsci522-dockerfile-practice # change to your image
tags: |
type=raw, value={{sha}},enable=${{github.ref_type != 'tag' }}
type=raw, value=latest
- name: Build and push Docker image
uses: docker/build-push-action@v6
with:
context: .
file: ./Dockerfile
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}If you are creating this workflow file locally and try to push the file to your repository, you may run into an error that mentions your current PAT does not have permissions to push workflow files.
Go to (account) Settings > Developer Settings > Personal Access Token > Find and select your PAT, and then make sure the “Workflow” permission is selected.
