8  Virtual environments

Learning Objectives

  1. Explain what a virtual environment is and why it can be useful for reproducible data analyses
  2. Discuss the advantages and limitations of virtual environment tools (e.g., conda and renv) in the context of reproducible data analyses
  3. Use, create and share virtual environments (for example, with conda and renv)

Attribution

The conda virtual environment section of this guide was originally published at http://geohackweek.github.io/ under a CC-BY license and has been updated to reflect recent changes in conda, as well as modified slightly to fit the MDS lecture format.

8.1 Virtual environments

Virtual environments let’s you have multiple versions of programming languages packages, and other programs on the same computer, while keeping them isolated so they do not create conflicts with each other. In practice virtual environments are used in one or multiple projects. And you might have several virtual environments stored on your laptop, so that you can have a different collection of versions programming languages and their packages for each project, as needed.

Most virtual environment tools have a sharing functionality which aids in making data science projects reproducible, as not only is there a record of the computational environment, but that computational environment can be shared to others computer - facilitating the reproduction of results from data and code. This facilitation comes from the fact that programming languages and their packages are not static - they change! There are new features added, bugs are fixed, etc, and this can impact how your code runs! Therefore, for a data science project to be reproducible across time, you need the computational environment in addition to the data and the code.

There are several other major benefits of using environments:

  • If two of your projects on your computer rely on different versions of the same package, you can install these in different environments.
  • If you want to play around with a new package, you don’t have to change the packages versions you use for your data analysis project and risk messing something up (package version often get upgraded when we install other new packages that share dependencies).
  • When you develop your own packages, it is essential to use environments, since you want to to make sure you know exactly which packages yours depend on, so that it runs on other systems than your own.

There are MANY version virtual environment tools out there, even if we just focus on R and Python. When we do that we can generate this list:

R virtual environment tools

  • packrat
  • renv
  • conda

Python virtual environment tools

  • venv
  • virtualenv
  • conda
  • mamba
  • poetry
  • pipenv
  • … there may be more that I have missed.

In this course, we will learn about conda and renv. conda is nice because it can work with both R and Python. Although a downside of conda is that it is not as widely adopted in the R community as Python is, and therefore there are less R packages available from it, and less recent versions of those R packages than available directly from the R package index (CRAN). It is true, that you can create a conda package for any R package that exists on CRAN, however this takes time and effort and is sometimes non-trivial.

Given that, we will also learn about renv - a new virtual environment tool in R that is gaining widespread adoption. It works directly with the packages on CRAN, and therefore allows users to crete R virtual environments with the most up to date packages, and all R packages on CRAN, with less work compared to conda.

Note

Note on terminology: Technically what we are discussing here in this topic are referred to as virtual environments. However, in practice we often drop the “virtual” when discussing this and refer to these as simply “environments”. That may happen in these lecture notes, as well as in the classroom.

8.2 Conda

conda is an open source package and environment management system for any programming language; though it is most popular in the Python community. conda was originally developed by Anaconda Inc. and bundled with their Anaconda distribution of Python. However, conda’s widespread popularity and utility led to its decoupling into its own package.

It is now available for installation via: - Anaconda Python distribution - Miniconda Python distribution (this is what we recommended most of you install for this course) - Miniforge (this is what we recommended folks with Mac ARM machines install for this course)

Conda builds of R and Python packages, are in fact R and Python packages and built from R and Python source code, but they are packaged up and built differently, and with a different tool chain. How to create conda packages from R and Python source code is beyond the scope of this course. However, we direct keen learners of this topic to the documentation on how to do this: - Conda-build documentation

What we will focus on learning is how to use conda to create virtual environments, record the components of the virtual environment and share the virtual environment with collaborators in a way that they can recreate it on their computer.

8.2.1 Managing Conda

Let’s first start by checking if conda is installed (it should be if we followed the recommended course computer setup instructions) by running:

conda --version

To see which conda commands are available, type conda --help. To see the full documentation for any command of these commands, type the command followed by --help. For example, to learn about the conda update command:

conda update --help

Let’s update our conda to the latest version. Note that you might already have the latest version since we downloaded it recently.

conda update conda

You will see some information about what there is to update and be asked if you want to confirm. The default choice is indicated with [], and you can press Enter to accept it. It would look similar to this:

Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata: ....
.Solving package specifications: .........

Package plan for installation in environment //anaconda:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-env-2.6.0            |                0          601 B
    ruamel_yaml-0.11.14        |           py27_0         184 KB
    conda-4.2.12               |           py27_0         376 KB
    ------------------------------------------------------------
                                           Total:         560 KB

The following NEW packages will be INSTALLED:

    ruamel_yaml: 0.11.14-py27_0

The following packages will be UPDATED:

    conda:       4.0.7-py27_0 --> 4.2.12-py27_0
    conda-env:   2.4.5-py27_0 --> 2.6.0-0
    python:      2.7.11-0     --> 2.7.12-1
    sqlite:      3.9.2-0      --> 3.13.0-0

Proceed ([y]/n)? y

Fetching packages ...
conda-env-2.6. 100% |################################| Time: 0:00:00 360.78 kB/s
ruamel_yaml-0. 100% |################################| Time: 0:00:00   5.53 MB/s
conda-4.2.12-p 100% |################################| Time: 0:00:00   5.84 MB/s
Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Unlinking packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%

In this case, conda itself needed to be updated, and along with this update some dependencies also needed to be updated. There is also a NEW package that was INSTALLED in order to update conda. You don’t need to worry about remembering to update conda, it will let you know if it is out of date when you are installing new packages.

8.2.2 Managing conda environments

What is a conda environment and why is it so useful?

Using conda, you can create an isolated R or Python virtual environment for your project. The default environment is the base environment, which contains only the essential packages from Miniconda (and anything else you have installed in it since installing Miniconda). You can see that your shell’s prompt string is prefaced with (base) when you are inside this environment:

(base) Helps-MacBook-Pro:~ tiffany$

In the computer setup guide, we asked you to follow instructions so that this environment will be activated by default every time you open your terminal.

To create another environment on your computer, that is isolated from the (base) environment you can either do this through:

  1. Manual specifications of packages.
  2. An environment file in YAML format (environment.yml).

We will now discuss both, as they are both relevant workflows for data science. When do you use one versus the other? I typically use the manual specifications of packages when I am creating a new data science project. From that I generate an environment file in YAML format that I can share with collaborators (or anyone else who wants to reproduce my work). Thus, I use an environment file in YAML format when I join a project as a collaborator and I need to use the same environment that has been previously used for that project, or when I want to reproduce someone else’s work.

8.2.3 Creating environment by manually specifying packages

We can create test_env conda environment by typing conda -n <name-of-env>. However, it is often useful to specify more than just the name of the environment, e.g. the channel from which to install packages, the Python version, and a list of packages to install into the new env. In the example below, I am creating the test_env environment that uses python 3.12 and a list of libraries: jupyterlab and pandas.

conda create -n test_env -c conda-forge python=3.12 jupyterlab pandas=2.2.3

conda will solve any dependencies between the packages like before and create a new environment with those packages. Usually, we don’t need to specify the channel, but in this case I want to get the very latest version of these packages, and they are made available in conda-forge before they reach the default conda channel.

To activate this new environment, you can type conda activate test_env (and conda deactivate for deactivating). Since you will do this often, we created an alias shortcut ca that you can use to activate environments. To know the current environment that you’re in you can look at the prefix of the prompt string in your shell which now changed to (test_env). And to see all your environments, you can type conda env list.

8.2.4 Seeing what packages are available in an environment

We will now check packages that are available to us. The command below will list all the packages in an environment, in this case test_env. The list will include versions of each package, the specific build, and the channel that the package was downloaded from. conda list is also useful to ensure that you have installed the packages that you desire.

conda list
# packages in environment at //miniconda/envs/test_env:
#
Using Anaconda Cloud api site https://api.anaconda.org
blas                      1.1                    openblas    conda-forge
ca-certificates           2016.9.26                     0    conda-forge
certifi                   2016.9.26                py27_0    conda-forge
cycler                    0.10.0                   py27_0    conda-forge
freetype                  2.6.3                         1    conda-forge
functools32               3.2.3.2                  py27_1    conda-forge
libgfortran               3.0.0                         0    conda-forge

8.2.5 Installing conda package

Under the name column of the result in the terminal or the package column in the Anaconda Cloud listing, shows the necessary information to install the package. e.g. conda-forge/rasterio. The first word list the channel that this package is from and the second part shows the name of the package.

To install the latest version available within the channel, do not specify in the install command. We will install version 0.35 of rasterio from conda-forge into test_env in this example. Conda will also automatically install the dependencies for this package.

conda install -c conda-forge rasterio=0.35

If you have a few trusted channels that you prefer to use, you can pre-configure these so that everytime you are creating an environment, you won’t need to explicitly declare the channel.

conda config --add channels conda-forge

8.2.6 Removing a conda package

We decided that rasterio is not needed in this tutorial, so we will remove it from test_env. Note that this will remove the main package rasterio and its dependencies (unless a dependency was installed explicitly at an earlier point in time or is required be another package).

conda remove -n test_env rasterio
Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata .........
Solving package specifications: ..........

Package plan for package removal in environment //anaconda/envs/test_env:

The following packages will be REMOVED:

    rasterio: 0.35.1-np111py27_1 conda-forge

Proceed ([y]/n)? y

Unlinking packages ...
[      COMPLETE      ]|#######################################################################################################| 100%

8.2.7 Sharing Environments with others

To share an environment, you can export your conda environment to an environment file, which will list each package and its version in the format package=version=build.

Exporting your environment to a file called environment.yaml (it could be called anything, but this is the conventional name and using it makes it easy for others to recognize that this is a conda env file, the extension can be either .yaml or .yml):

conda env export --from-history -f environment.yml

Remember that .yaml files are plain text, so you can use a text editor such as VS Code to open them. If you do, you will realize that this environment file has A LOT more packages than jupyterlab and pandas. This is because the default behavior is to also list the dependencies that were installed together with these packages, e.g. numpy. This is good in the sense that it gives an exact copy of everything in your environment.

We use the --from-history flag/option above as some dependencies might differ between operating systems, so this file might not work with someone from a different OS. The --from-history flag, looks at the history of the packages you explicitly told conda to install and only list those in the export. The required dependencies will then be handled in an OS-specific manner during the installation, which guarantees that they will work across OSes. This environment.yaml file would be much shorter and look something like this:

name: test_env
channels:
  - conda-forge
  - defaults
dependencies:
  - conda
  - python=3.7
  - pandas==1.0.2
  - jupyterlab

Importantly, this will not include the package version unless you included it when you installed with the package==version syntax. For an environment to be reproducible, you NEED to add the version string manually.

8.2.8 Creating environment from an environment file

Now, let’s install environment.yml environment file above so that we can create a conda environment called test_env.

$ conda env create --file environment.yml
Exercise

Create an environment on your laptop with an older version of Python!

  1. Clone this GitHub repository.

  2. Try to run some antiquated (Python 3.0.0 and higher compatible) Python code, such as python -c "print 'Back from the Future'". This should fail.

  3. In the terminal, navigate to the root of the repository and run: conda env create --file environment.yml (note: If you have a M1/M2 Mac, please run this command instead: CONDA_SUBDIR=osx-64 conda env create –file environment.yml).

  4. Activate the environment by typing conda activate oldie_but_a_goodie

  5. Try again to run some antiquated (Python 3.0.0 and higher compatible) Python code, such as python -c "print 'Back from the Future'". If you activated your environment correctly, this should succeed!

8.2.9 Copying an environment

We can make an exact copy of an environment to an environment with a different name. This maybe useful for any testing versus live environments or different Python 3.7 versions for the same packages. In this example, test_env is cloned to create live_env.

conda create --name live_env --clone test_env

8.2.10 Listing all environments on your laptop

You may have created an environment and forgotten what you named it, or you want to do some cleanup and delete old environments (next topic), and so you want to see which exist on your computer and remove the ones you are no longer using. To do this we will use the info command along with the --envs flag/option.

conda info --envs
Note

Listing all the conda environments on your laptop with this command also shows you where conda stores these environments. Typically conda environments are stored in /Users/<USERNAME>/opt/miniconda3/envs. This means that conda environments are typically in your terminal’s path, resulting in the environments being accessible from any directory on your computer, regardless of where they were created. However, despite this flexibility, commonly one environment is created per project, and the environment.yml file that is used for sharing the conda environment is stored in the project root.

8.2.11 Deleting an environment

Since we are only testing out our environment, we will delete live_env to remove some clutter. Make sure that you are not currently using live_env.

conda env remove -n live_env

8.2.12 Making environments work well with JupyterLab

In brief, you need to install a kernel in the new conda environment in any new environment your create (ipykernel for Python and the r-irkernel package for R), and the nb_conda_kernels package needs to be installed in the environment where JupyterLab is installed.

By default, JupyterLab only sees the conda environment where it is installed. Since it is quite annoying to install JupyterLab and its extensions separately in each environment, there is a package called nb_conda_kernels that makes it possible to have a single installation of JupyterLab access kernels in other conda environments. This package needs to be installed in the conda environment where JupyterLab is installed. For the computer setup for this course, we did that in the base environment, so that is where you would need to install nb_conda_kernels to make this work.

More details are available in the nb_conda_kernels README).

Remember that when you forget a specific command you can type in the help command we have created mds-help in you terminal to see a list of all commands we use in MDS.

8.3 renv

In R, environments can also be managed by renv, which works with similar principles as conda, and other virtual environment managers, but the commands are different. Detailed documentation for renv, can found at the package website.

renv differs from conda in the way that it adds package dependencies. Briefly, when you prompt renv to create (or update) a file to record the project dependencies (done via renv’s snapshot() function), it recursively crawls the files in the project looking for calls to library() or require().

The key file renv creates for recording and sharing environments is called renv.lock in the project’s root directory. Other files are created in the project’s root directory when you use renv but renv.lock is the file that documents which programming languages and packages (including versions) are used in the project. It is recommended that when sharing an renv environment that you version control renv.lock, .Rprofile and renv/activate.R to facilitate collaboration. When you setup an renv environment with renv::init() it creates a renv/.gitignore file so that files that renv creates and uses locally but are not helpful to share, are not shared.

renv environments work best in the context of RStudio projects - and so it is recommended that you create an RStudio project that corresponds to the root of your data science project repository. If this is not done - renv will crawl files outside of the project, looking for dependencies.

Below we create a table with the general virtual environment commands for renv as well as the equivalent conda command for comparison:

Description renv command conda command
Create a new environment without an environment file renv::init() conda create -n <ENV_NAME> ...
Activate a new environment renv::activate() conda activate <ENV_NAME>
Export environment to a file renv::snapshot() conda env export --from-history -f environment.yml
Create a new environment from an environment file renv::restore() conda env create --file environment.yml