8 Virtual environments

Learning Objectives

Explain what a virtual environment is and why it can be useful for reproducible data analyses
Discuss the advantages and limitations of virtual environment tools (e.g., conda and renv) in the context of reproducible data analyses
Use, create and share virtual environments (for example, with conda and renv)

Attribution

The conda virtual environment section of this guide was originally published at http://geohackweek.github.io/ under a CC-BY license and has been updated to reflect recent changes in conda, as well as modified slightly to fit the MDS lecture format.

8.1 Virtual environments

Virtual environments let’s you have multiple versions of programming languages packages, and other programs on the same computer, while keeping them isolated so they do not create conflicts with each other. In practice virtual environments are used in one or multiple projects. And you might have several virtual environments stored on your laptop, so that you can have a different collection of versions programming languages and their packages for each project, as needed.

Most virtual environment tools have a sharing functionality which aids in making data science projects reproducible, as not only is there a record of the computational environment, but that computational environment can be shared to others computer - facilitating the reproduction of results from data and code. This facilitation comes from the fact that programming languages and their packages are not static - they change! There are new features added, bugs are fixed, etc, and this can impact how your code runs! Therefore, for a data science project to be reproducible across time, you need the computational environment in addition to the data and the code.

There are several other major benefits of using environments:

If two of your projects on your computer rely on different versions of the same package, you can install these in different environments.
If you want to play around with a new package, you don’t have to change the packages versions you use for your data analysis project and risk messing something up (package version often get upgraded when we install other new packages that share dependencies).
When you develop your own packages, it is essential to use environments, since you want to to make sure you know exactly which packages yours depend on, so that it runs on other systems than your own.

There are MANY version virtual environment tools out there, even if we just focus on R and Python. When we do that we can generate this list:

R virtual environment tools

packrat
renv
conda

Python virtual environment tools

venv
virtualenv
conda
mamba
poetry
pipenv
… there may be more that I have missed.

In this course, we will learn about conda and renv. conda is nice because it can work with both R and Python. Although a downside of conda is that it is not as widely adopted in the R community as Python is, and therefore there are less R packages available from it, and less recent versions of those R packages than available directly from the R package index (CRAN). It is true, that you can create a conda package for any R package that exists on CRAN, however this takes time and effort and is sometimes non-trivial.

Given that, we will also learn about renv - a new virtual environment tool in R that is gaining widespread adoption. It works directly with the packages on CRAN, and therefore allows users to crete R virtual environments with the most up to date packages, and all R packages on CRAN, with less work compared to conda.

Note

Note on terminology: Technically what we are discussing here in this topic are referred to as virtual environments. However, in practice we often drop the “virtual” when discussing this and refer to these as simply “environments”. That may happen in these lecture notes, as well as in the classroom.