23  Introduction to R & Python packages

Learning Objectives

  1. Explain the circumstances in which one should consider creating a package for their code
  2. Name the key files and directories in both R & Python pacakges and describe the function of each
  3. Given a function and a unit test written in Python, use a Cookiecutter template and poetry to create a small and simple Python package
  4. Given a function and a unit test written in R, use devtools and usethis to create a small and simple R package

23.1 What is a minimal viable software package?

In this course, we are learning to build software packages in R and Python to a high standard, with regards to:

  • code robustness
  • documentation
  • collaboration strategies

This is really great, but that means there are a LOT of files to track and learn about. What can be helpful is to know and be able to identify which files are essential to make a locally installable software package.

23.2 When to start writing an R or Python package

When your DRY radar goes off:

It depends on your level of patience for violating the DRY principle, for me the magic number is about 3. If I repeat writing code within a single script or notebook, on the third try hair starts to stand up on the back of my neck and I am irritated sufficiently to either write a function or a loop. The same goes with copying files containing functions that I would like to re-use. The third time that I do this, I am convinced that I will likely want to use this code again on another project and thus I should package it to make my life easier.

When you think your code has general usefulness others could benefit from:

Another reason to package your code is that you think that it is a general enough solution that other people would find it useful. Doing this can be viewed as an act of altruism that also provides you professional gain by putting your data science skills on display (and padding your CV!).

When you want to share data:

The R package ecosystem has found great use in packaging data in addition to Software. Packaging data is beyond the scope of this course, but you should know that it is widely done in R and that you can do it if you want or need to.

Optional video

I recommend watching a video from rstudio::conf 2020, called RMarkdown Driven Development by Emily Riederer, because I think it is such a nice narrative on how analysis code can evolve into software packages. Despite this talk being focused on R, it really applies to any data science language, and you could easily re-present this as “Jupyter Driven Development” or “Quarto Driven Development”.

RMarkdown Driven Development by Emily Riederer (video)

23.3 Tour de packages

We will spend some time now exploring GitHub repositories of some Python and R packages to get more familiar with their internals. As we do this, fill out the missing data in the table below (which tries to match up the corresponding R and Python package key files and directories):

Description Python R
Online package repository or index
Directory for user-facing package functions
Directory for tests for user-facing package functions
Directory for documentation
Package metadata
File(s) that define the Namespace
Tool(s) for easy package creation

Note: at the end of lecture I will review the answers to this table.

23.4 Tour de R packages

Some R packages you may want to explore:

  1. foofactors - the end product of the Whole Game Chapter from the R Packages book.

  2. convertempr - a simple example package I built as a demo for this course

  3. broom - a tidyverse R package

  4. tidyhydat - a reviewed ROpenSci R package

23.4.1 usethis and the evolution of R packaging

What constitutes an R package or its configuration files, has not changed in a long time. However the tools commonly used to build them have. Currently, the most straight forward, and easy way to build R packages now involves the use of two developer packages called usethis and devtools.

These packages automate repetitive tasks that arise during project setup and development. It prevents you from having to do things like manually create boiler plate file and directory structures needed for building your R package structure, as well as simplifies the checking, installation and building of your package from source code.

Packages created via usethis and devtools can be shared so that they can be installed via source from GitHub, or as package binaries on CRAN. For a package to be shared on CRAN, there is a check and gatekeeping system (in contrast to PyPI). We will learn more about this in later parts of the course.

Fun fact! Jenny Bryan, a past UBC Statistics Professor UBC MDS founder, is the maintainer for the usethis R package!

23.4.2 Learn more by building a toy R package

For individual assignment 5, you will build a toy R package using a tutorial Jenny Bryan put together: The Whole Game

After building the toy package, read Package structure and state to deepen your understanding of what packaging means in R and what packages actually are.

23.5 Tour de Python packages

Some Python packages you may want to explore:

  1. pycounts_tt25 - the end product of How to Package a Python from the Python Packages book

  2. PyWebCAT - a Pythonic way to interface with the NOAA National Ocean Service Web Camera Applications Testbed (WebCAT)

  3. laserembeddings - a Python package (uses the Poetry approach to creating Python packages)

  4. pandera - a reviewed PyOpenSci Python package (uses more traditional approach to creating Python packages)

23.5.1 Poetry and the evolution of Python packaging

If you want to package your Python code and distribute for ease of use by others on the Python Packaging Index (PyPI) you need to convert it to a standard format called Wheel.

Previously, to do this it was standard to have 4 configuration files in your package repository:

  • setup.py
  • requirements.txt
  • setup.cfg
  • MANIFEST.in

In 2016, a new PEP (518) was made. This PEP defined a new configuration format, pyproject.toml, so that now a single file can be used in place of those previous four. This file must have at least two sections, [build-system] and [tool].

This new single configuration file has inspired some new tools to be created in the package building and management ecosystem in Python. One of the most recent and simplest to use is Poetry (created in 2018). When used to build packages, Poetry roughly does two things:

  1. Uses the pyproject.toml to manage and solve package configurations via the Poetry commands init, add, config, etc

  2. Creates a lock file (poetry.lock) which automatically creates and activates a virtual environment (if none are activated) where the Poetry commands such as install, build, run, etc are executed.

Cookiecutter templates are also useful to help setup the biolerplate project and directory structure needed for packages. And conda environments are useful to isolate your package development environment. We will use these in this course as well.

23.5.2 Learn more by building a toy Python package

You can learn more about building Python packages, by first building a toy package using a tutorial we have put together for you: How to package a Python.

After building the toy package, read Package structure and state to deepen your understanding of what packaging means in Python and what packages actually are.

23.6 Key package things for Python and R packages

Description Python R
Online package repository or index PyPI CRAN
Directory for user-facing package functions src directory R directory
Directory for tests for user-facing package functions tests tests/testthat
Directory for documentation docs man and vignettes
Package metadata pyproject.toml DESCRIPTION
File(s) that define the Namespace __init__.py NAMESPACE
Tool(s) for easy package creation Cookiecutter & Poetry usethis & devtools

23.7 What is a minimal viable software package?

In this course, we are learning to build software packages in R and Python to a high standard, with regards to: - code robustness - documentation - collaboration strategies

This is really great, but that means there are a LOT of files to track and learn about. What can be helpful is to know and be able to identify which files are essential to make a locally installable software package.