20  Testing functions Python with pytest

20.1 Example of workflow for writing python functions and tests for data science

Let’s say we want to write a function for a task we repeatedly are performing in our data analysis. For example, summarizing the number of observations in each class. This is a common task performed for almost every classification problem to examine how many classes there are to understand if we are facing a binary or multi-class classification problem, as well as to examine whether there are any class imbalances that we may need to deal with before tuning our models.

1. Write the function specifications and documentation - but do not implement the function:

The first thing we should do is write the function specifications and documentation. This can effectively represented by an empty function and docstrings in Python as shown below:

# translated my R Code to Python using ChatGPT.
# R Code source:
# https://github.com/ttimbers/demo-tests-ds-analysis-r/blob/main/R/count_classes.R

import pandas as pd


def count_classes(data_frame, class_col):
    """
    Count class observations in a pandas DataFrame.

    Creates a new DataFrame with two columns, listing the classes present
    in the input DataFrame and the number of observations for each class.

    Parameters:
    ----------
    data_frame : pandas.DataFrame
        The input DataFrame containing the data to analyze.
    class_col : str
        The name of the column in the DataFrame containing class labels.

    Returns:
    -------
    pandas.DataFrame
        A DataFrame with two columns:
        - 'class': Lists the unique classes found in the input DataFrame.
        - 'count': Lists the number of observations for each class in the input DataFrame.

    Examples:
    --------
    >>> import pandas as pd
    >>> data = pd.read_csv('mtcars.csv')  # Replace 'mtcars.csv' with your dataset file
    >>> result = count_classes(data, 'am')
    >>> print(result)

    Notes:
    -----
    This function uses the pandas library to perform grouping and counting
    of class observations in the input DataFrame.

    """
    # Group by the class column and count the observations for each class
    result = data_frame.groupby(class_col).size().reset_index(name="count")
    # Rename the class column to match the R function
    result = result.rename(columns={class_col: "class"})
    return result

2. Plan the test cases and document them:

Next, we should plan out our test cases and start to document them. At this point we can sketch out a skeleton for our test cases with code, but we are not yet ready to write them, as we first will need to reproducibly create test data that is useful for assessing whether your function works as expected. So considering our function specifications, some kinds of input we might anticipate our function may receive, and correspondingly what it should return is listed below:

Simple expected use test case #1
  • Dataframe with 2 classes, with 2 observations per class

Function input:

  1. dataframe
  class_labels values
0       class1    0.2
1       class2    0.5
2       class1    0.8
3       class2    0.5
  1. unquoted column name
class_labels

Expected function output:

Dataframe (or tibble)

   class count
0 class1     2
1 class2     2
Simple expected use test case #2
  • Dataframe with 2 classes, with 2 observations for one class, and only one observation in the other

Function input:

  1. dataframe
  class_labels values
0       class1    1.0
1       class1    0.9
2       class2    0.9
  1. unquoted column name
class_labels

Expected function output:

Dataframe (or tibble)

   class count
0 class1     2
1 class2     1
Edge test case #1
  • Dataframe with 1 classes, with 2 observations for that class

Function input:

  1. dataframe
  class_labels values
0       class1    0.7
1       class1    0.5
  1. unquoted column name
class_labels

Expected function output:

Dataframe (or tibble)

   class count
0 class1     2
Edge test case #2
  • Dataframe with no class observations

Function input:

  1. dataframe
  class_labels values
  1. unquoted column name
class_labels

Expected function output:

Dataframe (or tibble)

   class count

20.1.1 Error test case #1

  • A dictionary with 2 classes, with 2 observations for each class

Function input:

  1. list
["class1", "class2", "class1", "class2"]
  1. unquoted list element name
[0.4, 0.7, 0.0, 0.6]

Expected function output:

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 36, in count_classes
AttributeError: 'list' object has no attribute 'groupby'

Next, I sketch out a skeleton for the unit tests. For Python, we will use the well maintained and popular pytest Python package for writing our tests. For extra resources on Pytest beyond what is demonstrated here, we recommend reading: - pytest documentation

With Python and pytest, we create a test function for each related group of tests for a function. For our example, we will create the three test functions shown below:

import pytest

# `count_classes` should return a data frame, or tibble,
# with the number of rows corresponding to the number of unique classes
# in the `class_col` from the original dataframe. The new dataframe
# will have a `class column` whose values are the unique classes,
# and a `count` column, whose values will be the number of observations
# for each  class
def test_count_classes():
  pass

# `count_classes` should return an empty data frame, or tibble,
# if the input to the function is an empty data frame
def test_count_classes_empty():
  pass

# `count_classes` should throw an error when incorrect types
# are passed to the `data_frame` argument
def test_count_classes_errors():
  pass
3. Create test data that is useful for assessing whether your function works as expected:

Now that we have a plan, we can create reproducible test data for that plan! When we do this, we want to keep our data as small and tractable as possible. We want to test things we know the answer to, or can at a minimum calculate by hand. We will use R code to reproducibly create the test data. We will need to do this for the data we will feed in as inputs to our function in the tests, as well as the data we expect our function to return.

import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(2024)

# Test input data
two_classes_2_obs = pd.DataFrame(
    {
        "class_labels": ["class1", "class2", "class1", "class2"],
        "values": np.round(np.random.uniform(size=4), 1),
    }
)

two_classes_2_and_1_obs = pd.DataFrame(
    {
        "class_labels": ["class1", "class1", "class2"],
        "values": np.round(np.random.uniform(size=3), 1),
    }
)

one_class_2_obs = pd.DataFrame(
    {
        "class_labels": ["class1", "class1"],
        "values": np.round(np.random.uniform(size=2), 1),
    }
)

empty_df = pd.DataFrame({"class_labels": [], "values": []})

class_labels = (["class1", "class2", "class1", "class2"],)
class_values = [0.4, 0.7, 0.0, 0.6]

# Expected test outputs
two_classes_2_obs_output = pd.DataFrame(
    {"class": ["class1", "class2"], "count": [2, 2]}
)

two_classes_2_and_1_obs_output = pd.DataFrame(
    {"class": ["class1", "class2"], "count": [2, 1]}
)

one_class_2_obs_output = pd.DataFrame({"class": ["class1"], "count": [2]})
4. Write the tests to evaluate your function based on the planned test cases and test data:

Now that we have the skeletons for our tests, and our reproducible test data, we can actually write the internals for our tests! We will do this by using assert and pytest.raises functions.

import pytest

# `count_classes` should return a data frame, or tibble,
# with the number of rows corresponding to the number of unique classes
# in the `class_col` from the original data frame. The new data frame
# will have a `class column` whose values are the unique classes,
# and a `count` column, whose values will be the number of observations
# for each  class
def test_count_classes_succes():
  assert isinstance(count_classes(two_classes_2_obs, "class_labels"), pd.DataFrame)
  assert count_classes(two_classes_2_obs, "class_labels").equals(two_classes_2_obs_output)
  assert count_classes(two_classes_2_and_1_obs, "class_labels").equals(two_classes_2_and_1_obs_output)

# `count_classes` should return an empty data frame, or tibble,
# if the input to the function is an empty data frame
def test_count_classes_edge():
  assert count_classes(one_class_2_obs, "class_labels").equals(one_class_2_obs_output)
  assert count_classes(empty_df, "class_labels").empty

# `count_classes` should throw an error when incorrect types
# are passed to the data frame argument
def test_count_classes_errors():
  with pytest.raises(AttributeError):
    count_classes(class_values, class_labels)

Wait what??? Most of our tests fail…

Yes, we expect that, we haven’t written our function body yet!

5. Implement the function by writing the needed code in the function body to pass the tests:

FINALLY!! We can write the function body for our function! And then call our tests to see if they pass!

# translated my R Code to Python using ChatGPT.
# R Code source:
# https://github.com/ttimbers/demo-tests-ds-analysis-r/blob/main/R/count_classes.R

import pandas as pd


def count_classes(data_frame, class_col):
    """
    Count class observations in a pandas DataFrame.

    Creates a new DataFrame with two columns, listing the classes present
    in the input DataFrame and the number of observations for each class.

    Parameters:
    ----------
    data_frame : pandas.DataFrame
        The input DataFrame containing the data to analyze.
    class_col : str
        The name of the column in the DataFrame containing class labels.

    Returns:
    -------
    pandas.DataFrame
        A DataFrame with two columns:
        - 'class': Lists the unique classes found in the input DataFrame.
        - 'count': Lists the number of observations for each class in the input DataFrame.

    Examples:
    --------
    >>> import pandas as pd
    >>> data = pd.read_csv('mtcars.csv')  # Replace 'mtcars.csv' with your dataset file
    >>> result = count_classes(data, 'am')
    >>> print(result)

    Notes:
    -----
    This function uses the pandas library to perform grouping and counting
    of class observations in the input DataFrame.

    """
    # Group by the class column and count the observations for each class
    result = data_frame.groupby(class_col).size().reset_index(name='count')
    # Rename the class column to match the R function
    result = result.rename(columns={class_col: 'class'})
    return result
import pytest

# `count_classes` should return a data frame, or tibble,
# with the number of rows corresponding to the number of unique classes
# in the `class_col` from the original data frame. The new data frame
# will have a `class column` whose values are the unique classes,
# and a `count` column, whose values will be the number of observations
# for each  class
def test_count_classes_succes():
  assert isinstance(count_classes(two_classes_2_obs, "class_labels"), pd.DataFrame)
  assert count_classes(two_classes_2_obs, "class_labels").equals(two_classes_2_obs_output)
  assert count_classes(two_classes_2_and_1_obs, "class_labels").equals(two_classes_2_and_1_obs_output)

# `count_classes` should return an empty data frame, or tibble,
# if the input to the function is an empty data frame
def test_count_classes_edge():
  assert count_classes(one_class_2_obs, "class_labels").equals(one_class_2_obs_output)
  assert count_classes(empty_df, "class_labels").empty

# `count_classes` should throw an error when incorrect types
# are passed to the data frame argument
def test_count_classes_errors():
  with pytest.raises(AttributeError):
    count_classes(class_values, class_labels)

The tests passed!

Are we done? For the purposes of this demo, yes! However in practice you would usually cycle through steps 2-5 two-three more times to further improve our tests and and function!

Discussion: Does test-driven development afford testability? How might it do so? Let’s discuss controllability, observability, isolateablilty, and automatability in our case study of test-driven development of count_classes.

20.1.2 Where do the function and test files go?

In the workflow above, we skipped over where we should put our functions we will use in our data analyses, as well as where we put the tests for our function, and how we call those tests!

We summarize the answer to these questions below, but highly recommend you explore and test out our demonstration GitHub repository that has a minimal working example of this: https://github.com/ttimbers/demo-tests-ds-analysis-python

Where does the function go?

In Python, functions should be abstracted to Python scripts (plain text files that end in .py) which typicaly live in the project’s src directory. Commonly we name the Python script with the same name as the function (however, we might choose a more general name if the Python script contains many functions).

In the analysis script or notebook file where we call the function (e.g. eda.py) we need to call from PATH.TO.FILE import FUNCTION before we are able to use the function(s) contained in that Python script inside our analysis file. If we have to go up a directory (to a parent directory) in our PATH to point to the file containing the function, we will need to add that path to Python. We can do that via:

import sys
import os
sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
from src.count_classes import count_classes

Where do the tests go?

The tests for the function should live in tests/test-FUNCTION_NAME.py. The test suite can be run via running pytest in the project root.

20.2 Reproducibly generating test data

As highlighted above, where at all possible, we should use code to generate reproducible, simple and tractable helper data for our tests. Code for generating data can live in the test files themselves. If the code is complex, or the data needs to be shared across several test files, then we recommend defining pytest fixture functions in a file named conftest.py in the tests directory. A nice example of how this works can be seen here: https://www.tutorialspoint.com/pytest/pytest_conftest_py.htm

In some cases, many test cases are needed to be generated and iterated over to test a function that can take many values. In these cases, parameterization can be useful. An example of how to do this with pytest is shown here: https://www.tutorialspoint.com/pytest/pytest_parameterizing_tests.htm

20.3 Common types of test levels in data science

  1. Unit tests - exercise individual components, usually methods or functions, in isolation. This kind of testing is usually quick to write and the tests incur low maintenance effort since they touch such small parts of the system. They typically ensure that the unit fulfills its contract making test failures more straightforward to understand. This is the kind of tests we wrote for our example for count_classes above.

  2. Integration tests - exercise groups of components to ensure that their contained units interact correctly together. Integration tests touch much larger pieces of the system and are more prone to spurious failure. Since these tests validate many different units in concert, identifying the root-cause of a specific failure can be difficult. In data science, this might be testing whether several functions that call each other, or run in sequence, work as expected (e.g., tests for a tidymodel’s workflow function)

Source: CPSC 310 class notes from Reid Holmes, UBC

20.4 Testing in Python resources

20.5 Attribution: