13  Data validation

Learning Objectives

  1. Explain why it is important to validate data used in a data analysis project, and give examples of consequences that might occur with invalid data.
  2. Discuss where data validation should happen in a data analysis project.
  3. List the major checks that should be performed when validating data for data analysis, and justiy why they should be used.
  4. Use the Python Pandera package to create data schema for checking and to validate data.
  5. Use the Python Pandera package to drop invalid rows.
  6. List other commonly used data validation packages for Python and R.

13.1 The role of data validation in data analysis

Regardless of the statistical question you are asking in your data analysis project, you will be reading in data to Python or R to visualize and/or model. If there are data quality issues, these issues will be propagated and will become data visualization and/or data model issues. This may remind you of an old saying from the mid 20th century:

“Garbage in, garbage out.”

Thus, to ensure that our data visualization and/or modeling results are correct, robust and of high quality, it is important that we validate, or check, the quality of the data before we perform such analyses. It is important to note that data validation is not sufficient for a correct, robust and of high quality analysis, but it is necessary.

13.2 Where does data validation fit in data analysis?

If we are going to validate, or check, our data for our data analysis, at what stage in our analysis should we do this? At a minimum, this should be done after data sourcing or extraction, but before data is used in any analysis. In the context of a project where data splitting is needed (e.g., predictive questions using supervised machine learning), this should be done before the data is split.

If there are larger, more severe consequences of the data analysis being incorrect (e.g., autonomous driving), and the data undergoes file input/output as it is passed through a series of scripts, it may be advisable for data validation, checking to be done each time the data is read. This can be made more efficient by modularizing the data validation/checking into functions. This likely should be done however, regardless of the application of the data analysis, as modularizing the data validation/checking into functions also allows this code to be tested to ensure it is correct, and that invalid data is handled as intended (more on this in the testing chapter later in this book).

One note of caution for where to perform data validation checks in data analysis where data splitting is needed (e.g., splitting data into a training and test set for answering predictive questions) is that you want to be sure that the data validation checks do not cause any data leakage between the split data sets. For example, when checking for anomalous correlations between the target/response variable and features/explanatory variables, when attempting to answer a predictive question, it would be important to not use the entire data set. This is because using the entire dataset for such checks could inadvertently reveal patterns, distributions, or relationships from the test set – which may impact the analyst’s decisions/choices when performing feature and model selection. Given that, data validation checks like this should initially only be done on the training set. It may make sense to apply this data validation check also to the test set, but only after finalizing the feature and model selection.

13.3 Data validation checks

What kind of data validation, or checks, should be done to ensure the data is of high quality? This does somewhat depend on the type of data being used (e.g., tabular, images, language). Here we will list validations, or checks, that should be done on tabular data. If the reader is interested in validations, or checks, that should be done for more complex data types (e.g., images, language) we refer them to the deepchecks checks gallery for data integrity:

13.3.1 Data validation checklist

Checklist references

  1. Chorev et al (2022). Deepchecks: A Library for Testing and Validating Machine Learning Models and Data. Journal of Machine Learning Research 23 1-6
  2. Microsoft Industry Solutions Engineering Team (2024). Engineering Fundamentals Playbook: Testing Data Science and MLOps Code Chapter
  3. Breck et al (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data 1123-1132
  4. Hynes et al (2017). The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop 1(2017) 5

13.4 Introduction to Python’s Pandera

Python’s Pandera is a package designed to make data validation/checking of dataframes and other dataframe-like objects easy, readable and robust. Key features of Pandera that we will discuss include:

  • The ability to define a data schema and use it to validate dataframes and other dataframe-like objects
  • Check the types and properties of columns
  • Perform statistical validation of data
  • Execute all validation in a lazy manner, so that validation rules are executed before raising an error
  • Handle invalid data in a number of ways, including throwing errors, writing data validation logs, and dropping observations that are invalid

13.4.1 Validating data with Pandera

In the simplest use case, Pandera can be use to validate data by first defining an instance of the DataFrameSchema class. This object specifies the properties we expect (and thus would like to check) for our dataframe index and columns. After the DataFrameSchema instance has been created and defined, the DataFrameSchema.validate method can be applied to a pandas.DataFrame instance to validate, or check, all of the properties we specified that we expect for our dataframe index and columns in the DataFrameSchema instance.

13.4.2 Dataframe Schema

When we create an instance of the DataFrameSchema class, we can specify the the properties we expect (and thus would like to check) for our dataframe index and columns.

Creating pa.DataFrameSchema and setting required columns

To create an instance of the DataFrameSchema class we first import the Pandera package using the alias pa, and then the function pa.DataFrameSchema. Below we demonstrate creating an instance of the DataFrameSchema class for the first two columns of the Wisconsin Breast Cancer data set from the UCI Machine Learning Repository (Dua and Graff 2017).

import pandas as pd
import pandera as pa


schema = pa.DataFrameSchema(
    {
        "class": pa.Column(),
        "mean_radius": pa.Column()
    }
)
Note

By default all columns listed are required to be in the dataframe for it to pass validation. If we wanted to make a column optional, we would set required=False in the column constructor.

Specifying column types

We can specify the type we expect each column to be, by writing the type as the first argument to pa.Column. Possible values include:

  • a string alias, as long as it is recognized by pandas.
  • a python type: int, float, double, bool, str
  • a numpy data type
  • a pandas extension type: it can be an instance (e.g pd.CategoricalDtype(["a", "b"])) or a class (e.g pandas.CategoricalDtype) if it can be initialized with default values.
  • a pandera DataType: it can also be an instance or a class.

See the Pandera Data Type Validation docs for details beyond what we present here.

If we continue our example from above, we can specify that we expect the class column to be a string and the mean_radius column to be a float as shown below:

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str),
        "mean_radius": pa.Column(float)
    }
)

Missingness/null values

By default Column objects assume there should be no null/missing values. If you want to allow missing values, you need to set nullable=True in the column constructor. We demonstrate that below for the mean_radius column of our working example. Note that we do not set this to be true for our class column as we likely do not want to be working with observations where the target/response variable is missing.

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str),
        "mean_radius": pa.Column(float, nullable=True)
    }
)

If you wanted to allow a percentage of the values for a particular column to be allowed to be missing, then you could do this by writing a lambda function in a call to pa.Check in the column constructor. We show an example of that below where allow up to 5% of the mean_radius column values to be missing.

Note

This is putting the cart a bit before the horse here, as we have not yet introduced pa.Check. We will do that in the next section, so please fell free to skip this and come back to this example after you have read that.

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str),
        "mean_radius": pa.Column(float, 
                                pa.Check(lambda s: s.isna().mean() <= 0.05, 
                                    element_wise=False, 
                                    error="Too many null values in 'mean_radius' column."), 
                                nullable=True)
    }
)
Note

Above we have created our custom check on-the-fly using a lambda function. We could do this here because the check was fairly simple. If we needed a custom check that was more complex (e.g., needs to generate data as part of the check) then we would be better to register our custom check. For situations like this, we direct the reader to the Pandera Extension docs.

Checking values in columns

Pandera has a function pa.Check that is useful for checking values within columns. For any type of data, there is usually some reasonable range of values that we would expect. These usually come from domain knowledge about the data. For example, a column named age for a data set about adult human patients age in years should probably be an integer and have a range of values between 18 and 122 (the oldest person whose age has ever been independently verified). To specify a check for a range like this, we can use the pa.Check.between method. We demonstrate how to do this below with our working example to check that the mean_radius values are between 5 and 45, inclusive.

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str),
        "mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
    }
)

In our working example, we might also want to check that the class column only contains the strings we think are acceptable for our category label, which would be "Benign" and "Malignant". We can do this using the pa.Check.isin method, which we demonstrate below:

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"])),
        "mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
    }
)

There are many more built-in pa.Check methods. A list can be found in the Pandera Check API docs: - https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html#pandera.api.checks.Check

If there is a check you wish to do that is not part of the Pandera Check API you have two options:

  1. Use a lambda function with boolean logic inside of pa.Check (good for simple checks, similar to the percentage of missingness in the section above), or
  2. Register our custom check (see how to in Pandera Extension docs)

Duplicates

Pandera does not yet have a method to check for duplicate rows in a dataframe, however, you can apply pa.Check to the entire data frame using a lambda function with boolean logic. Thus, we can easily apply Pandas duplicated function in a Lambda Function to check for duplicate rows. We show an example of that below:

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"])),
        "mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
    },
    checks=[
        pa.Check(lambda df: ~df.duplicated().any(), error="Duplicate rows found.")
    ]
)

Empty observations

Similar to duplicates, there is no Pandera function for this. So again we can use pa.Check applied to the entire data frame using a lambda function with boolean logic.

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"])),
        "mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
    },
    checks=[
        pa.Check(lambda df: ~df.duplicated().any(), error="Duplicate rows found."),
        pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found.")
    ]
)

13.4.3 Data validation

Once we have specified our the properties we expect (and thus would like to check) for our dataframe index and columns by creating an instance of pa.DataFrameSchema, we can use the pa.DataFrameSchema.validate method on a dataframe to check if the dataframe is valid considering the schema we specified.

To demonstrate this, below we create two very simple versions of the Wisconsin Breast Cancer data set. One which we expect to pass our validation checks, and one where we introduce three data anomalies that should cause some checks to fail.

First we create two data frames:

import numpy as np

valid_data = pd.DataFrame({
    "class": ["Benign", "Benign", "Malignant"],
    "mean_radius": [6.0, 31.2, 22.8]
})

invalid_data = pd.DataFrame({
    "class": ["Benign", "Benign", "benign", "Malignant"],
    "mean_radius": [6.0, 6.0, 31.2, -9999]
})

Let’s see what happens when we apply pa.DataFrameSchema.validate to our valid data:

schema.validate(valid_data)
class mean_radius
0 Benign 6.0
1 Benign 31.2
2 Malignant 22.8

It returns a dataframe and does not throw an error. Excellent! What happens when we pass clearly invalid data?

schema.validate(invalid_data)
---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
Cell In[11], line 1
----> 1 schema.validate(invalid_data)

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:126, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    114     check_obj = check_obj.map_partitions(  # type: ignore [operator]
    115         self._validate,
    116         head=head,
   (...)
    122         meta=check_obj,
    123     )
    124     return check_obj.pandera.add_schema(self)
--> 126 return self._validate(
    127     check_obj=check_obj,
    128     head=head,
    129     tail=tail,
    130     sample=sample,
    131     random_state=random_state,
    132     lazy=lazy,
    133     inplace=inplace,
    134 )

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:156, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    147 if self._is_inferred:
    148     warnings.warn(
    149         f"This {type(self)} is an inferred schema that hasn't been "
    150         "modified. It's recommended that you refine the schema "
   (...)
    153         UserWarning,
    154     )
--> 156 return self.get_backend(check_obj).validate(
    157     check_obj,
    158     schema=self,
    159     head=head,
    160     tail=tail,
    161     sample=sample,
    162     random_state=random_state,
    163     lazy=lazy,
    164     inplace=inplace,
    165 )

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:105, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
    100 components = self.collect_schema_components(
    101     check_obj, schema, column_info
    102 )
    104 # run the checks
--> 105 error_handler = self.run_checks_and_handle_errors(
    106     error_handler,
    107     schema,
    108     check_obj,
    109     column_info,
    110     sample,
    111     components,
    112     lazy,
    113     head,
    114     tail,
    115     random_state,
    116 )
    118 if error_handler.collected_errors:
    119     if getattr(schema, "drop_invalid_rows", False):

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:180, in DataFrameSchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, column_info, sample, components, lazy, head, tail, random_state)
    169         else:
    170             error = SchemaError(
    171                 schema,
    172                 data=check_obj,
   (...)
    178                 reason_code=result.reason_code,
    179             )
--> 180         error_handler.collect_error(
    181             validation_type(result.reason_code),
    182             result.reason_code,
    183             error,
    184             result.original_exc,
    185         )
    187 return error_handler

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/base/error_handler.py:54, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc)
     47 """Collect schema error, raising exception if lazy is False.
     48 
     49 :param error_type: type of error
     50 :param reason_code: string representing reason for error
     51 :param schema_error: ``SchemaError`` object.
     52 """
     53 if not self._lazy:
---> 54     raise schema_error from original_exc
     56 # delete data of validated object from SchemaError object to prevent
     57 # storing copies of the validated DataFrame/Series for every
     58 # SchemaError collected.
     59 if hasattr(schema_error, "data"):

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:201, in DataFrameSchemaBackend.run_schema_component_checks(self, check_obj, schema_components, lazy)
    199 for schema_component in schema_components:
    200     try:
--> 201         result = schema_component.validate(
    202             check_obj, lazy=lazy, inplace=True
    203         )
    204         check_passed.append(is_table(result))
    205     except SchemaError as err:

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/dataframe/components.py:162, in ComponentSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    133 def validate(
    134     self,
    135     check_obj,
   (...)
    142 ):
    143     # pylint: disable=too-many-locals,too-many-branches,too-many-statements
    144     """Validate a series or specific column in dataframe.
    145 
    146     :check_obj: data object to validate.
   (...)
    160 
    161     """
--> 162     return self.get_backend(check_obj).validate(
    163         check_obj,
    164         schema=self,
    165         head=head,
    166         tail=tail,
    167         sample=sample,
    168         random_state=random_state,
    169         lazy=lazy,
    170         inplace=inplace,
    171     )

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/components.py:136, in ColumnBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
    132             check_obj = validate_column(
    133                 check_obj, column_name, return_check_obj=True
    134             )
    135         else:
--> 136             validate_column(check_obj, column_name)
    138 if lazy and error_handler.collected_errors:
    139     raise SchemaErrors(
    140         schema=schema,
    141         schema_errors=error_handler.schema_errors,
    142         data=check_obj,
    143     )

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/components.py:92, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj)
     88         error_handler.collect_error(
     89             validation_type(err.reason_code), err.reason_code, err
     90         )
     91 except SchemaError as err:
---> 92     error_handler.collect_error(
     93         validation_type(err.reason_code), err.reason_code, err
     94     )

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/base/error_handler.py:54, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc)
     47 """Collect schema error, raising exception if lazy is False.
     48 
     49 :param error_type: type of error
     50 :param reason_code: string representing reason for error
     51 :param schema_error: ``SchemaError`` object.
     52 """
     53 if not self._lazy:
---> 54     raise schema_error from original_exc
     56 # delete data of validated object from SchemaError object to prevent
     57 # storing copies of the validated DataFrame/Series for every
     58 # SchemaError collected.
     59 if hasattr(schema_error, "data"):

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/components.py:72, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj)
     69 def validate_column(check_obj, column_name, return_check_obj=False):
     70     try:
     71         # pylint: disable=super-with-arguments
---> 72         validated_check_obj = super(ColumnBackend, self).validate(
     73             check_obj,
     74             deepcopy(schema).set_name(column_name),
     75             head=head,
     76             tail=tail,
     77             sample=sample,
     78             random_state=random_state,
     79             lazy=lazy,
     80             inplace=inplace,
     81         )
     83         if return_check_obj:
     84             return validated_check_obj

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/array.py:81, in ArraySchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
     75 check_obj = self.run_parsers(
     76     schema,
     77     check_obj,
     78 )
     80 # run the core checks
---> 81 error_handler = self.run_checks_and_handle_errors(
     82     error_handler,
     83     schema,
     84     check_obj,
     85     head=head,
     86     tail=tail,
     87     sample=sample,
     88     random_state=random_state,
     89 )
     91 if lazy and error_handler.collected_errors:
     92     if getattr(schema, "drop_invalid_rows", False):

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/array.py:145, in ArraySchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, **subsample_kwargs)
    134         else:
    135             error = SchemaError(
    136                 schema=schema,
    137                 data=check_obj,
   (...)
    143                 reason_code=result.reason_code,
    144             )
--> 145             error_handler.collect_error(
    146                 validation_type(result.reason_code),
    147                 result.reason_code,
    148                 error,
    149                 original_exc=result.original_exc,
    150             )
    152 return error_handler

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/base/error_handler.py:54, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc)
     47 """Collect schema error, raising exception if lazy is False.
     48 
     49 :param error_type: type of error
     50 :param reason_code: string representing reason for error
     51 :param schema_error: ``SchemaError`` object.
     52 """
     53 if not self._lazy:
---> 54     raise schema_error from original_exc
     56 # delete data of validated object from SchemaError object to prevent
     57 # storing copies of the validated DataFrame/Series for every
     58 # SchemaError collected.
     59 if hasattr(schema_error, "data"):

SchemaError: Column 'class' failed element-wise validator number 0: isin(['Benign', 'Malignant']) failure cases: benign

Wow, that’s a lot, but what is clear is that an error was thrown. If we read through the error message to the end we see the important, and useful piece of the error message:

pandera.errors.SchemaError: Column 'class' failed element-wise validator number 0: isin(['Benign', 'Malignant']) failure cases: benign

The error arose because in our invalid_data, the column class contained the string "benign", and we specified in our pa.DataFrameSchema instance that we only accept two string values in the class column, "Benign" and "Malignant".

What about the other errors we expect from our invalid data? For example, we there’s a value of -9999 in the mean_radius column that is well outside of the range we said was valid in the schema (5, 45), and we have a duplicate row as well? Why are these validation errors not reported? Pandera’s default is to throw an error after the first instance of non-valid data. To change this behaviour, we can set lazy=True. When we do this we see that all errors get reported.

schema.validate(invalid_data, lazy=True)
---------------------------------------------------------------------------
SchemaErrors                              Traceback (most recent call last)
Cell In[12], line 1
----> 1 schema.validate(invalid_data, lazy=True)

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:126, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    114     check_obj = check_obj.map_partitions(  # type: ignore [operator]
    115         self._validate,
    116         head=head,
   (...)
    122         meta=check_obj,
    123     )
    124     return check_obj.pandera.add_schema(self)
--> 126 return self._validate(
    127     check_obj=check_obj,
    128     head=head,
    129     tail=tail,
    130     sample=sample,
    131     random_state=random_state,
    132     lazy=lazy,
    133     inplace=inplace,
    134 )

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:156, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    147 if self._is_inferred:
    148     warnings.warn(
    149         f"This {type(self)} is an inferred schema that hasn't been "
    150         "modified. It's recommended that you refine the schema "
   (...)
    153         UserWarning,
    154     )
--> 156 return self.get_backend(check_obj).validate(
    157     check_obj,
    158     schema=self,
    159     head=head,
    160     tail=tail,
    161     sample=sample,
    162     random_state=random_state,
    163     lazy=lazy,
    164     inplace=inplace,
    165 )

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:123, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
    121         return check_obj
    122     else:
--> 123         raise SchemaErrors(
    124             schema=schema,
    125             schema_errors=error_handler.schema_errors,
    126             data=check_obj,
    127         )
    129 return check_obj

SchemaErrors: {
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": null,
                "column": "class",
                "check": "isin(['Benign', 'Malignant'])",
                "error": "Column 'class' failed element-wise validator number 0: isin(['Benign', 'Malignant']) failure cases: benign"
            },
            {
                "schema": null,
                "column": "mean_radius",
                "check": "in_range(5, 45)",
                "error": "Column 'mean_radius' failed element-wise validator number 0: in_range(5, 45) failure cases: -9999.0"
            },
            {
                "schema": null,
                "column": null,
                "check": "Duplicate rows found.",
                "error": "DataFrameSchema 'None' failed series or dataframe validator 0: <Check <lambda>: Duplicate rows found.>"
            }
        ]
    }
}

13.4.4 Handling invalid data

By default Pandera will throw an error when a check is not passed. Depending on your situation, this can be a desired expected behaviour (e.g., a static data analysis published in a report) or a very undesired behaviour that could potentially be dangerous (e.g., autonomous driving application). In the latter case, we would want to do something different than throw an error. Possibilities we will cover here include dropping invalid observations and writing log files that report the errors.

13.4.5 Dropping invalid observations

In an in-production system, dropping non-valid data could be a reasonable path forward instead of throwing an error. Another situation where this might be a reasonable thing to do is when training a machine learning model with a million observations. You don’t want to throw an error in the middle of training if only one observation is invalid!

To change the behaviour of pa.DataFrameSchema.validate to instead return a dataframe with the invalid rows dropped we need to do two things:

  1. add drop_invalid_rows=True to our pa.DataFrameSchema instance
  2. add lazy=True to our call to the pa.DataFrameSchema.validate method

Below we demonstrate this with our working example.

schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"]), nullable=True),
        "mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
    },
    checks=[
        pa.Check(lambda df: ~df.duplicated().any(), error="Duplicate rows found."),
        pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found.")
    ],
    drop_invalid_rows=True
)

schema.validate(invalid_data, lazy=True)
class mean_radius
0 Benign 6.0
1 Benign 6.0

Hmmm… Why did the duplicate row sneak through? This is because Pandera’s dropping rows only works on data, or column, checks, not the DataFrame-wide checks like our checks for duplicates or empty rows. Thus to make sure we drop these, we need to rely on Pandas to do this. We demonstrate how we can do this below:

schema.validate(invalid_data, lazy=True).drop_duplicates().dropna(how="all")
class mean_radius
0 Benign 6.0

13.4.6 Writing data validation logs

Is removing the rows sufficient? Not at all! A human should be told that there was invalid data so that upstream data collection, cleaning and transformation processes can be reviewed to minimize the chances of future invalid data. One way to do this is to again specify lazy=True so that all errors can be observed and reported. Then we can get the SchemaErrors and write them to a log file. We show below how to do this for our working example so that the valid rows are returned as a dataframe named validated_data and the errors are logged as a file called validation_errors.log:

import json
import logging
import pandas as pd
import pandera as pa
from pandera import Check

# Configure logging
logging.basicConfig(
    filename="validation_errors.log",
    filemode="w",
    format="%(asctime)s - %(message)s",
    level=logging.INFO,
)

# Define the schema
schema = pa.DataFrameSchema(
    {
        "class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"]), nullable=True),
        "mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True),
    },
    checks=[
        pa.Check(lambda df: ~df.duplicated().any(), error="Duplicate rows found."),
        pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found."),
    ],
    drop_invalid_rows=False,
)

# Initialize error cases DataFrame
error_cases = pd.DataFrame()
data = invalid_data.copy()

# Validate data and handle errors
try:
    validated_data = schema.validate(data, lazy=True)
except pa.errors.SchemaErrors as e:
    error_cases = e.failure_cases

    # Convert the error message to a JSON string
    error_message = json.dumps(e.message, indent=2)
    logging.error("\n" + error_message)

# Filter out invalid rows based on the error cases
if not error_cases.empty:
    invalid_indices = error_cases["index"].dropna().unique()
    validated_data = (
        data.drop(index=invalid_indices)
        .reset_index(drop=True)
        .drop_duplicates()
        .dropna(how="all")
    )
else:
    validated_data = data

13.5 The data validation ecosystem

We have given a tour of one of the packages in the data validation ecosystem, however there are a few others that are good to know about. We list the others below:

Python:

R

13.5.1 Deep Checks

In particular, the Deep Checks package is quite useful due to it’s high-level abstraction of several machine learning data validation checks that you would have to code manually if you chose to use something like Pandera for these. Examples from the checklist above include:

To use this, we first have to create a Deep Checks Dataset object (specifying the data set, the target/response variable, and any categorical features):

from deepchecks.tabular import Dataset


cancer_train_ds = Dataset(cancer_train, label="class", cat_features=[])

Once we have that, we can use the FeatureLabelCorrelation() check set the maximum threshold we’ll allow (here 0.9), and run the check:

from deepchecks.tabular.checks import FeatureLabelCorrelation


check_feat_lab_corr = FeatureLabelCorrelation().add_condition_feature_pps_less_than(0.9)
check_feat_lab_corr_result = check_feat_lab_corr.run(dataset=cancer_train_ds)

Finally, we can check if the result of the FeatureLabelCorrelation() validation has failed. If it has (i.e., correlation is above the acceptable threshold), we can do something, like raise a ValueError with an appropriate error message:

if not check_feat_lab_corr_result.passed_conditions():
    raise ValueError("Feature-Label correlation exceeds the maximum acceptable threshold.")
Note

Notice above the name of the data frame and Deep checks data set? It has the word “train” in it. This is important! Some data validation checks can cause data leakage if we perform them on the entire data set before finalizing feature and model selection. Be conscientious about your data validation checks to ensure they do not data introduce leakage.

Deep Checks has a nice gallery of different data validation checks for which it has high-level functions: https://docs.deepchecks.com/stable/tabular/auto_checks/data_integrity/index.html