import pandas as pd
import pandera as pa
= pa.DataFrameSchema(
schema
{"class": pa.Column(),
"mean_radius": pa.Column()
} )
13 Data validation
Learning Objectives
- Explain why it is important to validate data used in a data analysis project, and give examples of consequences that might occur with invalid data.
- Discuss where data validation should happen in a data analysis project.
- List the major checks that should be performed when validating data for data analysis, and justiy why they should be used.
- Use the Python Pandera package to create data schema for checking and to validate data.
- Use the Python Pandera package to drop invalid rows.
- List other commonly used data validation packages for Python and R.
13.1 The role of data validation in data analysis
Regardless of the statistical question you are asking in your data analysis project, you will be reading in data to Python or R to visualize and/or model. If there are data quality issues, these issues will be propagated and will become data visualization and/or data model issues. This may remind you of an old saying from the mid 20th century:
“Garbage in, garbage out.”
Thus, to ensure that our data visualization and/or modeling results are correct, robust and of high quality, it is important that we validate, or check, the quality of the data before we perform such analyses. It is important to note that data validation is not sufficient for a correct, robust and of high quality analysis, but it is necessary.
13.2 Where does data validation fit in data analysis?
If we are going to validate, or check, our data for our data analysis, at what stage in our analysis should we do this? At a minimum, this should be done after data sourcing or extraction, but before data is used in any analysis. In the context of a project where data splitting is needed (e.g., predictive questions using supervised machine learning), this should be done before the data is split.
If there are larger, more severe consequences of the data analysis being incorrect (e.g., autonomous driving), and the data undergoes file input/output as it is passed through a series of scripts, it may be advisable for data validation, checking to be done each time the data is read. This can be made more efficient by modularizing the data validation/checking into functions. This likely should be done however, regardless of the application of the data analysis, as modularizing the data validation/checking into functions also allows this code to be tested to ensure it is correct, and that invalid data is handled as intended (more on this in the testing chapter later in this book).
One note of caution for where to perform data validation checks in data analysis where data splitting is needed (e.g., splitting data into a training and test set for answering predictive questions) is that you want to be sure that the data validation checks do not cause any data leakage between the split data sets. For example, when checking for anomalous correlations between the target/response variable and features/explanatory variables, when attempting to answer a predictive question, it would be important to not use the entire data set. This is because using the entire dataset for such checks could inadvertently reveal patterns, distributions, or relationships from the test set – which may impact the analyst’s decisions/choices when performing feature and model selection. Given that, data validation checks like this should initially only be done on the training set. It may make sense to apply this data validation check also to the test set, but only after finalizing the feature and model selection.
13.3 Data validation checks
What kind of data validation, or checks, should be done to ensure the data is of high quality? This does somewhat depend on the type of data being used (e.g., tabular, images, language). Here we will list validations, or checks, that should be done on tabular data. If the reader is interested in validations, or checks, that should be done for more complex data types (e.g., images, language) we refer them to the deepchecks checks gallery for data integrity:
13.3.1 Data validation checklist
Checklist references
- Chorev et al (2022). Deepchecks: A Library for Testing and Validating Machine Learning Models and Data. Journal of Machine Learning Research 23 1-6
- Microsoft Industry Solutions Engineering Team (2024). Engineering Fundamentals Playbook: Testing Data Science and MLOps Code Chapter
- Breck et al (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data 1123-1132
- Hynes et al (2017). The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop 1(2017) 5
13.4 Introduction to Python’s Pandera
Python’s Pandera is a package designed to make data validation/checking of dataframes and other dataframe-like objects easy, readable and robust. Key features of Pandera that we will discuss include:
- The ability to define a data schema and use it to validate dataframes and other dataframe-like objects
- Check the types and properties of columns
- Perform statistical validation of data
- Execute all validation in a lazy manner, so that validation rules are executed before raising an error
- Handle invalid data in a number of ways, including throwing errors, writing data validation logs, and dropping observations that are invalid
13.4.1 Validating data with Pandera
In the simplest use case, Pandera can be use to validate data by first defining an instance of the DataFrameSchema
class. This object specifies the properties we expect (and thus would like to check) for our dataframe index and columns. After the DataFrameSchema
instance has been created and defined, the DataFrameSchema.validate
method can be applied to a pandas.DataFrame
instance to validate, or check, all of the properties we specified that we expect for our dataframe index and columns in the DataFrameSchema
instance.
13.4.2 Dataframe Schema
When we create an instance of the DataFrameSchema
class, we can specify the the properties we expect (and thus would like to check) for our dataframe index and columns.
Creating pa.DataFrameSchema
and setting required columns
To create an instance of the DataFrameSchema
class we first import the Pandera package using the alias pa
, and then the function pa.DataFrameSchema
. Below we demonstrate creating an instance of the DataFrameSchema
class for the first two columns of the Wisconsin Breast Cancer data set from the UCI Machine Learning Repository (Dua and Graff 2017).
By default all columns listed are required to be in the dataframe for it to pass validation. If we wanted to make a column optional, we would set required=False
in the column constructor.
Specifying column types
We can specify the type we expect each column to be, by writing the type as the first argument to pa.Column
. Possible values include:
- a string alias, as long as it is recognized by pandas.
- a python type:
int
,float
,double
,bool
,str
- a
numpy
data type - a pandas extension type: it can be an instance (e.g
pd.CategoricalDtype(["a", "b"])
) or a class (e.gpandas.CategoricalDtype
) if it can be initialized with default values. - a pandera
DataType
: it can also be an instance or a class.
See the Pandera Data Type Validation docs for details beyond what we present here.
If we continue our example from above, we can specify that we expect the class
column to be a string and the mean_radius
column to be a float as shown below:
= pa.DataFrameSchema(
schema
{"class": pa.Column(str),
"mean_radius": pa.Column(float)
} )
Missingness/null values
By default Column objects assume there should be no null/missing values. If you want to allow missing values, you need to set nullable=True
in the column constructor. We demonstrate that below for the mean_radius
column of our working example. Note that we do not set this to be true for our class column as we likely do not want to be working with observations where the target/response variable is missing.
= pa.DataFrameSchema(
schema
{"class": pa.Column(str),
"mean_radius": pa.Column(float, nullable=True)
} )
If you wanted to allow a percentage of the values for a particular column to be allowed to be missing, then you could do this by writing a lambda function in a call to pa.Check
in the column constructor. We show an example of that below where allow up to 5% of the mean_radius
column values to be missing.
This is putting the cart a bit before the horse here, as we have not yet introduced pa.Check
. We will do that in the next section, so please fell free to skip this and come back to this example after you have read that.
= pa.DataFrameSchema(
schema
{"class": pa.Column(str),
"mean_radius": pa.Column(float,
lambda s: s.isna().mean() <= 0.05,
pa.Check(=False,
element_wise="Too many null values in 'mean_radius' column."),
error=True)
nullable
} )
Above we have created our custom check on-the-fly using a lambda
function. We could do this here because the check was fairly simple. If we needed a custom check that was more complex (e.g., needs to generate data as part of the check) then we would be better to register our custom check. For situations like this, we direct the reader to the Pandera Extension docs.
Checking values in columns
Pandera has a function pa.Check
that is useful for checking values within columns. For any type of data, there is usually some reasonable range of values that we would expect. These usually come from domain knowledge about the data. For example, a column named age
for a data set about adult human patients age in years should probably be an integer and have a range of values between 18 and 122 (the oldest person whose age has ever been independently verified). To specify a check for a range like this, we can use the pa.Check.between
method. We demonstrate how to do this below with our working example to check that the mean_radius
values are between 5 and 45, inclusive.
= pa.DataFrameSchema(
schema
{"class": pa.Column(str),
"mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
} )
In our working example, we might also want to check that the class
column only contains the strings we think are acceptable for our category label, which would be "Benign"
and "Malignant"
. We can do this using the pa.Check.isin
method, which we demonstrate below:
= pa.DataFrameSchema(
schema
{"class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"])),
"mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
} )
There are many more built-in pa.Check
methods. A list can be found in the Pandera Check
API docs: - https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html#pandera.api.checks.Check
If there is a check you wish to do that is not part of the Pandera Check
API you have two options:
- Use a
lambda
function with boolean logic inside ofpa.Check
(good for simple checks, similar to the percentage of missingness in the section above), or - Register our custom check (see how to in Pandera Extension docs)
Duplicates
Pandera does not yet have a method to check for duplicate rows in a dataframe, however, you can apply pa.Check
to the entire data frame using a lambda
function with boolean logic. Thus, we can easily apply Pandas duplicated
function in a Lambda Function to check for duplicate rows. We show an example of that below:
= pa.DataFrameSchema(
schema
{"class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"])),
"mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
},=[
checkslambda df: ~df.duplicated().any(), error="Duplicate rows found.")
pa.Check(
] )
Empty observations
Similar to duplicates, there is no Pandera function for this. So again we can use pa.Check
applied to the entire data frame using a lambda
function with boolean logic.
= pa.DataFrameSchema(
schema
{"class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"])),
"mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
},=[
checkslambda df: ~df.duplicated().any(), error="Duplicate rows found."),
pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found.")
pa.Check(
] )
13.4.3 Data validation
Once we have specified our the properties we expect (and thus would like to check) for our dataframe index and columns by creating an instance of pa.DataFrameSchema
, we can use the pa.DataFrameSchema.validate
method on a dataframe to check if the dataframe is valid considering the schema we specified.
To demonstrate this, below we create two very simple versions of the Wisconsin Breast Cancer data set. One which we expect to pass our validation checks, and one where we introduce three data anomalies that should cause some checks to fail.
First we create two data frames:
import numpy as np
= pd.DataFrame({
valid_data "class": ["Benign", "Benign", "Malignant"],
"mean_radius": [6.0, 31.2, 22.8]
})
= pd.DataFrame({
invalid_data "class": ["Benign", "Benign", "benign", "Malignant"],
"mean_radius": [6.0, 6.0, 31.2, -9999]
})
Let’s see what happens when we apply pa.DataFrameSchema.validate
to our valid data:
schema.validate(valid_data)
class | mean_radius | |
---|---|---|
0 | Benign | 6.0 |
1 | Benign | 31.2 |
2 | Malignant | 22.8 |
It returns a dataframe and does not throw an error. Excellent! What happens when we pass clearly invalid data?
schema.validate(invalid_data)
--------------------------------------------------------------------------- SchemaError Traceback (most recent call last) Cell In[11], line 1 ----> 1 schema.validate(invalid_data) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:126, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 114 check_obj = check_obj.map_partitions( # type: ignore [operator] 115 self._validate, 116 head=head, (...) 122 meta=check_obj, 123 ) 124 return check_obj.pandera.add_schema(self) --> 126 return self._validate( 127 check_obj=check_obj, 128 head=head, 129 tail=tail, 130 sample=sample, 131 random_state=random_state, 132 lazy=lazy, 133 inplace=inplace, 134 ) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:156, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 147 if self._is_inferred: 148 warnings.warn( 149 f"This {type(self)} is an inferred schema that hasn't been " 150 "modified. It's recommended that you refine the schema " (...) 153 UserWarning, 154 ) --> 156 return self.get_backend(check_obj).validate( 157 check_obj, 158 schema=self, 159 head=head, 160 tail=tail, 161 sample=sample, 162 random_state=random_state, 163 lazy=lazy, 164 inplace=inplace, 165 ) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:105, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace) 100 components = self.collect_schema_components( 101 check_obj, schema, column_info 102 ) 104 # run the checks --> 105 error_handler = self.run_checks_and_handle_errors( 106 error_handler, 107 schema, 108 check_obj, 109 column_info, 110 sample, 111 components, 112 lazy, 113 head, 114 tail, 115 random_state, 116 ) 118 if error_handler.collected_errors: 119 if getattr(schema, "drop_invalid_rows", False): File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:180, in DataFrameSchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, column_info, sample, components, lazy, head, tail, random_state) 169 else: 170 error = SchemaError( 171 schema, 172 data=check_obj, (...) 178 reason_code=result.reason_code, 179 ) --> 180 error_handler.collect_error( 181 validation_type(result.reason_code), 182 result.reason_code, 183 error, 184 result.original_exc, 185 ) 187 return error_handler File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/base/error_handler.py:54, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc) 47 """Collect schema error, raising exception if lazy is False. 48 49 :param error_type: type of error 50 :param reason_code: string representing reason for error 51 :param schema_error: ``SchemaError`` object. 52 """ 53 if not self._lazy: ---> 54 raise schema_error from original_exc 56 # delete data of validated object from SchemaError object to prevent 57 # storing copies of the validated DataFrame/Series for every 58 # SchemaError collected. 59 if hasattr(schema_error, "data"): File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:201, in DataFrameSchemaBackend.run_schema_component_checks(self, check_obj, schema_components, lazy) 199 for schema_component in schema_components: 200 try: --> 201 result = schema_component.validate( 202 check_obj, lazy=lazy, inplace=True 203 ) 204 check_passed.append(is_table(result)) 205 except SchemaError as err: File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/dataframe/components.py:162, in ComponentSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 133 def validate( 134 self, 135 check_obj, (...) 142 ): 143 # pylint: disable=too-many-locals,too-many-branches,too-many-statements 144 """Validate a series or specific column in dataframe. 145 146 :check_obj: data object to validate. (...) 160 161 """ --> 162 return self.get_backend(check_obj).validate( 163 check_obj, 164 schema=self, 165 head=head, 166 tail=tail, 167 sample=sample, 168 random_state=random_state, 169 lazy=lazy, 170 inplace=inplace, 171 ) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/components.py:136, in ColumnBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace) 132 check_obj = validate_column( 133 check_obj, column_name, return_check_obj=True 134 ) 135 else: --> 136 validate_column(check_obj, column_name) 138 if lazy and error_handler.collected_errors: 139 raise SchemaErrors( 140 schema=schema, 141 schema_errors=error_handler.schema_errors, 142 data=check_obj, 143 ) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/components.py:92, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj) 88 error_handler.collect_error( 89 validation_type(err.reason_code), err.reason_code, err 90 ) 91 except SchemaError as err: ---> 92 error_handler.collect_error( 93 validation_type(err.reason_code), err.reason_code, err 94 ) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/base/error_handler.py:54, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc) 47 """Collect schema error, raising exception if lazy is False. 48 49 :param error_type: type of error 50 :param reason_code: string representing reason for error 51 :param schema_error: ``SchemaError`` object. 52 """ 53 if not self._lazy: ---> 54 raise schema_error from original_exc 56 # delete data of validated object from SchemaError object to prevent 57 # storing copies of the validated DataFrame/Series for every 58 # SchemaError collected. 59 if hasattr(schema_error, "data"): File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/components.py:72, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj) 69 def validate_column(check_obj, column_name, return_check_obj=False): 70 try: 71 # pylint: disable=super-with-arguments ---> 72 validated_check_obj = super(ColumnBackend, self).validate( 73 check_obj, 74 deepcopy(schema).set_name(column_name), 75 head=head, 76 tail=tail, 77 sample=sample, 78 random_state=random_state, 79 lazy=lazy, 80 inplace=inplace, 81 ) 83 if return_check_obj: 84 return validated_check_obj File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/array.py:81, in ArraySchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace) 75 check_obj = self.run_parsers( 76 schema, 77 check_obj, 78 ) 80 # run the core checks ---> 81 error_handler = self.run_checks_and_handle_errors( 82 error_handler, 83 schema, 84 check_obj, 85 head=head, 86 tail=tail, 87 sample=sample, 88 random_state=random_state, 89 ) 91 if lazy and error_handler.collected_errors: 92 if getattr(schema, "drop_invalid_rows", False): File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/array.py:145, in ArraySchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, **subsample_kwargs) 134 else: 135 error = SchemaError( 136 schema=schema, 137 data=check_obj, (...) 143 reason_code=result.reason_code, 144 ) --> 145 error_handler.collect_error( 146 validation_type(result.reason_code), 147 result.reason_code, 148 error, 149 original_exc=result.original_exc, 150 ) 152 return error_handler File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/base/error_handler.py:54, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc) 47 """Collect schema error, raising exception if lazy is False. 48 49 :param error_type: type of error 50 :param reason_code: string representing reason for error 51 :param schema_error: ``SchemaError`` object. 52 """ 53 if not self._lazy: ---> 54 raise schema_error from original_exc 56 # delete data of validated object from SchemaError object to prevent 57 # storing copies of the validated DataFrame/Series for every 58 # SchemaError collected. 59 if hasattr(schema_error, "data"): SchemaError: Column 'class' failed element-wise validator number 0: isin(['Benign', 'Malignant']) failure cases: benign
Wow, that’s a lot, but what is clear is that an error was thrown. If we read through the error message to the end we see the important, and useful piece of the error message:
pandera.errors.SchemaError: Column 'class' failed element-wise validator number 0: isin(['Benign', 'Malignant']) failure cases: benign
The error arose because in our invalid_data
, the column class
contained the string "benign"
, and we specified in our pa.DataFrameSchema
instance that we only accept two string values in the class
column, "Benign"
and "Malignant"
.
What about the other errors we expect from our invalid data? For example, we there’s a value of -9999
in the mean_radius
column that is well outside of the range we said was valid in the schema (5, 45), and we have a duplicate row as well? Why are these validation errors not reported? Pandera’s default is to throw an error after the first instance of non-valid data. To change this behaviour, we can set lazy=True
. When we do this we see that all errors get reported.
=True) schema.validate(invalid_data, lazy
--------------------------------------------------------------------------- SchemaErrors Traceback (most recent call last) Cell In[12], line 1 ----> 1 schema.validate(invalid_data, lazy=True) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:126, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 114 check_obj = check_obj.map_partitions( # type: ignore [operator] 115 self._validate, 116 head=head, (...) 122 meta=check_obj, 123 ) 124 return check_obj.pandera.add_schema(self) --> 126 return self._validate( 127 check_obj=check_obj, 128 head=head, 129 tail=tail, 130 sample=sample, 131 random_state=random_state, 132 lazy=lazy, 133 inplace=inplace, 134 ) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/api/pandas/container.py:156, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 147 if self._is_inferred: 148 warnings.warn( 149 f"This {type(self)} is an inferred schema that hasn't been " 150 "modified. It's recommended that you refine the schema " (...) 153 UserWarning, 154 ) --> 156 return self.get_backend(check_obj).validate( 157 check_obj, 158 schema=self, 159 head=head, 160 tail=tail, 161 sample=sample, 162 random_state=random_state, 163 lazy=lazy, 164 inplace=inplace, 165 ) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandera/backends/pandas/container.py:123, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace) 121 return check_obj 122 else: --> 123 raise SchemaErrors( 124 schema=schema, 125 schema_errors=error_handler.schema_errors, 126 data=check_obj, 127 ) 129 return check_obj SchemaErrors: { "DATA": { "DATAFRAME_CHECK": [ { "schema": null, "column": "class", "check": "isin(['Benign', 'Malignant'])", "error": "Column 'class' failed element-wise validator number 0: isin(['Benign', 'Malignant']) failure cases: benign" }, { "schema": null, "column": "mean_radius", "check": "in_range(5, 45)", "error": "Column 'mean_radius' failed element-wise validator number 0: in_range(5, 45) failure cases: -9999.0" }, { "schema": null, "column": null, "check": "Duplicate rows found.", "error": "DataFrameSchema 'None' failed series or dataframe validator 0: <Check <lambda>: Duplicate rows found.>" } ] } }
13.4.4 Handling invalid data
By default Pandera will throw an error when a check is not passed. Depending on your situation, this can be a desired expected behaviour (e.g., a static data analysis published in a report) or a very undesired behaviour that could potentially be dangerous (e.g., autonomous driving application). In the latter case, we would want to do something different than throw an error. Possibilities we will cover here include dropping invalid observations and writing log files that report the errors.
13.4.5 Dropping invalid observations
In an in-production system, dropping non-valid data could be a reasonable path forward instead of throwing an error. Another situation where this might be a reasonable thing to do is when training a machine learning model with a million observations. You don’t want to throw an error in the middle of training if only one observation is invalid!
To change the behaviour of pa.DataFrameSchema.validate
to instead return a dataframe with the invalid rows dropped we need to do two things:
- add
drop_invalid_rows=True
to ourpa.DataFrameSchema
instance - add
lazy=True
to our call to thepa.DataFrameSchema.validate
method
Below we demonstrate this with our working example.
= pa.DataFrameSchema(
schema
{"class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"]), nullable=True),
"mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True)
},=[
checkslambda df: ~df.duplicated().any(), error="Duplicate rows found."),
pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found.")
pa.Check(
],=True
drop_invalid_rows
)
=True) schema.validate(invalid_data, lazy
class | mean_radius | |
---|---|---|
0 | Benign | 6.0 |
1 | Benign | 6.0 |
Hmmm… Why did the duplicate row sneak through? This is because Pandera’s dropping rows only works on data, or column, checks, not the DataFrame-wide checks like our checks for duplicates or empty rows. Thus to make sure we drop these, we need to rely on Pandas to do this. We demonstrate how we can do this below:
=True).drop_duplicates().dropna(how="all") schema.validate(invalid_data, lazy
class | mean_radius | |
---|---|---|
0 | Benign | 6.0 |
13.4.6 Writing data validation logs
Is removing the rows sufficient? Not at all! A human should be told that there was invalid data so that upstream data collection, cleaning and transformation processes can be reviewed to minimize the chances of future invalid data. One way to do this is to again specify lazy=True
so that all errors can be observed and reported. Then we can get the SchemaErrors
and write them to a log file. We show below how to do this for our working example so that the valid rows are returned as a dataframe named validated_data
and the errors are logged as a file called validation_errors.log
:
import json
import logging
import pandas as pd
import pandera as pa
from pandera import Check
# Configure logging
logging.basicConfig(="validation_errors.log",
filename="w",
filemodeformat="%(asctime)s - %(message)s",
=logging.INFO,
level
)
# Define the schema
= pa.DataFrameSchema(
schema
{"class": pa.Column(str, pa.Check.isin(["Benign", "Malignant"]), nullable=True),
"mean_radius": pa.Column(float, pa.Check.between(5, 45), nullable=True),
},=[
checkslambda df: ~df.duplicated().any(), error="Duplicate rows found."),
pa.Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found."),
pa.Check(
],=False,
drop_invalid_rows
)
# Initialize error cases DataFrame
= pd.DataFrame()
error_cases = invalid_data.copy()
data
# Validate data and handle errors
try:
= schema.validate(data, lazy=True)
validated_data except pa.errors.SchemaErrors as e:
= e.failure_cases
error_cases
# Convert the error message to a JSON string
= json.dumps(e.message, indent=2)
error_message "\n" + error_message)
logging.error(
# Filter out invalid rows based on the error cases
if not error_cases.empty:
= error_cases["index"].dropna().unique()
invalid_indices = (
validated_data =invalid_indices)
data.drop(index=True)
.reset_index(drop
.drop_duplicates()="all")
.dropna(how
)else:
= data validated_data
13.5 The data validation ecosystem
We have given a tour of one of the packages in the data validation ecosystem, however there are a few others that are good to know about. We list the others below:
Python:
- Deep Checks: https://docs.deepchecks.com
- Great Expectation: https://docs.greatexpectations.io
- Pandera: https://pandera.readthedocs.io
- Pydantic: https://docs.pydantic.dev/latest/
R
- pointblank: https://rstudio.github.io/pointblank
13.5.1 Deep Checks
In particular, the Deep Checks package is quite useful due to it’s high-level abstraction of several machine learning data validation checks that you would have to code manually if you chose to use something like Pandera
for these. Examples from the checklist above include:
To use this, we first have to create a Deep Checks Dataset object (specifying the data set, the target/response variable, and any categorical features):
from deepchecks.tabular import Dataset
= Dataset(cancer_train, label="class", cat_features=[]) cancer_train_ds
Once we have that, we can use the FeatureLabelCorrelation()
check set the maximum threshold we’ll allow (here 0.9), and run the check:
from deepchecks.tabular.checks import FeatureLabelCorrelation
= FeatureLabelCorrelation().add_condition_feature_pps_less_than(0.9)
check_feat_lab_corr = check_feat_lab_corr.run(dataset=cancer_train_ds) check_feat_lab_corr_result
Finally, we can check if the result of the FeatureLabelCorrelation()
validation has failed. If it has (i.e., correlation is above the acceptable threshold), we can do something, like raise a ValueError with an appropriate error message:
if not check_feat_lab_corr_result.passed_conditions():
raise ValueError("Feature-Label correlation exceeds the maximum acceptable threshold.")
Notice above the name of the data frame and Deep checks data set? It has the word “train” in it. This is important! Some data validation checks can cause data leakage if we perform them on the entire data set before finalizing feature and model selection. Be conscientious about your data validation checks to ensure they do not data introduce leakage.
Deep Checks has a nice gallery of different data validation checks for which it has high-level functions: https://docs.deepchecks.com/stable/tabular/auto_checks/data_integrity/index.html