19 Python Data Validation: deepchecks
19.0.1 Deep Checks
In particular, the Deep Checks package is quite useful due to it’s high-level abstraction of several machine learning data validation checks that you would have to code manually if you chose to use something like Pandera
for these. Examples from the checklist above include:
To use this, we first have to create a Deep Checks Dataset object (specifying the data set, the target/response variable, and any categorical features):
from deepchecks.tabular import Dataset
= Dataset(cancer_train, label="class", cat_features=[]) cancer_train_ds
Once we have that, we can use the FeatureLabelCorrelation()
check set the maximum threshold we’ll allow (here 0.9), and run the check:
from deepchecks.tabular.checks import FeatureLabelCorrelation
= FeatureLabelCorrelation().add_condition_feature_pps_less_than(0.9)
check_feat_lab_corr = check_feat_lab_corr.run(dataset=cancer_train_ds) check_feat_lab_corr_result
Finally, we can check if the result of the FeatureLabelCorrelation()
validation has failed. If it has (i.e., correlation is above the acceptable threshold), we can do something, like raise a ValueError with an appropriate error message:
if not check_feat_lab_corr_result.passed_conditions():
raise ValueError("Feature-Label correlation exceeds the maximum acceptable threshold.")
Notice above the name of the data frame and Deep checks data set? It has the word “train” in it. This is important! Some data validation checks can cause data leakage if we perform them on the entire data set before finalizing feature and model selection. Be conscientious about your data validation checks to ensure they do not data introduce leakage.
Deep Checks has a nice gallery of different data validation checks for which it has high-level functions: https://docs.deepchecks.com/stable/tabular/auto_checks/data_integrity/index.html