16  Data validation

Learning Objectives

  1. Explain why it is important to validate data used in a data analysis project, and give examples of consequences that might occur with invalid data.
  2. Discuss where data validation should happen in a data analysis project.
  3. List the major checks that should be performed when validating data for data analysis, and justiy why they should be used.
  4. Use the Python Pandera package to create data schema for checking and to validate data.
  5. Use the Python Pandera package to drop invalid rows.
  6. List other commonly used data validation packages for Python and R.

16.1 The role of data validation in data analysis

Regardless of the statistical question you are asking in your data analysis project, you will be reading in data to Python or R to visualize and/or model. If there are data quality issues, these issues will be propagated and will become data visualization and/or data model issues. This may remind you of an old saying from the mid 20th century:

“Garbage in, garbage out.”

Thus, to ensure that our data visualization and/or modeling results are correct, robust and of high quality, it is important that we validate, or check, the quality of the data before we perform such analyses. It is important to note that data validation is not sufficient for a correct, robust and of high quality analysis, but it is necessary.

16.2 Where does data validation fit in data analysis?

If we are going to validate, or check, our data for our data analysis, at what stage in our analysis should we do this? At a minimum, this should be done after data sourcing or extraction, but before data is used in any analysis. In the context of a project where data splitting is needed (e.g., predictive questions using supervised machine learning), this should be done before the data is split.

If there are larger, more severe consequences of the data analysis being incorrect (e.g., autonomous driving), and the data undergoes file input/output as it is passed through a series of scripts, it may be advisable for data validation, checking to be done each time the data is read. This can be made more efficient by modularizing the data validation/checking into functions. This likely should be done however, regardless of the application of the data analysis, as modularizing the data validation/checking into functions also allows this code to be tested to ensure it is correct, and that invalid data is handled as intended (more on this in the testing chapter later in this book).

One note of caution for where to perform data validation checks in data analysis where data splitting is needed (e.g., splitting data into a training and test set for answering predictive questions) is that you want to be sure that the data validation checks do not cause any data leakage between the split data sets. For example, when checking for anomalous correlations between the target/response variable and features/explanatory variables, when attempting to answer a predictive question, it would be important to not use the entire data set. This is because using the entire dataset for such checks could inadvertently reveal patterns, distributions, or relationships from the test set – which may impact the analyst’s decisions/choices when performing feature and model selection. Given that, data validation checks like this should initially only be done on the training set. It may make sense to apply this data validation check also to the test set, but only after finalizing the feature and model selection.

16.3 Data validation checks

What kind of data validation, or checks, should be done to ensure the data is of high quality? This does somewhat depend on the type of data being used (e.g., tabular, images, language). Here we will list validations, or checks, that should be done on tabular data. If the reader is interested in validations, or checks, that should be done for more complex data types (e.g., images, language) we refer them to the deepchecks checks gallery for data integrity:

16.3.1 Data validation checklist

Checklist references

  1. Chorev et al (2022). Deepchecks: A Library for Testing and Validating Machine Learning Models and Data. Journal of Machine Learning Research 23 1-6
  2. Microsoft Industry Solutions Engineering Team (2024). Engineering Fundamentals Playbook: Testing Data Science and MLOps Code Chapter
  3. Breck et al (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. Proceedings of IEEE Big Data 1123-1132
  4. Hynes et al (2017). The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop 1(2017) 5

16.4 The data validation ecosystem

There are a few data validation packages. We list the others below:

Python:

R