Messy Data to Meaningful Insights: The Importance of Data Checks | Bennett Institute for Applied Data Science

General practice (GP) electronic health records (EHR) can generate valuable insights about population health and how healthcare is delivered. But, these data exist to support direct patient care, rather than to answer research questions which makes them tricky to work with.

EHR data are sometimes described as “messy” by analysts. Information that is important for research may be recorded irregularly, or not at all. There may be duplicate records, formats or units may vary, and the coding practices and systems used to record diagnoses and treatments may change over time. Turning raw EHR into an analysis-ready dataset involves several steps, including data exploration and data cleaning. Below we explain how we make sense of GP data in OpenSAFELY.

In OpenSAFELY, what processing do we do to raw data?

At OpenSAFELY, we do minimal cleaning of the raw EHR data before it is made available for analysis. This gives researchers the freedom to make their own decisions about how to handle the data for their particular study.

But, we do take care of a few basic housekeeping tasks: this includes replacing nonsensical missing placeholder values (like 9999-12-31 for date values, or -1 for Index of Multiple Deprivation) with proper NULL entries; removing a small number of duplicate mortality records from the Office of National Statistics (ONS) data; and mapping numeric codes to more meaningful text labels, like with status in the appointments table. To learn about exactly what processing we do to the raw data, all code is publicly available for inspection on GitHub.

Beyond that, the responsibility for checking and cleaning the data lies with the analyst. Systematically checking the data for completeness, accuracy and consistency is an important part of an OpenSAFELY analytic pipeline. This process is sometimes called initial data analysis or exploratory data analysis.

What is initial data analysis?

Initial data analysis involves exploring the quality of the data before undertaking any formal analyses, with the goal of detecting problems which might impact your research and understanding patterns in the data. To start with, analysts should have a good understanding of the tables and fields they are using, including the range of possible values and data formats. Much of this info can be found in OpenSAFELY’s table schemas.

Initial or exploratory data analysis is usually followed by data cleaning, which aims to address any problems in the data to make it ready for analysis. As much as possible, potential data cleaning decisions should be planned and discussed before seeing the real data - for instance, deciding on an upper age cutoff - although this isn’t always possible or practical.

In OpenSAFELY, to maintain the highest level of data privacy, analysts never set eyes on the raw, patient-level data. But, this doesn’t hinder an analyst’s ability to explore the data and identify problems. You can also learn from other OpenSAFELY analysts’ experience. Protocols, outputs, and code may contain helpful guidance on how to approach use of certain fields - these can be found in the project-specific GitHub repositories or on our website.

Common issues with GP EHR data

There is a wide variation in completeness of data in GP data. While age and sex are rarely missing, 1 in 5 people will not have an ethnicity recorded in their GP record. In extreme cases, the degree of missingness means that a data field, or variable is not useful for research. In OpenSAFELY, we have outpatient appointments data which has diagnoses fields, however these fields are rarely complete due to how outpatient care is paid for in the NHS.

Data can be missing for many different reasons. It is often not random, meaning there are biases in who does or doesn’t have something recorded. This is important to think about when deciding how to deal with missing values.

Infrequently, EHR data can also contain apparent errors, such as events dated in the future, or impossible values like a body mass index (BMI) of 200 kg/m2. Some potential errors can only be spotted when combined with other information. For instance, events dated before a patient’s date of birth or after their date of death, or a pregnant person over 80 years of age.

While impossible values should be discarded, data will also often have outliers - that is, values that stand out, often because they are very small or very large, like a patient who is 115 years old. Clinical input can help decide how likely extreme values are to be real and what to do with them.

We suggest the following data checks:

Completeness: What proportion of values are missing? Does missingness vary by patient characteristics or time period?
Distribution of categorical variables: Does the relative frequency of each variable look as expected? Are there any unexpected values?
Distribution of continuous variables: Do summary statistics (median, interquartile range, range) look plausible? Visualise with a histogram to spot skew, outliers, or impossible values.
Relationship between variables: Do variables relate to each other as expected? Crosstabulations (for categorical variables) or scatterplots (for continuous variables) can help visualise associations.
Trends over time: Has recording frequency changed over time? Are there sudden shifts that might reflect changes in coding practice?

Remember: there are things that data checks alone can’t tell you! Knowing how and why things are - and aren’t - recorded is key to fully understanding what you are seeing in the data. This could be things like financial incentives that increase coding of certain conditions, or which patient groups are underrepresented in data collections. It is always helpful to involve people on your team who have prior experience with the data or clinical area!

What can I do if things aren’t as expected?

Deciding how to deal with unusual values in the data is an important part of any study and is context-specific, and beyond the scope of this blog. But, while a small number of odd values isn’t uncommon, a large number of incorrect or extreme values or unusual patterns in the data, are not. If you encounter this situation, the following checks can be helpful:

Is your code correct? Have a colleague do a code review to check for potential errors.
Is your codelist correct? Make sure your codelists are up-to-date and have been reviewed by a clinical expert.
How do your data compare against other data sources? Externally validate your results against other publicly available data.

You can also ask for advice on our #opensafely-users Slack channel.

Don’t forget to pay it forward

The data quality checks and data cleaning that you do will often be useful for other OpenSAFELY users, so remember to document every decision that you make! While your OpenSAFELY code will have a record of every analysis that you run on the real data, also having a written record of every data cleaning decision and the reason behind it can help future research.

Consider describing it in a blog, a report, a post on Slack, or even a peer-reviewed paper, accompanied by reusable code for others to use and adapt. If you’d like to work with us to make your data cleaning efforts usable by the whole OpenSAFELY community, get in touch with your co-pilot or email us!