How OpenSAFELY makes reproducible research easier

“Reproducibility” is a term that researchers use to describe how independent investigators can obtain the same findings, when applying the same methods, to the same data source. If team B can reproduce the findings of team A in this way, it builds trust in the evidence they’ve produced. It’s not just one team’s conclusion; it’s more than one. Many minds make right work.

Reproducibility is important for all research, but it’s especially challenging when working with pseudonymised primary care electronic health records (EHRs) - the sort of data that OpenSAFELY was built for. The data are confidential, so they can’t be shared openly, and that’s a barrier for openness and reproducibility.

That’s something we’ve kept in mind while building and iterating OpenSAFELY, which has a number of features designed to support and improve reproducibility.

We’ve just published a paper on this - but here’s the short version.

Five steps towards more reproducible research

1. Standardised workflows for preparing data

We developed a purpose-built query language, (ehrQL), with a focus on readability and reusability. That means researchers without deep technical knowledge or extensive training can use and understand the code. It also means OpenSAFELY users can run the same queries on different health record databases, despite differences in their underlying schemas.

All code used for research in OpenSAFELY is developed on dummy data outside of the secure data environment. It’s easier to share as it doesn’t have to be checked for confidential information.

3. All the code is public

All code run against patient data must be published on GitHub for sharing, review, and version control. All codelists used in OpenSAFELY must be shared on our partner site opencodelists.org for review and reuse.

opencodelists

4. Everyone uses the same computing environment

OpenSAFELY uses Docker, a system designed to improve software reproducibility. This means there is a consistent computational environment and library versions and users will always be able to re-run code on their own computer.

5. There’s a public audit trail

OpenSAFELY Jobs is a live dashboard, showing the current status for every project on the platform. It logs every line of code run on real data, in public, which promotes clear, hypothesis-driven research and discourages data dredging.

jobs_dashboard

A step in the right direction

These 5 steps are helpful, but they’re just a start. They are technical facilitators that help us to enable, encourage and occasionally enforce open working practices while maintaining patient privacy.

Beyond the technical there’s cultural change, and that’s where there’s more work to be done.