How OpenSAFELY works | Bennett Institute for Applied Data Science

Previous efforts to extract public health data into a central database – and then disseminate it to multiple locations – caused huge public disquiet, and 3 million people chose to opt out of their records being used in research. OpenSAFELY uses technology to mitigate those concerns. None of the raw patient data ever leaves the secure data centres where it already lives. OpenSAFELY provides a secure way for researchers to submit questions, run them against the data, and get back aggregated results about groups of people.

To use OpenSAFELY, researchers first prepare the raw GP data in a form where it can be used in an analysis. They do this by writing code in Electronic Health Records Query Language (known as ehrQL for short – it rhymes with ‘circle’) – that helps to extract and shape data from the available data sets.

When they’ve prepared their dataset for analysis, they then write analysis code (in standard languages like Python, R or Stata) to produce graphs and tables, or to run statistical tests.

All the code users write is made up of individual units called actions, and those actions are organised into a pipeline. By working in this way, we ensure every users’ code is well organised.

OpenSAFELY generates dummy data, so that researchers can test their assumptions, and make sure their code is likely to work, all on their own computer. This is a critical design feature for privacy: it means that users don’t interact directly with real patient data when writing their code. Users can also import their own dummy data, if they prefer.

Codelists are collections of short codes that match specific clinical terms in the data – they’re a useful tool for designing research projects. We’ve built an online tool called OpenCodelists to help researchers create and share their codelists.

Once the code has been run on dummy data, researchers select a button to submit it for running on the real data. All code submitted to run in OpenSAFELY must first be made available online using GitHub, along with contextual information about the project.

Each package of work is known as a job. OpenSAFELY automatically keeps track of all the jobs, including every action being run, what it does, who requested it, and when it happened – there’s a live public dashboard on the web at jobs.opensafely.org, where anyone can keep an eye on what’s happening.

OpenSAFELY then runs that research code automatically, at arm’s length, inside a secure environment, meaning that researchers never need to access sensitive patient data directly.

When each job is complete, researchers can see summary results (mostly in the form of tables and graphs) inside the secure environment, using a tool called Airlock. Inside Airlock, users can see log files (useful for debugging and problem-solving), and data outputs, which must not contain any identifiable information. Airlock has automatic controls to restrict data (such as very large files, or certain file types).

For some (but not all) projects, the researcher might want to move selected outputs outside the secure environment, perhaps for use in a draft paper. Before that happens, we have to make sure that nothing leaving the secure environment could potentially identify any individual patients – what’s known as disclosive information.

This is where our output checking service comes in. After a researcher requests that some outputs be released from the secure environment – some graphs, or results tables – then at least two trained and qualified humans will manually check that they aren’t accidentally releasing anything that could possibly contain any information about any individual, even an anonymous individual.

Those approved outputs are then moved to a secure job server, outside the secure environment, from where they can be released to the outside world.

The output checking process is also fully audited, including requests for changes made by output checkers, and responses from the researchers. It’s called ‘Airlock’ for a reason: it’s a secure place where outputs can be viewed, understood and output-checked. Some of those outputs will be released, but many aren’t.

Some important things to keep in mind:

The private patient-level data never leaves the secure data centre; only aggregated outputs.
The researchers never get direct, unconstrained access to interact with private patient data; instead they develop their code using randomly generated dummy data; and their finished code is then run against real patient data.
OpenSAFELY was designed to encourage scientific rigour and best practice. There are a few hoops to jump through, but they exist for good reasons: to ensure the safety of the outputs, and to help users write high-quality code that’s capable of running on unprecedentedly huge national datasets.
Newcomers using OpenSAFELY for the first time are given a helping hand from experienced co-pilots.
We have strict information governance policies, and a team of in-house experts to make sure everyone sticks to them.
OpenSAFELY was created in close collaboration with teams from the main private sector suppliers of data services for GP surgeries – TPP and EMIS – and in close collaboration with research users who have deep expertise in working with electronic health records.
NHS England is the Data Controller for the whole service, and the GP practices whose records we are using remain the Data Controller for patients’ records, with our tools integrated onto their systems.
All the documentation for using OpenSAFELY is published on the web, so anyone can start learning how to use it.

How OpenSAFELY is different

We don’t give researchers huge extracts of pseudonymous data, either direct to their computer or inside the remote secure environment, because we don’t believe that pseudonymisation is secure enough. It’s often possible to identify individual patients, even in pseudonymised data.

The dummy data that OpenSAFELY generates is a unique and important feature – it means that researchers can check that their code works as expected, before using it with real data. This encourages a more hypothesis-driven approach, and discourages mid-research iterations that could potentially introduce biases and affect research findings.

OpenSAFELY has earned the trust of all the big names in medicine and medical privacy, including the British Medical Association (BMA), the Royal College of General Practitioners (RCGP), their Joint GP IT Committee, Citizens’ Juries and privacy campaigners such as medConfidential.

OpenSAFELY is designed to be open:

all the code that makes OpenSAFELY work is in the public domain
all the code that researchers write is open, making it easier for others to re-use as part of their research
there’s a live dashboard on the web, showing the current status of every job that’s running, or has been run before

Openness is part of the deal: researchers can develop their code privately, but have to agree to making it open once any results are shared, as a condition of using the service. We’ve won awards for our commitment to openness. We think it’s very important, because it means:

anyone can scrutinise the work, to check a researcher’s findings,
and anyone can re-use the work, as part of their own studies, which makes the science faster and more efficient