Six months at the Bennett Institute: reflections on AI-assisted tools, research integrity, and future collaboration | Bennett Institute for Applied Data Science

Last year, I sent a cold email to Bennett Institute Director Ben Goldacre and Research Integrity Lead Dr. Nick DeVito hoping to visit the Bennett Institute on a 6-month research stay. I knew it was a long shot, but as a postdoctoral researcher with a particular interest in data-driven tools and infrastructure development, I knew that the Bennett Institute would be a natural fit.

As it turned out, Ben and Nick were as keen as I was.

Over the last six months, I’ve worked on developing and evaluating AI-assisted tools for improving research transparency, reproducibility, and trustworthiness, with a particular focus on how these tools can:

help researchers and users of OpenSAFELY develop more robust and accurate analysis code
facilitate checking of consistency between study registrations and papers in clinical trials

In this blog post, I’ll share the work that we were able to accomplish in this short stretch, and our plans to expand this research in the near future.

Assisting research with CodeBot and RegCheck

My visit was organised around two main projects.

The first was the development of CodeBot: a software for checking whether the statistical analyses described in a research paper are consistent with the analyses implemented in the accompanying code. In many empirical papers, the written methods and results sections provide a narrative description of the analyses, while the code contains the executable version of those analyses.

In principle, these should match. In practice, discrepancies can arise: a model may be described differently from how it was implemented, a covariate may appear in the code but not the manuscript, or the levels of variables may be coded opposite to the interpretation of effects.

CodeBot is intended to support consistency checks of this kind, which Dr. Nick DeVito, Research Integrity Lead at the Bennett Institute, and Professor Malte Elson, Head of the Department of Digitalisation at the University of Bern, and I have termed “descriptive reproducibility.” The prototype of this software performs well in detecting these discrepancies, which we evaluated by using the open-source analysis code, and published papers, available for OpenSAFELY research projects.

The second project focused on the further development of the RegCheck architecture – an open-source tool that uses AI to compare a study’s registration with its published paper to identify any inconsistencies – to automate the detection of outcome-switching in clinical trials (according to CONSORT guidelines).

Several years ago, colleagues at the Bennett Institute conducted manual assessments of the prevalence of outcome-switching in the medical trials literature in the COMPare Trials project. However, conducting these checks involved a huge manual burden that made it difficult to do at scale. RegCheck’s architecture offered an avenue to build on the COMPare project by exploring whether recent advances in LLM-based workflows can assist in detecting outcome-switching.

Results and outputs: what we’ve achieved and what comes next

The creation of the CodeBot prototype software, and the promising early results from it, paved the way for me, Dr. Nick DeVito, and Professor Malte Elson to apply for a grant from the Swiss National Science Foundation to start a larger programme of work.

This grant was successful, and will now set the stage for the development, deployment, evaluation, and maintenance of CodeBot for 4 years starting in 2027. This provides a pathway for continuing the work beyond the initial prototype and for exploring how code-paper comparison tools might be integrated into broader reproducibility and research-integrity workflows throughout quantitative science.

For RegCheck, we now have evaluation data assessing its performance in detecting cases of outcome-switching in clinical trials using existing data from the COMPare Trials project. Our results show that RegCheck performs very accurately (>90%) in detecting outcome-switching, and even catches some cases which were missed by human evaluators.

We’re currently manually reviewing discrepancies that will help in iteratively improving the performance of our system, and have a pending grant submission to further develop this work into a live Outcome Tracking tool. This will provide real-time monitoring and reporting of outcome reporting discrepancies, similar to the Bennett Institute’s existing TrialsTracker project.

One of the most valuable aspects of the visit was the opportunity to develop new collaborations and identify possible integrations with ongoing work at the Bennett Institute. In particular, the visit opened up conversations about further ways that we can support researchers - both within Bennett, and more generally across different services.

I’m excited about the potential of harnessing some of the promise of AI to aid researchers in improving the robustness of their research, and will be exploring avenues to develop and pilot some exciting new ideas with the Bennett team in the coming months.

My main takeaway – beyond all of the interesting and exciting work I did in this six-month stretch – is that the best is yet to come. I feel incredibly grateful to have established new collaborations with colleagues at one of the most exciting and innovative research groups in health research, and have taken away too many lessons to count. The foundations we built, and the insights we’ve learned in this time will bear plenty of fruit in the months to come – stay tuned.

About Dr. Jamie Cummins

Jamie is an advanced postdoctoral researcher based at the University of Bern. As a meta-scientist, he is interested in how to make practical changes to the research production process to improve the transparency, robustness, and trustworthiness of research.

Jamie is part of the core team of ERROR, a bug bounty programme rewarding researchers for engaging in scientific self-correction. He has developed and continues to maintain RegCheck, a tool to compare study registrations to published papers aimed at making it easier for researchers to make these types of comparisons.