TORTUS Logo
TORTUS Logo

Contact Sales

CREOLA: A Clinical Safety Framework for Large Language Models in Medical Documentation

Read our paper published in Nature Digital Medicine2

Ambient voice technology (AVT) is already demonstrating the tremendous potential for streamlining clinical documentation, a task that consumes substantial clinician time and contributes to burnout. At TORTUS we are leading the way on AVT safety, here we discuss the challenge of hallucinations—where LLMs generate information not present in the source material—that may pose clinical safety risks if left unchecked by the clinician. 

As a UK MHRA Class I Medical Device, TORTUS is pioneering the science of clinical safety in the AI space – the CREOLA approach we developed last year now underpins our automated clinical guardrail systems that make TORTUS the safest AVT enterprise supplier on the market. We believe AI can be powerful if deployed well, but equally can be dangerous if not, and therefore must be regulated as any drug or other medical device would be – a position that NHS England agrees with.1

The CREOLA Framework

We’ve developed a comprehensive framework called CREOLA (short for Clinical Review Of LLMs and AI), paying tribute to Creola Katherine Johnson, a pioneering human computer at NASA. Just as human computers were integral to the safe landing of Apollo moon missions, clinicians play a vital role in safely integrating AI technologies into clinical practice.

In our paper we demonstrate how to apply CREOLA to any LLM, we used GPT-4  as a case study. Our framework provides a structured approach to evaluate LLM outputs for clinical documentation. Our work represents one of the largest manual evaluations of LLM-generated clinical notes to date, analysing 49,590 transcript sentences and 12,999 clinical note sentences across 18 experimental configurations.

Key Components of the CREOLA Framework

  1. Error Taxonomy
  2. Clinical Safety Assessment
  3. Iterative Experimental Structure

  1. Error Taxonomy: We’ve developed a detailed taxonomy that categorises errors in clinical documentation. This gave us useful insight into LLM error rates and how to mitigate them.
    • Hallucinations: instances of text in clinical document unsupported by the clinical encounter
      • Fabrication (43%) – completely invented information
      • Negation (30%) – contradicting clinical facts
      • Contextual (17%) – mixing unrelated topics
      • Causality (10%) – speculating on causes without evidence

  • Omissions: Clinically important text in the encounter that was not included in the clinical documentation

  1. Clinical Safety Assessment: Our innovation incorporates accepted clinical hazard identification principles (based on NHS DCB0129 standards) to evaluate the potential harm of errors:

AVTs generate a draft clinical document for clinicians to review, this is known as “clinician-in-the-loop”, providing vital human oversight to AI. Whilst errors will be exceedingly rare, it’s our responsibility to track and quantify these risks. In our experiment, we assume a “worst case scenario” where a clinician does not review or correct their note, and we consider all possible subsequent actions if that error were taken without question.

We categorise errors as either ‘major’ or ‘minor’, where major errors can have downstream impact on the diagnosis or the management of the patient if left uncorrected.  This is further assessed as a risk matrix comprising of:

  • Risk severity rating from 1 (minor) to 5 (catastrophic)
  • Likelihood assessment from very low to very high

Risk estimation based on the likelihood and consequence of harm occurrence.

This figure illustrates the scoring of clinical risk based on the likelihood of an incident occurring and the severity of harm it may cause (based on DCB0160 principles).

This provides TORTUS with a grounded way of quantifying risks and errors of LLMs that aligns with the medical device regulations.

  1. Iterative Experimental Structure: Our team at TORTUS implemented a methodical approach to compare different prompts, models, and workflows, attributing changes in performance to specific modifications (e.g. prompt or model changes).

Our workflow for the assessment of LLM output using CREOLA platform.

This diagram illustrates the process we followed using the available dataset for various experiments, including the input from clinicians for labelling and consolidating reviews, followed by a safety analysis.

What We’ve Discovered

To demonstrate how to apply CREOLA to any LLM / AVT, we used GPT-4 (early 2024) as a case study here. LLMs have improved since, and likely have even lower error rates than those reported here. 

Hallucinations, while less common than omissions, carry significantly more clinical risk. Of 12,999 sentences in 450 clinical notes, 191 sentences had hallucinations (1.47%), of which 84 sentences (44%) were major. Of the 49,590 sentences from our consultation transcripts, 1712 sentences were omitted (3.45%), of which 286 (16.7%) of which were classified as major and 1426 (83.3%) as minor.

Severity of risk in major hallucinations.

Based on our CREOLA framework- hallucinations can lead to significant clinical risk if left uncorrected in the notes.

We’re particularly concerned about negation-type hallucinations, which make up 30% of all hallucinations, that contradict information stated in consultations. If unchecked, this could potentially lead to clinical harm.

We found major hallucinations occurred most frequently in the Plan (21%), Assessment (10.5%), and Symptoms (5.2%) sections of clinical notes—areas with high potential for patient harm.

How TORTUS can reduce hallucinations 

Through our iterative prompt engineering approach, we’ve achieved breakthrough improvements in model output. Below are a few examples:

In this series of experiments, we eliminated major omissions completely, decreasing them from 61 to 0, and reduced minor omissions by 58%, from 130 to 54. Additionally, we lowered the total number of hallucinations by 25%, reducing them from 4 to 3.

In this experiment, we tackled major hallucinations and omissions through iterative prompting – The change in the prompt from Experiments 3 to 8 reduced the incidence of major hallucinations by 75% (from 4 to 1), major omissions by 58% (24 to 10), and minor omissions by 35% (114 to 74).

What CREOLA means for the future of AVT

At TORTUS, science and clinical safety are a priority, not an afterthought. CREOLA is a foundational framework that we hope other AVT companies can use to implement governance and safety assessment protocols into clinical settings. Data generated from CREOLA enables us to build automated safety guardrails in our technology, this allows us to perform “health checks” on each and every clinical document that we produce.

By quantifying both the occurrence and potential clinical impact of errors, we’re helping organisations make data-driven decisions about implementing LLM technologies in healthcare workflows. Our work demonstrates that careful prompt engineering and workflow design can reduce LLM error rates below those documented for human clinicians. 

At TORTUS science and safety permeate everything that we do. Fundamentally it is about putting clinical safety and science into the iterative process of designing and building for clinical environments. Evidence generation and research like this takes time, deploying AI in healthcare means we have to move “as fast as we can, as slow as we need”. That’s why we are called TORTUS.

If you’re interested in further information about clinical safety in LLMs/ AVT. CREOLA was recently showcased at the London OpenAI Dev Day. Click here for the full video!

1. England, N. H. S. NHS England » Guidance on the use of AI-enabled ambient scribing products in health and care settings. https://www.england.nhs.uk/long-read/guidance-on-the-use-of-ai-enabled-ambient-scribing-products-in-health-and-care-settings/.

2. Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine 8, 1–15 (2025).

TORTUS Logo

Need help with something?