How we operate as a Class IIa device

Operating an LLM-based system as a regulated medical device imposes a discipline most software never faces: performance has to be defined before it can be measured, then the instruments to measure it have to be built, and both have to be sustained for as long as the device is live. That means maintaining technical performance monitoring alongside documentation systems that capture every aspect of the software's life cycle, often running longer than a PhD thesis. This is an account of what that actually involves.

The evidence bar is far higher, and it is externally policed

For a self-declared Class I device, the manufacturer essentially promises that they have followed the rules. For a Class IIa device, an external Notified Body (or Approved Body in the UK) audits those promises.

This audit is not a sampling exercise; it is a line-by-line review of the entire technical file. It covers everything from the clinical evaluation report, which must prove the device is safe and effective, to the risk management file, which must identify and mitigate every conceivable way the system could fail.

It is the difference between marking your own homework and defending a thesis: a level of rigour that is uncomfortable for traditional software development, but necessary for anything that participates in a clinical decision.

Measuring performance to that bar required us to build the instruments from scratch

The first problem was definitional, and it was harder than it sounds: what does accuracy even mean for an ambient voice system? There is no single number. Answering that rigorously took years of work and a peer-reviewed paper in npj Digital Medicine, a Nature Portfolio journal, quantitatively defining AVT's hallucination rate and omission rate, and how to measure them. Only once the metric was defined could we build anything that reports it.

With the definition in place, we built the instrumentation in two layers. The first measures accuracy during development, using human clinicians, against that definition. The second, an AI observation system we built, measures accuracy in production and in real time after deployment: this is the layer the post-market surveillance requirement actually depends on. That monitoring platform runs continuously. The live hallucination rate, omission rate, and other metrics are displayed on the wall in our office, and it is what lets us make claims about the device in the field, not only in the lab.

Clinical benefit had to be measured to the same standard as accuracy. We ran the largest service evaluation of AVT in Europe: over 16,000 patients across nine clinical sites, with human observers in the room time-mapping how clinicians spent their time against their normal routine, across multiple care settings simultaneously. That produced the headline results: 23.5% more direct patient care time, and a 13.4% productivity increase in the emergency department.

This is the scientific underpinning of TORTUS, both technical and clinical. The output so far is three peer-reviewed papers, a granted US patent, and three further papers in train. It has become central enough to the company that we set up a permanent research function to maintain measurement of the next set of problems.

We have to sustain this indefinitely, and every higher-risk feature re-enters the same gate

None of this ends at certification. The QMS is maintained under recurring audit, and any significant change to the system goes through the same level of scrutiny before it can ship. This is the part that most changes how you build: a feature is not done when it works, it is done when it has been evidenced to the standard and cleared. Our observational layer recently became an actuator, the Shell, which removes hallucinations before a clinician ever sees the note. Crossing from observing to acting is exactly the kind of change that re-enters the process in full. Everything beyond scribing (ordering, decision support, translation) carries higher risk and goes through the same gate. For a software company, that is extremely high friction.

But the friction is the point. Like a diamond, the pressure is what produces something reliable, auditable and provable, rather than merely working.

Running an LLM as a Class IIa device is slow by construction. The bar is raised twice: once to evidence that the system works, and again to evidence that it keeps working, and every subsequent change is held to both. That is how healthcare has always treated things that can affect patients, and it is the reason we are called TORTUS: speed is the cost of assurance, and assurance is the product. Slow is smooth, smooth is fast.

How we operate as a Class IIa device

The evidence bar is far higher, and it is externally policed

Measuring performance to that bar required us to build the instruments from scratch

We have to sustain this indefinitely, and every higher-risk feature re-enters the same gate

More from the blog

Are you ready to build with TORTUS?