How we operate as a Class IIa device
Operating an LLM-based system as a regulated medical device imposes a discipline most software is never subjected to: we had to define performance before we could measure it, then build the instruments to do so, and sustain both for as long as the device is live. That means maintaining both very complex technical performance monitoring but also documentation systems that capture every single aspect of the life cycle of the software, running longer than most PhD theses. This is an account of what is actually involved.
The evidence bar is far higher, and it is externally policed.
For a self-declared Class I device, the manufacturer essentially promises that they have followed the rules. For a Class IIa device, an external Notified Body (or Approved Body in the UK) audits those promises.
This audit is not a sampling exercise; it is a line-by-line review of the entire technical file. It covers everything from the clinical evaluation report—which must prove the device is safe and effective—to the risk management file, which must identify and mitigate every conceivable way the system could fail.
The difference is the difference between marking your own homework and defending a thesis. It forces a level of rigour that is uncomfortable for traditional software development but necessary for anything that participates in a clinical decision.
Measuring performance to that bar required us to build the instruments from scratch.
The first problem was definitional, and it was harder than it sounds: what does accuracy even mean for an ambient voice system? There is no single number. Answering it rigorously took years of work and a peer-reviewed paper in npj Digital Medicine, a Nature Portfolio journal, defining quantitatively the hallucination rate and omission rate for AVT and how to measure it. Only once we could define the metric could we build anything that reports it.
With the definition in place, we built the instrumentation in two layers. The first measured accuracy during development, using human clinicians, against the definition above. The second system we developed using AI observation now measures accuracy in production, in real time, after deployment, which is the layer the post-market surveillance requirement actually depends on. That monitoring platform runs continuously, the live hallucination rate, omission rate and many other metrics are on the wall in our office, and it is the thing that lets us make claims about the device in the field rather than only in the lab.
Clinical benefit had to be measured to the same standard as accuracy. We ran the largest service evaluation of AVT in Europe: >16,000 patients and nine clinical sites, with human observers present in the room, time-mapping how clinicians spent their time with the system against their normal routine, across multiple care settings simultaneously. That produced the headline results, 23.5% more direct patient care time and a 13.4% productivity increase in the emergency department.
This is the scientific underpinning of TORTUS - both technical and clinical. The output so far is three peer-reviewed papers, a granted US patent, and three further papers in train. It has become central enough to the company that we set up a permanent research function to maintain the measurement of the next set of problems.
We have to sustain this indefinitely, and every higher-risk feature re-enters the same gate.
None of this terminates at certification. The QMS is maintained under recurring audit, and any significant change to the system is submitted to the same level of scrutiny before it can ship. This is the part that most changes how you build: a feature is not done when it works, it is done when it has been evidenced to the standard and cleared. Our observational layer recently became an actuator, the Shell, which removes hallucinations before a clinician ever sees the note, and crossing from observing to acting is exactly the kind of change that re-enters the process in full. Everything beyond scribing, ordering, decision support, translation, carries higher risk and goes through the same gate. This is extremely high friction for a software company.
But the friction is the point. Like a diamond, the pressure is what produces something reliable, auditable and provable, rather than merely working.
Running an LLM as a Class IIa device is slow by construction. It raises the bar twice, once for evidencing that the system works, and again for evidencing that it keeps working, and it holds every subsequent change to both. That is how healthcare has always treated things that can affect patients, and it is the reason we are called TORTUS: the speed is the cost of the assurance, and the assurance is the product. Slow is smooth, smooth is fast.
Read more about why TORTUS became a UKCA Class IIa medical device and what that means for your organisation below.
