TORTUS AI Academy
Welcome to the TORTUS AI Academy – a free educational webinar series from TORTUS AI for NHS CXOs and other digital leaders, to learn all about AI.
Episodes
Watch Previous Episodes
1. What is AI? – Dr. Dom Pimenta
First Broadcast:
20241211
Welcome to the first inaugural session of the TORTUS AI Academy. My name is Doctor. Dominic Pimenta, I’m the CEO and co founder of TORTUS. We’re still letting a few people join. So I’m just gonna get started super slowly and go through the slides, but you can stop looking at this.
So the purpose of today’s session, and we’ll give you a bit of an overview of the whole series is really about orientation. So for those of you who are interested in AI, this actually might be too simple a session. It’s very much a fundamentals review of the very basics. And I mean, the very basics. We’re not going to get into anything too complicated.
The overview of the series as a whole is for NHS, C suite and digital leaders to better understand the AI landscape as a whole. And infrequently in our conversations now as AI company, talking to all sorts of leaders across the NHS, I think there’s a real gap in keeping up, but also having to evaluate technologies which are new and haven’t been evaluated before and there’s no precedence for. So really is to ground the fundamentals and to keep it super simple. We’ve got a bunch of polls, we’ve got a bunch of questions that we can be asking in the chat here and we’ll do a question and answer session at the very end of the study of the session today. The program today, we’ll run a session every 2 weeks with the exception obviously 2 weeks from today is in fact Christmas.
So what is AI today? And then beginning of January, we’ll look at evaluation of AI in healthcare. And then we’ll start looking at some of the clinical risks in late January there on 22nd. And we’ll look at how to drive adoption and implementation. And we can share some of the lessons learned and also from some of the literature.
And then the last session mid February will be about the future of AI and the NHS and another Ask Anything session. So the goal here today is to learn how to buy AI for healthcare wisely. So I’ll tell you what this isn’t, is a computer science class. And I’m being the wrong person to be leading that if that was the case. It’s a crash course in AI for healthcare leaders from concept to clinic.
So we’re really gonna look at the very basics. So today we’re going to cover basic terminology in AI, some foundational concepts, and also look at some practical live examples of AI today. We’ll go through a bunch of explanations about how the models work, what the basic terms mean, but we won’t dive into a tremendous amount of the finer detail. So specifically what we will not cover, do not ask me about the maths of the models or the deeper technicals of how to build these models. That’s beyond the scope of this.
But what we will do is share some really good resources after the session of courses that people can do if they’re super interested in learning how to do that. Similarly, we won’t get into the deep philosophy of AI. We might touch on some of the ethics and the risks when we come to that session next month. But today, specifically, we’re not going to dive into will AI replace us or anything like that. We also won’t be looking at any specific workflows.
We’ll just look very generally at the field, what the various streams and sources mean and what the various systems are useful for and what they’re not useful for and we’ll keep it super light. And the other thing that we definitely won’t dig into today, although we will cover that in quite a lot of detail, I think in the following sessions on evaluation and clinical risk is compliance, medical device status, the regulatory side, what we think of now and what we think of in the future. So with that in mind, let’s make a start. So I think when we talk about AI, I really want to get right down to the fundamentals and set up some definitions from DayDOT that are useful when we’re talking about practical deployment of artificial intelligence in healthcare. There probably, there’s many, many definitions of these things in the academic world.
But for the purposes of today’s session, we’re gonna make some very simple definitions. Number 1, intelligence is the ability to solve a problem. And that is indexed actually on the speed that you can solve that problem at and the complexity of that problem. So artificial intelligence then is essentially a machine that can solve problems. Now to demonstrate this and also to check that the chat’s working, I would like somebody to put into the chat as fast as they can the answer to this maths equation 20 times 18 and See if there’s any takers for that.
Yeah 360. Damn. Okay. So you did a good job. So that’s very good.
Wow, a very responsive audience. This is awesome. Okay, so now what I’d like you all to do and let’s see how fast, what is this answer to this equation? And let’s like see how fast we can get to that equation. And I will wait a little bit because I appreciate it’s either pre lunch or post lunch.
So there’s some brains that are working pretty slowly. Wow, Helen Russell. Okay, very good. 38745. People got some calculators out, I’m sure.
But I think the point stands is that when we talk about intelligence, the complexity of the problem, the speed is a very natural way of assessing that. And also that our human brains are very good at certain types of problems. We are optimized for pattern recognition, for example, our instincts are very much like, is that a tiger over there? Is it going to eat me? We are quite bad.
And there’s no way I could ever get that answer in my head at calculation, for example, whereas machines are exceptionally good calculators, they’re very poor at pattern recognition in the way that we do it. And I think that’s a really important thing to recognize that although artificial intelligence is a machine solving problems, potentially in a humanistic way, it actually isn’t as easy for certain problems as it is and much easier for others. So it’s about using the right tool at the right time. So we’re just gonna go through a bunch of definitions today and we’re gonna keep it super light. So let’s talk about some terms that you might hear.
So one of the terms you might hear is narrow versus general AI. So narrow AI is easy to define. It solves one problem or very few problems. The classic example, and it always makes my presentations look better, is classifying cats. Now you might not think that’s a problem.
That’s very hard for human beings, but actually that’s quite a difficult problem. It was one of the groundbreaking problems to demonstrate some of the AI capabilities about 10 or 15 years ago now was classification of various images. And the classic example is, is there a cat in this picture or not? Here’s a cat. I think actually that’s not a real cat.
It looks like a fake cat, but anyway, there’s a cat. And that’s an example of narrow AI where the machine solves one problem. And the problem is, is there a cat here or is there not? Now, you can also extend that to general AI. And there’s a lot of debate about where we are, but the fundamental definition means an AI model that can solve many, many, many problems.
Some people say, we haven’t reached that state yet and it has to be human capability. There’s some other argument that says, you know, for example, a language model can solve many different language problems. Does that count as general? It doesn’t really matter. And certainly when we look in the practical deployment side, it doesn’t matter at all, but it’s useful to understand that as a terminology in the background.
Okay. So you’ll often hear terms used interchangeably in the AI space. So just for clarity, just want to take you through what it means for an AI model, an an AI algorithm, AI training, because these are things that you’ll hear all the time. And also there’s some fundamentals of how to evaluate these models that will come on to later, that really sort of require us to have a good metric and understanding of what those things mean. So when we talk about AI systems, we use the term model.
And that is a pretty good term because it can be thought of as an internal representation of certain patterns that the model has learned that can help it to solve a problem. So typically, when you’re training a model, and there’s lots of different ways of training, you use an algorithm. That’s the way of training the model to take data and to put it into a series of what we call weights or parameters, which essentially just a huge calculator and to have some sort of output. Now, when we talk about training, the way that these models are trained is we will have a data set that we know is true. I mean, this is one type of training.
I won’t go super hard into other types, but this is an easy way to understand. And we’ll talk about a worked example later. But when we’re doing training, we need to know that the solution that comes out of the model, if it’s correct or not. And when it’s not correct, what we do is we say, okay, try again, change the model slightly, change the calculator slightly, put the same data back in and see if you get an output. And we do this millions of times round and round and round and round, changing the model, changing the model, changing the model until we get the right model that produces a solution as good or as close as we think to the right answer as possible.
And that’s what we call machine learning. It’s AI getting better at solving problems over time. So just with this definition, you now have a definition of a model, which is the fundamental system where data comes in and solution comes out. You have a definition of what we mean by AI. You have a definition of what we mean by training and also what we mean by machine learning.
Now, here’s a good example. And we go back to cats because it’s fun for cats. So here’s some data. And we’ve labeled this already. So we use the term labels to annotate data so we know exactly what the truth is.
And sometimes that’s human label. There’s other ways of doing training, but we’ll stick to the simple stuff for today. So here’s a cat, you can see as a cat. Here’s a pillow, that’s not a cat. Here’s another cat, that’s a cat.
Here’s a cat, that’s a cat. Here’s a slipper, not a cat. Now, again, I’m emphasizing that machines are quite bad at things that we find funny, but I do also like this thing cause you can talk about slippers and cats and it’s a completely bad thing to do. So we put that into the model and this is to represent the various sort of loops of training, what we call epochs into the path. And eventually, we get a model.
And then we show it a picture of a cat. And it goes, this is a cat. And we go, Okay, great. We’ve got a good model that knows that when we show it some new pictures of cats that it hasn’t seen before, it can recognize that in fact is a cat. Now, this is super interesting, right?
So like when we talk about what the model has learned, in theory, it’s learned certain patterns or certain features of the training data here that allow it to identify cats that it hasn’t seen before. So for example, I’m just hypothesising the model sort of learns that cats have pointy ears, they’re a bit furry, and they’re generally all this sort of brown greeny color, which I’m sure has a name escapes me at this moment. And that equals cat. Now, here’s a problem that happens with data. And this is why data is so important when you’re training AI models.
Problem number 1, there’s not enough data. So for example, the model here has learned that cats look like this, but it’s also learned that cats always stand up and they stand up in a pretty similar pose. Now here we’ve shown a picture of quite obvious cat. It’s got pointy ears, it’s about the right colour, but it’s lying down in a pose that the model hasn’t seen before. So it classifies it as no cat.
And this is a classic example of what we call underfitting. There was not enough data in the original model training data that we put in to understand that sometimes cats lie down. And that is a feature that would recognize. Now extract that into the healthcare space, for example, say you were training a heart attack model to identify heart attacks, and you only trained it on middle aged men who came in with ST elevated myocardial infarctions, big heart attacks, major heart attacks. Actually, that model would then fail to recognize women, ethnic minorities and actually non ST elevated myocardial infarctions.
So I wouldn’t use the term minor heart attacks, very controversial for a cardiologist, but other types of heart attacks wouldn’t be recognized. So it’s really important when we’re considering training models, the data that that model was trained on is diverse enough for the solution that we want to implement. And that’s what key thing to understand when you’re evaluating vendors, they want to use it. They want to tell you what’s useful in this setting. What evidence have you got that it’s trained to be capable in that setting?
And that’s actually a really hard question to answer. Here’s another problem. And it’s a bit of a more complicated problem. So we also have a problem called overfitting. Now, in this example, what’s happened here is we’ve trained this model many, many more times than actually is appropriate.
And we’ve trained it over and over again. So now instead of learning the patterns, it’s actually just memorized these specific pictures of cats that it was in its training data. And when it sees a picture of a cat, that’s not one of the pictures that it learned, it says this isn’t a cat. And that’s why data quality is also important, but also how the model was trained and how it was tested. And this is called overfitting.
And you see that sometimes, it funny enough in lots of healthcare papers, where they created a model that performs really well on data that it’s seen before, but it doesn’t what we call generalize, it doesn’t work at all on new data. And in healthcare, this is really important because you’ll also notice one important feature. All of these cats are that green, brown colour that sometimes cats are, someone will tell me what that’s called in, I think it’s eggshell. And but the here’s a white cat, and it doesn’t recognise that cats can be white. And therefore, it recognizes there’s no cat.
And another really good example of this is your training data in diverse ethnic populations, for example, it doesn’t account for that in the input, it can bias it in the output. And understanding diversity is also really important when we talk about overfitting as well. Some more definitions then and we’ll keep with cats because that’s a fun thing to talk about on a Wednesday lunchtime. Deterministic. Now this is really important when we come on to large language models and some future tech.
But just to explain what this means, deterministic means when given the same input, the same output will always come out like a calculator. And that’s relatively easy for us to understand. Most AI models fundamentally are big statistical representations. They’re big calculators. Put a bunch of numbers in, the same numbers always come out.
So statistical models, prediction models, and some computer vision models are all deterministic. And that means that if you put the same input all day long, the same output will come out. Now, probabilistic is much harder to understand, but in that context makes it a little bit easier. But the model fundamentally has an element of randomness built in. It is a feature, it is not a bug.
So sometimes a given input will often have slightly different outputs. And that’s because it’s a prediction algorithm. And again, offline, we can get into some real good resources to explain exactly why that happens. But it’s a really important thing to understand that you put the same data in to chat GPT, for example, large language models are probabilistic. The randomness can be controlled, but it can’t be eliminated.
And that is a feature of the systems. And it’s a really good feature to have when you’re dealing with uncertainty, when you’re dealing with human complexity. So language is a human complex situation. Weather forecasting is another one where you actually some uncertainty in the system is actually quite useful. So large language models, when we come on to talk about this vision language models, any speech to text transformer, which is based on AI as opposed to traditional.
And we’ll come to these definitions in a second. So don’t worry too much about that. But they will all have an inherent element of randomness. And that’s really important again, when looking at evaluation and safety systems of these models and how that’s accounted for. In a similar way, as you and most of the people on this call are pretty experienced with diagnostic tests, they have an inherent error, specificity and sensitivity.
How do you account for that error when you’re doing tests in a in a high risk environment like healthcare, And again, important to understand when evaluating systems. Okay, so all of that was very interesting. And I blew all through that really fast. We can come there and I’ll share these slides. Now what we’ll do is we’ll look at some specific types of AI that you may or may not have already come across, tried to use healthcare examples for the most part.
So let’s start with computer vision. So computer vision is very simply, here’s an image. The AI model tends to break up that image into a bunch of numbers in some meaningful way. And then the output is usually some sort of classification task. So the classic example now, and actually I think this is technology that’s live in the NHS.
I’ve certainly seen it in a few different places is classification of radiology. So here’s an example of an x-ray. The AI algorithms detected there’s some abnormality here. And arguably you could say there’s some consolidation and this is an early pneumonia. Some really interesting use cases now combining AI with photo plasma magnography, which I can never say.
And that’s essentially shooting lasers into skin and looking at the relevant changes in the light that comes back. And you can do that with some complexities. So combined vision with that with PPG, which is the same technology that if you have an Apple Watch, for example, that’s how it reads your heartbeat or a pulse oximeter, that’s how it reads your pulse. It will give you the vital signs from a picture. So heart rate, breathing rate and SATs.
And again, this is technology that is live in the NHS today. Facial recognition, much more controversial, but if you ever open your phone using your face, that is a computer vision model and doing exactly that task. And skin, so dermatology, looking at lesions, taking pictures and recognizing classification of those tasks. And again, that is also technology live today in the NHS that people are already using at some scale. We’ve got predictive algorithms.
So again, another example of deterministic algorithms, a couple of these ones are pretty famous. So a sepsis algorithm and heart failure. So what these models are doing essentially is taking a bunch of statistics, a bunch of metrics similar to like the MuseScore, for example, for sepsis, heart rate, age, white cell count, CRP, and creating statistical models that predict the presence or not of sepsis. And this is, I think it’s quite a famous paper published in Nature a few years ago and deployed at a number of healthcare systems in the US. Here’s a model for prediction of heart failure, looking at various predictors in various cohorts, looking at various features, for example, age, previous history of heart attack, sex, ethnicity, etcetera, and predicting the outcome in heart failure.
These models much harder actually, in my experience, and did that a little bit as an academic to actually utilise because as good as they are diagnostically, it’s not always clear what to do with the outputs or how you actually implement them. So funnily enough, from a product perspective, sometimes these can be quite difficult, but statistically, they are well recognized. And then we come on to other forms of AI that are becoming more and more ubiquitous. So when we take speech and we turn it into text, typically for these use cases today, transcripts, we call this speech to text AI. And you’ll hear that term a lot when people talk about the space and talk about speech to text technology.
Older speech to text systems, voice recognition systems, you are what we call phoneme based. So here’s a bunch of phonemes. If you’ve got kids, you’ll recognize this in terms of phonetics, but kind of recognize small pieces of the word, single syllables, then matching those phonemes to the specific word individually as close as possible. And that’s how a lot of older VR systems work and why they also took a lot of training, and were quite prone to accents. And that would often make a lot of mistakes.
Newer speech to text AI systems don’t work like that at all. They take the entire context of the audio and then individually match each speech with the context with intelligence. So on the plus side, you get much more accurate speech to text AI accuracy. But on the downside, they’re probabilistic. So there is an element of randomness even in the speech to text AI systems themselves, which you have to account for them.
So some key terms, real time you’ll hear that. That means that the transcript appears as you speak and batch transcription. That means you take the whole audio file at the end of whatever that speech is required to record, send that off all of a single file, and then we get the entire transcript back. It’s slower, although actually not that much slower these days. But also it’s actually more accurate generally because it has the whole context all at the same time of all of the speech.
And then the way that we measure accuracy with speech to text models is something called the word error rate. You’ll see that if you look at any reports for accuracy and safety of speech to text AI. And that just literally means you have existing transcript that somebody’s validated, normally a human being. You run it through the speech to text AI system, it gives another transcript, you just match up where mistakes were made, and you call that the word error rate. And again, that’s something that’s been falling continuously over time.
Now, large language models, so you’ve probably all if you’re even adjacent to the space at the moment, come across ChatGPT, or to mention some other vendors, Anthropix, Claude, Google’s Gemini, Facebook, or Meta’s, Llama, for example, these are all large language models. And this is where we delve really into probabilistic models that have an element of randomness. Now, large language models are transformers. Transformers were designed to transform text from one form into another. And actually, they’re originally designed for machine translation.
But the way that they work, they sort of predict the next word. And then there was a finding that actually by doing this, you actually put knowledge into the model. So to write a sentence, you have to have some representation of knowledge because that sentence has to make sense. And that sense contains knowledge within it. So a typical sort of use of one of these models is text plus or minus an instruction.
And that’s what we would call the prompt passes into the model. And then new text is output. The terms that we use for the input is prompt. And that’s normally like an instruction plus potentially some other information. And then the output is what we call inference.
So you might hear with lots of AI models, the inference, which means that the models inferred some sort of meaningful output. And often we’ll talk about inference time or data processing and inference is a bit different. And I think this is most important of all. So these models were designed to output something that looks like, and this is the critical difference, that looks like an accurate answer. But they aren’t trained that the answer is factually correct.
And that’s really, really important because sometimes these models will have the capability of what we call hallucinations. The model will make things up in our parlance. So the model’s perspective, it’s achieving what it’s asked to achieve. If you asked it to give 10 references for some sort of legal case, it would give you 10 references. But the likelihood is those references do not exist.
So using large language models as knowledge basis still has these inherent problems is not actually what they’re designed for, even though it is a by product. And even though it gives a really good understanding and context, it is not the fundamental purpose of these models. Now getting a little bit into the future state, but I think it’s useful to talk about what other types of models are coming online now. So we use the term multimodal, which literally means different types. And we see that more and more.
And all that means is you have a unimodal model here, like text that outputs text, a multimodal model, you can put a text, an image, a video now actually as well, documents. And the AI model can process that all together and then output a single output. And the output also can in fact be multimodal. You can have image to image, image to text, text to image. And this is a very interesting time to learn the nomenclature.
Here’s an example to be super concrete. So here’s a picture and you put some text in, tell me what this is. So that’s multimodal model. You’ve got an image plus some text goes into the transformer model and outputs, a text output. This is the Eiffel Tower.
And that would be a type of multimodal model. The reason it’s important to understand this is like people are now thinking about using multimodal models in healthcare, for example, my inherent distrust of that is that actually it’s much harder to tease out which part of the model is doing what and therefore to improve it and therefore to create safety cases, etcetera. And I think that’s something that we’ll have to figure out over time. And lastly, on this sort of run through AI as it is today, knowledge bases. So there’s a technology, called Retrieval Augmented Generation or RAG.
And what we’re doing here is we’re taking data, so actual knowledge, and turning it into numbers, which we call vectors. And then we’d make that searchable for a large language model. So to solve one of the problems of large language models, which they don’t necessarily have verifiable data, here’s a technology where you take a bunch of structured data, or in fact, unstructured text, turn it into a bunch of chunks of that data, turn those chunks into various numbers, and the math of that escapes me. So we can talk about that. And I’ll send you some resources on how you do that.
And then the model can search that very quickly, and then utilize it in its response. And this in theory, should increase the factuality of the outputs of large language models, in practice still has some problems around hallucinations and omissions in the same way. And it’s another thing that you need to learn to evaluate. Some other important terms on our trip through orientation then. So how these models work, right?
They run on very high powered computers, typically GPUs. And we need to use that technology to perform them fast enough in real time. But obviously, most of those GPUs require large data centers, they get very hot. It’s just not practical. You can’t run most of the top of the line AI models today on your computer at home, although increasingly that is changing.
So generally, these computers are accessed by the web, and we call that cloud compute. And most of you will know all about this anyway, or they can be installed locally, which we call on premises. So typically on prem refers to something literally physically located on prem or within the data structure that you already have. And data doesn’t have to leave, which is the crucial piece. But generally cloud is required, because most companies, most places won’t have access to the GPUs that they need.
The other two terms that are really important at the moment are open source and closed source. So open source, AI models literally means you can download the model. And if you have the expertise, you can run it on any system that you want and do pretty much anything that you want with it. They often come with licenses, so you can’t legally do anything that you want with it. But in practicality, you can.
Often that’s cheaper, but it depends on the use case. And actually at scale, you still need the hardware to run to run it reliably, to have the enterprise grade. So often actually doesn’t work out cheaper at large volumes. And then closed sources where you’re accessing models via companies. So OpenAI, Anthropic, Google, they’re all hosted on various cloud service systems, AWS, Microsoft, Google, etcetera.
And you can access them anywhere, generally more expensive. And then you have to get into data agreements as well around hosting and bits and bobs. So there’s pros and cons, and there’s not really in today’s world, a good answer of where AI should run. It’s just about managing the different systems. And again, really important to ask vendors where their models are sourced, how they’re trained, who trained them, who looks after them, and actually what the enterprise agreements are if it’s closed source models.
Okay, so let’s have a little breather there because that was definitely a bit of a gallop. Oh, it’s called Tortoise-shell. Sorry, I’m just looking at the chat now. I’ve been switching and forth. That’s actually very useful.
Thank you. Thank you so much for that. Okay, so what we’ll do now is just for a little bit of a break between the question and answer time and just see how many of you are actually listening to anything. We’ll do a quiz. So he says not knowing yes here we go.
So for this webinar, the purposes of this webinar, where’s the definition here? Yeah, so let’s publish that. Publish poll. So if you see in the polling side, I might, can I push this to oh, yeah? Great.
So people are voting. So I’ll give you about 3 or 4 minutes to have a vote on that while I see actually how many people turned up to this. 41 people. Okay, great. So we’ll give it a little bit of time.
We’ve had a few different votes. So yes, for this webinar, what is the TORTUS AI definition of intelligence? So yes, there are many predefined definitions of intelligence. But what is the one that we’re talking about today? Because that’s basically how, for this session and for all the future sessions, we’ll talk about AI around and around that sense.
I’ll give you guys a few more minutes on that. Yeah, it looks pretty good. So I think generally people are agreeing that we’re defining it as the ability to solve a problem. And again, remember we talked about the speed of the problem and the intelligence, and it did seem that we had some pretty impressive mathematicians on the call with us. Okay, let’s answer the next question.
So again, for the definitions that we’ve used today, what is narrow artificial intelligence? Is it narrow bandwidth cellular AI? It’s cool. Is it low latency AI running locally? Or is it solving very few or only one problem at a time?
And can everybody see that? Okay. It’s a slightly different poll. I could unpublish the other one. So that’s easier to see.
And that one’s a voting. Can you can you see the poll okay? The sec the second poll, it’s the same technology. It may not switch over as easier, actually. Is there anyone?
Oh, people are upping. Yes. We can see it. Okay. Great.
So let’s let people write on that. Oh, I can’t see that. Oh, I can now. Sorry. Yes.
That’s my fault for not using the system. Yeah, great. So solving very few or only one problem, not a super useful definition, especially as we’re moving to very complex models, but certainly something to understand a bit more about. Okay, so an AI model being trained means what? It practices with a human tutor.
It gets better at solving a problem using data, or attends a special AI school with other AIs, which sounds like a great TV show. We should definitely, definitely watch that one. Yeah, great. Okay. People are voting for that.
It gets better if it’s solving your problem using data. Yeah, I mean, I’m not saying that these are difficult questions. It’s just nice to consolidate knowledge at the end of a session. Okay, great. So AI models can make bad predictions when?
Given too little data, given a lack of diverse data, or all of the above? Yep, most people going for all of the above there and that exactly right. So we talked about underfitting, we talked about overfitting. I guess one of the other answers here is given the wrong data and you certainly see that in different models and that’s one of the biggest complexities. One of the things I didn’t mention actually was there’s a sort of slight misconception that more is better with certain AI models, but actually training data that’s poor quality, often actually decreases the quality of the model.
So quality of the data, and often having people like do hand analysis of it, which is pretty grim, often is the key to producing really good AI models as opposed to just slamming in a bunch of quantity. Obviously, chat GPT has got the whole internet behind it. So that’s a different example. Great. Yeah.
All of the above. Good. AI ingesting images and outputting a classification is known as what? Computer vision, visible working or AI. I really I really enjoyed that one.
Maybe maybe too much. Yeah. Yeah, people are enjoying the workup division. Physical working is always a bit of a joke. Yeah, people liked AI.
I thought that was quite fun. There’s a whole company name in that. I think AI people have missed a trick there. Some people have put invisible working. I like that.
I like that for risk. I do think in the world of an AI is monitoring all the situations visible working will be one of them. And maybe there’s a dystopian future there where AI is used to see if people aren’t visibly working, but that is not something that we are doing today. Great. Okay.
And last couple now. So first of all, we didn’t actually mention this. So this is maybe a test that’s slightly off piece to see how well people know the space. So ambient voice technology combines what types of models, image to text and computer vision, speech to text and large language models, or text to speech and multimodal models. Now when we talk about ambient voice, what we mean is listening to consultations and creating notes and letters.
So out of the model types that we’ve looked at today, which of those models will sort of fit that technology? Is there imagery in there? Is that multimodal models? Yeah, most people have gone speech to text and large language models. So image to text, we didn’t, we talked about a little bit with the picture of the Eiffel Tower, computer vision, we talked about text to speech.
You’ll see models like 11 Labs and a couple of these others where you write some text and it will generate voice for you. I think that has its place certainly in health care with sort of agents and number of different translation tools maybe down the line. But today, yeah, most people have got it right. It’s a speech to text model that takes the audio, turns it ambiently into a transcript. And then that transcript is summarized to make the downstream documentation and other tasks potentially like clinical coding.
Okay, last question before we get to the Q and A and well done for getting this far. So last but not least, knowledge bases, just some of these can be added by 2 LLMs by using A RAG, B OCR or C SVM. And I didn’t explain what SVM is at all, so I won’t ask anyone to explain what that is. But I think most people are getting, yeah, that it’s retrieval augmented generation. OCR is optical character recognition.
So that’s the type of computer vision we use to read letters, so characters. And that’s also obviously very useful for healthcare where we’re reading documents. And SVM is a algorithm for training models. It’s a support vector machine completely slightly red herring. Yeah, very good.
Okay, awesome. So you’ve come to the end of the first session of the AI Academy. What is AI? And I feel hopefully that maybe there’s 1% increased clarity, if not more. We’ve got some other sessions coming up beginning of January into February.
How to evaluate AI in healthcare is the next session. And we’ll go quite deep on the various model types, how to evaluate them, accuracy, what you should be asking vendors. The session after that is about understanding clinical risk. It’s a really interesting area constantly changing. Also managing things like bias, diversity of data, how you’re accounting for various use cases, thinking about product, not just about technology.
Beginning in February, we look at driving adoption. How do you get you buy expensive technology is a problem that we all have across the NHS, but no one wants to use it. And what is the adoption gap? What is the trust, the credibility, but also like the training and the requirements for implementation, which actually shouldn’t really be underestimated. And the last session, I think we’ll go a bit crazy in the middle of Feb is to talk about the future of AI and the NHS, the future of technology and see where we get to from there.
Great, okay, well we’ve got about 10 minutes before we finish up this session and thanks so much for your engagement and your continued attendance. We’ll go for some questions. You can either ask them in the chat or there is like a little questions tab as well where you can just send questions directly. And yeah, well if there’s any questions to answer now is the time to ask them. And no one’s asked anything thus far, so let’s have a wait and see.
I can see these people are typing. Oh someone’s managed to make it into questions so first come first serve. How to arrange info or demo of your product? Oh, that’s nice. Yeah, so you can go to app.
TORTUS.ai or my email is on the next slide, domTORTUS.ai. And yeah, we can arrange that with the commercial team for sure. I’m just using this platform for the first time. Dom Permenta is live answering. Oh cool, okay, I’m done answering.
I’ll put the detail in the chat. Great, okay. Any other questions? Oh, another question. Can you discuss how data security is considered for both open and closed models and cloud based systems?
Yeah, that’s actually a really good question. And it’s very much dependent on, first of all, where is the data going? So I think the way to think about that as a question space is to really look at the workflow for whatever the data system is. So if you have some data in your hospital and whatever the vendor is, whether it doesn’t really matter in terms of open or closed source, they will have to probably send it somewhere unless you’re hosting the GPU locally. So it really is like, where does the data physically go?
And then first of all, mapping out the data workflow, that is the most important part, because you’ll find all the different vulnerabilities. So if you have to transfer the data to a cloud, who’s the cloud provider? Is it someone that’s recognized? Are they based in the UK? Are they GDPR compliant?
Do they have the various ISO GDPR, things like that? The second thing to answer then is like when that model goes there, what is the model it is running on? Is it a model that is licensed from another company or is it an open source model that they’re running on that GPU? Obviously, if it’s not going outside the cloud, if it’s staying inside your system, then data security is as secure as the GPU is in your data environment. And also just think about like, what is the flow in that cloud environment?
Does it go to one vendor? And really like studying each part of it. And also asking some questions like at every stage, not just the flow total, but at every stage, is the vendor taking super careful care on does that data with potentially patient identifying information need to be exposed to each of those models? Or can you actually remove the data at some point in the flow, put it back in another point in the flow without affecting overall performance? So I think it starts with data workflow.
Open and closed is not so important. I would say a little bit on the open side, data security is down to the cloud provider. On the closed side, if it’s a big company with an enterprise zero data retention agreement, you can see those agreements pretty straightforward, but it’s going to be Microsoft, AWS, OpenAI, one of these similar ones. And then also some insurances from the company themselves that they’re not training data. And if they are, they’re declaring it and you’re needing patient consent.
Hope that’s useful. Okay, we’ll keep going through these questions. Great. So Adam Smith says, do you have any recommended resources courses for clinicians? Resources courses is quite nice to learn about AI and how to implement locally.
Yeah, I’ll share one. But the one that I did actually to start with was the Coursera course with Andrew Ng, Ng being NG. That’s excellent. And it’s really pitched a layperson introductory course. You have to be somewhat techie, but it will teach you also a bit of Python.
So you can go through that. There are more courses now on large language models and some other resources. And I’ll have a little think about the levels of healthcare resources. There are a few academies. The RSM runs a really good podcast with, Doctor.
Annabelle Painter. That’s a great place just to talk about the space in general capabilities, the future, health tech podcast as well. Also an excellent resource because often you’ll often hear about how things work and it gives you an overview from the people building it. Yeah, we’ll share a bunch of resources after that one. And then how to implement.
Yeah, I think that’s actually a different track entirely and something that we’ll probably cover in the next couple of sessions about deployment and evaluation. Great. Naomi asks, do you have any advice for medical professionals in the process of creating an AI product and looking to connect with technical partners who can help with product design? Wow, that is a really good question. Do I have any advice?
Yes, I think one thing is, don’t forget that you know what the problem is, right? I think that’s something that we forget sometimes as medics is that it’s our problem. We understand it inherently. It might be super obvious to us that like the handover sheet, for example, is inherently dangerous if this is wrong. It’s not actually that obvious to anybody outside our field.
So in Medigland, as we all know, if you’re married to lay people, for example, is actually quite a weird space. And some of the rules, some of the unspoken things that we do, some of the things that we do subconsciously every day are not at all obvious to non medics to people outside the industry. And that specifically is with like engineers. So the first thing to say is like, make sure that you understand the problem. And the best way, you know, there’s a phrase in one of the books that we use here in from base camps is build uphill.
And what it means by that is like you should build your way into finding these solutions. So the one thing I’d also recommend is if you can’t code, go and learn to code, or if you can’t, you know, if you can’t visualize it, draw it or like whatever you can do to take your idea as it is today as far into the future of realness or testing as possible is really important because that will help you on product designers massively. And then the flip side of that to play devil’s advocate with myself is like, we all think as clinicians that we know how to do it. And actually, I’ve learned specifically in my at Tortas, you just there’s so many other use cases, people are doing things so many different ways. So as soon as you’ve got an idea that you can show someone go and show it to a colleague, because again, that will make it much richer.
And anyone you work with on the technical side, whether it’s product designers or engineers, they will also want to hear, well, how many people did you test with this? This is just your idea. Because also actually in the real world, that doesn’t go super far. And then sometimes also just make sure that you have a great solution. But just understand that do you understand the actual problem?
Is this solution suitable for the problem to solve it fundamentally? Or is it in fact a solution that solves your specific problem that doesn’t generalize? Because again, it might not be worth even building it or using something off the shelf. And the last thing I would say in terms of where to find these people, hackathons are really good. There’s quite a few in London now where a bunch of engineers interested in the healthcare space, a bunch of AI scientists will turn up.
They’re always looking for clinicians. It’s a really good place to start products. And TORTUS was founded in an accelerator, entrepreneur first, myself and Chris Tan, my co founder machine learning engineer, that’s where we met, and then actually, you know, became the company as it is today. So I think that’s also another really good place to look. Okay, great.
Got some more questions. How does the UK compare as a location for an AI startup? Have you considered listing in the US? That is a good question. Yes, of course, we’ve considered listing in the US.
I think I’ll tell you a story. I think so for startups specifically, forget AI for a second. I guess that’s a different question. I think the difference between here and the US is like if you go to someone in the US and you’re like, I’ve got this crazy idea, I want to build this thing. They’ll be like, Yeah, man, let’s do it.
And in the UK, they’ll go, Oh, I’ve got this crazy idea. I wanna build this thing. And they’ll be like, Nah, I can’t do that. So it’s a bit about ambition. It’s a bit about optimism.
It’s a bit about our view of failure. I think it’s just a cultural thing. Like if you go to someone in San Francisco and you’re like, I’ve got this cool idea, I want to build things like, yeah, I’m going to quit my job and I’m going to build it with you. Because they’ve seen what happens when you join Facebook as one of the first employees or you join Google, you know, and that culture exists there. And that’s one of the bigger problems that we have here is like new ideas are sometimes harder to get off the ground.
Now, having said that, the UK and the NHS specifically is actually a great place for an AI startup. Loads of innovation, loads of people super keen to actually like help patients. In the US, the incentives are crazily misaligned. And for healthcare specifically, probably not actually such a hot place to build. But then again, there’s differentials on that.
Talent once upon a time engineering talent, very popular in the US actually very expensive here. London is now becoming a great place for AI talent. We’ve got DeepMind, we’ve got Google, we’ve got Meta, we’ve got Anthropic opening soon as well. Paris also really good for AI talent. So actually the AI, the UK is becoming a really good place, specifically London.
With the European regulations, you have to keep an eye on that, because that might change things again. But at the moment, it’s just about, I guess, have a big idea, find people that really have the big idea as well. Again, accelerators are good for that culture. And then everything else here is actually a really good place to build and can talk a bit more about working with the NHS because it’s been really good for us. Great.
Can you please repeat the name of the Coursera course again? Yeah, it’s called Introduction to Machine Learning. I think it’s with Andrew and G. So I’ll put that in sorry. I’ll put that in the chat because I’ve answered that question now, and I didn’t mean to do that.
Okay and I’ve got some nice comments. Simon Brunner has had a great session now, thanks so much. Adam Smith said, do you have any recommended courses? Yes, I will send that in the chat. Thank you very much.
Awesome. Well, we’re coming to the end of time. Any more for any more? As anyone, as I used to say when I was a waiter a long time ago, any questions for anybody else? Awesome.
Well, I think it’s just people being very nice and that’s always appreciated. So as I said, we will be running these sessions every couple of weeks, with the exception of Christmas Day. And the next couple of sessions will be joined by some of the team here at TORTUS. So there’s 5 clinicians here as well as some scientists. So look out for some co hosts.
That might make it a bit more exciting. We’ll just be talking for a good hour. And we will come back to it again soon. So thank you all so much. And yeah, and the recording also, we will be sharing around.
I will leave the chats and comments open for any residual questions that people wanted to ask off the bat. Yeah, have a lovely Christmas and we’ll see you in the new year.
Meeting Title: TORTUS AI Academy – Inaugural Orientation Session
Purpose of Meeting: To provide an introductory orientation on artificial intelligence (AI) in healthcare for NHS C-suite executives and digital leaders, focusing on fundamental concepts and practical applications.
Date: 11th December 2024
Time: 1-2pm
Location: Online
Attendees:
- Dr Dom Pimenta (CEO and Co-founder of TORTUS)
- Additional participants (total attendees: 41)
Agenda:
- Welcome and Introduction
- Purpose and Overview of the Series
- Basic Terminology and Foundational Concepts in AI
- Practical Examples of AI in Healthcare
- Interactive Quiz
- Question and Answer Session
- Next Sessions Outline
Discussion Points and Decisions Made:
- Welcome and Introduction
- Dr Dom Pimenta welcomed participants to the inaugural session of the TORTUS AI Academy.
- Introduced himself as the CEO and Co-founder of TORTUS.
- Noted that additional attendees were joining as the session commenced.
- Purpose and Overview of the Series
- The session aims to orient NHS C-suite executives and digital leaders to fundamental AI concepts.
- Emphasised that the content is designed to be a basic introduction, possibly too elementary for those already familiar with AI.
- The series will address gaps in understanding and evaluating new AI technologies in healthcare.
- Sessions will occur every two weeks, except during the Christmas period.
- Basic Terminology and Foundational Concepts in AI
- Defined intelligence as the ability to solve problems, emphasising speed and complexity.
- Explained artificial intelligence as machines solving problems.
- Discussed narrow versus general AI:
- Narrow AI solves specific problems (e.g., classifying images of cats).
- General AI solves multiple, varied problems.
- Introduced key AI terms:
- Model: The system that processes input data to produce output.
- Algorithm: The method used to train AI models.
- Training and Machine Learning: Processes by which AI models improve performance using data.
- Highlighted issues of data quality, underfitting, and overfitting in AI models.
- Practical Examples of AI in Healthcare
- Computer Vision:
- Used for tasks like radiology image classification, vital sign monitoring through imaging, and dermatological assessments.
- Predictive Algorithms:
- Examples include sepsis prediction models and heart failure risk assessments.
- Discussed challenges in implementing predictive models in clinical practice.
- Speech-to-Text AI:
- Compared traditional phoneme-based systems with modern AI models using context for improved accuracy.
- Introduced terms such as ‘real-time’ and ‘batch’ transcription, and ‘word error rate’ for accuracy measurement.
- Large Language Models (LLMs):
- Described LLMs like ChatGPT, Claude, and others as probabilistic models that predict text.
- Explained the concept of ‘hallucinations’ where models generate incorrect information.
- Emphasised that these models are designed to produce coherent text, not guaranteed factual accuracy.
- Multimodal Models:
- Introduced models that process multiple data types (e.g., text and images) simultaneously.
- Provided examples relevant to healthcare and discussed complexities in their evaluation.
- Knowledge Bases and Retrieval Augmented Generation (RAG):
- Explained how RAG enhances LLMs by providing access to verified data sources.
- Discussed ongoing challenges with accuracy and reliability despite advancements.
- Computer Vision:
- Interactive Quiz
- Conducted a quiz to reinforce key concepts discussed during the session.
- Participants engaged by answering questions related to AI definitions, types of AI, and practical applications.
- Question and Answer Session
- Arranging Product Demos:
- Participants expressed interest in demos of TORTUS’s products.
- Guidance provided on how to arrange demonstrations via the company’s website or contact email.
- Data Security in AI Models:
- Discussed considerations for data security in both open and closed AI models, especially in cloud-based systems.
- Emphasised the importance of understanding data workflows and compliance with regulations such as GDPR.
- Resources for Clinicians:
- Recommended resources and courses for clinicians to learn about AI, including Coursera courses and podcasts.
- Mentioned the importance of understanding AI to implement it effectively in clinical settings.
- Advice for Developing AI Products:
- Advised medical professionals to deeply understand the problems they aim to solve with AI.
- Suggested collaborating with technical partners and participating in hackathons and accelerators.
- UK as a Location for AI Startups:
- Discussed the advantages of the UK, particularly London, for AI startups.
- Highlighted the growing AI talent pool and supportive environment within the NHS.
- Arranging Product Demos:
- Next Sessions Outline
- Provided an overview of upcoming sessions in the series:
- How to Evaluate AI in Healthcare (early January)
- Understanding Clinical Risk (late January)
- Driving Adoption and Implementation (February)
- The Future of AI and the NHS (mid-February)
- Invited participants to join future sessions and encouraged ongoing engagement.
- Provided an overview of upcoming sessions in the series:
Action Points:
- Resource Sharing:
- Compile and distribute recommended resources and courses on AI for clinicians.
- Share slides and materials presented during the session.
- Session Recording:
- Provide access to the recording of the session to all participants.
- Future Sessions Preparation:
- Schedule and prepare content for upcoming sessions in the series.
- Coordinate with additional team members and co-hosts for future presentations.
- Product Demonstrations:
- Arrange demonstrations of TORTUS’s products for interested participants.
- Questions and Feedback:
- Maintain open channels (chats and comments) for any residual questions or feedback from attendees.
Next Steps:
- Participants are encouraged to:
- Review the shared resources and materials to enhance understanding.
- Attend the next session scheduled for early January on ‘How to Evaluate AI in Healthcare’.
- The organising team will:
- Prepare and distribute the recording of the session.
- Plan and develop content for future sessions, incorporating feedback from participants.
- Remain available for queries and further discussions related to AI in healthcare.
2. How to Evaluate AI in Healthcare – Dr. Dom Pimenta and Dr Sarah Gebauer MD
First Broadcast:
20250108
Speaker 0 (DP): For those of you who are joining us for the very first time, the purpose of the AI Academy is really designed for NHS leaders, CXOs and digital healthcare leaders to better understand the AI landscape. It’s an evolving landscape, we’re making senior decisions all the time, but the technology is new. And actually as you’ll see today, a lot of how we make decisions in this space is also a new and evolving science. This is the program. So if you didn’t catch the first webinar at the end of middle of December, There’s a Wiki, I’ll share the link at the end.
You can watch the video back if you’d like to. That covered the basics. So some of what we’ll talk about today, we have talked about previously and I won’t be revisiting those terms specifically unless asked. Anna, here we are, I was joined by Sarah. Hi Sarah, hi, how are you?
Speaker 1(SG): Hi. Apologies. I was I was in the meeting, but wasn’t sure how to actually join the stage. So thanks so much for having me. I’m so happy to be here.
Speaker 0: Yeah, no problem. Well, I’ll give over to you. I did a little bit of intro. But yeah, Sarah, love to love to get your intro.
Speaker 1: I’m an anesthesiologist and palliative care physician in the United States and have an AI background and do a lot of AI model evaluation work in different fields, including healthcare.
Speaker 0: Yeah, awesome. So I was just giving an overview of where we’ve gone up to. So today we’ll be doing how to evaluate AI in healthcare. And I think maybe just for the tech purposes, I’ll run your slides for you, Sarah, and you could just talk over them unless you can see them. I actually have no idea how it works.
But I don’t mind being the mouse monkey and then what we’ll do at the end we’ll have a little panel discussion between the 2 of us some relevant topics and good to dig into your experience a bit in the US which is slightly well, about 18 months maybe ahead of us in some capacity of deploying AI over there. So that’d be a super useful perspective, but also your other roles at the Random will cover some of that. And then just looking ahead, so in 2 weeks time, we’ll be looking at clinical risk, be joined by one of our clinicians here at Tortoise who would do a bit of a more clinical safety session and the regs here in the UK and the medical device regulations. Beginning of Feb, looking at adoption with my colleague, Doctor. Dave Trisker and how we get stuff into the hands of people, the trust gap, the credibility gap, the implementation gaps, and what training really is.
And it’s actually a bit more complicated than we think. And then a bit of a fun session planned for February with what clinical lead here, Doctor. Josh O’Jung, what the future can hold. And for that, we might actually try to get a bit of a bigger panel, but time and time. So without further ado then, let’s see.
Yes, and just to reiterate, so the point of this course is to learn how to buy AI for healthcare wisely. It’s not a computer science class. So although we will touch on some basic principles, we won’t be diving into them. We won’t necessarily be diving deep into how the models fundamentally work, the mathematics, the latest advancements. It’s very much about what is the market today?
How do we utilise? How do we get to a level of understanding that’s useful and to make wise decisions for in the best interests fundamentally of our patients. So ideally from concept to clinic. Yeah, and at that point I’ll hand over to Sarah, Sarah, take it away.
Speaker 1: Great. Thank you. For those in the audience, I would really welcome questions and I’m really excited to hear about your experiences in this area and how you think things should be different or could be better for you as clinicians. In terms of what we want to talk about today, we’ll talk about why AI models need different kinds of metrics. We’re all used to metrics in medicine for better or worse.
That’s been a focus of the last 20 years or so, especially since the adoption of EHRs. Why are AI models different? Why do we even need a specific session to cover this? The meaning behind the metrics. There’s a lot of uncertainty right now about what it means to have a good AI platform.
What does that actually mean clinically? What does that actually mean in comparison to the current standard and what should we be looking for? Then really focusing on key questions that you can ask people who are either trying to sell you something or that might be discussed during adoption of a clinical AI application. Because I think there’s a lot of noise out there in terms of what people might try to tell you that may or may not actually be applicable to what you need to know. As you guys talked about last time, does anyone want put in the chat, can anyone remember the difference between the probabilistic and deterministic approaches to models?
Speaker 0: Let’s have a second one. You can
Speaker 1: get it in there.
Speaker 0: Anyone’s brave enough to answer that. So you’ll find the chat on the bottom right hand corner of the platform. So yeah, anyone want to take that question?
Speaker 1: Or just, and this can be kind of a general question of why, how is AI in terms of how it makes its decisions or how it creates its outputs, how it’s different than other platforms that you might have used in the EHR and other places.
Speaker 0: Oh, yeah, very good. So Parag says same input equals same output versus an element of randomness. I think that’s pretty good.
Speaker 1: Yeah, that’s a nice summary. Some people will say the word stochastic, meaning that it will try an output and see if some percentage of the time and try a different output some other percentage of the time. So you might hear that word as well. And obviously that’s different than a lot of the tools we have, which are very rule based. So what we use right now a lot of times is, if this, if X and Y.
That is not how AI systems work. When you’re evaluating them, there is that same randomness which can make it great in terms of creating different kinds of outputs and being able to be more creative. Also brings up some questions in terms of if you imagine a factory, you’re putting in the same thing and then possibly getting out different outputs or different probabilities every time. It makes it a little more challenging in terms of determining how many tests you have to do, for example, to show that how often the right answer might occur. We have a few
Speaker 0: more answers in the chat here. So Vitor says a deterministic system has predictable outcomes while a probabilistic system involves randomness and uncertainty. Also pretty good. And Claire said trained or learning, not quite, we’ll go back to that. And then OMA says basis for answer guided by rules or policies or just reasoned.
That’s interesting, isn’t it? I guess maybe that’s similar to what you were saying about workflows and deterministic versus probabilistic in that sort of sense.
Speaker 1: Yeah. I think it is interesting because a lot of the AI systems do use the rule based systems in terms of policies and guidelines that we have in order to inform themselves about what should happen next. But they still do have that element of stochasticity or probabilistic thinking, if we want to use that word, as they work. Okay, so we got some great answers in here. Wonderful.
But we can see how that would cause a problem for model evaluation. It can vary on the context. That context could mean kind of patients that you have. It can mean the setting, whether you’re in the ER, are you in a clinic setting? There’s a lot of aspects of things like bias and then the training set and whether that reflects your patient population.
Just because it’s been evaluated in one context with one patient population or in one setting, that doesn’t necessarily translate into another setting. We know that just intuitively as clinicians a lot of things we do would be basically useless as an anesthesiologist if I were to be in a clinic without the tools that I usually have, for example. Especially when we’re talking about things like bias that we’re worried about not having the kind of data to accurately represent people and their conditions, then we get more and more concerned about knowing how often we need to run the tests to be sure that they have a kind of output that we can at least trust in a basic sense. Yes, things like psychiatry. You are used to some randomness probably in that field too.
Then the other issue is model drift. This is a well known phenomenon within AI in which the performance of the AI models changes over time. The point of them often is to learn from the data that they’re presented with and the decisions that people make based on that or not so much question and answer pairs, but the input and then the next steps from that. That can be great because it can inform the model and provide it a way to give it even better output, but then it can also stray away from its original setting. There was this recent paper people might have seen looking at chest x rays and the authors asked, Does this check which of these patients eats refried beans most often and then drinks beer most often?
The models provided answers because they had learned about things like demographic patterns in other parts of their training in terms of what areas of the country were most likely to eat refried beans and drink beer, what kinds of people were most likely to do that. While that’s a slightly different concept, you can see how as you put in more information into a system that may or may not be relevant, how it may then infer different answers to the same questions over time. Don, feel free to pop in if you have any comments.
Speaker 0: Yeah, I was gonna say like who eats refi beans in the UK and no one. Maybe a burrito here and there but nobody So if you trained it here, it’d be like 0, don’t eat that.
Speaker 1: Probably a wise decision. Right. Exactly. So this means that you can’t just do one evaluation for an AI model in most cases. You really need to be continuously monitoring these systems.
And really this is something that we should have been doing probably always with any kind of, IT product that we have, not because of this model drift situation, but more because practice patterns change, workflows change, and we really should be seeing, making sure that the technology we’re using is still relevant and helpful at different periods that we use it. But in AI, there is this additional layer of making sure that the model hasn’t drifted away from its primary purpose and from its primary outputs in terms of the metrics for accuracy and the metrics for bias and things like that that we’re concerned about.
Speaker 0: Yeah. I would just double down on actually some of our partners, we have an evolving technology test. So every 3 months we’ll look at the LLM quality and the outputs, and we make a lot of changes much more rapidly than that as a startup. So we kind of have to do that. But it’s interesting what you said about the workflow changes, because obviously I’m not obviously at all, but if you’re going to have a probabilistic model where you have a fixed input and the output changes over time, if the input starts to change over time as well, because the workflows change, I’m just thinking like maybe the letter type is different or you know, the summary ingestion is different, then actually the output will also change.
So it’s like constantly looking back and seeing what we’re doing and maybe building that into the system is also a requirement for AI specifically. But I love your point about we don’t do this for any other tech. And maybe I could think of so many things that are broken because they don’t work like they were supposed to because we put it in 10 years ago, lots of EHR functions being one of them. And we never look back and see, you know, is that safe? Is that testing?
So yeah, I agree 100% there’s a wider principle here that probably goes, beyond AI.
Speaker 1: Right. And this might be a way for AI, for places like Tortoise and others who do this kind of more frequent testing to be able to provide some leadership and say, no, this is really what should be happening for everyone. And then one other point that’s related is a lot of times you’ll get results from testing. For example, the Hippocratic AI group has done a really great job doing a lot of safety testing on their product. But they’ve been changing the model as they’ve been testing.
Therefore, the results that they provide are not necessarily the results of the current model, which I think is something, it’s a subtle point, but it is important, especially when we talk about this model drift and how AI models change over time. There’s this natural drift that happens as the inputs change or as the models change themselves, but then there’s also this artificial drift in which the developers change the model as it’s going on too. The evaluation of models that they did 9,000 patients ago might not be relevant for the 1,000 that they just did and just reported as a 10,000 person set, for example.
Speaker 0: Yeah, I mean, we see the same problem with 3rd party vendors. So if you sort of, if you’re using and evaluating a company that’s hitting, even if they’re very clear, like we use this 3rd party cloud service, that 3rd party cloud system is also updating its model pretty regularly. And if the vendor, when you’re talking to them, doesn’t know exactly which instance of that model that they’re hitting and it’s not locked to that dated instance, as opposed to the evolving model, then actually the performance will change over time. So it’s almost like what pre evaluation metrics are we actually testing, But then post, is that still valid? And actually, is that even something the vendor’s monitoring?
And I think that’s another like, you know, who’s watching the watchman in that sense. So what is the evaluating of the evaluators and how much reliable? And that’s again about instilling knowledge across the whole procurement spectrum from right from the patient, but all the way back maybe to the cloud provider, because actually we need to figure out what actually is happening in a very rigorous way, which we haven’t had to probably do before.
Speaker 1: Yeah. I totally agree. This is a graphic of some US quality metrics that we have to report. Some are for cardiology, Don’s cardiologist and some are for an anesthesiologists. You can see that if I tried to report these cardiology metrics, that would be completely useless and vice versa for Dom.
In the same way, we need to have different kinds of AI models, different metrics for different kinds of AI models. This is intuitive for most physicians. Most physicians really understand this because we already do this very regularly for our own specialties.
Speaker 0: Yeah, it’s interesting because I’ve added some sort of flavor there. So we have different metrics for primary care. We have different metrics for surgery. We have national audits, national registers, where you have the same thing. And then we have tariffs and individual audits for what people care about.
And it’s interesting because like, and even evaluating the metrics when we’re talking about the same AI system, It’s a completely different, you know, the value of what the actual chain is is really, really different. And also the quality and the time spent against psychiatry may be very different to, you know, the ambulance service, for example. So that’s another really important thing is like the evaluation needs to be dependent on the use case. And I think, yeah, super good. I didn’t clock actually that these are so different, but it’s interesting that you have to report prevention of post op nausea and vomiting.
That’s an interesting metric, a granular detail into the US system and maybe we can get into another session again.
Speaker 1: Yes, they are very big into granular detail. Metrics. For ambient scribes, we talk about metrics related to accuracy for the transcription itself. You know, are the words being captured, accurate? Then note organization.
Once the transcription is made, is the note organized in a way that people can read it and that’s meaningful? Then is it fluent? If the sentence is choppy, is it hard to read, or is it something that flows really nicely and that people actually want to read? Whereas things like diagnostic prediction, we want to know about things like, can this identify patients who are at high risk for this condition? Are the clinicians adhering to appropriate guidelines or can the diagnostic prediction algorithm help them do that for certain patient populations?
What’s the sensitivity and specificity of the prediction model? These are really different kinds of metrics that are measured in different ways and reported in different ways and that mean different things. They really are captured at different points in process. You can see how very quickly these are just 2 very basic examples of technology that we’re getting to know better and better and that already have good baselines in medicine itself. We already know a lot about how accurate physician notes are and general clinician notes, which is really terribly.
About half of notes in the United States up to 80% have information that can’t be verified and about half of it is inaccurate. Between 50 70% is copy and pasted. Then in terms of note organization, we have data about how well residents do that versus medical students versus attendings and that kind of thing. We interestingly have in the United States mostly switched to an assessment and plan up top acknowledging that basically no one reads the subjective and objective portions of the exam and to the point. Those have been flipped relatively recently.
Then fluency obviously is something that can be measured pretty well in an AI context just because that’s such a strength of AI in general is creating generative AI is creating words and texts. Then for the diagnostic prediction, that’s something we’ve been trying to do for a long time in medicine also. I just mentioned those because it’s nice when we have some baselines of knowing how well we already do things in medicine. There are starting to be and there will be more cases in AI in which we don’t have that clear baseline. We don’t know how well we do at things already and those are even harder to compare.
We know with ambient scribes that almost anything is better than a human because very unlikely to Even dictation is 24% of dictations have some kind of inaccuracy that includes the clinical part of it. This is only going to get more complicated as the technology improves and the use cases proliferate.
Speaker 0: Yeah, and I think it was really interesting about putting these up side by side is I was thinking about bias. So over here on the diagnostic deterministic side, bias has a really clear statistical meaning like a mathematical meaning. And then we also have the term in the healthcare sector wider about bias. And we really care about, you know, how are you treating your diverse populations? How you systematically treating them from a human perspective.
And then we introduce large language model bias and we ask us a question and actually kind of could mean both, like how does it deal with accents? That’s one version of bias. But actually there’s a statistical element which doesn’t make a lot of sense with probabilistic models. So we are often asked questions like how do you deal with bias on the large language model side? What do you actually mean?
And then the answer we often get is, oh, we’re not actually sure. And this is the problem that we have back and forth. And again, it’s about definition. So maybe there is an argument to say, as the metrics get better, we develop at least more defined or more clear frameworks and maybe redefine some of these terms, which are now becoming, multi plural, one word means 3 things in the same setting with patient impact. That’s not a very great situation to be in.
And I’m sure there’ll be loads more as AI becomes more and more part of our day to day and takes over more of the human element.
Speaker 1: Those are really interesting points. The human bias piece is actually really interesting too because we also have baseline data for that, which is also that clinicians tend to include quite a bit of bias in their notes at baseline as well. There are a lot of parts of these things that I think people think humans do a good job at that has been shown to really not be the case.
Speaker 0: Yeah, and I think that’s interesting because now we have to study this stuff. And I think we’ll come onto that, but like the studying what humans are doing is probably something we should have been doing, but weren’t doing because there was very little point, but suddenly we’re like, oh, maybe the AI could do that. Or maybe the AI shouldn’t do that, but what is the performance metric that we’re trying to beat? I think we don’t understand that at all for large parts of the system we have today.
Speaker 1: Exactly. We talked about this some in terms of the bias and what bias means in different kinds of settings and then when it’s more important versus less important and it can have a really a range of meanings. This is going to be an answers in the chat question. Say you’re listening to a presentation and someone says, Our ambient scribe is 90% accurate. What do they mean by that?
What is your first thought? What have people told you when they’ve talked to you about their products?
Speaker 0: Ooma’s come straight in with BS. Obviously, it’s a this is a 1 o’clock showing, so we can all intimate what he means by that. But that is also my first thought. And mostly, you know, I’ve, I was an academic before I go into this and I would say a lot of meetings and I’d hear that word accurate a lot. And my first question was always be, what do you mean?
How have you measured that? What metrics have you used to do that? Perrig says 10% of information is wrong. That actually sounds quite worrying when you put it like that. Accurate so I missed the point.
Steve says, my first thought is they don’t have a beautifully attractive product. That’s a very silly, I like this British cynicism here. I think we’re really showing up for the country right now. Correct. 9 out of 10, 90% is good.
10% is a lot of work, arbitrary number, 90% accurate would not be good enough as Parekh says. I mean, it’s actually really interesting, isn’t it? A, the breadth and diversity of how we’re interpreting that, B, the cynicism. But C, I think the question no one’s asking, which is the only question is like, well, how accurate is the human? And I think that’s what we’ve just been talking about because if it’s, if it’s at 90% accurate by the same metric, but humans are 60%, then it’s a done deal.
Right. And we should be iterating from there. What else 90% of what of whom? Yeah that’s very valid, definitely looking at that. It’s pretty meaning.
Yeah, I think Stuart hits it nicely. It’s pretty meaningless. And I think, yeah, without the hard data behind it or what the situation is and what it means, Accent clinic specifically. James is asking, how do you get to a 100? Maybe that is the optimal question actually.
It doesn’t matter where you start from an AI perspective, but what are you gonna improve about the system that gets you? And then that obviously that tells you how they’ve measured it. Yeah, that’s really good answers here.
Speaker 1: So if we go through, If we click through the slides, we’ll have some possibilities of things that I have seen people mean when they say this kind of thing. Do we get to 100% even in conversations? I’ve been married for 16 years and I can tell you a lot of times, no. That’s a great point despite everyone trying their best. I’ve seen this mean how many words are captured from the conversation in the transcription itself.
More like an ASR speech recognition metric. How much of the important transcription is included into the note? How often hallucinations occurred or didn’t occur? Percentage of clinicians who would write the same note given that transcript or given that information, the percentage of medical terminology that was used correctly, the percent of clinicians who used it, who thought the note was accurate, who used the technology and then the thought that the note that they were provided was accurate, Then how much even the patient agrees with a note is something that I’ve seen. I’m sure there are probably another 20 possible meetings to this one relatively simple seeming question.
I think all of you had a great first instinct, which was to mostly disregard this as being meaningful, but I think there is an important point in terms of you do want to be talking to people who are measuring something and knowing what to measure is hard. They’re not going to have every single answer to every single metric you probably want at their fingertips. But think what they’re measuring will probably tell you a lot about and how they’re measuring it will probably tell you a lot about them. There’s this dichotomy within the AI metric world in healthcare in which any study by clinicians tends to have to use the PDQI9 scale, which is a subjective view of physicians reading the note and they rated on a 1-five or 1-fifty Likert scale basically. Whereas if you look at all the AI literature, if an AI person does basically the same study, they run metrics like BLEU and BLUERT and these more quantitative accuracy metrics, the same kind of data.
You’ll see both of those reported and they mean really different things in terms of your clinical needs also.
Speaker 0: Yeah, I think, I mean, this is a science that we’ve been working on at Tufts for about 2 years. And one of the things that we had to do very early on is figure out something because we didn’t know how to make any decisions about which models to choose or how we were getting better at prompting, worse at prompting like any of these metrics. So we ended up actually somewhere between those 2 of saying, well, what is the task delineate it to clinical entities in the transcript delineate it clinical entities in the note. If they’re present, they should be present. That’s a true positive.
If they’re not present and they shouldn’t be present, that’s an infinite space, but that is your true negative. And then your precision score is your false positive. So of the percentage of clinical entities, how many are made up? That’s your hallucination rate. And then your admission score, your recall is going to be how many clinical entities are missing from the transcript.
Great idea. And as soon as we lodged it, like hallucinations pretty much makes sense, right? But there’s some subtlety with inference and externals and internal and there’s whole papers written about that. Most of the nomenclature in the AI world is just kind of a bit nonsense around that. But emissions is really hard and you’ve nailed it here as well.
Like what is the important transcription? But what’s important to me as a cardiologist? Is it important to the dietitian and vice versa? And in fact, we’ve had some feedback that dietitians are using the system and saying, well, it never writes down the flavor of crisps the patient likes. And I can laugh about that as a clinician.
I just care about the salt content, right? As the cardiologist, but the flavor is so important because that’s something that they can eat. They can eat meaningfully and they have to buy that specific thing to get their weight or their nutritional status up or whatever. So like that’s extremely subjective. And I still don’t really know how we score that other than it’s relative and it’s getting better.
And then there’s this medical legal question of, okay, the recall isn’t great for this situation, but it’s subjective. And you’ve made a choice that in your AI system, you don’t want that data in because it’s not relevant to your specialty, but can any vendor actually make that decision? I don’t have any answers to these questions, by the way. I’m just like going through some of the issues that we’ve had, but yeah, it’s actually quite a fascinating problem in that sense.
Speaker 1: I agree. There are some really great comments in the chats. Yes, these AI metrics do focus on semantic similarity rather than meaning. There are advantages and disadvantages to both ways to look at these metrics. Probably some combination is likely most meaningful for clinicians.
And then yes, contextualization is just so important for clinicians, especially to be able to make decisions. So gold standards. So what is the gold standard for a good AI product? How do you know if you’re getting a good one? Does anyone have a way that they decide this?
I’m curious. I mean, I I think it’s a really hard question and one that I think is, especially as clinicians, we often have to advocate for a specific technology or for a specific way of doing things, especially if it’s expensive and changes the workflow. And so we need to be able to say, This is best in task, this will do something that I really want it to do for some reason that will probably and you’ll probably end up saying something about saving money in there because you’ll be talking to someone who holds a budget. Here are some things that I have heard people say for this is what the gold standard should be. You can see this is great in the chat.
There’s a huge range. It’s everything from no hallucinations to allowing me to focus on tasks that I prefer, reliably performs its stated function. No critical errors is one that I hear a lot. Not likely to kill someone is the bar, which is probably a higher bar than we have for interns a lot of times, frankly, but 1st year doctors, but we do do our best. Or is it that you do as a physician, you together with the AI system do better than you would have done otherwise?
Is it a completely just machine issue or is it a physician or clinician human plus AI improvement as what you’re really looking for?
Speaker 0: It’s really interesting. I think actually Lorne has hit the nail on the head for me. It’s a reliability. I think when we first started out the idea, like we had this whole list of accuracy and I mean, we certainly do aim for all of this as a quality product, but I think the fundamental is like any technology, if it, the closer you can get to something repeatable, the better you can iterate upon it and the more you can mitigate for risk. If you understand that it’s going to perform day in and day out at the same capability.
Because even if it’s worse than the human, you can make it better as soon as you get to that iterable state. And actually like most things in medicine, it’s like the boring stuff done well, but every day is so key to a quality product that’s sustainable. Whereas actually, you know, a product that saves you a bunch of time, but only one task in 20, we all know we’re gonna get rid of after the second task. Right. So it’s interesting how on reflection, I think, you know, something that works pretty much the same way every single time in a very simple but very reliable way would all be actually in my top three.
I think I would say in terms of quality for metrics.
Speaker 1: I agree. And just checking the time, I think we only have about 9 minutes left. Is that right?
Speaker 0: Yeah, we try. We’ve still been doing some questions and answer as we go, right? So like, yeah, but we can, we can, I think we only got a few many slides left?
Speaker 1: We call this thing where we compare the computer to the human, human baselining, but that also is an evolving science. There are all these kind of science of evaluations questions that have yet to be answered in any field in terms of how many humans is enough. Do you want an expert? Do you want a non expert? Do you want a mix of those 2?
What if the experts disagree or have really varied performance on tasks? One thing we see a lot is that AI systems have a wide breadth of domain expertise, whereas humans tend to have one very deep area of expertise. When you say we’re going to compare this product to human performance, you often then have to, if you’re going to do it well, you have to find maybe 5 or 6 experts who evaluate different sections of the AI because there’s no one person who has that entire breadth of knowledge. Then things like we’re the human experts. If you’re having them take a multiple choice test, for example, that compare knowledge between humans and the AI systems, were the humans given money by the number of questions they completed or were they given money by the number of questions they got right?
Even things like that can make a difference in terms of what you expect for humans and therefore what that context would be for what you would expect for the AI systems. I took this from one of the ambient scribe systems, which I mostly thought was funny because it said unmatched outcomes and I was kind of like, Well, that’s because no one really wants to match these because these are silly outcomes to have. The time saved per clinician, that’s important, but a lot of these, as you can see, they’re very subjective metrics based on how the clinicians or the patients felt about the technology and there really are not any It says that the clinicians believe it improves documentation quality, but not that it actually improves documentation quality. You’ll see a lot of these metrics out there. I know this is an especially savvy group and I feel confident that none of you would be unduly swayed by this particular set of numbers, but I think a lot of people see numbers above 70% and are like, Well, it must be pretty good, and move on, or their administrators may be more impressed by this than some of the clinicians would be.
I think part of, as we’re all learning, all of us are learning together how to do these evaluations and what’s important and what’s meaningful. The questions that you guys identified in the chat, like where is this number coming from? How many tests have you run to show that this number is accurate? What are you going to do on an ongoing basis, as Don mentioned, not just for monitoring your product, but monitoring how your product interacts with other systems. I would also note that that’s not just confined to a system that is related specifically to your product.
There have been some studies showing that when more than one AI system interact, that the performance of both of them decreases. That I expect will be a growing area of evaluation as well that really is so nascent that it hasn’t even started yet. For that company, I think it’s really important to understand in terms of their values and how rigorously they think about evaluating their products, why did they think that these were the most important things to measure and why did they feel like that’s what they wanted to put resources towards. I think that’ll give you some information about what you can expect from them going forward as well in terms of where they’re going to focus and how well they plan to continue to look at metrics even after the product is sold because that is going to be a big part of you relying on them to provide ongoing support. I agree.
I shouldn’t have said meaningless. I should say that in the context in which they were studied, I don’t think it’s super meaningful, but in a real life setting, those could definitely be more meaningful. I agree. You’re not going to get any technology anywhere unless people actually like it and are using it. You are absolutely right.
Yes, intra- and intra observer calculations are really important. I’m not sure those are being done anywhere that I have seen so far.
Speaker 0: Yeah, especially for subject measures. We’re quite bad at that, I think. A few more questions here that you’ve got for key questions. It’s a different slide, so it looks very similar.
Speaker 1: Yeah. So then, similar. Yes. Similarly, how did you determine what was a reasonable performance standard? Did you compare it to humans?
Did you compare it to perfection? Did you compare it to the Then what’s the human computer interaction, which I also expect to be a growing area of research in the coming years. This has been shown multiple times too with AI systems is it works great by itself and then once you put a human with it, it just doesn’t have any impact.
Speaker 0: I think we’ve seen that as well as like recommendation systems the human doctor is wrong, but also ignores the AI. And it was like, well, what was the output? Right. Great, that’s the end of our slide. So we’ll come on to the quiz section.
So we’ve got a little bit of a quiz that we have prepared. So you should be able to see it in the poll section. I’ll just push them live as we go. So the first question is model drift describes, let’s see how well the audience were listening, how AI model performance changes over time. The use of inappropriate training, inappropriate data in training or see Kim Kardashian on a raft.
Let’s see how people are voting for that. Can I see the role? Okay, 100% of people are going for AI model performance over time. One person is in a good mood today and thinking it’s Kim Kardashian. I think most people have got that.
And I think not just even the model, but increasingly the system. So one thing that we also haven’t talked about is how do you evaluate increasingly multi model systems, multiple models doing multiple things, multiple different workloads. And I think increasingly that might actually become an outcomes based assessment as opposed to the each individual model in the system because it will probably be impossible and potentially very difficult. Okay, next question then. Very good.
Describing accuracy. Hold on, let me publish it first. Describing accuracy for AI systems is A, an evolving science, B, variable depending on the vendor or C, all of the above. Let you guys have a little look at that. Yeah, it’s a bit of a tricky one, but as you’ve seen, it’s certainly all of the above.
And I think could potentially go also variable depending on the buyer, right? Because how much you dig in, what your frameworks, what your expertise are to ask the right questions and then to understand the answers, that’s kind of the gap that this course is trying to address. But that will also change over time because the speed of technology is exponential. And how do we even start to understand these things I think is increasingly becoming very difficult. So very good.
So the answer to that one was indeed all of the above. And last question then, so human baselining describes what a human AI systems working together. B determining current human performance of a task or C base jumping straight off a cliff and instead of down? Yeah, great. Currently performed.
I think this is really interesting. I mean, we did some human based lining of summarization of notes a long, long time ago. It was a super crude study. We did it basically over a weekend. But the idea was like, if you look at a bunch of letters and you write a summary based on that, on the patient, which we all do every single day for every single patient we see, how good are we actually at that role?
And the answer was awful. We’re about the 6 errors for every two letters was what we saw. Whereas the AI model was 0 and it’s only 2 letters. So even with a small context window, you wouldn’t expect it to make many errors at all, but that is a task we’re repeating and compounding every single day. So you can increasingly see that this is going to be a bigger and bigger issue as we get into more AI systems.
Well, fantastic. Thank you so much, Sarah, for that. Just a brief reminder of today’s session, we’ll be uploading, we’ll send out a bunch of resources alongside it as well. Some extra material to have a read and have a think about, understanding clinical risk in 2 weeks time, and then driving adoption in clinical AI beginning of Feb at the next session. So we’ll ask if there are any questions, we have a little bit of time left and we can run slightly slightly late.
Floor’s open to the audience, got nice engaged audience today, 70 people. Do ask us any questions and we can see what comes out of it. There were some really interesting comments that maybe we didn’t get to, I wanted to go back to the beginning actually.
Speaker 1: I wanted to say that I’m happy to, if anyone has questions about evaluating a system or something, I’m happy to answer questions, happy for people to get in touch and email me at any time. I know these can be a really overwhelming thing to try to do, especially if you’re trying to integrate it into a hospital and talk to other people about it. People get very confused very quickly, so I’m happy to talk through that with anyone. It’s helpful.
Speaker 0: Yeah, no, for sure. I’ve just added the Wiki there as well for the previous session. So it’s an open site. Anyone should be able to access it and have a look. And it’s got the first session and we’ll upload this video hopefully a bit later today.
Any questions from the audience? We went through quite a lot as we went. There was some really interesting stuff I actually wanted to go back through. So there was a comment about deterministic always being accurate. And I think it’s interesting because the best way of thinking about it is like it’s gonna be wrong, but predictably wrong in a way that you can measure and rely on as a probabilistic is a bit more variable.
Is that a fair characterization, do you think?
Speaker 1: I think so. I’d be interested in hearing others’ thoughts. It puts me in mind of the picture you obviously with target and accuracy and it can be off the mark. Reliably hits the same place off the mark over and over. I think that’s probably a good visual.
Speaker 0: I did have a question for you, Sarah. So when you see a new AI vendor who comes to you in your clinical practice, do you have like one killer question that’s like really sorts the, you know, the wheat from the chaff in that sense?
Speaker 1: You know, I think the main thing I look at is the metrics that they present and if they can explain them well, I am very impressed. That happens honestly pretty rarely. Part of that is because often we’re dealing with salespeople and not the data people. There is usually some disconnect there that’s a little harder for some of these companies to bridge on a regular basis. I’m hoping that as this matures, that there’ll be a better understanding.
I did want to also piggyback off your comment about the multi system interactions. I think there’s that and then there’s also these, Soon a lot of this will be agentic. There’ll be a model that’s not just interacting with but actually using other models and they’ll be multi agent models. Those are even more complicated in terms of measuring them, but just something for the audience to think about as they move forward that these models are going to start to be able to use tools, to use the internet, to use the EHRs more reliably and things like that. I think understanding these questions, you guys are so far ahead of your peers already in terms of just I can tell by the questions you’re asking that you guys are really well positioned to educate your peers and you’re probably gonna need to be in that role.
Speaker 0: Yeah, the chief AI officer, we’re seeing the rise of that in the US certainly. And I let Omar ask her, what is AI anyway? That was the previous session, Omar. You can check that out. It’s a good one.
I had another question for you, but actually oversleeth throw me off there. I think what’s really interesting with the tools is we often see and you’ve seen this before like there was a lawyer that was struck off because they were using GPT-four to write their cases for them without realising that it’s not a knowledge summary, it doesn’t have knowledge embedded in it. So like when we start bringing things like guidelines or like sources of knowledge, which some of the models are starting to do now, and we’ll definitely see that this year, for sure. How do we start thinking about that? Because then you’re, there’s a 2, 2 layers, Is the knowledge source good?
And is the model interpreting it in the right way and missing? And that’s much more of an autonomous task. Do you have any thoughts on how we’re going to start dealing with those kinds of evaluations?
Speaker 1: Yeah, I mean, I think we’re going to need to, I think in some ways these are getting more and more similar to evaluations that we do with humans because humans also do a lot of kinds of tasks and use a lot of kinds of tools along the way. Clinical validation focus more on outcomes and less on process. Probably you just have to look at, does this help on things that are meaningful to clinicians because you can’t look at the more process based outcomes like the transcription accuracy and things like that anymore when you have so many different variables that could affect the end result.
Speaker 0: Time plan is much more thing. And if there’s no more questions from the audience, I’ve got one other question, which is, we’ve really talked about this before. I think the one thing, well, we vaguely mentioned it a little bit, but it’s about the human AI interaction and not even necessarily like the impact of like the physical buttons or the time spent on the computer, but what does it do to our brains? And I was thinking about this a lot, a couple of months ago, because someone said to me, Dom, you’re not communicating as clear as you normally do. And I was like, Oh, why is this?
And I realized that I’d been using AI to do a lot of 0 to 1 thinking. I’d be like, Oh, what should I think about this? And then it gives me an answer when I was editing it. And I realized I’d lost a bit of capacity cognitively to sort of do that. It seemed a bit extra hard, you know, the muscle wasn’t as strong, but I do wonder like when we have AI suggesting things, for example, I have AI guide, we’re pulling up guidelines.
Does that influence our decision making is that, you know, there’s like another portion of that, which is like the human AI pair. I have no idea how we begin to evaluate that. I don’t know if you’ve had any thoughts around that.
Speaker 1: Yeah. I mean, there is some interesting data from radiology, from the vision based fields, because they’re so far ahead of a lot of the other AI based fields. There definitely is evidence that if you present the answer from the AI first, the humans turn off their brains and just go with whatever the AI says. If you present it too far after someone has already decided that this is what’s happening. They just completely disregard it, as you mentioned.
There’s some sweet spot in terms of providing the information at the right time and at the right level to make a difference. But I don’t think we have a real complete answer yet about when that is. You may be familiar in the United States because things are approved by the FDA as software as a medical device and therefore they can be billed for. They read at the same time as the radiologist and then if they see something the radiologist doesn’t see, then they will say, We have detected something in here. The radiologist doesn’t always know what that is.
Then they’re kind of guessing what might that be and is it right or not, which is kind of a way to make somebody crazy just by saying, something in there might be wrong, but I’m not sure. It ends up taking more time.
Speaker 0: A lot
Speaker 1: of these systems are really put in place to save radiologist time and to increase speed of readings and instead have either kept them the same or often increased how long they spend on each read. I think that’s another place where the context has to come into play. How often are things meaningfully missed in radiology at baseline and is the 2% to 5% of catches that the AI has been shown to provide for readers, is that actually clinically valuable? I think the answer may be that it’s maybe clinically valuable from 10 pm to 6 am or something when we know that people are fatigued and humans naturally will make more mistakes. There may be some element of when the humans are known to be especially bad at something, that’s when we include an AI co pilot or somewhat an AI helper to really assist in those kinds of tasks or specific time periods.
Speaker 0: Yeah, I remember doing my favorite like SIM training effort was human factor training. So it was like sort of resource situation, SIM suite, but they’re very much focused on. And I remember I really like it actually because it had some completely nuts scenarios that had like a, oh, what’s that poison gas? Sarin, it had a Sarin gas attack as one of the simulations and you had to like figure it out. But I think what’s really interesting here is like maybe there’s a gap already for AI factors or human AI factor training to think about how to interpret when to interpret.
I mean, the co pilot analogy I was just thinking about as well. There’s a really clear handoff who’s flying the plane at any given moment. And there’s real bunch of safety. I’m going to give it over to the autopilot. It’s going to come back to me.
This is how I do it. The copilot’s there. We haven’t quite got there with the cognitive, but often, you know, as clinicians, we’re flying planes in the most true sense of the word where we’re trying to like carry a human being through a process or through a journey. And if the AI is assisting us, but we haven’t really thought about who’s making the decision or how medically, legally that’s a minefield already, but even just practically, like will we descale a whole generation before they even have a chance to learn how to fly a plane in the clinician sense? I think that’s also something we need to rethink about.
We are at time and slightly over time. So I will thank the audience. Thank you so much for your time. I thank you, Sarah for, I don’t know what time it is there, but it’s super early and far earlier than most UK doctors would ever wake up. And yeah, and it’d be lovely to follow-up and have another chat and maybe we come back to the panel in next month and we can have a further chat.
Speaker 1: That’d be great. Look forward to keeping in touch with any participants who are interested in this. I love talking about this stuff, so I’m thrilled that Dom was kind enough to invite me because I could talk all day and I think I’ve used up all the people that are also interested in this that want to hear me talk about
Speaker 0: this. No, you’re not. I was just thinking maybe we get a podcast going, but we’ll talk about that offline. All right. Well, thanks everyone.
And I’ll close it. And, yeah, the slides will be available available soon.
Speaker 1: Thanks. All right. Bye.
Meeting title: AI Academy Session – How to Evaluate AI in Healthcare
Purpose of meeting: To educate NHS leaders, CXOs, and digital healthcare leaders on evaluating AI models in healthcare, focusing on understanding AI metrics, biases, and model evaluation techniques to make wise decisions in the best interest of patients.
Date: 20th January 2025
Time: [No time mentioned]
Location: Online
Attendees:
- Dr Dom Pimenta
- Dr Sarah Gebauer
Discussion points and decisions made:
1. Introduction to AI Academy
- Welcome and Purpose
- Dom welcomed participants to the AI Academy session, designed for NHS leaders, CXOs, and digital healthcare leaders.
- Emphasized that the AI landscape is evolving, and leaders are making senior decisions with new technology.
- Highlighted the aim to understand how decisions are made in the AI space, which is a new and evolving science.
- Program Overview
- Mentioned the previous webinar held in mid-December.
- Informed participants about the availability of a Wiki with resources and a video of the previous session.
- Presented the program schedule and upcoming topics.
2. Introduction of Sarah
- Joining the Session
- Sarah joined the session after resolving technical issues with joining the stage.
- Dom welcomed her and handed over for an introduction.
- Sarah’s Introduction
- Sarah introduced herself as an anesthesiologist and palliative care physician from the United States.
- Explained her background in AI, focusing on AI model evaluation work across different fields, including healthcare.
3. Overview of Today’s Session
- Session Agenda
- Dom outlined that the day’s topic is “How to Evaluate AI in Healthcare.”
- Explained the plan for Sarah to present her slides, with Dom assisting by controlling the slides as needed.
- Panel Discussion and Future Topics
- Mentioned a panel discussion between Dom and Sarah to delve into relevant topics.
- Expressed interest in exploring Sarah’s experience in the US, which is ahead in deploying AI in healthcare.
- Previewed upcoming sessions:
- Clinical risk and medical device regulations in two weeks.
- Adoption challenges and training in early February.
- A future-focused session with Dr. Josh Au Yeung planned for February.
- Course Aim Reminder
- Reiterated that the course is about learning how to wisely buy AI for healthcare.
- Emphasized it is not a computer science class but about understanding the market and making decisions in patients’ best interests.
4. Presentation: How to Evaluate AI in Healthcare by Sarah
- Introduction to AI Model Evaluation
- Sarah invited questions and was eager to hear about participants’ experiences.
- Set out to discuss why AI models need different metrics, the meaning behind metrics, and key questions to ask when evaluating AI applications.
- Why AI Models Need Different Kinds of Metrics
- Discussed the difference between probabilistic and deterministic models.
- Explained the concept of stochasticity in AI and how outputs can vary even with the same input.
- Highlighted challenges in model evaluation due to variability and context dependence.
- Introduced the concept of model drift.
- Discussed how model drift necessitates ongoing evaluation rather than one-time assessments.
- The Meaning Behind the Metrics
- Highlighted the complexity of AI metrics and the importance of understanding what they represent.
- Discussed variability in metrics such as accuracy and how different interpretations can lead to misunderstandings.
- Engaged participants with a question about what “90% accuracy” might mean for an AI system.
- Addressed the challenge of defining gold standards for AI products.
- Key Questions When Evaluating AI Models
- Presented critical questions to ask vendors.
- Emphasized assessing the vendor’s understanding of their own metrics and their relevance to clinical needs.
- Discussed the importance of ongoing support and evaluation after product deployment.
- Human Baseline and AI Comparison
- Discussed the concept of human baselining—determining current human performance to compare with AI systems.
- Highlighted challenges in comparing AI performance to humans due to variability among experts and context-specific expertise.
- Mentioned that AI systems often have a wide breadth of domain expertise, while humans have depth in specific areas.
5. Interactive Discussions with Participants
- Engagement through Questions
- Sarah and Dom encouraged audience participation through chat interactions.
- Asked participants about their understanding of probabilistic vs. deterministic models.
- Presented scenarios and solicited opinions on interpreting AI accuracy metrics.
- Audience Insights
- Participants shared varied interpretations of AI accuracy and expressed skepticism about vague metrics.
- Discussed the importance of context, reliability, and understanding the specifics behind performance claims.
- Highlighted concerns about bias, importantly differentiating between statistical bias and human biases in AI systems.
6. Discussion on Human-AI Interaction
- Impact on Clinicians
- Dom shared personal reflections on how AI tools can influence human cognition and decision-making.
- Discussed concerns about over-reliance on AI potentially reducing clinicians’ own cognitive abilities over time.
- Explored the idea of AI affecting clinicians’ communication clarity.
- Human-AI Partnership
- Sarah mentioned studies showing that AI can influence human decision-making in fields like radiology.
- Discussed the need to find a balance in human-AI interaction to enhance clinical outcomes.
- Future Considerations
- Highlighted the potential for AI to incorporate tools and external knowledge sources.
- Discussed the complexity of evaluating AI systems that use multiple models and handle tasks autonomously.
- Emphasized the necessity for clinicians to adapt and develop skills in interpreting and working alongside AI.
7. Future Sessions and Closing Remarks
- Upcoming Sessions
- Dom provided reminders about future AI Academy sessions.
- Resource Availability
- Promised to upload the current session’s video and provide additional materials.
- Shared a Wiki link for participants to access previous sessions and resources.
- Closing Thoughts
- Sarah encouraged participants to reach out with questions or for assistance in evaluating AI systems.
- Dom thanked Sarah for her contributions and acknowledged the active participation of the audience.
- Mentioned the possibility of future collaborations and further discussions on the topics.
Action points:
- Upload Session Materials
- Upload the video recording of the session for participant access.
- Provide additional resources and materials related to the topics discussed.
- Update the Wiki with the current session’s content.
- Prepare for Upcoming Sessions
- Develop content and materials for the next session on clinical risk.
- Coordinate with the clinician at TORTUS who will lead the session.
- Plan for the session on driving adoption in clinical AI with Dr. Dave Triska.
- Organize the future-focused panel discussion with Dr. Josh O’Jung and consider inviting additional panelists.
- Participant Engagement
- Encourage participants to reach out with questions or for support in AI evaluation.
- Foster ongoing communication and provide assistance in educating peers.
- Consider the development of human-AI factor training sessions to address the impact of AI on clinical decision-making.
- Continuous Improvement
- Reflect on the feedback and insights shared during the session to enhance future presentations.
- Stay updated on emerging trends in AI model evaluation and incorporate them into upcoming sessions.
- Explore opportunities for further collaboration between Dom, Sarah, and other experts in the field.
Published papers on evaluation in ambient AI:
https://www.sciencedirect.com/science/article/pii/S2514664524015479
LLM evaluation Framework: https://www.medrxiv.org/content/10.1101/2024.09.12.24313556v1.full.pdf
The CHAI model card has a long list of questions that may be applicable to implementation. Its questions won’t apply to all AI software, so requires some discernment.
This is probably the most well-known graph of general human baselines for AI. Not immediately useful but broadly interesting.
This is a new Nature scoping review about ML robustness concepts, which are related to evaluation metrics, though this is more on the technical side
This is a more clinical piece about evaluation of Chat GPT
3. Understanding Clinical Risk in AI – Dr. Dom Pimenta and Dr Ellie Asgari
First Broadcast:
20250122
Speaker 0: Okay, great. Well, welcome back everyone to another session of the Tortoise AI Academy. And I shall also be controlling the point. And it’s my great pleasure to introduce my good friend and colleague here at Tortoise and consultant nephrologist, Doctor. Ellie Askari.
Hi, Ellie, how are you?
Speaker 1: Hi, good to see you, Dom. Yes, good to meet you, everyone. Thank you very much for joining us in this lunch hour afternoon. We look forward to talking to you on the 3rd session.
Speaker 0: Yeah. So today’s topic is understanding clinical risk in AI. Just a brief recap of the series so far. So what we’re doing here at the Tortoise AI Academy is a series for NHS CXOs and other digital leaders to better understand the LAI landscape, which is obviously increasingly moving at a very fast pace. There is a wiki and I’ll share the link after this to look back at the previous sessions.
And you can also look at this session as well. Before Christmas, we covered what is AI, the very basics, and it’s worth a rewatch. We will be revisiting some of those terms here again today. The last session a couple of weeks ago now with Doctor. Sarah Cabala, we looked at how to evaluate AI in terms of business case and efficiency savings.
And today we’re going to talk specifically around clinical risk. And just to remember the goal of this series is not a computer science class. Many of the in-depth answers are very interesting and provide some additional resources, but it’s to understand the fundamentals. So we as clinicians and decision makers can make better decisions about what clinical AI we’re buying, implementing and deploying on behalf of our patients. And with that in mind, I will hand over to Ellie.
Speaker 1: Thank you very much. So I think the clinical risk is a big topic, but what we’re going to kind of go over today is to tell you a little bit about categories of clinical risk and some strategies to mitigate this, some conversation about governance and accountability and some tips about implementation of best practice. So I think people who have joined this kind of meeting today have probably do know that there are lots of benefits in using AI. There’s huge amounts of promise in using AI to improve patient outcomes by improving decision making, the diagnosis, treatment plans, increase accessibility by using remote monitoring, enhancing patient experience, so much different types of chatbots and different devices available for patients to use to improve their experience. There’s a lot of cost saving to be had by reducing administrative tasks, by reducing patient readmission, by kind of monitoring patients effectively and not as regularly as we do in clinics.
However, it’s very very important to be to do this correctly, we need to make sure that we do a clinical evaluation and demonstrate them and make sure that we identify the risk and mitigate them. So to increase the kind of benefits and also improve the confidence of people. This is a new technology and it’s really important that we do it right so that increasing the confidence of users in this technology. So the various types of clinical risks, there’s no kind of one way to categorise them but we thought that we kind of discuss some of these, some things like diagnostic errors, therapeutic errors, bias and discrimination won’t be talking too much about data privacy and security that’s a whole kind of area for itself and it’s not so much like a clinical risk and a bit of conversation about the automation bias. So we’ll go through some examples of each so I don’t know and so for one of the very important kind of.
Issues with the Ai models is about potential diagnostic errors. Ai models, if you can kind of look listen to the previous sessions about evaluation and what it is and how it works, you can, you know, you get an idea about how AI models can misinterpret data because that’s what they do. They use a lot of data, they find patterns and so that they can make mistakes and then this can lead to misdiagnosis, make false positives or negatives or incorrect risk stratification. I found this, you might’ve heard this example is quite famous and people talk about this a lot because it’s a really interesting point where the AI model was kind of developed to look for malignant or benign skin lesions like melanoma or is this a skin lesion malignant or benign and what they found is that the model is working really, really well and then when they kind of identity kind of go into more detail, they realize that model is doing really, really well because it’s picking on. The images that has a ruler in it, because there were the dermatologists were using a ruler in the images that the lesion was malignant because they wanted to see the size of the malignant lesion whereas for the benign region they weren’t using a ruler so the machine was actually picking all the images with Rula and saying these are malignant and without paying attention to the lesion itself.
So you can see if you think about how the machine is kind of identifying things then you can understand how things can potentially go wrong and I think as we as our knowledge increases about the use of AI and training models, these risks are getting less and less, and we can identify them and mitigate them a lot better. And another kind of really important thing is about incorrect treatment options, so I think that you can kind of think about many examples, but what something I was thinking about, for example, in infectious diseases, if you think about it, every region has got their own infection risk kind of pattern. So even in the UK, different trusts, different areas have different kind of epidemiology with regards to what kind of type of bacteria is going around for urine infections, for chest infections. So if you train your model to pick up infections and suggest treatment in one area, and then you pick that up and use it somewhere else, might not give you the correct advice or for example, you know, there’s so many models have been developed for identification of sepsis and then it’s kind of been seen that say for example if your model makes a mistake and says person has sepsis, this person doesn’t have sepsis and how that treatment offering could be kind of affected and can have detrimental effects on the outcome if you get it right, don’t get it right.
So this can be a model bias as we have described. So it’s very important to identify what data your model has been trained on, what the output is and does it kind of fit into your certain population and would it apply?
Speaker 0: Yeah, that’s really interesting. I get radiology actually in both of those examples. So the classic one with the chest x-ray evaluation was like, oh, we can age chest x rays using the model. And after very successful training realized it was just reading the date of birth at the top right. And want to take that out of the image.
And the other interesting thing here is like, is it, there’s adult chest x-ray reading technology and all sorts of other radiology. If you show an image of a child, it will make a diagnosis but it’s completely incorrect because of that training bias. I remember once walking into the surgical offices as an S1 and 2 of the senior regs showed me an x-ray and they were like what do you think this is? And it was a picture of like a curled up skeleton in a bigger whole body x-ray and I was like oh the lady must have a cat on her lap or something and no she was pregnant so even we don’t often see images that we expect to see right so I was expecting there must be an animal in there and it was actually a child but I’d never seen a fetal x-ray before because it’s not something you should see.
Speaker 1: That’s right yeah.
Speaker 0: Exactly the same situation with AI.
Speaker 1: So I think kind of exacerbation of bias in inequality is one of the really important things that has been this you know you’ll probably hear it in the media and all the conversations about AI and it’s a real true issue and I think one of the challenges is that we do know that a lot of our data, including our health data is inherently biased And so because of how society works, unfortunately. So I think there’s already a lot of inherent bias within the data. And of course, if we’re using that data to then drive our models or kind of make predictions, we need to be prepared that there will be some bias and kind of being aware of it and making sure that we can address it as much as possible is a very important topic. So this is, I really like this study that was published actually a few years ago where they in the US, which is a very monetary health system, they try to understand if what groups of population require more healthcare and they kind of found that the black patients were requiring less healthcare which is kind of they thought of how is that possible and what had happened is that actually the black patients were using less healthcare because they couldn’t afford it and they actually were quite a lot sicker and they kind of presented a lot later and had poorer health and it wasn’t because that they needed less health it’s because they weren’t using the healthcare as much as they needed it because they couldn’t.
So I think it’s just like, this is a really, I think, good example to think about when you see something with, you need to be very careful and question the output and say, is this really true? Does it actually fit with what’s happening just because the machine said so doesn’t mean it. So, and there are many ways that people are trying to mitigate this type of biases in the data are trying, there’s a lot of attention being paid to make sure that we collect data from kind of more diverse racial groups and also be aware that say for example if we are trying testing something and we don’t have a representation of a certain ethnicity or a certain population at least we can say and disclaimer we don’t have, we don’t know if that model works exactly like we’re saying, if we have tested it all on a certain group of people, it might not exactly work the same in another group of people.
Speaker 0: I mean, the way I think about it is leverage. So used in the right way can massively exacerbate the positive impact, but equally in an ignorant way, you can actually massively exacerbate the existing negative impact if you’re not careful about how you dispute it. And that example is a classic of the existing negative bias of consumer monetary access being then amplified in that sense by an AI algorithm.
Speaker 1: Exactly. And I really like this automation bias. I think people, I mean, in general, we always say, oh, you know, like human in the loop and human in the loop is kind of sort of kind of solve all the problems that AI may have. So then, you know, if you if you have a human to check the output is going to all be fine. But I think it’s very important to have this in mind about automation bias.
So clinicians like not just clinician, humans can be a very much over reliant on a technology and can, after a while they stop paying attention to potential mistakes. So I think this has been studied both in the clinical setting and in the non clinical setting, there are several small studies coming out say that they test like small clinical scenarios and see how the doctors kind of solve the problem compared with AI separately with AI or AI on its own, and then comparing all these things. And I would say that the results are a bit like hit and miss. So in some scenarios, the doctor quite often actually the doctor and I work the best, but it is not always the case. And it’s not like you would think, oh, it’s improved huge amounts because doctors at some point get tired.
Imagine if you have some kind of a model that reviews test x rays and makes very little mistakes, at some point the doctor kind of loses the ability to pay the amount of attention that it requires. So, I think there is this study that they did, actually a non clinical one, that they found 49 experienced drivers to drive driverless cars for a week for an hour and then after a week they see after a few days they realised that the 80% of the drivers weren’t paying attention to the road anymore, they were just watching their phones or reading a book or something. So I think, you know, humans are fallible. So I think it’s really important to keep that in mind and that’s why evaluate and we talk about it in a minute that you know, constant evaluation, you know, not taking your eye off the ball is really important in evaluating and tracking the models.
Speaker 0: I have to say, I agree with this is actually one of my favorite areas because we know so little about this. We’re just starting to see what does it look like when someone’s worked with an AI system for a year? I mean, though, in the clinical world, that’s only really existed for like the last 6, 8 months. So even those people, you know, how their behaviors changed, how they changed in a good way, how they changed in a bad way, how do they use the system now to how they started using it a year ago and whatever that system might be. I got into a Waymo in San Francisco to go to your point about autonomous vehicles.
That was so funny when the Uber driver, when the human is there, I’m really vigilant to be like, are they driving okay? Like what are they doing? And actually the Waymo drove really well. And I completely forgot I was in a car at one point then someone started beeping me and I was like I’m just sitting in this, I’m not driving it and I was really like just from the whole situation and I thought wow that was such an instant behaviour change. We don’t really understand, Well, we don’t really understand why human clinicians would ignore AI suggestions.
I mean, we have ideas why, but we don’t know why. We don’t know why they sometimes take them, sometimes don’t. And we also don’t know why they would ever rely on something. And I think that’s something that we need to study, especially as the systems get more complicated and start suggesting guidelines on making diagnoses in association with how do they fit into the hierarchy of medicine? Does a consultant of the AI?
Does a junior doctor learn from the AI? Like how does it all fit together? I think that’s something that we’ll have to study quite extensively, as you say, continuously over the next decade or more likely.
Speaker 1: Definitely, I think we need to really pay attention to this. It’s kind of not very exciting and sexy, that’s why it’s not, you know, behavioral science, but people are generally more excited about the tech side of things, but it is really, really an important area. So can’t kind of emphasize enough on this. So to kind of mitigate risk and to make sure that we kind of cover risk, rigorous evaluation is key and that should be in all different kind of stages of any technology before the deployment, if depending on the tool, if it needs clinical trial, not everything actually fits into the clinical trial, but depending on the tool and the post market surveillance so these are very very important and some of the things that needs to be really considered as we as I kind of mentioned that earlier about the quality of data I always think about what what the AI tool that you’re supposed to be, you’re planning to use, what kind of data has it been trained on, is it generalisable and see how it performs in the real world and assess its kind of performance. In that, if you are to procure or use an AI tool in understanding the relevant regulations and ethical guidelines is really important And I think in that since CHA2PT came out, everyone is, I think the conversations about regulations of why has kind of gone through the roof and there isn’t any, at least if you’re following this topic there isn’t any day that people are not talking something about the regulations of AI and there’s a lot of anxiety because it’s such a new technology such a fast development and people are not sure how to go on about regulating it But I would say if you kind of see understand the regulation landscape of medical device, then it’s kind of reassuring that this technology also fits in within the medical device regulation and of course there are some challenges because of the way how they develop so fast that how we can kind of catch up with regulating is something that people continuously talk about.
But I think the foundation is there. It’s just a matter of how we can fine tune how to regulate certain specificities. So in the UK, MHRA oversees the regulation of all medical devices in the EU medical device regulation and the EU AI act that kind of has come into effect last year and in the US the FDA does the guidelines for medical devices and of course in addition to the guidelines which to the kind of regulations which are essentially law means that we need to adhere to them It’s very important that we adhere also to ethical AI principles and although ethics on they’re not part some of them kind of have the foundations for the law, but they’re not law in itself, but abiding by them and making sure that they’re kind of the devices that we use fit in with that ethical principles is really important to build trust to kind of in the population to ensure that we align with the with the kind of the whole population or governments AI principles so that we have a fair, we kind of abide by fairness and reducing bias and all the things that an ethical society kind of tries to follow.
If that makes sense. So what is really, really important, I think this is one of the most important slides about the kind of assessment of the medical device. So mostly I mean, all technologies would fall into some sort of a medical device and the definition of the medical device is really important and medical something, whatever tool that you’re using, it would be a medical device, whether it’s a software or is it the hardware that it needs to be a product that is intended for a medical purpose to diagnose, monitor, treat, alleviate or prevent disease or injury And so this, this definition is the key factor about what tool that you’re using and how that tool is gonna be regulated. And the classification, most places follow the similar classification that you see like 123 and 2 is like divided into A and 2 B and as you go higher and higher the risk is increased. So for example, class 1 kind of devices that doesn’t require a notified body approval because they’re kind of companies that have a class one medical device that they can self report themselves by following a lot of regulations that are set by the NHS, it was NHS digital and now they’ve kind of merged with NHS England with a lot of regulatory requirements that they can self certify themselves, but as the kind of devices become more sophisticated, say for example, if you have a pacemaker that would be class 3, so it requires a lot more vigilant kind of assessment but what is really important with the intended purpose if you think about it I mean if for example if you have a watch that is a smartwatch and you say that this smartwatch I’m making this with the intended purpose just so that you can check your pulse whatever you like and check your activity or how much calories you burn and that would not be a medical device but if you have the same watch and say actually I’m making this watch and my intended purpose of it to find out if you have irregular heartbeat and you’re at risk of stroke, so if that would be your intended purpose then that watch then becomes a medical device and needs to go through all the regulation.
So the intended purpose is a very important one and to identify whatever tool you have, what kind of class of risk it has would then guide you on what kind of regulatory approval you require and how you need to go on about it.
Speaker 0: Yeah, I mean, this is another really fascinating area where the technology is moving faster than the regulators can keep up. I think I agree with you. I mean, Torchus is a class 1 medical device for this particular reason. There is a structure, there is a evidence requirement, there is a process. It is imperfect.
And actually a lot of the elements will need to be updated quite regularly as we move. And again, as typical everyday AI that you’re working on-site goes from class 1 to class 2 to potentially even further. This is the kind of rigor that we will need. Thankfully seeing some AI innovation on the compliance side as well to be able to like keep up with the documents. But I think more fundamentally, it’s our understanding of what is and isn’t a medical device is changing quite rapidly now.
And AI is probably somewhere between a person and software as opposed to purely software, which is very deterministic. Person has to be trained, it can make mistakes, it can have variable or performance. And that stochasticity is something that none of these traditional ways of assessing technologies actually can account for today. And therefore you have to build evaluation. I think automating evaluation and that’s big part of the scene outside of the medical space in the AI sphere, it’s all about evaluation at scale.
And that’s how you get to enterprise grade performance. We’ll probably have to become part of this. And my suspicion is it will all end up at patient outcome levels as in patient do or was there any patient harm? So the best definition I’ve heard for medical device, which is the simplest, if you’re thinking of creating an innovation, if it goes wrong, will your patient come to harm potentially? And if the answer is yes, then it’s a medical device and you just have to figure out where you are on that ranking.
And if the answer is no, then you’re building something for the consumer market and there’s something else. But I think, you know, while we’re still pioneering, this has become a really interesting area, b a potentially really hairy one as we go forward as well. And obviously somewhere we spend a lot of time at Tortoise.
Speaker 1: Absolutely. So the MHRA has had a roadmap I think last year and they’ve recently updated their kind of AI roadmap for medical devices and they require several kind of many things but falls into 4 categories. One of them is clinical evidence requirements so you need to show if you have a medical device there is a scientific validity of it, how the clinical performance evaluation and real world performance monitoring for the technical requirements you need to have some change management protocols, you need to have some data quality standards, cyber security measures so for example your company needs to show that they’re kind of follow the guidelines for the cyber security, they’re not non penetrable, comply with GDPR, comply with data processing, all the different requirements, they need to have, you need to have a risk management. I think anyone who is kind of using any medical device probably would be familiar with it. They do need to have a clinical safety officer to do a clinical risk management forms and have a hazard log and go through all these processes, continually doing risk assessments, do a benefit risk analysis, do a post market surveillance and there’s many, many documents that are required I think even for a class 1, there’s so much documentation that required and there are companies that help with that and most places that you are going to be using medical devices, you would need to have some kind of regulatory expert to kind of help with these types of document gathering and evidence gathering, and being able to look after all of these elements because it’s quite comprehensive and extensive.
So I think in a way these are all already in in kind of in place for medical devices and AI is not is it has some slightly different capabilities and capacities but these fundamental things are there already that they need to be complied complied
Speaker 0: with. It’s a great forcing function actually as painful as it is for startups. Gone through these processes and you put them in place, you’ve created a culture almost by design where you have to maintain these documentation. So if we have to look at processes that we have continually test, therefore safety becomes part of the DNA, if you want to get to market in this way. And I think that’s exactly why, even if they’re imperfect, this is definitely the way to go forward with any AI technology, I would say.
Speaker 1: I think it’s really good when you actually do it, as you said, is that when you try to fill in the hazard log, for example, it makes you think a lot. You know, what if, you know, it makes you think about a lot of risk that you wouldn’t have otherwise thought about, as you were saying earlier, like if what would my patient come to harm, it’s something like, oh, how about if it comes to, you know, what would happen if, for example, an AI scrub say something wrong? And we’ve talked a lot about this, like at what point would you consider it as a harm? So we can standardize risk estimation based on the probability risk and the probability of a risk and how bad the consequence of that risk will be. So with harm we mean that if somebody, some injury happens to a person and so.
Say, for example, you know, this is kind of one of the things that I find is in some cases not so straightforward because some of it can be slightly subjective, so I think for and people’s kind of understanding of risk is a bit different, I think that in itself is a science of how individuals perceive risk so. You know, even in the general population and I discuss with patients, if I want to do a kidney biopsy if they are kind of say 1 in a 100 they need a blood transfusion, I think some people find that really shocking, some people find that it’s fine, so it’s really interesting, but essentially that’s how one of the kind of key ways that matrix tests that we use to identify the estimation of risk based on the likelihood and consequence.
Speaker 0: Yeah, again, imperfect, isn’t it? It’s all context specific when we used to do risk assessments in the cardiology land like cardiothoracic bleeding, you know, catheter bleeding is like the aorta’s hosing everywhere. That’s catastrophic, whereas catastrophic or severe bleeding in when you do a cath is like, you get a hematoma here or here that you can’t control. And it’s a very different level of risk, but equally those events should be very low relative to the danger of what you’re trying to achieve. And that’s why I think this is probably like it’s a very transmutable way of talking about risk, but it’s actually quite difficult to then make those decisions.
It’s useful to have, we’re both CSOs have an external CFO, talk to the CSO in the hospital as well. I think one of the things to stress here is to make sure that when you’re adopting a new technology, those hazard workshops that you’re supposed to be doing actually do get done because the subjectivity of what is and isn’t significant can be quite variable. And your risk tolerance as an organisation might not be the risk tolerance of the organisation down the road, for example. And there’s lots of inherent interesting things that can come out of those discussions anyway. Use cases intended to use better instructions, better training so well worth doing that as an exercise.
Speaker 1: So I thought I put this question out to the audience and saying so if you have an AI radiology tool that says lung cancer, you have a that’s reports lung cancers that make mistake 1 out of 1,000, how would you categorize the risk? I mean, what which number would you give it? I’m just curious that if everyone would think about the same risk, say for example, if 1 in a 1000 it misses.
Speaker 0: It’s interesting how would people categorize that? So a radiological misses a lung cancer on some sort of scan. How would we think about that in terms of impact? Is that minor, significant, considerable, major, catastrophic? Could do with a poll really, but you could put your answers in the
Speaker 1: If you put your number in.
Speaker 0: So we’ve got one vote for 4. Yeah. Which is confusing. Stephen, do you mean major or do you mean likelihood high? Like you categorize that on the numbers there, I suppose.
Yeah, interesting to break it down. So also I guess likelihood is also subjective, right? So what is 1 in 1000? Is that a very high likelihood? Yeah.
Low likelihood, very low likelihood. So example, 1 in 1,000 might be considered very low if the AI radiology tool only looks at a 100 scans a year. If it looks at a million scans a year, 1 in 1,000 suddenly becomes almost a daily event. So how do you assess likelihood as well?
Speaker 1: I think it’s very difficult.
Speaker 0: Like a low medium catastrophic consequence. Interesting. So yeah, that would come as a 4. Whereas we’ve had some disagreements already about the impact of the risk. I think this is really good to highlight like how difficult this actually can be.
And again, depends on the context. So for example, if the AI radiology tool is an over read and there’s 2 steps and the human’s in there, then you can massively decrease the risk, right? In this scenario, this is it and the human picks up and the risk is significant, but it’s not going to cause the patient significant harm because there’s a second line. Similarly, you know, the number and the context become very important. Adelsa is major too catastrophic, reasonable high.
How do you feel that has that 1 in 1,000 had the fate of the emplasia mist? Depends on sample size, which is risk relevant to current situation. Is it better or worse? I mean, that’s a very good point. AI technology, one of the things that we rarely talk about is what is the baseline?
And actually that from Kate is a very good suggestion that may be, and I think I made this point to you a couple of months ago actually when we were looking at this, what is the relevant baseline of risk today? So for example, Ambien, if you miss a, how often do human physicians miss important details that could potentially change the diagnosis from what they document? I don’t think we actually know that if we did, then baselining in the performance of risk would be much easier if it’s safer or worse. And I guess similarly here, like if the lung cancer is missing 1 in 1000, but the human radiologist was missing 1 in 200, does that make us look again at this table? And I think that’s what I’ve always found tricky about this process is it’s kind of a static one when really it’s the dynamic relative change that is what we should be paying attention to, especially when we’re trying to achieve something in the organisation or in the clinic.
Speaker 1: Absolutely, I think all the comments that people are making is actually quite valid. And I think it’s important to have the baseline, see what the rate is in the organisation already with the humans. And of course, that really helps us. And I talk about it a bit later about the question to ask what level of risk are you happy to accept when you’re kind of trying to use an AI tool and then that helps you communicate that with your patients as well because they’re the one of the most important stakeholders. So I think I just wanted to bring that that it’s not so yeah, in here.
Yeah, so I think human stress, I think, you know, so much happening. So just wanted to make a point about the EU AI Act, a lot of people have been going talking about it. It took a long time for the different EU countries to come up with this regulation. And what they have done is also a risk based assessment for the AI technologies which includes medical technologies. And this is all AI technologies and it comes from minimal risk at the bottom which is like spam filters or video games which goes higher and higher which the unacceptable risk is social scoring or face recognition And you can see that this is kind of different from the US for example, in the, they can, they use in the airports facial recognition, which I found quite interesting after studying the EUA act for a long time.
And when I went to the US in the, at the airport and I was thinking, how is it that they’re just letting everyone have the facial? So it’s very different and EU does have like a lot more stricter regulations, but the medical devices fall into the category of high risk. So I think they need to go through a more rigorous process to get to be able to be used. And there’s a lot of discussions about, oh, what is the best approach? Because some people think that AI is very limiting because it has a tough regulation and it doesn’t let innovation flourish and some people say well in the US the FDA has a lot lighter approach so people say that the US is a better place and I think UK is trying to kind of take them somewhere and sit in somewhere in the middle to have a good kind of safety approach but also allow for innovation and I think that’s probably a better way forward in a way Difficult to know, I think time will tell.
Speaker 0: I think it’s a more sustainable way and obviously you know the latest administration in the US have just slightly deregulated again. But I think the fundamental interesting thing about that is it’s an extremely litigious environment in the US. So it’s deregulated as you want to be on the innovation side. When you come to deploy, you are then entering one of the most high risk clinically medical legal environments in the world in terms of how easy it is to sue everybody. So like AI companies who are going, Oh, yeah, we’re just going to deregulate everyone straight into hospitals, are going to be woken up very, very sharply to the fact that, you know, clinical medical legal litigiousness is not going to change and probably will get worse in the world of AI.
And so actually this is the type of stuff that I think in the limit will actually give EU companies an advantage growing up in this environment and growing up hard as it were in the clinical regs mean that when you go over there you can have high confidence in your systems and you can evidence that then that becomes defensible in that legal environment. And I think the FDA slightly aside will maintain a lot of standards around Benecord devices but equally who knows in terms of what gets deployed and how it gets deployed. I think it will be the Wild West until it isn’t. That’s generally my view with most of these industries, especially in medicine. It will persist with relatively light touch until there’s an event or there’s public interest in the protection and the safety.
And then those on the right side of the regs of the medical devices will be the ones that persist and are sustainable as companies and innovations. Anyways, Maggie.
Speaker 1: Yeah, time will tell. So
Speaker 0: how do
Speaker 1: we mitigate clinical risk? I think is a continuous cycle. Like any other medical device that we’ve been using it’s very important to identify and monitor, assess and monitor risk regularly making sure that there is a robust risk management process in place, governance process in place, make sure that we do clinical validation and pilot studies, checking if the company has done any real world testing and if they needed to do any clinical trials depending on the device, have they done any real world kind of testing and assessment and what is their result? Very important to have multi kind of interdisciplinary collaboration, have the engineers, the clinicians, the governance team engaged together to make sure that people know the kind of pathway of how to report incidents, how to raise issues, where to get that fixed and also continuous performance monitoring and ensure that post deployment you have an eye on it and we’ve been emphasizing this throughout the talk, but that is one of the main or most important things to ensure that safe deployment of AI tool, whatever that may be. Yeah, I think this is one of the issues I think I don’t know how we’re going to address in the UK and that’s upskilling of staff.
I think in my view the knowledge for AI is kind of way behind how the technology is progressing. Unfortunately, the situation with clinical teams is that everyone is so busy, it’s really hard to make time to kind of put in this education. But as far as I know, there’s now in the curriculum for medical students to learn about AI. But I think that’s going to take some time before they come into clinical practice. But for the time being, it’s really important that the clinical staff have some basic data literacy, there’s some basic understanding of AI, what how it how it works and its limitations.
So then they know what to look out for. It’s very important that there is collaboration between the data scientists and it people and regular training because these AI tools are constantly updating, things are changing so fast it’s really hard to keep up, so it’s really important that regular updates happens and and I think sometimes we underestimate the change in the workflow and we were talking about it with regards to the human computer interface or human AI tool interface and how it we don’t know how it works So I think with any tool that you implement that is going to affect the workflow, we don’t know how that influences the whole dynamic of the clinical care that people are going through. So if you have an AI technology that helps with reading x rays, how is that going to affect the workflow? Would the clinician then be asked to see more, review more x rays? So I think one of the things I heard from some radiologists is that what they were saying, so if I look at say 100 x rays and if they’re all, you know, like say half of them are simple, half of them are hard and then you put in an AI technology that looks at all the easy ones, so then the head of department can argue well I’ve taken off this 50 easy ones so then or 50 of they won’t say 50 easy ones they say I took off 50 off your load so you need to look at the other ones but if they’re harder to kind of look at or more difficult to analyse then how are we going to kind of factor that in how are we going to put the enough time for the radiologists to be able to and of course you know that cognitive load as well which is we don’t kind of pay attention to if you’re going to analyse a difficult x-ray it’s harder for your brain so it takes a lot of time so maybe you know it’s just a lot of nuances that we don’t know until we kind of pay attention to it and risk to take the factors.
Speaker 0: It’s such a good question. And I think the probably why, you know, the impetus of starting this webinar in the first place was the fundamental of like the gap between what we know in deploying AI today versus what we knew 6 months ago versus what we know 12 months ago is vastly different. And it’s full time working on clinical AI every single day. Right. And then we talk to people and they’re like, oh, can you do this thing now?
And like, yeah, we could do that 8 months ago, the technology is moving so fast. But I think it’s not just clinicians. And I think interestingly, this is one of those innovations that shouldn’t live in the innovation department. It shouldn’t live in the transformation department. Everybody’s gonna have to understand some of the basics of how this technology is gonna change, not just healthcare, but so many other parts of our society, and actually really persist, autocorrect on your phone, that’s AI, the way that you open your phone when it recognises your face, that’s AI, right?
These are technologies that already we’re kind of familiar with, but I think pitfalls are new, the limitations are new, some of the edge cases, behaviors like hallucinations, for example, these are new terms that are really keen to understand. I think your point about workflows is exactly right. Like the technology is just the technology, but it’s the product solving the problem. So what is the problem, right? Radiologists don’t have enough time.
And then you create, you go through all the steps and you realize that you’ve actually landed them with less time to do the important work, even though you’ve moved the workflow around significantly. And I think that’s something that we need to be super conscious of if you don’t have a problem across the workflow. Sometimes, for example, and you say this all the time, AI is not actually the right solution or it’s actually gonna make it worse, it’s gonna amplify the problem in the wrong way. So like diagnostic screening algorithms sound great in the, what’s the, the, the Daniel Up, the box where it sort of gives you an AI based diagnosis. That’s great.
But then who’s gonna deal with the white noise that then creates in the system downstream? Because it’s not gonna be the NHS for sure. It’s already full of actual patients who are actually unwell who already have symptoms. What do we do with the walking well? They may have a problem and we don’t actually understand the data behind it either.
So these big system changes they’ll come, but we need to sort of manage them through very carefully. And I think that’s where a lot of the value will be added or in fact, detracted by deploying AI in the right or the wrong way respectively.
Speaker 1: So there’s a lot of kind of questions about accountability I think whenever AI tools are being used and people ask a lot of time. So what if we buy a tool and then it just makes mistakes, who’s gonna answer? And I think there’s no kind of right or wrong or definite answer about it. I listened to a recent podcast at Annabelle Painter at the RSM with a lawyer who was talking about this accountability issue and it was really interesting that this is a lawyer who does a lot of mitigations, you know, like negligence cases in the NHS and what he was saying is that he hasn’t come across a case yet against an individual with using AR technology, but I think it’s probably not going to be too long before probably a case comes about but in the answer to the question of who is accountable I think there are different levels of accountability and everyone you know different people are different for different aspects of it. Ultimately, if a clinician uses an AI tool in the clinic, they are responsible to check the output.
So if I use an AI scribe, I need to check the output. If I use it to check them an image report, I need to make sure that I have checked if I agree with the result. The organisation needs to have an oversight of the device to make sure that they’ve done their checks with the provider, they’ve got a clear governance structure in place, they know how they can train the staff, they can let the staff know how to report incidents and there’s of course the technical accountability, so if you buy a tool from someone from a company that says our kind of accuracy rate is 90% and then next you know something happens with the model that they’re using and the next thing you know they are 50% accurate they are responsible as a provider to give you a device that is good. So each person kind of different people are different that are accountable at various stages of the AI journey. So implementation and evaluation go hand in hand, some good tips for practice, we’ve been through this throughout the talk but it’s good to start small and iterate, see how it working in a small group, make sure that you can iron out any problems, you don’t suddenly roll it out in the organisation and see if it’s not working, it’s quite obvious.
Multi stakeholder engagement, I would emphasize involving patients very early because that kind of helps you with getting confidence, getting their views and see what’s tolerable and what’s not, what’s acceptable from the patient’s perspective, having the robust project management in place and continually improving by learning from what didn’t go well. Did I not communicate with my staff well? Did patients not like it? Did the microphone not work? Do I have the right things in place to make the most out of the tool?
Speaker 0: 100%. Okay, great. Oh, is this your last one?
Speaker 1: Yeah, so this is the last one. So questions that you need to consider when you’re thinking about clinical risk, you need to as an organisation, Don was saying earlier that different organisations might accept a different level of risk, so what is an acceptable level of risk that your organisation is happy to take? I think that’s something that you need to be clear. Who is responsible for the performance of the device? I think it’s really important, and I’m sure most organisations have this when they procure a device, that they have an agreement with a provider to say who is responsible for which bit, who is going to follow-up, and do you have a plan in place when things go wrong?
I think that’s the key for the governance structure, the risk assessment. So I think we’ve gone through these key messages is that rigorous risk management is important, Strong governance is important and continuous monitoring is very important.
Speaker 0: Yeah. Awesome. Okay. Well, thank you so much, Ellie. Let’s go through some of your quiz and see how much the audience were paying attention.
So if you’ve done this before, you’ll see that there’s now going to be a poll that you guys can answer if you look at polls there on the live storm. So automation bias is defined as one of the following. Tendency to accept computer generated recommendations against your own judgment, a tendency to ignore AI recommendations due to past experiences with technical errors, a preference for using the newest AI technology over older systems. So you can have a little of a vote. We’ve got a few votes in there now, if you’ve managed to find the poll button, it’s just on the bottom right.
I’ve also heard this called reliance bias and maybe in the future we’ll have other categories of this type of human computer interaction issues. One of the things I’ve also seen which we don’t really have a word for at the moment is the loss of your own ability to do the 0 to 1 thinking that you require. We see your autonomous vehicle is quite accurate. I’ve seen this also in writing. I use chatGPT for a while to do some writing work.
And then about a month in, I was like, I can’t start something myself now. So I had to stop and start again. That switch is in your brain, it’s a muscle, right? And it’s the muscle that causes this kind of problem. And we need to understand that clinically.
Great, yes. So most of you put down that it is indeed the tendency to accept computer rated recommendations and something that is an interesting area to study. And we should probably start studying it quite quickly. A quick one, which organization oversees the medical device regulation in the UK? So let me just publish this poll as well.
You should be able to vote for that one as well. So again, a slightly transitory one as things are changing a little bit, but the answer we’re looking for is who is the body to look for today? So if a vendor comes to you and says we’re a medical device, where should you go and see if they’re registered? And yeah, so most of you have said the correct answer, which is indeed the MHRA. And if you see a vendor and there are a medical device, go check them out on the MHRA website and see if they’re registered there.
And then last but by no means least, and this is a hard one because I’ve noticed I’ve been putting 2 easy questions here. Which of these are considered high risk by the EU AI Act? So chatbot interactions, emotion recognition, or justice systems. I’ll let you have a little bit of a think as people go through that. If you remember there was unacceptable risk, high risk, sort of moderate risk and low risk.
So a few people, lots of don’t know as I’m guessing because I need 4 votes so far. So the answer is actually justice systems. And I think the EU very much is thinking about the impact on society as a whole with legislation. There’s a lot of interest in things like facial recognition, emotion recognition of tracking individuals. And I think the EU is very nervous about that.
And that actually is a bit of a differentiator in terms of how people are going about implementing AI technologies in society a little bit more broadly. Awesome. Well, thank you very much, Ellie. I’m gonna leave the floor open a little bit now for some questions. Oh, I didn’t change that last slide.
So let me just go back there because like it still says the wrong person. But we’ll stop sharing there and we’ll leave it open for questions from the audience. So there’s a question button. You can press question and ask new questions and they get logged actually. So that is the most useful way of asking questions because then they’ll go into the replay and we can store those or you can just stick them in the chat.
And we’ll wait for any interested parties to do that. Or we just cycle back to the beginning of this presentation. We can leave the opening slide up nicely. Take a look. Perfect.
Any questions from the audience at all? Steven’s typing, see what he is saying. Or you can put it in the question chat. That was very interesting. Thanks very much.
I have one question. Like adoption is pretty low at the moment, but let’s flash forward to 2 or 3 years time where co working with some form of AI is like the norm for most clinicians most of the day. What do you think the biggest risk is if you were to worry about one thing more than anything else?
Speaker 1: De skilling, I think is a good one.
Speaker 0: Interesting, what do you mean like diagnostics?
Speaker 1: Yeah, it is. It’s kind of exactly as you were saying about using ChartiBity to write, I mean I find the same because I use ChartiBity and Grammarly, I can’t now write a text message without checking with Grammarly to see if I’m doing it right. It’s like becoming really dependent on the staff. I mean, imagine if you have like a machine that does a lot of things for you, then exactly as you say, it’s like a brain muscle that might might kind of make you descilled. I think that would be my main one and of course the question is how are we going to then keep up with the update of the of the stuff that’s another big problem that AI models are improving so rapidly, how are we going to keep up with the improvements, then we’re going to be stuck with 1 while something else is going to do better, so it’s a constant iteration.
Speaker 0: Yeah, no, I think the descaling one is one I’m particularly worried about. There’s a question here from Adele. She asks, I come from the veterinary field where there is no regulation of medical devices. How would you approach ethical and responsible AI adoption in the clinical setting where there are no guardrails?
Speaker 1: Is that for you with regards to the animals? I think to be honest I’m not familiar with kind of what regulations exist for the animals but I’m pretty sure there would be some, I mean there’s lots of kind of issues about how to you know look after the well-being of animals, different organisations, so I think if the AI does come into effect you would expect some proper regulation as in law and ethical standards where you would need to have animal welfares in place, but that’s a good question and I think I wouldn’t be like in a perfect position to answer because I’m not familiar with ethics and regulations with regards to the animal welfare.
Speaker 0: No, 100%. I think the model is similar to the model that we’ve taken of like consciously make a decision that this is an powerful and dangerous technology, and then apply your judgment as best you can and the vendor or whoever you’re buying from, what evidence do they have? How significant is it? And there might not be a framework for that, but again, that’s maybe where a clinician role comes in more than anything else. If you’re applying something to your patient in front of you or your pet or your animal patient in front of you equally so there’s still a duty of care in that sense.
And I think that’s probably the only way that we can do this. And then obviously, lobbying for impact on seniority of clinicians or any industry using AI on de skilling?
Speaker 1: That’s a really good question. Not so much on de skilling as yes, but there’s quite a lot of studies looking at clinician age based on how they take on the AI recommendations about adoption of AI technologies. There’s this kind of and there some studies suggest that more senior clinicians are less likely to adopt AI because they feel like they know they know it all either that or because they’re kind of a bit more old school and they don’t use AI technologies as much But there are some studies that actually have shown that more senior clinicians are less likely to make mistakes because they have more confidence in their own clinical judgment. But I think with the skilling, I think we need a bit more time because the adoption has not been long enough to then study the skilling of the clinicians. But I think that would be a really good thing to understand.
Speaker 0: I would also kind of love to understand like the AI interaction that’s required to not de skill you. So like you can imagine a world where the AI is like, I know the answer, but what do you think? Right? Like it’s very much like active teaching in the same situation. And I can see that for genius for sure.
Yeah. And I think we have to take that super seriously otherwise we will accidentally wipe away all the existing algorithms that we have out there which are all the humans, you know, 40 plus years of clinical experience per head at some of the top level consultants. And we would just wipe all that away. And equally interestingly, I think AI is a consistent source of information is really important when humans are very inconsistent. So how do we marry those 2 things together?
And it’s a question that I think we’ll have to answer. I’d love to, if you’re interested in doing some research in the SIM lab, I’ve got a whole bunch of experiments I’d love to run and we haven’t had time to think about them. Okay, well, we’re over time. I just checked there’s nothing else in the, Thanks so much for your time everybody. Just to remind everybody again, so the next session will be in a couple of weeks time.
It’s about driving adoption in clinical eyes, a very different tact, but adoption is pretty critical when you decide that you want to get something to get it through your organization. If anybody’s trying to deploy technology that understand the pains and AI has some unique and some non unique elements of that. I’ll be hosting that with Doctor. David Trisko in a couple of weeks. So thanks very much all again for your time and we’ll see you next
Speaker 1: time. Thank you very much, everyone. Bye.
Meeting title: Understanding Clinical Risk in AI
Purpose of meeting: To discuss the various categories of clinical risk associated with artificial intelligence (AI) in healthcare, strategies to mitigate these risks, and the importance of governance and accountability in the implementation of AI technologies.
Date: 22nd January 2025
Time: Not specified
Location: Online
Attendees:
- Dom Pimenta (Host)
- Dr. Ellie Asgari (Consultant Nephrologist at TORTUS)
Discussion points and decisions made:
Introduction:
- Dom welcomed everyone to the session and introduced his colleague, Dr. Ellie, a consultant nephrologist at TORTUS.
- Dr. Ellie greeted the attendees and expressed enthusiasm for discussing the third session’s topic.
Overview of TORTUS AI Academy Series:
- Dom provided a brief recap of the series aimed at NHS CXOs and digital leaders to better understand the rapidly evolving AI landscape.
- Mentioned that the first session covered the basics of AI, and the second session with Dr. Sarah Gebauer focused on evaluating AI in terms of business cases and efficiency savings.
Today’s Topic – Understanding Clinical Risk in AI:
- The goal is to understand the fundamentals of AI to make better decisions regarding the purchase, implementation, and deployment of clinical AI systems.
Benefits of AI in Healthcare:
- Dr. Ellie highlighted the potential benefits of AI, including improved patient outcomes through better decision-making, diagnosis, and treatment plans.
- Discussed increased accessibility via remote monitoring, enhanced patient experience through chatbots and devices, and cost savings by reducing administrative tasks and patient readmissions.
Categories of Clinical Risk:
- Dr. Ellie outlined various clinical risks associated with AI:
- Diagnostic Errors
- Therapeutic Errors
- Bias and Discrimination
- Automation Bias
Diagnostic Errors:
- Explained how AI models can misinterpret data, leading to misdiagnosis, false positives or negatives, and incorrect risk stratification.
- Provided examples:
- A skin lesion detection model incorrectly identified malignant lesions by mistakenly associating the presence of a ruler in images with malignancy.
- Radiology AI models misreading pediatric images due to training exclusively on adult data.
Incorrect Treatment Options:
- Discussed risks where AI may suggest incorrect treatments due to biases in training data.
- Example: AI models trained on specific regional infection data may not provide accurate antibiotic recommendations in different regions with different bacterial profiles.
Exacerbation of Bias and Inequality:
- Emphasized that bias in AI can lead to inequalities in healthcare delivery.
- Cited a study where an AI system underestimated the health needs of black patients due to biased cost data, failing to account for socioeconomic factors affecting healthcare access.
Automation Bias:
- Highlighted the risk of over-reliance on AI recommendations, leading clinicians to overlook errors.
- Mentioned studies showing that clinicians might accept AI output without critical evaluation, potentially causing harm.
- Dom added personal anecdotes about human error and the tendency to trust AI outputs.
Strategies to Mitigate Clinical Risk:
- Dr. Ellie stressed the importance of rigorous evaluation at all stages, from pre-deployment to post-market surveillance.
- Key strategies include:
- Assessing data quality and ensuring it’s representative and generalizable.
- Understanding regulatory and ethical guidelines specific to AI in healthcare.
Regulatory Landscape:
- Discussed the role of regulatory bodies:
- In the UK, the Medicines and Healthcare products Regulatory Agency (MHRA) oversees medical device regulations.
- Mentioned the EU Medical Device Regulation and the EU AI Act.
- In the US, the Food and Drug Administration (FDA) provides guidelines.
MHRA and Medical Device Regulations:
- Explained the classification of medical devices (Class I, IIa, IIb, III) based on risk.
- Emphasized the importance of the intended purpose in determining classification.
- Discussed regulatory requirements, including clinical evidence, technical documentation, risk management, and post-market surveillance.
EU AI Act:
- Introduced the EU AI Act’s risk-based framework, categorizing AI applications from minimal to unacceptable risk.
- Noted that medical devices are considered high-risk under this framework.
- Compared regulatory approaches in the EU, UK, and US.
Mitigating Clinical Risk:
- Outlined a continuous cycle for risk management:
- Identify and monitor risks regularly.
- Conduct clinical validation and pilot studies.
- Collaborate across disciplines (clinicians, engineers, governance teams).
- Ensure ongoing performance monitoring and evaluation.
Up-skilling Staff:
- Discussed the need to improve AI literacy among clinical staff.
- Emphasized regular training and collaboration between data scientists and clinicians.
- Addressed the impact of AI on clinical workflows and the importance of managing change effectively.
Accountability in AI:
- Addressed questions about who is responsible when AI errors occur.
- Responsibility layers include:
- Clinicians verifying AI outputs and ensuring patient safety.
- Organizations overseeing device performance and establishing governance structures.
- Providers ensuring the accuracy and reliability of their AI tools.
Implementation and Evaluation:
- Recommended best practices:
- Start small with pilot implementations and iterate based on feedback.
- Engage multiple stakeholders early, including patients.
- Establish robust project management and governance processes.
- Continuously learn and adapt from experiences.
Questions and Answers:
Impact of AI on De-skilling Clinicians:
- Dom asked about the biggest risks in widespread AI adoption in healthcare over the next few years.
- Dr. Ellie expressed concern about de-skilling clinicians due to over-reliance on AI tools.
- They discussed the potential loss of critical thinking skills and the need to ensure clinicians maintain their expertise.
Ethical AI Adoption in Veterinary Medicine:
- Adele, from the veterinary field, asked about adopting AI ethically in a setting without medical device regulations.
- Dr. Ellie acknowledged the challenge but suggested that ethical principles and duty of care still apply.
- Dom added that clinicians should consciously assess the technology’s risks and evidence, applying their judgment to ensure responsible use.
Effect of Clinician Seniority on AI Adoption and De-skilling:
- Stephen inquired about the impact of clinician seniority on AI adoption and the risk of de-skilling.
- Dr. Ellie noted studies showing that senior clinicians might be less likely to adopt AI due to confidence in their judgment or discomfort with new technologies.
- Discussed the need for more research on AI’s impact on skills across different experience levels.
Action points:
- Explore opportunities for research on the impact of AI on clinician de-skilling to better understand and address this risk.
- Encourage continuous education and up-skilling initiatives for clinical staff to improve AI literacy and promote responsible adoption.
- Plan to attend the next session on “Driving Adoption in Clinical AI” hosted by Dom and Dr. Dave Triska in a couple of weeks.
Artificial intelligence in medicine: mitigating risks and maximizing benefits via quality assurance, quality control, and acceptance testing https://pmc.ncbi.nlm.nih.gov/articles/PMC10928809/
Artificial Intelligence in Skin Cancer Diagnosis: A Reality Check https://www.sciencedirect.com/science/article/pii/S0022202X23029640
Automation Bias and Assistive AI. Risk of Harm From AI-Driven Clinical Decision Support https://jamanetwork.com/journals/jama/fullarticle/2812931
Dissecting racial bias in an algorithm used to manage the health of populations https://www.science.org/doi/abs/10.1126/science.aax2342
Guidance for unbiased predictive information for healthcare decision-making and equity (GUIDE): considerations when race may be a prognostic factor https://www.nature.com/articles/s41746-024-01245-y
Medical Devices Regulatory Reform Roadmap to Implementation https://assets.publishing.service.gov.uk/media/6759a8827e419d6e07ce2b21/Med_Tech_Regulatory_Roadmap_V2_December_2024.pdf
Software and artificial intelligence (AI) as a medical device Updated 13 June 2024 https://www.gov.uk/government/publications/software-and-artificial-intelligence-ai-as-a-medical-device/software-and-artificial-intelligence-ai-as-a-medical-device
MHRA’s AI regulatory strategy ensures patient safety and industry innovation into 2030 https://www.gov.uk/government/news/mhras-ai-regulatory-strategy-ensures-patient-safety-and-industry-innovation-into-2030
Navigating the EU AI Act: implications for regulated digital medical products https://www.nature.com/articles/s41746-024-01232-3
Liability & Healthcare AI. With Majid Hassan-Partner at Capsticks LLP https://open.spotify.com/episode/59Rl3RXFzkB6iuzQS3DE3R
Clinical Risk Management Standards https://digital.nhs.uk/services/clinical-safety/clinical-risk-management-standards
Clinical Safety Documentation https://digital.nhs.uk/services/clinical-safety/documentation
4. Driving Adoption in Clinical AI – Dr. Dom Pimenta
First Broadcast:
20250205
Speaker 0: Okay, hello and welcome again to another episode series of Tortoise AI Academy. I’m Doctor. Don Pimenta. I’m the CEO and co founder of Tortoise. I was going to be joined by my good colleague, Doctor.
David Triska for this session. But unfortunately, he’s away for personal reasons, but he helps us write the slides and has contributed a lot to this session. So just as a reminder, what are we doing here? Total AI Academy is all about helping NHS digital leaders and CXOs to better understand the AI landscape. So just to have a little recap, we kicked off in December.
This is the 4th of 5 initial series. So we did What is AI? And that covers all the basics and the fundamentals. I’ll share a link on our AI Wiki if you haven’t seen it already, but you can watch all these sessions back and there’s lots of resources on those sessions from December. We looked at evaluation with Sarah Cabau in January, we looked at clinical risk a couple of weeks ago with Eli Asghari.
Today, we’re going to do a great topic and we might go a bit off piece today because we’re going to delve into how I look at adoption in clinical AI. And then next fortnight, we’ll probably end up doing quite an interesting panel session with a bunch of leaders in the space to sort of think about the future and you know what healthcare could look like in 2 years, in 5 years, in 10 years. So that one’s going to be super fun. Okay, So today, we, oh, yeah, that’s a good point. The goal of this series is not a computer science class.
It is not to delve into the deep ethics or how the models work or the maths behind it. The goal is to simply make better buying decisions. When you’re looking at clinical AI from a vendor, which is going to become more and more part of the skill set that we all need if we’re holding budgets for our hospital systems or organ patients. So, it’s all about wisdom and understanding the different elements and the complexities of how to do that. So it’s a crash course in AI for healthcare from concept to clinic.
So today, we’re going to cover adoption in clinical AI. So we’re going to talk about practical insights on implementation and how to think about implementation. And we might go a bit off piece there into product management, which I think is actually a slightly more interesting way of talking about this. How to persuade the most skeptical and what to do with sort of trust, credibility, how to roll out, how to monitor and how to ensure ROI. Okay, so to kick us off then, what are we talking about?
Now, this is interesting for me. I’m 16 years in the NHS. I’ve been in tech for a couple of years and actually how people talk about adoption is quite different. The one thing everybody acknowledges is driving adoption does not simply mean to tell the people to use the thing. And you will hear me refer to the thing a lot as the generic technology or product or software or whatever we’re trying to implement.
There are lots and lots of words for this in the NHS, some of which you may have come across before, change management, transformation, digital delivery. But it all means the same thing. How do we get people using something that they weren’t previously using? So a really good way of thinking about this and maybe a bit different to how we thought about this before is to think about how tech thinks about this. So in my mind now, and this is something I’ve been learning about personally, like crossing the fence from clinical into tech, adoption is product management at its most basic.
So what we’ll do now is we’ll go through some very simple basic definitions of product management and then look at adoption through that lens. So, I have a really simple definition of what a product is in the tech space. A product is something that transforms an input into an output. And a good product does that for a reason. So chocolate kettle being a terrible product, for example, solves no problems.
Good products solve a problem. And we’ll come back to that because understanding that you’re trying to solve a problem first helps you decide the quality that you need for the technology and the product in front of you. So here’s my favorite example of, you know, man’s first product or human beings’ first product, I should say. So let’s have a look at this. So here’s a problem.
We’ve got some cold people who are hungry. Somebody invented the product of fire. So it’s obviously a Neanderthal time product. And it was solved with warm people and cooked food. Easy peasy, right?
So what is product management then? So product management is the continuous optimization of that input to the output. So first fire invented, warm people, cooked food people, and then people started moaning or feedback. So, Ah, the fire took ages to build, or It’s super inconvenient, or The logs were too wet, or it’s very inefficient, or, you know, we all fell into it and burned alive. So, fire by itself is a product and product management iteration is about taking feedback from that product and trying to figure out is there a better way to optimize this input output.
Flash forward 10000 years or whatever. We’ve decided the best way to optimise the problems that we’re trying to solve then is to split them up. People are hungry, people are cold. We rarely try and put those 2 things together anymore. We’ve optimized the experience products for both of those problems very specifically.
So we’ve invented ovens and we’ve invented radiators and now we have cooked food, which is convenient and accessible and controllable. And we are warm and it’s all automatic and in the background and we’ve got lots of fancy thermostats on top of that. But I think the point is, we both know we’re inventing new ovens all the time, air fryers, shout out. And we’re inventing new ways to stay warm: heaters and thermostats and automation and AI. So this is a continuous process.
And it’s really important to think about this mindset when you come to think about adoption of technologies, especially in fact AI technologies, simply because it gives you a roadmap to think about, well, what are we actually trying to do? So before anything else before technology, before AI, before workflow, we need to ask one simple question: how well do we understand the problem that we are trying to solve? So here’s a nice healthcare orientated example, completely hypothetical, but based on some real world data and some real world problems that have occurred. So, a problem that we’re trying to solve with technology: Do Not Attend rates, DNA rates, are higher in lower socioeconomic backgrounds. So we think we’re going to build ourselves a product, fancy AI machine, automates sending SMS text messages to patients in those cohorts due to come in the next day.
Sounds like a great edit. And then we get some feedback. So completing that loop, no change at all. Nothing happened. Didn’t make a difference.
What happened? Perhaps we didn’t understand the problem that we were actually trying to solve. So again, this is based on some real world cases. The DNA rates are higher in lower socioeconomic groups, but not because they are in socioeconomic groups, because there’s a correlation in those groups generally with patients who don’t speak English as a first language. So, the problem that we’re trying solve, we misunderstood, therefore we built the wrong solution.
The correct solution was to send letters to the patients in their first language. And obviously we solved this and then clinics were too busy. They were already overbooked to account for DNAs. So, we bought ourselves a new problem, which is often actually how this product cycles work. When you change workflow and solve one problem, you create another one.
And that’s actually a good thing, right? You’ve solved one problem, you’ve created a new one, you’re constantly iterating. But to think about it as a constant process as opposed to 1 and done, everything’s going to be great because it rarely is and there’s new processes that you have to manage and that’s just the nature of deploying any technology. So once we know our problem, then we have to know your user. I have to say, this is actually my favourite webinar so far.
So what do they actually do? And I think that’s a really hard question to answer. On the top level, you know, for example, what does a postman do? I can speak from a little bit personal experience. My brother was a postman for about a year, about 25 years.
What do they do? I mean, we know they deliver letters. But when we ask this question from this perspective of trying to solve a problem for a specific user, a specific clinician, a specific workflow, we need to basically ask the question, what time do they wake up? Do they use an alarm? Where do they keep their clothes?
Do they need to keep their uniform? How many uniforms do they have? Do they have to wash them? Do they dry them? Do they iron them before they leave the house?
What shoes do they wear? Where do they park their car? Do they go and drive their back? Do you see what I mean? The complexity of the granularity of what you need to actually understand to be able to solve a problem is something that actually I think we’re quite bad at.
And that’s universal as human beings. We think we know what’s going on over there. But until we’ve lived that experience and looked at it, we very rarely actually know enough to solve a problem for somebody over there. And that’s the same thing in the hospital system and clinicians as it is in any other sector of tech. What are they afraid of?
And I’ll talk specifically around that because that’s a really important part of driving adoption in any technology, but really, really important in AI. It new technology? There’s a credibility issue around it. People have lots and lots of existential worry. Lots of stuff is happening all the time.
It’s scary. So understanding that is really important. And what are the pain points and how do we characterise them? What are they trying to solve that’s causing them pain that will drive adoption if you fix it. So, there’s some really nice things from the tech world that really characterize this.
Many of you, if you’ve ever worked in any sort of change, transformation or implementation team or any product team for that matter, will have seen this before. So this is what’s known as the 4 forces of change. You have forces that push you from a before state to the new thing. So, you have a push, what pushes you away from the old and what pulls you towards the new. So, what is the pain that’s pushing you to change your behavior?
What is the pull of the new functionality that attracts you towards that change? And, you know, we talk about strong pushes being painkillers, And we talk about sort of nice pulls being vitamins. But equally, it’s what directs you from before to the new thing. And then we have the forces of change in the opposite direction that keep you staying in the status quo. So anxieties, again, super important when we talk about technology, especially AI.
What is the user worried about that prevents using the thing? And habits. And this, I think, is a really key one, especially in clinical workflows where everything we do is extremely ritualized. If you do anything 30 times a day, it will become a habit. You cannot help it.
That is how your brains are wired. So if you’re seeing 30 patients a day, whatever you repeat in those workflows will become an intensely strong habit. And if we look around, most of the failures of adoption of any technology in the clinical landscape is because we were trying to change a habit that was happening multiple, multiple, multiple times a day. Trying to change ward round discharge processes, trying to change the way the doctors do their consultations, trying to weigh the nurses do their drug rounds. These things are really hard because they are habitual And breaking those needs to take account all the pushes and the pulls and all the anxieties to be able to break habits.
So this is some really interesting data. So let me pause this. This is a video. And if I don’t know if I can zoom in, but so, what we’ve been doing at Tortoise, we’ve been studying clinician workflows. So, this is the only data I had to show you this.
I don’t tend to veer too much into it. But we’ll use Ambient Voice Technology implementation as our case study because that’s where we personally have a whole bunch of knowledge and data about what we’ve been doing for the last 2 years. So what this is, is it’s a time graph. Each one of these dots here represents a single clinical episode, right? So, this is one clinician doing a whole clinic.
So, each of these dots is roughly, I think, about 10 different patients that she saw over the clinic. We’re stacking them up and we’re playing them back to you in parallel. So you’re basically watching each clinic happen at the same time. And what we’re measuring here is we had a human observer in the room. In the centre is when the clinician was interacting directly with the patient.
Around the outside are a whole bunch of other tasks that they could potentially be doing: reading the computer, when they’re outside the room, when they’re doing notes on their computer, so typing, when they’re not doing anything at all, when they’re writing physical notes, when they’re ordering things on the computer. And there’s some other elements to it. So as you watch this, you’ll see that constantly in every patient interaction, the clinician is flicking between multiple tasks. And we call this in the industry context switching, And we use that in multiple different circumstances. Very bad for the brain.
We know this from Twitter and Instagram as much as anything else these days. Context switching is something that is really, really important to understand how user workflows are working, especially when you want to break that context. When you want to introduce something different, you have to understand how noisy it is already. So this is the workflow done, let’s see. She’s done some clinician appointments already.
So, there’s a sort of end of finish day. But they keep going back and forth between computer notes, between computer read, between direct care, back and back and back and forth. So, we mapped that out and that gave us a lot of insight. And then what we did is we introduced Ambient Voice Technology. For those who don’t know what that is, it’s listening to the computer, listening to your conversation.
You don’t have to really take notes contemporaneously, you know, at the same time as typing, talking to the clinician, and it’s doing stuff in the background. So you can see the difference. Most of that context switching now is eliminated. And now the clinician is basically spending almost the entire time on direct care, occasionally going to the computer, occasionally going to the computer read. But what’s really interesting from us is we’ve changed the workflow again.
And now the computer is not being used for large parts of the consultation. Whereas before it was intermittently being used all the time. So there’s a change in workflow and there’s a new problem to solve or a new opportunity to solve. Maybe that computer could be doing something else by itself. And you’ll also see that consultation is now a little bit shorter.
So the workflow has changed and some of the habits have changed. That’s a sort of very interesting, like pretty video to show you. But really to emphasize, this is the kind of level of detail you really do need to understand before implementing any technology into any, especially clinical, workplace. So then let’s take a look at AVT continuing case studies. So again, ambient voice technology, speech to text AI, plus large language models, listen to consultations, create notes and letters, technology most of us are quite familiar with at this stage in our lives.
So let’s look at the pushes and the pulls and the habits and the anxieties using that use case. And this is from our own learnings. So what is the push? The context switching, you’ve already seen that. It’s very heavy cognitive loads.
It takes a lot out of you to constantly switch between keyboard, between mouse, between pop ups and the patient. Who’s often telling you things that are very important from a diagnostic point of view or clinical significance. What is the pull? Product could be very simple to use, could save you a lot of time, could improve patient contact, a few different pulls and technologies, but it’s primarily the push, actually, for this technology of how much work you’re currently doing that you just don’t want to do anymore. But we also have to acknowledge the anxieties.
And these are something that we’ve learned from experience, talking to lots and lots of users for this specific technology. But any AI technology, you won’t really know what people’s anxieties are until you put it in front of them until they start using it and you just ask them. So the other principle of implementation, especially for AI, is you really do have to try it. And you have to try it in real world environments where the stakes are very high and try to figure out why don’t people use it. And that will help you with adoption at scale.
So anxiety, is it going to miss something? Big anxiety for this type of technology. Is it going to sound like me? And we have to remember that often when you’re making clinical notes, that’s what you will read back in the coroner’s court. So, I was always taught write your notes to the judge.
Similar thing here, we’re looking specifically at technology that can change the documentation style. So, it has to represent you. You have to have minimal edits to make sure it does represent you. And is it going to break? So, one of the big anxieties here is if you ask the clinician to not type anymore because they don’t need to, the opportunity cost is they didn’t take any notes if they were normally typing.
Some people actually do write handwritten notes as well, especially for long consults, but that’s the exception. So if it’s going to break, the robustness becomes really important because you suddenly lost and you haven’t made that record. And the opportunity cost of using an AI technology versus not is not something that we often factor in. But that’s a key anxiety. And lastly, about habits.
Keyboard typing is very habitual. Trying to break that is really, really hard. And the other habits that we don’t necessarily know that we don’t know. So one of the most interesting things that I’ve observed is when you take the keyboard away, there is no longer an excuse to look away from the patient when you’re actually interviewing them. And that sounds like a really weird thing to say but for any clinicians on the call that know exactly what I mean.
Sometimes you’ll use the computer time to think or to think and you want to look away from the patient while you’re talking to them. But if you take that excuse away, that context switching, A, you can’t interrupt the patient, which is much better for them, but B, you’re suddenly having to look directly at them, which is a normal human interaction. So, the humanity gets restored. But yeah, it is quite uncomfortable. And it’s uncomfortable because it’s such a strong habit to not look at the patient when you’re talking to them.
Good thing to restore, but just to acknowledge again, that’s something habitual that can, for this specific type of technology, reduce the ability to get it adopted. Just going to pause there and see if there’s any thoughts or questions. There’s some chat. Does Tortoise recognize any other? Oh, yeah, we’ll come back to some questions.
Okay, so I think this is also important to understand that the push will be different in different clinical settings and with different users. So what motivates somebody to change, even with a similar technology, so again, sticking with this case study of ambient voice technology. In primary care, it’s probably quite different from what it is in outpatients. I mean, the technology is very similar, documentation is very similar, but what actually the drivers are, are different. So what we’ve observed in primary care, it’s very much about quality of life and reduction in cognitive load.
In outpatients, there’s a much higher per patient administrative burden, if you think about all the letters and notes and sometimes very long complex documents, which is not accounted for in short consultation times, there’s lots of pyjama time. So reducing that time outside or at home when you’re writing out complex notes or NDT decisions or whatever it might be, that’s a different use case, different push to understand. A and E, it’s all about throughput. It’s all about how many patients can you see per hour. And in mental health, there’s different use case again, very long consultations.
When you’re actually a 90 minute consultation, you personally are not going to actually maybe remember without taking notes what you started at, especially if you’re doing things like mental state examination, for example, during the consult. You might want to have some non verbal observations as well. But certainly you could argue that it’s a better patient rapport than typing during the consultation. So therefore, rolling out by department, you need to optimize. What is the push and what is the pull?
So that’s some basics of product management. But a really nice way, I think, to think about it. And certainly, obviously, the way that we think about it as a tech company. But really, it’s something that we’re trying to push more and more people to think about, especially for AI adoption of any technology, really. Okay, so some basic things that we’ve learned from implementation.
I think this is one you read in lots of textbooks, but this is really, really true. You need a champion. And actually that champion has to have a certain number of characteristics that are actually pretty repeatable across most of them. We’ve done this a couple of dozen times now, and actually it’s almost always exactly the same person. So they need to understand the push.
And what I mean by the understand is that depth of understanding, like I’ve lived this problem, I understand why we would want to change it. So ideally they’re a local clinician leader or at least somebody who is very familiar with the problem. You know, if they’re an A and E consultant and they’re helping their A and E department adopt their technology or they’re a radiologist and they’re helping the radiology department adopt their technology. Teck has always understood this, but I think for any implementation, having these people on the ground and super engaged and actually having the, what’s the word, the ear of the department, I suppose, is really, really important. They also have to be the evangelist for the pull, right?
So everyone can understand the pain, but sometimes they’ll disagree about where they need to go to get away from the pain. So a leader will be able to evangelize and love whatever that product is. So I use this AI score every day. It saves me so much time or it’s so accurate or whatever that might be. Often evangelists are converted by science.
The one thing I left off this slide is often the conversion is what creates the strongest champions. So if you walk into a room and you’re trying to get an organization to adopt some sort of clinical AI thing, the cynic in the room that you can convince with evidence, with the product itself or whatever, often becomes the strongest champion. Actually vice versa, your strongest champion that gets disillusioned will often become your biggest cynic. So you do need to be really careful about how you manage those conversions, but in the right way becomes a massive asset. They’ve already overcome their own anxieties.
And I think that’s really important. Again, if you haven’t lived it and you haven’t used it, there’s nothing more powerful than someone saying, I’ve used this technology. I know you’re worried about X. I was worried about X too, but this thing is fine. Or I’ve overcome it this way.
Or I just do this one thing. And that’s a really, really important part of the champion, again, being local and being connected to the team that you’re trying to confirm. And lastly is understanding the habits. How strong they are, what’s important, what isn’t important, what kind of little hacks can you get out of those habits? Human beings, you know, from a neuroplasticity point of view, very malleable in general.
These type of technologies, especially some changes, sometimes the workflow change is quite big, sometimes it isn’t. And I think the best product sometimes incrementally change you to do something else or do something else in a different way and discover it naturally. Now, this is some other secret knowledge, which I don’t think I’ve ever read in a textbook, but something that we’ve observed, is champions are often visionaries. They’re often great leaders. They’re often really good evangelists.
They aren’t the best day to day, do the boring stuff well, operators. So we term that operators. So if you’re thinking about implementation, especially for clinical AI, you need to pair these people up. Find your champions and then find around them the operators that believe in that champion, but can deliver. And they can deliver in the, is the training done on time?
Did people come to their dates? Did they sign the microphones in and out? And that maintains the momentum because the champions can get drained. It’s very draining to try to change a department, for example, to adopt something because that takes a lot of positive energy, whereas the operators can sort of deploy that, make sure it happens, get people through it, get people get people excited and continue to be excited about it and maintain that momentum. You do need to put those 2 people together.
It’s not obvious at all but even now, some of you are listening and thinking, oh yeah, I get it, I know who an operator is or a champion is. There are people that you’ve met. And thank God the NHS is for the both of those people, because otherwise nothing would actually get done. Okay. Adoption is not a quick and it takes a lot of effort.
It takes time. It takes trust, especially for AI. And it does take a lot of trial and error. So time is really important. Clinicians are time poor.
Can’t is this a PC time? You can’t half ass this, right? It’s not something that you can throw something in front of somebody. If they have a bad experience, often that is the experience. It’s just like your first impression of a person.
You meet it for the first time. It doesn’t really work. You’re probably not coming back at least for months months months. Some hacks here: make sure that people have time and space, especially if you’re trying to originate something with a champion. They need time and space and resource to do that.
Scalable, highly engaging training materials. And I say that as somebody who’s done the what is fire e learning many, many times in an NHS career. They need to be engaging, they need to be something that shows the product or shows the workflow or speaks to the habits and the anxieties, like really be thoughtful about that because that’s very scalable. So we, for example, now have an e learning platform that has been actually very useful for us in getting people to understand the technology and have that moment, we call it, much quicker. Now trust.
Trust, I would think in AI especially, is absolutely critical. It cannot be underestimated. And it is that age where that’s not necessarily the wrong thing to do. They want to understand the evidence. They want to understand how you’ve assessed it.
They want to understand what it might mean for them, for their patients, for their medical legal. I’ll come on to that in a second. So how do you build trust? I think evidence is really important. If this is the right audience, then publish papers is very helpful and education as well.
But I think the biggest thing to make sure you can get people to do is to play with it. You can tell people about something as much as you like, and most people only listen to about 5% of what anyone says anyway. But when they try something, then you’ll see either it’s something that they’ll understand or they won’t, and they’ll understand it more or less instantly. So again, when you’re looking at technologies in the clinical AI space, putting a product management hat on in this sense is so important because you really think, okay, what’s the initial user experience for the actual people on the ground who I think want to use this technology? Whether it’s the admin staff or the clinic staff or the clinicians themselves.
What is their first onboarding experience? How do they access the UI? These things aren’t just pretty. They actually define the experience and get them to be able start using it and to then start making new habits so it gets adopted. Time and time and time again, we spend a lot of money on technology and we miss this one bit that it doesn’t matter how much the hospital spent on it or the organization has spent on it.
They don’t like it and it doesn’t fit into their workflow easily. It would be really hard to get much value out of it because people just won’t use it. And that’s just the reality. Okay. So, for changing, it’s a team sport, at least in the best instances that we’ve seen, it’s a team sport.
So the most successful implementations, I’m thinking now of a very specific and very, very good implementation in one of our A and E sites, it’s been teams taking on altogether. And there’s lots of different reasons for that. It builds community knowledge, lowers mutual anxiety. We’re all trying something together, so it takes down that individual anxiety, that individual risk. Creates a social poll.
Everyone’s doing that cool thing over there, let’s go do that. And actually, interestingly, we observe that people naturally form tips and tricks, hacks. They start sharing ideas. It becomes a very exciting way to integrate. And then you lose all of that if you try and lots of people say, oh, anyone in the hospital wants to use, or anyone who wants the org wants to use this individual technology on their own.
Actually, that’s quite isolating. And what we’ve observed is when you start like that, it’s actually very hard for that one adopter to spread it even to the colleagues adjacent to them, Because there’s a different mentality, whereas if they all start and learn together, then there’s something that bonds during that. And then some AI technologies do require networks to maximize effect. I’m struggling to think from off the top of my head, but there’s certainly some in that category. Get ahead of common anxieties.
I think some of the things we’ve heard, again, I’ll talk specifically around some of the things that we hear sometimes around AI, are pretty common. And I think rather than trying to wait for the question or skip around the question, just hit it up front, hit it in some training materials, talk to people about it, acknowledge their fears. We’re really bad at this as well. So being replaced, this put me out of a job. The learning curve and the change fatigue.
Oh my God, it’s not another IT thing. Oh no, it’s a cool IAI thing. Privacy and security. Where does my data go? People are really worried about that, especially in a lot of AI.
And then medical legal concerns. What will I say in court? What’s my defensibility? How does this change? These are the things that everyday clinicians always worry about.
But it’s really important that, A, we thought about it, and B, that we communicate those solutions to the people in our office. So, I thought I’d just finish here, before we get into the question and answer time, with a bit of a cookbook. This is something that we’ve internally been working on some of the thoughts that we’ve had that may or may not be useful if you’re implementing any AI technology. So this one I’ve slightly adapted from Dave, but start small and waterfall. So obviously you know what a waterfall is, but like, I think it’s really easy to want to try to roll out big, I’m not talking about like do a little pilot and then see what happens.
I mean, when you want to talk about an actual rollout strategy, this is probably one of the strongest ways of doing that. So, 2 ways of doing it. 1, if you have an AI technology you could use anywhere in the org, I mean, I’m not talking about specific radiology technologies, but maybe like something a bit more ubiquitous like ambient voice technology or like a scoring system or something Find the highest value point, right? Where those forces of change, the push and the pull, are the strongest and the habits and anxiety may be the weakest. Because what you really want to do is start a new technology somewhere where you will see an instant change, instant excitement, and people will be really excited about that.
And start that off there. Find your champion and then several operators behind them to make sure that’s sustainable. And then resource them with sufficient time and materials to train a majority in that area altogether. That was a team. And then use that team to seed the next 2 teams and then use those teams to seed the next 2 teams beyond that.
And we’ve seen this technique used quite a lot for non AI technologies in organisations. It’s a much cheaper way than paying 100 of 1000 of X to consultancies to achieve transformation. But it does require really deep engagement, really good champions. So there’s a lot of energy to put at the top end, but this is a way to propagate. And you often see this with EHR, Superusers, floor walkers.
There’s lots of models for this. This is a non obvious one. Measure what you already measure. So we, I mean, as a company, we’ve been running the largest trial of ambient in Europe, right? With human observers.
And it’s incredibly expensive. It’s really hard to maintain. The quality of the data you get really depends on how well you train your human observers to do it. So everyone thinks that if you want to try a new technology or you want to try something, the best thing to do is just have people watch, do a proper study, apply all the pharma elements to it. But actually, in our experience, it’s not actually the best thing to do at all.
It’s not scalable. You can’t do large amounts of data with it. Most systems now will have some data. They know how deep it is. It’s very variable in our experience.
So, for example, if you want to measure an A and E study, what do you know or measure already about patient flow in A and E? Have you got the check-in time for the patients? Have you got the time toward decision to admit? What have you got stored somewhere that you can then compare, if that’s what you’re trying to compare? And everywhere you can, try to use what you already have.
If you’re already somehow measuring the quality of the nodes, or you’re already measuring time to discharge, don’t change your metrics. But make sure that whatever you’re implementing has a meaningful change that you can measure. If you really get stuck, try and use humans sparingly. But if you want to do something at scale and demonstrate things that are robust, you really want to have a lower throughput, like five, 10, 20000 patients. Very hard to do that reliably and sustainably with human beings.
This is my favorite one. This is super obvious. So everybody is different. I mean, that’s not an obvious thing to say. But what I mean is for technology, you can’t A) expect to win everybody’s heart.
But B) customization is also really important. And that actually is a really potentially thing you can do with AI technology, but also a really key part of AI technology, previous software systems or digital health systems, which were very fixed. AI, especially large language models, probabilistic, can be very customizable. And that’s a big part of the journey. If you’re users that you’re trying to get to adopt technology inside your organization can make it feel like their own, that’s when you’ve already won half the adoption.
Like Epic, you can do a lot of customisation for people can already make their templates and lots of EHRs. This is stuff that we recognise, but again, asking vendors and thinking, okay, how does the user make this their own? What’s the user journey? And some of that delight, some of that elements is really important for stickiness. Again, prettiness and stickiness can be apart, but actually they’re often together.
If it’s an enjoyable thing to use, people will use it because we don’t make rational decisions even in healthcare. And then that’s how you calculate that for your ROI. So, don’t expect to win everybody. I think the best examples that we’ve seen thus far of like system wide adoption of older technologies like voice recognition rarely gets above 30% if that’s a really good technology. If it’s something that’s mandatory, EHRs is a good example of that, but then you’re not leaving people with a choice.
So like, are they using it to the fullest? Probably not. Are they optimizing it for the data model that you wanted to create? Probably not. And this is something that’s really useful and something that I’ve tried to apply to most of what we do here at Tortoise and actually lots of things.
So I’m a company called Superhuman. They make an email app, which is a weird thing to try to make because there’s so many incumbents. So they try to create a framework to understand how to improve their product based on a very, very simple question. And in product management, we use a term called product market fit, which basically means you’ve built something that people want in the simplest way. But how do you work that out?
That’s really hard. If you find one evangelist that loves it, is that enough? Do you need everybody to love it? Is that enough? And how do you know how to make it better to get to that maximum state?
So in this framework, you literally just ask one question. How disappointed would you be if we took the thing away tomorrow? And they only have to say, I’m going to be very disappointed, somewhat disappointed, or not at all disappointed. And the genius, I think for me, is what you do with the response. So the users in your very disappointed box, if they say, I love it because it’s so fast, then you say, I’m going to make it faster.
Or they say, I love it because it’s so accurate. And you try and make it more accurate or you make the accuracy more visible to them. The people in the Somewhat group take their feedback with a little bit of pinch of salt. If it’s easy to do, or you can do it, or you can change something meaningfully, then that might move them up to very. That’s what you should do.
But if it’s very hard to do, it’s very off piece and you can ignore it. And the genius other point here is that what you do with the people who say, I’m not at all disappointed. I don’t like this thing at all. Just ignore their feedback because there’s nothing that you can do that will ever win them into that daily users implementation. And actually focusing on a core segment that use it all the time is way more valuable than creating something which hardly anyone uses or really likes to use, at least to its full extent.
And then you kind of want to get to about 40% saying they’re very disappointed and then you know that you’ve got a great product that people are actually going to use. And then lastly, it’s just a journey. And as I said to you at the beginning, once you change a workflow, then you have a new problem. Then you have to start from scratch. What is the problem?
What am I trying to solve? What do I need to iterate? And again, just having that implementation. Okay, great. So we are going to go through some of these questions and then we’ll come to some of the questions in the chat.
So, if you remember, I often submit this poll. Yeah, nice. So, in tech, a product is defined as something you can buy in a shop, a process of converting an input to an output, or a mathematical equation. And remember, this is in tech. So I’ll let some people answer it.
Have some of my lunchtime coffee. Yeah, very good. So a process of converting NIMA to output. This is not an official definition, by the way. This is just my definition, but it’s an easy one to think about when we when we think about this.
Hold on. How do I get rid of this? Very good. The 4 forces of change are defined as push, pull, habits and anxiety, earth, wind, water, fire, money, money, money, money. And that’s kind of a joke and it’s not really a joke because obviously money is extremely important.
Yeah, again, great. Most people are coming for push pull habits and anxieties and I think, if nothing else, this is probably one of the most useful ways of thinking about whatever technology, making sure that you filled in all those boxes. So you’ve really thought about why people would want to change and how you do that. Okay. I’m going to vote for money, money, money, because that’s also Very good.
And then last, but by no means least, and this should be very easy because we’ve literally just talked about it. The superhuman product market fit framework asks you just one question: If the thing broke tomorrow, would you fix it? If the thing could be anything, what should it be? Or if we took the thing away tomorrow, how disappointed would you be? Which I appreciate is a bit of a double negative, but it’s really important because it’s one of those things that human beings think a lot more about loss aversion than they do about getting something new.
So actually it’s a good way to phrase it because it makes you focus and you get more reliable answers as a consequence. And then obviously people go like, I think. Very good. Okay, well, that’s it from me. We’ll do some Q and A.
Just a reminder, we’ve got one more session coming up in 2 weeks. This session will be recorded and published. You can access it on the AI Training Academy Wiki, which I will put in the chat and send out shortly. And you can also just get in touch with me at domtortoise.ai if you have any other questions. 2 weeks’ time, we’re going to talk about the future of AI in the NHS and just ask anything you want about AI.
To a bunch of clinical experts. So, let’s have a look at some questions. We’ve got a few different ones. Let’s go through chat first. Nice visualization.
Waseem Ahmed asked, Does tortoise recognize other languages or vernacular speech? Yes and no. Yes in that it does recognise. No is it will do anything with that deliberately because we haven’t built clinical evaluation systems in things that aren’t in English as yet. And therefore, we’re wondering about whether there’s an emergency use only kind of button, but we haven’t quite figured that out yet.
I mean, for now, it’s English only until we can do it safely. The models themselves are capable of 100 of languages, but we don’t know what their accuracy is. And that’s why that’s no for now. Tim Hunter says, I wonder about the change in clinical practice. Templates form part of patient safety.
How we add patient safety frameworks to ambient with our visual templates. Again, interesting, you’ve changed the workflow, you’ve taken away a template that they were filling in. So now you have a new problem to solve. And I think it was something again we’ve been asked about. Maybe we just show the templates back to the user as a basic way of doing that to reinforce that safety.
And maybe go one step further, add some real time where it can check through that you’ve done parts of the template. So we can probably do a bit of sort of real time work together. Some more questions and keep asking your questions. So let’s have a look at this one. What do you think are the biggest hurdles utilizing AI in primary care?
That’s actually a really good question. I think there’s a few things. I think we’ve gone through a phase of early adopters. So, if you when anyone has any new technology, there’s a curve. The people that start using it are mad and anyone starting a new technology is always mad.
They are a certain subset of users that are really techie, really understanding, always want to use the cutting edge, right? Then you have a gap and that gap is often quite hard to get into the mainstream. And then you have the mainstream users who understand the technology and are happy to use it. I think one of the things right now is that a lot of the early adopters, that’s a saturated market, but their needs, which are quite techy and really customizable and doing a whole bunch of stuff, are actually quite different to mainstream. So, now we’re moving towards widespread and scale use.
Those in primary care actually want a digital experience. They want it to be simpler, more straightforward. They want to have trust. So, I think there’s a user adoption. And obviously there’s a money adoption.
In primary care, specifically in the UK, as we all know, there’s a real squeeze on cash. Procurement rules are pretty tricky. But also some of the governance around this stuff, especially in primary care where it’s left up to each individual practice sometimes to do a lot of the paperwork without much support, that’s really hard. And that creates a lot of friction. So I guess one of my things I maybe missed out in the push and the pulls and the habits and anxieties is like, well, how do you actually access the technology in the first place?
What’s the paperwork? What’s the friction to do that? I think that’s probably one of the biggest hurdles: procurement and the IG side, the trust that somebody’s done it so you don’t have to do it yourself. If you could solve that, you could actually get adoption much, much, much more at scale. Ian Robinson has asked, can you please summarize the best available evidence for the use of AI scribes in a healthcare setting?
Yeah, I mean, without having to quote papers, it’s a really tough question. I think there’s been a few papers that have been published in reasonably good journals. Nedjem published a study of DAX, but that was from 2 years ago. So that’s Microsoft products, but 2022. There was a study from Nabla, Kaiser a couple of years ago.
There’s been a couple of other nameless vendors. We published some of our own stuff. There’s a published paper in the JMAR last year, which is an earlier iteration of Tortoise. I think there’s a few threads that come through every single paper. One of the ones that’s most obvious and almost undeniable now is cognitive load.
There’s lots of different ways to measure that, but essentially it means how hard or frustrating did you find the task with and without technology? We use something called the NASA Task Load Index, but there’s variations of that. That was invented by NASA to figure out how hard tasks are to think about for astronauts. I’m gonna say spacemen astronaut. And we see reductions of like 60%.
And that’s mostly because you’re only doing one task instead of trying to do 2 or 3. That’s pretty universal. I think time saving is quite difficult. We have observed that if you don’t actually meet the user, to go to my point about the workflow, if you don’t meet them where they are and it minimizes their edits for their workflow and what they need, you don’t really save a lot of time at all. It has to be that sort of experience where it comes out exactly as you want it with no edits to really save the maximum time.
We’ve seen and many people have seen, some total time saving. We’ve definitely seen increases in direct care, so more time spent with patients. I mean, that’s an obvious one. Quality of documentation. It’s Very hard to measure.
So lots of people try to measure it in different ways, successfully and unsuccessfully. The Kaiser paper from a couple of years ago kind of looked at a subjective analysis of like, this is pretty good, but it was just sort of a visual analog scale of doctors saying, one’s 10, this is good or bad, not super reproducible. We’ve tried a bunch of staff who use a SAIL score, which is the Sheffield assessment of inpatient letters, and PDQ-nine you could potentially use, but it’s not really fair. So, quality is actually quite hard to measure. Some of the things I’d like to measure that no one’s measured, there’s no published data, is like interrupting with your decision making, your cognitive flows, simulations.
These are stuff that we kind of need to dig, but it is kind of not a great space actually for evidence. And we’re trying to sort of correct that with some studies at the moment. And obviously, the study that we’re running, we should be publishing in a couple of months time. So that hopefully will set the bar. Okay.
No more questions on the questions side. Any other questions from anybody else? Pause that to have some coffee. Stephen’s typing, so we shall wait with bated breath. Oh, he said no.
Thanks, Stephen. That wasn’t that helpful. Oh, someone’s got a question. It’s very good. Which trusts are using Tortoise now?
That’s a bit I don’t think I’m allowed to say that on a public thing, but email me and I can tell you. You have to go talk to their comms teams. But we’re live in quite a few, well, a couple of hospitals. Some stuff in Quest, got some primary care stuff as Oh, no, no, just cut out there. So, hopefully you heard the answer.
But yeah, if you just drop me an email on that one. And am I rewired? Yes, apparently I’m speaking there. I don’t know what I’m talking about yet. I shouldn’t admit to that, should I?
But it will be very exciting. So I will see you all there. A reminder again then in 2 weeks time, we have the future of AI in the NHS. Oh, Alice has got one more question. As a GP who is interested in introducing AI to my practice to help develop research.
Keen to educate myself more. Where would you advise starting? Start right here, my friend. So there’s a bunch of other AI related topics in this series exactly pitched towards you as an audience or somebody who wants to practice. We personally have a bunch of e learning stuff specifically for tortoise as a product.
And there’s lots of resources. I would also say people who really want to get into this stuff, there’s some really good courses on Coursera by Andrew Ng, who’s very simple to use. Some of it’s some coding, but it just gets into the basic principles of a lot of this stuff. And then playing with it, I think playing with it, thinking about what other, you know, research topics or tools could be produced with this technology. You’ll see a lot of crazy developments all over the shop, but I think paying attention to some of the basics is probably the best place to start.
Awesome, more questions. Great. Will AI product support development or deployment of PAs? Presumably, I mean, you mean physician associates? I don’t see why not.
I mean, most of these tools are pretty ubiquitous. Clinicians in general tend to do the same workflow. Need to find out about something about the patient, need to take history and exam, diagnose, need to commit some decision making. I do think it’s a very interesting one when you get into guidelines, into algorithmic care, in terms of who benefits most from suggestions or cognitive, lots of the clinical evidence suggests that actually senior physicians sometimes reject AI, but certainly don’t benefit as much, whereas junior training physicians do benefit a lot more. So whether that also translates into physician associates and with a degree of oversight and autonomy, I don’t know.
So that was something to be very interested in. And there’s been some AI products in the past that looked at sort of the term is practicing at the top of your license. Don’t know about that and I’m probably not going to wedge into that right now, but I think there is an element to say, you know, if we have a lot of knowledge around and we can distribute expertise, augmenting the full range of staff is probably a very good idea and seeing what we can do with that as a technology. I wouldn’t be surprised if that becomes the norm for sure. Oh, loads of people have questions now.
Okay, so Christo said, Are there good cases, examples of people in this position porting their ambient AI across trusts? Not sure I understand that question. As in people taking it within their trust to lots of different places or like between trusts or scaling out? I’m not sure I understand that one, so I’m just going to leave that one off. How are you overcoming the challenge of selling torts to the village chairs?
Oh God, political ones. With great grace. No, I mean, it’s difficult, right? It’s a difficult macro environment. We’ve done pretty well in terms of being on the right procurement frameworks of like doing all the clinical evidence and like selecting all the security elements to that.
I think building evidence is really important. I think we’re not very good at that as a nation, building really high quality evidence for technology. That’s been super helpful. Having all the badges and having a really strict way of managing data, really helpful to get into the right doors. I do think there is a lot of difficulty right now.
There’s a lot of uncertainty. Is there money coming? Is there not? Where is it happening? That’s a bit of a frustration.
But we are seeing that this is technology that people want. That always helps. And certainly, you know, working with clinical champions of Germany across the board to build this also very helps. I think things will change. I’m hoping that there’ll be more ability to easily port governance, for example, easily port procurement.
And that will mean that when we’ve gone down the road a long way with 1 trust, we can take some of that at least to the next one and it becomes a bit more repeatable. But yeah, certainly an ongoing fun challenge. Does tortoise support ordering blood to metal scans in EHR? Not yet. So we’re a class 1 medical device moving towards class 2.
And when you get into ordering of prescriptions, it’s pretty hard to argue that isn’t a medical device. So, the bar for clinical evaluation, for clinical evidence, for accuracy, very, very high. Even if you’re just drafting it, giving it to the clinician. And there’s a lot of complexity around the data models of different EHRs, how you can serve that to clinicians so they can approve. So for example, in EPIC, there’s only a few EPIC sites in the UK, but they do have the ability to order and store it as a draft, which means the clinician can then take over and approve.
And there’s an easy way to hand off between the AI suggestion, what the clinician wanted to do that’s safe and comfortable. That’s actually the main blocker. The technical capability to do this is not actually that hard. But it’s the ability of handing back the clinical safety element, I think, which is much more important. So it will come, but it will take a little bit longer.
What is the USP of tortoise versus other AI scribes such as Heidi Corti? Yeah, I’ve got a little spiel now. So, simplicity. So we’ve deliberately gone for a simple workflow, very simple user interface. You can’t do as much with our tools as you can with some of our other competitors.
2 reasons for that: 1 is we think that the mainstream requirements or actually simplicity is more important. But 2, as a medical device, we have to be really constrained about what we are allowing to use because everything has to be evidenced and documented and that’s really careful. So the second thing is about safety. Only Class 1 medical device in our category moving towards Class 2, but also a bunch of clinical evidence and clinical evaluation on the back. And now, sovereignty.
The only British company doing this at the moment, NHS born and raised, you know, I trained down the road at UCH many, many years ago. And I think that does count because we are building specific things for the NHS. NHS coding systems, NHS specific notation systems, and constantly thinking about what can we build specifically for our system to help our system do better. I think that’s a really important part of the product and our DNA. Stephen says, gosh, loads of questions.
Do you need more clinicians interested in AI working with you? Jobs on offer. Watch this space, Stephen. I’m always interested in people who are interested. But I think there is an element as well around implementation and creating champion teams and things.
And that’s something that we’re probably looking at to do more as we start scaling up. Adele says there are about 60 AI medical scribes on the market. They won’t all survive. What’s the future? Where are they heading with their abilities?
Super good question. In the UK, there’s not that many. So I think there’s about, actually there’s about 90 globally, but 70 of them all in the US right now. They won’t all survive and they won’t all survive because this is going to become a regulated space. I don’t see a world where it can’t in terms of the safety of what you can do, especially as you expand the capabilities.
When you start making suggestions or doing orders or doing diagnosis or anything useful additional to scribe, that becomes a medical device, a regulated environment. So, that will compress. So, the really good example of this is when I heard the story recently, so don’t quote me this is true. Combustion engine was invented. Within 2 or 3 years of that invention, there was like 400 car companies.
But within 10 years of that invention, there was like 6. So what happened to the 394? Well, the cars blew up on the road. And I think this is a similar thing. As we see adoption, safety, evaluation, medical device regulation will oversee and actually kick out a lot of the poorly thought out players in the market and the ones that will survive will be the ones that can basically operate in healthcare, as opposed to any other industry where the stakes are much higher and the requirements are much more severe.
Great, I think we got through all of them. Yeah. Oh, sorry. Yeah. Christo asked about moving between trusts.
Unfortunately, the answer to this one, Christo, is like, if you’re going to move between trusts and using our technology, you can’t take it with you unless the trust approves this. And we’ve seen this. And I can tell you firsthand when we found other, I wouldn’t say who, companies being used in trusts where they weren’t approved, those clinicians came to some significant disciplinary difficulty. What’s the phrase? Meeting with no coffee, I think is the words that we use.
And that’s because essentially what you’re doing is you’re taking patient data and you’re storing it somewhere where the trust hasn’t overseen. Now, most of these technologies have been through these approval processes. So like, you know, there is an element of that, but it’s probably just not worth your personal risk. So, I’ll give you a definitive answer. But it I’ll give you a definitive answer.
But often it actually has kick started conversations and then people get to use it. And that’s actually a great way to become a champion for something that you think is useful when you want to change in your department. Very good. Chats wise, Adele said, one recent paper evaluating ambulances showed no overall increase in infant efficiency in doctors, but there were differences between types of clinician specialty focused. Have you noticed this at all and what types is it best worth sited for?
Super good question. The paper that you’re referring to is from a competitor I won’t say who but it’s 2 years old. So it predates a lot of the systems now. And as I said to you before, customization is a big part of time saving. I know from a fact that a lot of the early pilots of this technology failed because they would just produce outputs which weren’t close enough to the physician to actually be able to use.
We saw that with our early iterations and most other ambient companies did. Now the technology is much more mature. It’s much more capable and being able to be, you know, latency as well as massively reduced. So I’d say most of the papers in this space go out of date as soon as they’re published, really, which makes it quite difficult to follow. In terms of different types of clinician specialty focus, not particularly.
There’s a really good case for GP land, but it’s more dependent on the clinician actually than it is in my experience in the specialty itself. Because as we know, we all practice quite differently, even in the same department. I think where we’ve seen some real winners, primary care, I think potentially a real winner developing in the A and E department for obvious reasons. And new patients in outpatients, but not follow-up interestingly. So that’s another change.
Okay, well, we’re time and we’re over time. Thank you so much again. See you all in 2 weeks’ time for a fun session of looking into the future. And we’ll take that.
Meeting title: TORTUS AI Academy Session 4: Adoption in Clinical AI
Purpose of meeting: To provide practical insights on the implementation and adoption of AI in clinical settings, focusing on strategies to overcome barriers, understand user needs, build trust, and ensure return on investment (ROI) for NHS digital leaders and CXOs.
Date: 5th February 2025
Location: Online
Attendees:
- Dr. Dom Pimenta, CEO and co-founder of TORTUS
- Dr. David Triska (absent)
Discussion points and decisions made:
1. Introduction
- Welcome and Overview
- Dom opened the session, welcoming participants to the fourth installment of the TORTUS AI Academy series.
- Mentioned that David was scheduled to join but was absent due to personal reasons; however, he contributed significantly to the session’s content.
2. Recap of Previous Sessions
- Session Summaries
- December: “What is AI?” covering basics and fundamentals.
- January: Evaluation with Dr Sarah Gebauer (external).
- Previous Session: Clinical risk with Dr Ellie Asgari.
- Upcoming Session
- Next session will be an engaging panel discussion on the future of healthcare in 2, 5, and 10 years.
3. Purpose of the Series
- Goal Clarification
- The series aims to help NHS digital leaders and CXOs understand the AI landscape to make better purchasing decisions.
- Emphasized that it is not a computer science class or an in-depth ethical or mathematical exploration.
4. Today’s Topic: Adoption in Clinical AI
- Focus Areas
- Practical insights on implementation and adoption strategies.
- Understanding how to persuade skeptics, build trust, roll out technologies, monitor progress, and ensure ROI.
5. Understanding Adoption
- Misconceptions
- Adoption is not simply instructing people to use new technology.
- Terminology
- Terms like change management, transformation, and digital delivery all relate to how to get people to use something new.
- Product Management Perspective
- Adoption is viewed as product management, focusing on optimizing the input-to-output transformation.
6. Basics of Product Management
- Product Definition
- A product transforms an input into an output to solve a problem.
- Example: Invention of Fire
- Fire as an early product solved problems of cold and hunger.
- Product management involves continuous optimization (e.g., developing ovens and radiators).
7. Understanding the Problem
- Importance of Accurate Problem Identification
- Misunderstanding the problem leads to ineffective solutions.
- Case Study: Reducing DNA Rates
- Initial assumption: High DNA rates in lower socioeconomic groups.
- Real issue: Language barriers due to non-English speakers.
- Correct solution: Sending appointment letters in patients’ first languages.
8. Knowing the User
- Deep Understanding Required
- Need to understand daily routines, pain points, fears, and habits of users (clinicians).
- Context Switching
- Frequent task-switching increases cognitive load.
- Visualization of Workflows
- Mapping workflows helps identify areas for improvement.
9. The Four Forces of Change
- Framework Explanation
- Push: Factors that drive users away from the old way.
- Pull: Attractive features of the new solution.
- Habits: Existing routines that resist change.
- Anxieties: Fears about adopting the new solution.
- Application to AVT
- Analyzed how these forces affect the adoption of Ambient Voice Technology.
10. Case Study: Ambient Voice Technology (AVT)
- Description
- AVT uses AI to transcribe consultations, reducing the need for manual note-taking.
- Push Factors
- High cognitive load from context switching.
- Pull Factors
- Simplifies workflow and saves time.
- Anxieties
- Fear of missing information, data security concerns, reliance on technology.
- Habits
- Habitual typing and note-taking during consultations.
11. Implementation Insights
- Role of Champions
- Identify local clinician leaders who understand the problem and can evangelize the solution.
- Operators
- Support champions with logistics and maintain momentum.
- Adoption Takes Time
- Requires effort, trust-building, and addressing users’ concerns.
12. Building Trust
- Evidence and Education
- Provide clinical evidence and educational resources.
- User Experience
- Allow users to try the technology to overcome skepticism.
13. Adoption as a Team Effort
- Collaborative Implementation
- Implementing technology in teams fosters mutual support and knowledge sharing.
- Social Influence
- Team adoption creates a positive environment for change.
14. Addressing Common Anxieties
- Identifying Concerns
- Fear of job replacement, learning new systems, data privacy, legal implications.
- Proactive Communication
- Acknowledge and address these concerns directly.
15. Adoption Strategies (“Cookbook”)
- Start Small and Expand
- Begin with high-impact areas and gradually expand adoption.
- Measure Existing Metrics
- Use current measurement systems to evaluate impact.
- Customization
- Recognize users’ individual needs and allow for personalization.
- Acceptance Levels
- Focus on users who find the technology essential.
- Continuous Improvement
- Treat adoption as an ongoing journey with iterative enhancements.
16. Future Plans
- Next Session Preparation
- Organizing a panel discussion on “The Future of AI in the NHS.”
- Participant Engagement
- Encouraged attendees to submit questions for the upcoming session.
Action points:
- Resource Sharing
- Provide links to previous session recordings and resources on the AI Wiki.
- Next Session Organization
- Prepare for the panel discussion by coordinating with speakers and compiling participant questions.
- Educational Materials Development
- Create engaging training materials that address common anxieties and build trust in AI solutions.
- Implementation Support
- Identify and support champions and operators within organizations to lead adoption efforts.
- Participant Outreach
- Encourage attendees to reach out with additional questions or comments.
- Ongoing Communication
- Keep participants informed about future sessions and updates.
Note: No specific individuals were assigned tasks during the meeting.
Superhuman Product Market Framework:
https://review.firstround.com/how-superhuman-built-an-engine-to-find-product-market-fit/
The Four Forces of Progress:
https://jtbd.info/the-forces-of-progress-4408bf995153
5. The Future of AI in the NHS and AMA – Dr Josh Au Yeung, Daria Gherghelas, Dr Ekanjali Dhillon, Dr Keith Grimes
First Broadcast:
20250219
Not Available Yet
Not Available Yet
Not Available Yet
FAQ
Data security depends on several factors:
- Data Workflow: Map out where your data is going. Identify every point in the data flow to pinpoint potential vulnerabilities.
- Data Location: Determine whether data stays on-site (on-premises) or moves to the cloud. If it’s cloud-based, know which cloud provider is used.
- Cloud Providers: Ensure cloud providers are reputable, comply with GDPR (if in the UK/EU), and have necessary certifications like ISO standards.
- Model Source: Understand if the AI model is open-source or licensed from another company. This affects who handles your data.
- Data Handling: At each stage, verify whether patient-identifying information is necessary. Minimize exposure by removing or anonymizing data when possible.
- Vendor Agreements: For closed-source models, review enterprise agreements regarding data retention and security. Companies like Microsoft, AWS, and OpenAI often have clear policies.
- Data Training Practices: Ensure vendors are transparent about whether they train models using your data and that they obtain patient consent if required.
Yes, here are some valuable resources:
- Online Courses:
- Introduction to Machine Learning by Andrew Ng on Coursera. This course is excellent for beginners and covers foundational AI concepts and basic Python programming.
- Podcasts:
- RSM Podcast with Dr. Annabelle Painter – Discusses AI’s role in healthcare and its future.
- Health Tech Podcast – Offers insights into how AI works within the health tech industry from innovators and leaders.
- Additional Resources:
- Courses on large language models and AI applications in healthcare.
- Upcoming sessions (like our future webinars) focusing on deployment and evaluation of AI in healthcare settings.
- Deeply Understand the Problem:
- Leverage your clinical expertise to clearly define the problem you’re addressing.
- Remember that what seems obvious to clinicians may not be clear to engineers or those outside the medical field.
- Prototype Your Idea:
- Learn basic coding or create visual representations to bring your idea to life.
- Developing prototypes helps convey your concept effectively to potential partners.
- Gather Feedback:
- Share your idea with colleagues to refine it and ensure it addresses a real need.
- Collecting input can help you identify any gaps or areas for improvement.
- Connect with Technical Partners:
- Attend hackathons and networking events focused on healthcare and AI to meet engineers and data scientists.
- Consider accelerators like Entrepreneur First, where professionals from different backgrounds collaborate on startups.
- Validate Your Solution:
- Ensure your solution is scalable and addresses a generalizable problem, not just a specific case.
- Research existing products to see if similar solutions are available off the shelf.
- UK Advantages:
- Innovation in Healthcare: The NHS offers a unique environment with opportunities for innovation and improving patient care.
- Growing AI Talent Pool: London hosts major AI companies like DeepMind, Google, Meta, and soon Anthropic, making it rich in AI expertise.
- Supportive Environment: There’s increasing support for startups, with accelerators and communities fostering collaboration.
- Cultural Differences:
- While the US has a more risk-taking culture, the UK is building momentum in supporting ambitious ideas.
- The UK may have a more cautious approach, but this can lead to well-considered and sustainable ventures.
- Considerations for the US:
- Healthcare System Differences: The US healthcare incentives can be misaligned, potentially making it a challenging environment for healthcare startups.
- Talent and Costs: Engineering talent is abundant but can be more expensive in the US.
- Regulatory Environment: Be mindful of UK and EU regulations which may differ from the US and impact your operations.
- Expansion Strategy:
- Focus on establishing a strong foundation in the UK before considering international expansion.
- Engage with accelerators and networks that can provide support and resources for scaling globally.
Try TORTUS
Experience the future of AI healthcare with us. Join us today to transform your practice.