Artificial intelligence algorithms are increasingly being integrated into various aspects of healthcare, from breast cancer screenings to virtual nurses and transcription services. However, the effectiveness of these tools is being called into question by some experts, who argue that they may not perform as well as claimed. One area of concern is the testing and evaluation of these AI tools, particularly for large language models (LLMs) trained on vast amounts of text data.
Current benchmark tests for LLMs in healthcare are criticized for not accurately measuring their clinical abilities or accounting for the complexities of real-world scenarios. They are also limited in flexibility and do not adequately evaluate different types of clinical tasks. These tests are primarily designed based on evaluations using medical student exams like the MCAT, which may not accurately represent real patient data or the tasks that LLMs are supposed to perform in clinical settings.
There is a growing need to develop better evaluations for healthcare AI models to ensure their effectiveness and safety in real-world applications. Computer scientist Deborah Raji emphasizes the importance of creating evaluations that are closer to real-world clinical workflows and collecting naturalistic datasets to capture a wider range of interactions and outputs. This approach helps in understanding how people interact with these systems and ensures that the evaluations are more representative of actual usage.
To improve benchmark testing for healthcare AI models, researchers are exploring methods like “red teaming” to adversarially prompt the model and gathering insights from actual hospitals on how they integrate AI systems into their workflows. By grounding evaluations in observations of reality and using anonymized patient data, the goal is to develop benchmarks that more accurately reflect the performance and challenges posed by these models in clinical practice.
In terms of policy and frameworks, there is a call for greater transparency at institutional and vendor levels. Hospitals should share the list of AI products they use, along with information about how these systems are integrated into clinical practice. Additionally, AI vendors should be more transparent about their evaluation practices and benchmarks, helping to bridge the gap between current testing methodologies and more realistic assessments of AI performance in healthcare settings.
Overall, the advice for those working with healthcare AI models is to be more thoughtful about evaluations and assessments. Instead of relying on traditional benchmarks that may not accurately represent real-world use cases, the focus should be on constructing evaluations that align with the expectations for these models in actual deployment scenarios. By being more methodical and careful in evaluation practices, the field can ensure the effectiveness and safety of AI tools in healthcare.