The use of artificial intelligence tools in healthcare, such as ChatGPT, has been lauded for their potential to alleviate clinician workload by assisting with patient triage, taking medical histories, and even providing preliminary diagnoses. While these large-language models perform well on standardized medical tests, a new study by researchers at Harvard Medical School and Stanford University suggests that their performance in more real-world scenarios may not be as effective. The study, published in Nature Medicine, introduced a new evaluation framework called CRAFT-MD, designed to closely mimic actual interactions with patients to assess how well AI models perform in these settings.
The analysis conducted using CRAFT-MD revealed that while all four large-language models performed well on medical exam-style questions, their performance declined significantly when engaged in conversations that more closely resembled real-world interactions with patients. The researchers noted that these AI models struggle with the dynamic nature of medical conversations, including the need to ask the right questions, piece together scattered information, and reason through symptoms, which goes beyond simply answering multiple choice questions. The findings highlight the importance of creating more realistic evaluations to better gauge the fitness of clinical AI models for use in real-world clinical settings.
Recommendations proposed by the research team include using conversational, open-ended questions in the design, training, and testing of AI models to better mirror unstructured doctor-patient interactions. Additionally, models should be assessed for their ability to extract essential information, ask the right questions, and integrate information from multiple conversations. It is also recommended that AI models be capable of interpreting non-verbal cues such as facial expressions, tone, and body language to improve their diagnostic accuracy and performance in clinical conversations. The evaluation process should involve both AI agents and human experts to enhance accuracy and efficiency.
The study also emphasizes the importance of continuously updating and optimizing evaluation frameworks like CRAFT-MD to integrate improved patient-AI models. By utilizing AI evaluators as the first line of assessment, the risk of exposing real patients to unverified AI tools can be minimized. This approach also offers advantages in terms of processing efficiency, as demonstrated by the ability of CRAFT-MD to process a large number of conversations within a short time frame compared to human-based evaluation methods, which would be more time-consuming and resource-intensive.
Overall, the research highlights the need for AI developers to design models that are more adept at conducting clinical conversations, extracting relevant information, and making accurate diagnoses based on real-world interactions. By improving the performance of AI models in health care settings, these tools could potentially enhance clinical practice and patient outcomes while ensuring ethical implementation. The study was supported by various grants and foundations, and the authors disclosed relevant conflicts of interest related to their research activities and partnerships with industry organizations.