In a research letter published in JAMA Internal Medicine, physician-scientists at Beth Israel Deaconess Medical Center (BIDMC) compared the clinical reasoning abilities of a large language model (LLM) called ChatGPT-4 against internal medicine residents and attending physicians. The study aimed to assess whether LLMs are as proficient as physicians at clinical reasoning, beyond just making diagnoses. The study involved 21 attending physicians and 18 residents who worked through 20 clinical cases with different stages of diagnostic reasoning, while the chatbot GPT-4 was given the same cases. Surprisingly, the chatbot outperformed the human physicians in terms of reasoning, earning the highest score on the revised-IDEA (r-IDEA) scale.
Lead author Stephanie Cabral, MD, explained the four sequential stages of diagnostic reasoning that the participants had to go through, which included triage data, system review, physical exam, and diagnostic testing and imaging. The chatbot scored the highest on the r-IDEA scale compared to attending physicians and residents, showing exceptional reasoning skills. However, when it came to diagnostic accuracy and correct clinical reasoning, it was more of a draw between the humans and the bot. The researchers noted that the chatbot also had instances of incorrect reasoning, highlighting the importance of AI as a supplementary tool to enhance, rather than replace, human reasoning in clinical practice.
The study emphasized the potential for integrating LLMs like ChatGPT-4 into clinical practice to improve patient care and physician efficiency. Co-author Adam Rodman, MD, highlighted the significance of AI demonstrating real reasoning abilities through multiple steps of the diagnostic process, potentially enhancing the quality and experience of healthcare for patients. The researchers suggested further studies to determine how LLMs can best be utilized in clinical settings, with the hope that AI could enhance the patient-physician interaction by reducing inefficiencies and allowing healthcare professionals to focus more on meaningful conversations with patients.
The research was supported by Harvard Catalyst and financial contributions from Harvard University and its academic healthcare centers. The authors disclosed potential conflicts of interest, with Rodman reporting grant funding from the Gordon and Betty Moore Foundation, while other co-authors had affiliations with various medical organizations. The study findings shed light on the evolving role of AI in healthcare, showcasing the capabilities of LLMs like ChatGPT-4 to augment clinical reasoning processes and potentially improve patient outcomes.
Overall, the study conducted by physician-scientists at BIDMC demonstrated that ChatGPT-4, an AI program, surpassed internal medicine residents and attending physicians in clinical reasoning abilities when evaluated using the r-IDEA score. While the chatbot showed superior reasoning skills compared to humans, there were instances of incorrect reasoning, emphasizing the need for AI as a complementary tool in healthcare. The researchers highlighted the potential benefits of integrating LLMs into clinical practice to enhance patient care and physician efficiency, with the ultimate goal of improving the quality and experience of healthcare for patients.