Study Finds AI Outperforms Doctors in Emergency Diagnoses - But There Is a Catch

Doctors were one job profile many predicted to stay unaffected by the Artificial Intelligence takeover. But a new Harvard study suggests otherwise.

For a study titled "Performance of a large language model on the reasoning tasks of a physician," researchers evaluated medical reasoning abilities of OpenAI o1, an updated large language model across six diverse experiments, comparing the model with hundreds of expert physicians. They found that at least in the case of high-pressure emergency medicine triage, the models outperformed human doctors, diagnosing more accurately in the potentially life and death moments when people are first rushed to hospital.

In an experiment done in an emergency room of a Boston hospital, o1 outperformed GPT-4o, and two attending physicians at three diagnostic touchpoints: ER triage, ER physician and admission to the medical floor or ICU. Its advantage was strongest at the very first stage, when there is very little information and quick decisions are needed.

In another experiment, they tested the AI using a well-known set of challenging medical case studies that have been used since the 1950s to evaluate how doctors think through diagnoses.

The results showed that the AI included the correct illness in its list of possible diagnoses in about 78% of cases. Its first guess was correct in more than half the cases (52%). When answers that were "potentially helpful or very close diagnoses" were considered, the AI's accuracy rose to 97.9%.

In addition to identifying illnesses, the AI was tested on suggesting followup patient care, including choosing the right medical test. It selected the correct test in 87.5% of cases, and in another 11%, its suggestion was good enough to be considered helpful.

In simulated patient cases, the OpenAI o1 got a perfect score in 78 out of 80 cases, "significantly outperforming GPT- 4 (47/80), attending physicians (28/80), and resident physicians (16/72).

Assistance - not replacement of human doctors

Although at first glance this sounds groundbreaking - there is a catch. The tests were entirely text- based, which means as of now, even the most advanced model can only be relied upon as a second opinion. Real-life clinical medicine is "multifaceted and awash with non-text inputs," the study notes, emphasizing that auditory inputs - such as the patient's level of distress or visual inputs -such as interpretation of medical imaging studies, were not tested on the model.

"The integration of AI into emergency care should be approached with clarity. It is a tool that can enhance clinical practice when used appropriately. It cannot function as a substitute for the physician at the bedside, says Dr Nayan Sriramula, Head of Emergency Medicine & Trauma,Medicover Hospitals.

"The real challenge lies in using technology to strengthen systems without diluting the role of clinical judgement. In Indian emergency departments, where unpredictability is constant, the ability to make timely and context-sensitive decisions remains essential," Dr Nayan added.

The study suggested that although applying AI to assist with clinical decision support is viewed as a high- risk endeavor, greater use of these tools might serve to mitigate the human and financial costs of diagnostic error, delay, and lack of access.