Skip to content
Health

AI Just Beat Doctors at Diagnosing ER Patients. Don’t Get All Excited

An advanced ‘reasoning’ model AI scored over 11% higher than two human doctors when diagnosing emergency room cases.
By

Reading time 4 minutes

Comments (2)

Emergency departments and other clinical settings across the world are now one step closer to sounding like the cockpit of the Millennium Falcon—with human doctors soliciting advice from, bickering with, and not infrequently trusting the guidance of their opinionated AI colleagues.

Researchers at Harvard and Boston’s Beth Israel Deaconess Medical Center have successfully tested an advanced large language model (LLM) AI against two attending physicians (humans) in their performance diagnosing incoming emergency room patients at the triage phase.

The LLM, OpenAI’s first so-called “reasoning” model o1-preview, made the correct call in 67.1% of the 76 actual emergency department cases put to it, with what the researchers called “exact or a very close” diagnostic accuracy in the new study, published today in the journal Science. Two expert physicians sourced from elite university medical institutions, however, only scored 55.3% and 50.0% accuracy, respectively, with blinded physician reviewers unable to tell these o1 and human-made diagnoses apart.

The new study also pitted o1 and OpenAI’s prior non-reasoning LLMs, like ChatGPT-4, against physicians’ past testing baselines diagnosing 143 complex cases published as clinical vignettes in The New England Journal of Medicine.

“o1-preview included the correct diagnosis in its differential in 78.3% of these cases,” according to one of the study’s lead authors, doctoral candidate Thomas Buckley with Harvard Medical School’s Department of Biomedical Informatics, who spoke at a press briefing Tuesday.

“And when expanding to a differential diagnosis that would have been helpful,” Buckley continued, “we found that o1-preview suggested a helpful diagnosis in 97.9% of cases.” The results, he noted, not only outperformed ChatGPT-4 but also vastly outpaced a human physician baseline published in Nature, where physicians with the freedom to consult search engines and standard medical resources had an accuracy of 44.5%. (Although, this study included a larger and perhaps more thorny set of 302 clinical vignettes.)

I, Robot, M.D.

“I don’t think our findings mean that AI replaces doctors,” study coauthor Arjun Manrai, who teaches biomedical informatics at Harvard, took pains to emphasize at the press briefing, “despite what some companies are likely to say.”

Manrai did, however, describe the team’s results as evidence of a “really profound change in technology that will reshape medicine,” one that would require rigorous testing to verify their utility in actually making patient outcomes better.

Two independent medical researchers, who commented on the new study in a piece published concurrently in Science, echoed this view. “The prevailing proposal for AI in health care is not replacement but collaboration,” they noted, “with clinicians providing oversight, contextual judgment, and accountability.”

Study coauthor Adam Rodman, an internal medicine physician at Beth Israel, likened the possible legal status of AI diagnoses to the current paradigm with clinical decision support (CDS), already existing digital tools doctors use while retaining personal culpability for those choices.

“I will tell you, as a practicing physician, that would be a limitation to widespread adoption of all of this, if the regulatory system is ‘Just trust me,’” Rodman said at the briefing. “I would have to see extraordinarily strong evidence, such as a randomized controlled trial, where I would do that for my patients.”

Playing doctor

Reasoning models, like o1-preview, differ from the AI chatbots you might be used to in that these LLMs have been built to work through problems in structured steps, mirroring more deductive thinking, before delivering answers to a prompt. The system still has its limitations, which, according to the researchers, include real difficulty diagnosing medical cases involving multimodal input, meaning images and audio evidence that would easily help a human doctor diagnose a patient’s case.

“They’re underperforming on most medical imaging benchmarks,” Buckley said. “I think a really active area of research over the next decade is how do we improve the multimodal integration capabilities of these models.”

Yujin Potter—an AI research scientist at the University of California, Berkeley, who reviewed the new study for Gizmodo—noted that the team’s finished paper was quiet on more troubling issues now known to plague AI. Potter, who’s not involved with the new research, co-published a study in March detailing how teams of AI can spontaneously develop and act on their own goals when tasked to work in coordination, actively deceiving their human users and exfiltrating files to hide on different servers.

“This paper is informative. It’s good. But also, this actually means that we also need to understand AI safety better,” Potter told Gizmodo. “People should keep in their mind that AI can also hallucinate and give them the wrong information—and even malicious or misaligned AI can manipulate them.”

At the Tuesday briefing, Buckley acknowledged that he and his colleagues “didn’t formally measure the hallucination rate of these models.”

“We do know that models such as o1 do hallucinate,” Buckley added, “but in the significant majority of cases, we are finding that the model is suggesting something at least helpful, and then in a huge amount of cases, it’s suggesting the exact diagnosis in the original case.”

Manrai, Buckley’s coauthor, added: “My mantra is still ‘trust, but verify.’”

Share this story

Sign up for our newsletters

Subscribe and interact with our community, get up to date with our customised Newsletters and much more.