AI Just Beat Doctors at Diagnosing ER Patients. Don't Get All Excited

Emergency departments and other clinical settings across the world are now one step closer to sounding like the cockpit of the Millennium Falcon—with human doctors soliciting advice from, bickering with, and not infrequently trusting the guidance of their opinionated AI colleagues.

Researchers at Harvard and Boston’s Beth Israel Deaconess Medical Center have successfully tested an advanced large language model (LLM) AI against two attending physicians (humans) in their performance diagnosing incoming emergency room patients at the triage phase.

The LLM, OpenAI’s first so-called “reasoning” model o1-preview, made the correct call in 67.1% of the 76 actual emergency department cases put to it, with what the researchers called “exact or a very close” diagnostic accuracy in the new study, published today in the journal Science. Two expert physicians sourced from elite university medical institutions, however, only scored 55.3% and 50.0% accuracy, respectively, with blinded physician reviewers unable to tell these o1 and human-made diagnoses apart.

The new study also pitted o1 and OpenAI’s prior non-reasoning LLMs, like ChatGPT-4, against physicians’ past testing baselines diagnosing 143 complex cases published as clinical vignettes in The New England Journal of Medicine.

“o1-preview included the correct diagnosis in its differential in 78.3% of these cases,” according to one of the study’s lead authors, doctoral candidate Thomas Buckley with Harvard Medical School’s Department of Biomedical Informatics, who spoke at a press briefing Tuesday.

“And when expanding to a differential diagnosis that would have been helpful,” Buckley continued, “we found that o1-preview suggested a helpful diagnosis in 97.9% of cases.” The results, he noted, not only outperformed ChatGPT-4 but also vastly outpaced a human physician baseline published in Nature, where physicians with the freedom to consult search engines and standard medical resources had an accuracy of 44.5%. (Although, this study included a larger and perhaps more thorny set of 302 clinical vignettes.)

I, Robot, M.D.

“I don’t think our findings mean that AI replaces doctors,” study coauthor Arjun Manrai, who teaches biomedical informatics at Harvard, took pains to emphasize at the press briefing, “despite what some companies are likely to say.”

Manrai did, however, describe the team’s results as evidence of a “really profound change in technology that will reshape medicine,” one that would require rigorous testing to verify their utility in actually making patient outcomes better.

Two independent medical researchers, who commented on the new study in a piece published concurrently in Science, echoed this view. “The prevailing proposal for AI in health care is not replacement but collaboration,” they noted, “with clinicians providing oversight, contextual judgment, and accountability.”

Study coauthor Adam Rodman, an internal medicine physician at Beth Israel, likened the possible legal status of AI diagnoses to the current paradigm with clinical decision support (CDS), already existing digital tools doctors use while retaining personal culpability for those choices.

“I will tell you, as a practicing physician, that would be a limitation to widespread adoption of all of this, if the regulatory system is ‘Just trust me,’” Rodman said at the briefing. “I would have to see extraordinarily strong evidence, such as a randomized controlled trial, where I would do that for my patients.”

Playing doctor

Reasoning models, like o1-preview, differ from the AI chatbots you might be used to in that these LLMs have been built to work through problems in structured steps, mirroring more deductive thinking, before delivering answers to a prompt. The system still has its limitations, which, according to the researchers, include real difficulty diagnosing medical cases involving multimodal input, meaning images and audio evidence that would easily help a human doctor diagnose a patient’s case.

“They’re underperforming on most medical imaging benchmarks,” Buckley said. “I think a really active area of research over the next decade is how do we improve the multimodal integration capabilities of these models.”

Yujin Potter—an AI research scientist at the University of California, Berkeley, who reviewed the new study for Gizmodo—noted that the team’s finished paper was quiet on more troubling issues now known to plague AI. Potter, who’s not involved with the new research, co-published a study in March detailing how teams of AI can spontaneously develop and act on their own goals when tasked to work in coordination, actively deceiving their human users and exfiltrating files to hide on different servers.

“This paper is informative. It’s good. But also, this actually means that we also need to understand AI safety better,” Potter told Gizmodo. “People should keep in their mind that AI can also hallucinate and give them the wrong information—and even malicious or misaligned AI can manipulate them.”

At the Tuesday briefing, Buckley acknowledged that he and his colleagues “didn’t formally measure the hallucination rate of these models.”

“We do know that models such as o1 do hallucinate,” Buckley added, “but in the significant majority of cases, we are finding that the model is suggesting something at least helpful, and then in a huge amount of cases, it’s suggesting the exact diagnosis in the original case.”

Manrai, Buckley’s coauthor, added: “My mantra is still ‘trust, but verify.’”

AI Just Beat Doctors at Diagnosing ER Patients. Don’t Get All Excited

I, Robot, M.D.

Playing doctor

Sign up for our newsletters

Latest news

The Immersive ’86 Eighty-Six’ Audiobook Goes Harder Than the Anime Ever Did

SwitchBot’s Newest Smart Lock Has More Biometrics Than You Can Shake a Stick at

‘The Testaments’ Star on Her Character’s Brutal Choice

Google Is Slopping Up Search and It Wants You to Talk to the Ads

SpaceX Is About to Launch the Most Powerful Rocket Ever Built. Here’s How to Watch

Brandon Sanderson’s ‘Skyward’ TV Series Adaptation in the Works

Watch Deep-Sea Creatures Feed on a Whale That’s Been Dead for Over 20 Years

OpenAI’s IPO Filing Might Be Just Days Away: Report

Latest Reviews

Bose Lifestyle Ultra Speaker Review: Sonos Can Start Sweating Now

Bose Lifestyle Ultra Soundbar Review: A Boisterous Stab at Dominating Home Theater

Dell’s XPS 16 (2026) Is Almost Everything I Could Have Asked for… Almost

Smart Glasses With Subscriptions Are As Bad as They Sound

iBuyPower’s Trace X Gaming PC Is the Fishbowl You Want to Swim In

Govee Ceiling Light Ultra Review: AI Art Ain’t It

JBL Live 780NC Review: Solid Wireless Headphones You Can Live Without

Acer’s Swift 16 AI (2026) Gets a Lot Right, But I Can’t Get Past the Trackpad

Related Articles

AI Just Beat Doctors at Diagnosing ER Patients. Don’t Get All Excited

I, Robot, M.D.

Playing doctor

Sign up for our newsletters

The Immersive ’86 Eighty-Six’ Audiobook Goes Harder Than the Anime Ever Did

SwitchBot’s Newest Smart Lock Has More Biometrics Than You Can Shake a Stick at

‘The Testaments’ Star on Her Character’s Brutal Choice

Google Is Slopping Up Search and It Wants You to Talk to the Ads

SpaceX Is About to Launch the Most Powerful Rocket Ever Built. Here’s How to Watch

Brandon Sanderson’s ‘Skyward’ TV Series Adaptation in the Works

Watch Deep-Sea Creatures Feed on a Whale That’s Been Dead for Over 20 Years

OpenAI’s IPO Filing Might Be Just Days Away: Report

Bose Lifestyle Ultra Speaker Review: Sonos Can Start Sweating Now

Bose Lifestyle Ultra Soundbar Review: A Boisterous Stab at Dominating Home Theater

Dell’s XPS 16 (2026) Is Almost Everything I Could Have Asked for… Almost

Smart Glasses With Subscriptions Are As Bad as They Sound

iBuyPower’s Trace X Gaming PC Is the Fishbowl You Want to Swim In

Govee Ceiling Light Ultra Review: AI Art Ain’t It

JBL Live 780NC Review: Solid Wireless Headphones You Can Live Without

Acer’s Swift 16 AI (2026) Gets a Lot Right, But I Can’t Get Past the Trackpad

Related Articles

Samsung’s Galaxy XR Is the Future of Wearables—Just Not VR Headsets

OpenAI’s IPO Filing Might Be Just Days Away: Report

Jeff Bezos Tells Workers to ‘Be So Happy’ They’re Being Given the Gift of AI

Neurosurgeons Are Weirdly Optimistic About Cryonics for Life Extension, Survey Finds

Trump’s AI Executive Order Will Reportedly Make Sharing Models with the Government Voluntary

SoftBank Insiders Are Reportedly Worried Their CEO Keeps Falling for Wallet Inspectors