AI Beat Doctors? Not Exactly.

0
11

The study hit the news on April 30. It dropped in Science, a journal with weight. Then the headlines started screaming. Social media lit up. Cable news jumped in. Hospital inboxes flooded with alerts saying our jobs are gone.

OpenAI’s o1 model. That was the villain in their script. Or the hero, depending on your stance. The coverage claimed it outperformed emergency physicians in diagnosing triage complaints.

NPR ran the headline: In real-world test, an AI did better than doctors.

Erasure by hype. That’s what felt like. Many ER physicians bristled. I read the study myself. What it actually shows is interesting, but also incredibly nuanced.

One author later clarified the context, but the damage to the narrative was already done.

What The Paper Actually Did

Here is the setup. Researchers gave OpenAI’s o1 and 5 models electronic medical records from 76 patients. These folks had gone through Beth Israel Deaconess ER and were admitted to the hospital.

Two internal medicine attending doctors looked at the same cases. Simple enough. Then, two separate internal medicine physicians — blinded to whether a human or AI generated the answer — evaluated the results.

The stats?

  • AI got the exact or closely related right in 67% of triage cases.
  • Physician 1 scored 55%.
  • Physician 2 scored 50%.

The AI had its biggest edge at the first touch point. Initial triage. When info is sparse. The researchers made sure the AI only got the raw, unprocessed EHR data available at the moment of each decision. No cheating with later results.

But the headlines missed a big chunk. The ER was just one of six experiments in the paper. Five others used standard benchmarks for evaluating diagnostic systems.

Impressive? Yes. Proof AI should run solo in a clinic? No. Still, ER docs remained uneasy. And rightly so.

The Doctors Were Not ER Docs

This is the friction point. The physicians in the study weren’t emergency specialists. They were internal med doctors. Different training. Different focus. Different pressures.

Emergency medicine isn’t just about pinning down a diagnosis. It’s about ruling out the stuff that kills you now. Managing chaos. Moving bodies safely through a high-volume funnel.

Spend a shift on the floor. Try it. You’ll see why a text-based exercise — no matter how clean the dataset — doesn’t capture the reality.

The AI read notes. Just text.
It didn’t see the patient who looked “ill” in that indefinable way.
It missed the subtle neuro exam finding.
It didn’t hear the patient’s story change slightly from the triage room to the exam table.

That nuance changes the differential. The AI didn’t practice medicine. It offered an opinion on data.

Is text all we need? Probably not.

An Author Pushes Back

Dr. Adrian Haimovich weighed in. He’s one of the study’s authors, an assistant professor at Harvard Med, and an attending at Beth Israel. An actual ER doc.

He framed it differently.

“Even tough cases in medical journals get solved by LLMs now,” he wrote. He pointed out the handoff dynamic. ER docs stabilize. Internal med docs admit. This experiment compared LLMs to internal docs using only the data available during the ER stay.

“ERs are messy,” he noted. “Reasoning under pressure is key. We restricted data to the ER because that’s when uncertainty peaks. That’s the hard part.”

Haimovich’s take? Not a head-to-head trophy contest. It was a signal that reasoning models can actually do clinical reasoning across messy domains.

Why This Matters (Without The Hype)

I think the results matter. Hence the Science placement. But the importance isn’t the scorecard.

It’s the fact that AI held up on raw, messy, real-world data. Previous studies used polished cases. Sterile scenarios that look nothing like an actual ER visit.

o1 handled the uncertainty. That’s a signal. Also, remember: the data is old. By AI standards, this is vintage tech. Newer models have already blown o1 away. The ceiling keeps rising.

The authors were clear. Next step: prospective trials. Not deployment. Definitely not replacing physicians.

The Accountability Vacuum

So where are we in mid-2025? The debate over AI in diagnosis is mostly over. It will be involved.

Docs use it for second opinions already. Sometimes it helps. Sometimes it doesn’t.

The real problem? Governance.

Who is liable when AI is wrong?

  • The physician who trusted it?
  • The hospital that bought it?
  • The vendor who built the black box?

If a patient dies because an AI missed something obvious, the backlash will be brutal. Healthcare hates risk. The system could hit the kill switch overnight.

What Now?

Haimovich calls it “all-hands-on-deck.” He’s right.

The question isn’t if the tech works. It does. It’s about integration. Can it reduce errors? Can it double-check EKGs to decide on urgent cath labs? Can it spot the subtle signs humans miss?

Specialty groups like the ACEP are already working on these frameworks.

Headlines said AI beat doctors. That’s lazy journalism. The science is real. The technology is coming.

But it’s not here yet.

Are you ready?