Ex-Google DeepMind Researcher Warns Benchmarks Won’t Save Us

Mark this.

By AJ Dellinger Published May 22, 2026, 9:45 am ET

Reading time 2 minutes

The Google DeepMind logo on the walls of an office © Dan Kitwood/Getty Images

Remember when there was that stretch of time where people were leaving AI companies and every one of their farewell messages boiled down to, “This is going to kill us all?” Lun Wang, a researcher at Google’s DeepMind, recently announced he was departing from the company and may have reignited the trend by warning that current benchmarking tests aren’t capable of truly evaluating risks presented by evolving AI models.

On X, Wang noted that before deciding to depart from DeepMind, he had been thinking a lot about how AI models are evaluated. “We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations,” he wrote.

He expanded on the idea in a blog post, in which he explained further: “Most benchmarks, safety evals, and red-teaming protocols implicitly assume the next model is a stronger version of the current one. If it’s a different kind of thing, our entire evaluation infrastructure breaks silently.” Basically, if we’re counting on the current methods of stress testing AI to catch malicious behavior that we haven’t already considered, we’re probably shit out of luck.

What would that look like? Wang offered an example:

“Imagine a model that, at some scale, develops the ability to strategically withhold information to achieve goals — not lying exactly, but selectively omitting facts in ways that steer conversations toward outcomes its training process accidentally reinforced. Your existing honesty benchmarks wouldn’t catch this, because they test for factual accuracy, not for strategic omission. Your safety classifiers wouldn’t flag it, because the individual outputs are all technically true.”

In that scenario, benchmarks and safety checks wouldn’t even know what to look for. They would monitor the risks that they are designed to watch out for, while the more nefarious functions slip right by. That would be bad!

Wang did offer a solution… kinda. Basically, build better evaluations—ones that can evolve as models do. Sounds like a good idea, maybe someone who is still working at these companies could go ahead and get started on that.

Wang isn’t the first to raise an alarm about the risks surrounding poor benchmarking. The method of evaluation has frequently been criticized for failing to meaningfully define what it aims to measure and being too rigidly tied to singular evaluation goals that often don’t even reflect the way models are actually used in real life. Benchmarking has become the de facto measure of model success across the industry, which has also led to companies effectively gaming the system by training against the test and inflating their scores.

If there were a benchmark for being a good benchmark, it seems the current benchmarks would fail.

Explore more on these topics

Share this story

Sign up for our newsletters

Subscribe and interact with our community, get up to date with our customised Newsletters and much more.

Ex-Google DeepMind Researcher Warns Benchmarks Won’t Save Us

Sign up for our newsletters

Latest news

A Version of the ‘Dune: Part Three’ Art Book Comes With Actual Sandworm [Exclusive]

The Asteroid That Killed the Dinosaurs May Not Have Done It Exactly How We Thought

Toshiba 65-Inch LED 4K UHD Smart Fire TV Is 53% Off, Letting You Buy It for Portable Monitor Money

Astronomers Found the Sun’s Missing Silver Hiding in Plain Sight

Galaxy Watch Ultra Is Now Hundreds Cheaper Than Buying Directly From Samsung as a Grade-A Refurbished Model

LG Monitors Fill PCs With Adware, and It’s Not Just Recent Displays

‘Avatar Aang: The Last Airbender’ Is Sensational

This Year’s Budget Pixel Might Be Less of a Cop-Out

Latest Reviews

Anker Solix S2000 Review: The Little 2kWh Battery That Could

SwitchBot Home Dashboard Review: An E Ink Smart Display for the Weather-Obsessed

Asus ROG Kithara Review: A Huge Gaming Headset With Even Bigger Sound

Geekom A9 Max (2026) Review: Not Much ‘Max’ About It

The Best Budget Laptops Under $1,000 for Back to School

Roborock Saros 20 Review: Jack of All Trades, Master of Most

You Know What Your Bathroom Needs? A Smart Mirror With Party Lighting

Narwal Freo Z10 Turbo Review: Midrange Vacuum, High-End Performance

Related Articles

Ex-Google DeepMind Researcher Warns Benchmarks Won’t Save Us

Sign up for our newsletters

A Version of the ‘Dune: Part Three’ Art Book Comes With Actual Sandworm [Exclusive]

The Asteroid That Killed the Dinosaurs May Not Have Done It Exactly How We Thought

Toshiba 65-Inch LED 4K UHD Smart Fire TV Is 53% Off, Letting You Buy It for Portable Monitor Money

Astronomers Found the Sun’s Missing Silver Hiding in Plain Sight

Galaxy Watch Ultra Is Now Hundreds Cheaper Than Buying Directly From Samsung as a Grade-A Refurbished Model

LG Monitors Fill PCs With Adware, and It’s Not Just Recent Displays

‘Avatar Aang: The Last Airbender’ Is Sensational

This Year’s Budget Pixel Might Be Less of a Cop-Out

Anker Solix S2000 Review: The Little 2kWh Battery That Could

SwitchBot Home Dashboard Review: An E Ink Smart Display for the Weather-Obsessed

Asus ROG Kithara Review: A Huge Gaming Headset With Even Bigger Sound

Geekom A9 Max (2026) Review: Not Much ‘Max’ About It

The Best Budget Laptops Under $1,000 for Back to School

Roborock Saros 20 Review: Jack of All Trades, Master of Most

You Know What Your Bathroom Needs? A Smart Mirror With Party Lighting

Narwal Freo Z10 Turbo Review: Midrange Vacuum, High-End Performance

Related Articles

The Best Budget Laptops Under $1,000 for Back to School

The Best Tech to Level Up Summer 2026

New Samsung Layoffs in the U.S. Show Smartphone Arm’s Struggles, Even as It Profits Massively From AI

Despite Its Busted AI, Apple Just Stole Nvidia’s Crown as the Most Valuable Company

Elon Musk Trained Grok Users to Expect Sexual Deepfakes, Now He’s Suing Them

All the Times Trump Has Falsely Claimed Something Is AI