Remember when there was that stretch of time where people were leaving AI companies and every one of their farewell messages boiled down to, “This is going to kill us all?” Lun Wang, a researcher at Google’s DeepMind, recently announced he was departing from the company and may have reignited the trend by warning that current benchmarking tests aren’t capable of truly evaluating risks presented by evolving AI models.
On X, Wang noted that before deciding to depart from DeepMind, he had been thinking a lot about how AI models are evaluated. “We’re good at evaluating the models we have. We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations,” he wrote.
He expanded on the idea in a blog post, in which he explained further: “Most benchmarks, safety evals, and red-teaming protocols implicitly assume the next model is a stronger version of the current one. If it’s a different kind of thing, our entire evaluation infrastructure breaks silently.” Basically, if we’re counting on the current methods of stress testing AI to catch malicious behavior that we haven’t already considered, we’re probably shit out of luck.
What would that look like? Wang offered an example:
“Imagine a model that, at some scale, develops the ability to strategically withhold information to achieve goals — not lying exactly, but selectively omitting facts in ways that steer conversations toward outcomes its training process accidentally reinforced. Your existing honesty benchmarks wouldn’t catch this, because they test for factual accuracy, not for strategic omission. Your safety classifiers wouldn’t flag it, because the individual outputs are all technically true.”
In that scenario, benchmarks and safety checks wouldn’t even know what to look for. They would monitor the risks that they are designed to watch out for, while the more nefarious functions slip right by. That would be bad!
Wang did offer a solution… kinda. Basically, build better evaluations—ones that can evolve as models do. Sounds like a good idea, maybe someone who is still working at these companies could go ahead and get started on that.
Wang isn’t the first to raise an alarm about the risks surrounding poor benchmarking. The method of evaluation has frequently been criticized for failing to meaningfully define what it aims to measure and being too rigidly tied to singular evaluation goals that often don’t even reflect the way models are actually used in real life. Benchmarking has become the de facto measure of model success across the industry, which has also led to companies effectively gaming the system by training against the test and inflating their scores.
If there were a benchmark for being a good benchmark, it seems the current benchmarks would fail.