Skip to content
Artificial Intelligence

Anthropic Apologizes For One of the Guardrails on Its Fable 5 Model, and Will Change It

Fable 5 is a nerfed version of Anthropic's Mythos model. Turns out it was actually too nerfed, and the company is sorry.
By

Reading time 2 minutes

Comments (0)

Anthropic’s Fable 5 model is the nerfed version of Mythos, which is in turn the model so scarily powerful that it could ostensibly endanger the world if it were released without guardrails. Most of the guardrails, especially the ones designed to prevent users from using Fable to build cyber- or bio-weapons, are very noticeable.

But one guardrail, aimed at preventing users from using Fable 5 to train other AI models, was invisible, which sparked unusual displays of user outrage.

 

And now Anthropic has asked for take-backs. The controversial invisible guardrail will be made visible. In a statement to Wired, Anthropic wrote “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.”

“We made the wrong tradeoff and we apologize for not getting the balance right,” the statement added.

In the models system card, Anthropic was upfront about what it was trying to do:

“Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).”

In other words, when Fable 5 prompts showed the telltale signs of a user developing a frontier LLM, instead of doing what it does with prompts about biology, chemistry, or cybersecurity and switching to an inferior model, or simply refusing the request, it was silently changing the prompt in order to generate faulty results with the potential to hamper the user’s model development.

Using the model to train another model is against Anthropic’s terms of service, but users still felt like this measure was a violation of users’ trust. Reddit user CheatCodesOf Life put it this way: “I wouldn’t use this thing for anything to be honest. A refusal or HTTP-4xx error for content is fair enough, but this is basically taking your money and poisoning your code base.”

Share this story

Sign up for our newsletters

Subscribe and interact with our community, get up to date with our customised Newsletters and much more.