Earlier this year, Sam Altman was confronted directly with a video from what has become a viral trend: people showing off the significant shortcomings of OpenAI’s voice model. It seems he didn’t particularly enjoy that, because OpenAI is taking steps to save Altman from future embarrassment. On Thursday, the company announced three new voice models meant to open up the technology to developers who might be able to do groundbreaking things like program a functional timer.
Per the company, it is releasing GPT-Realtime-2, its first voice model with “GPT-5-class reasoning” that can allegedly handle difficult prompts and better maintain conversations than its predecessors. It also introduced GPT-Realtime-Translate, which it claims can translate speech from more than 70 input languages into 13 output languages while “keeping pace with the speaker.” The final model, GPT-Realtime-Whisper, is meant for live speech-to-text transcription.
“Voice is becoming one of the most natural ways for people to use software,” the company said in a statement. “But building useful voice products takes more than fast turn-taking or a natural-sounding voice. A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment.”
The challenges that building AI models have presented have become the subject of many a meme over the past year or so. TikTok user @huskistaken, aka Husk, is perhaps the master of the genre, regularly poking holes in the capabilities of OpenAI’s previous voice models—though instead of doing so as a red teamer preventing issues from making it into the final product, he primarily encourages OpenAI to make changes via embarrassment.
@huskistaken I swear I was faster
It was one of Husk’s videos that made its way to Altman earlier this year. The CEO was made to watch ChatGPT’s voice model very obviously lie about starting a timer. Husk would ask the model to time how long it took him to run a mile, then immediately say he was done, only for the model to claim he finished his mile in 10 minutes. Altman, visibly annoyed about the whole thing, said it’d be “Maybe another year before something like that works well.”
The new models are meant to speed up solutions to this confounding problem. Per OpenAI’s press release, the new releases are adept at “voice-to-action, where people can describe what they need and the system can reason through the request, use tools, and complete the task.” They provide an example like asking Zillow to “find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday.” That certainly feels a bit more advanced than “start a timer,” but it stands to reason that’d fall under the same functionality.
The real test of OpenAI’s new models will be the jailbreakers like Husk. Earlier this year, former OpenAI founder Andrej Karpathy argued that people simply haven’t updated their priors on AI models, which he argued are advancing all the time in ways that don’t garner the same attention as voices messing with the voice model. But those videos aren’t old—Husk uploads new ones regularly. If he stops posting with the release of this new model, chalk up a win for the true believers like Karpathy.