In recent years, eerily accurate deepfake videos have gotten a lot of press, but automated voice replication has been quietly sliding into the uncanny valley as well. Case in point: The AI company Dessa has created a simulation of podcaster Joe Rogan’s voice that is nearly indistinguishable from the real thing.
Listen to it in this video that Dessa released last week. According to Dessa, the voice comes from a machine learning model, and all the words come from text input.
Sure, Robo-Rogan doesn’t sound quite as relaxed as the real thing is when he’s stoned and on a roll with a guest. It sounds a bit like the slightly stilted voice he might use if he were reading an ad. But it’s undeniably Rogan’s “voice.”
It’s especially hard to distinguish whether or not the voice is real when only heard in short snippets. To prove this, Dessa released a quiz—which, personally, I got a failing grade on. I’ve heard a lot of the his voice over the years, and I had a difficult time telling the difference between Joe Rogan and Joe Fauxgan.
As The Verge pointed out, Dessa obviously had a lot of material to work with. Rogan just released episode 1,299 of his podcast, and most of these episodes are two to three hours. So Dessa could easily access thousands of hours of Rogan’s voice to use for AI training.
The Dessa blog post announcing its speech synthesis model dives into the societal implications of this technology, because “in the next few years (or even sooner), we’ll see the technology advance to the point where only a few seconds of audio are needed to create a life-like replica of anyone’s voice on the planet,” according to Dessa. “It’s pretty f*cking scary.”
The post lays out a few examples of nefarious ways the technology could be used, including spam callers impersonating family members, fake voices being used to gain high security clearance, and audio deepfakes of politicians that could cause an uprising or manipulate elections.
Dessa also provides examples of what it sees as good things that could come from this technology, like automated voices that could make voice assistance more natural, improved text-to-speech applications for people with disabilities, and, um, “a workout app that contains a personalized pre-workout pep talk from Arnold Schwarzenegger.”
All those suggested benefits, I must say, don’t seem to outweigh the dystopian possibilities of anyone being able to mimic anyone else’s voice.
Because of these implications, Dessa said it’s not releasing its model to the public. But it’s probably only a matter of time before we’re going to have to worry about someone threatening to send our boss a recording of us talking about peeing in their office if we don’t send the scammer $5,000 in bitcoin.