On the website The Infinite Conversation, German filmmaker Werner Herzog and Slovenian philosopher Slavoj Žižek are having a public chat about anything and everything. Their discussion is compelling in part because these intellectuals have distinctive accents when speaking English and a tendency toward eccentric word choices. But they have something else in common: both voices are deepfakes, and the text they speak in those distinctive accents is being generated by artificial intelligence.
I built this conversation as a warning. Improvements in what's called machine learning have made deepfakes—incredibly realistic but fake images, videos or speech—too easy to create and their quality too good. At the same time, language-generating AI can quickly and inexpensively churn out reams of text. Together these technologies can do more than stage an infinite conversation. They have the capacity to inundate us with a deluge of disinformation.
Machine learning, an AI technique that uses large quantities of data to “train” an algorithm to improve as it repetitively performs a particular task, is going through a phase of rapid growth. This is pushing entire sectors of information technology to new levels, including speech synthesis, systems that produce utterances that humans can understand. As someone who is interested in the liminal space between humans and machines, I've always found it a fascinating application. So when those enhancements in machine learning allowed voice-synthesis and voice-cloning technology to advance in giant leaps over the past few years—after a long history of small, incremental improvements—I took note.
The Infinite Conversation got started when I stumbled across an exemplary speech-synthesis program called Coqui TTS. Many projects in the digital domain begin with finding a previously unknown software library or open-source program. When I discovered this tool kit, accompanied by a flourishing community of users and plenty of documentation, I knew I had all the necessary ingredients to clone a famous voice.
As an appreciator of Herzog's work, persona and worldview, I've always been drawn to his voice and way of speaking. I'm hardly alone, as pop culture has made Herzog into a literal cartoon: his cameos and collaborations include The Simpsons, Rick and Morty and Penguins of Madagascar. So when it came to picking someone's voice to tinker with, there was no better option—particularly because I knew I would have to listen to that voice for hours on end.
Building a training set for cloning Herzog's voice was the easiest part of the process. Between his interviews, voice-overs and audiobook work, there are hundreds of hours of speech that can be harvested for training a machine-learning model—or in my case, fine-tuning an existing one. A machine-learning algorithm's output generally improves in “epochs,” which are cycles through which the neural network is trained. The algorithm can then sample the results at the end of each epoch, giving the researcher material to review to evaluate how well the program is progressing. With the synthetic voice of Herzog, hearing the model improve with each epoch felt like witnessing a metaphorical birth, with his voice gradually coming to life in the digital realm.
Once I had a satisfactory Herzog voice, I started working on a second voice and intuitively picked Žižek. Like Herzog, Žižek has an interesting accent, a relevant presence within the intellectual sphere and connections with the world of cinema. He has also achieved a popular stardom, in part thanks to his polemical fervor and sometimes controversial ideas.
At this point, I still wasn't sure what the final format of my project was going to be—but I was surprised by how easy and smooth the process of voice cloning was. As noted, deepfakes have become too good and too easy to make. Just this past January, Microsoft announced a new speech-synthesis tool called VALL-E that, researchers claim, can imitate any voice based on just three seconds of recorded audio. We're about to face a crisis of trust, and we're utterly unprepared for it.
To emphasize this technology's capacity to produce ample quantities of disinformation, I settled on the idea of a never-ending conversation. I needed only a large language model—fine-tuned on texts written by each of the two participants—and a simple program to control the flow of the conversation so that it would feel natural and believable.
Given a series of words, language models predict the next word in a sequence. By fine-tuning a language model, it is possible to replicate the conversational style of a specific person, provided you have abundant transcripts of that person talking. I decided to use one of the leading commercial language models available. That's when it dawned on me that it's already possible to generate a fake dialogue, including its synthetic voice form, in less time than it takes to listen to it. This realization provided me with an obvious name for the project: the Infinite Conversation. After a couple of months of work, I published it online in October 2022. This year the Infinite Conversation was selected to be part of the Misalignment Museum art installation in San Francisco.
Once all the pieces fell into place, I marveled at something that hadn't occurred to me when I started the project. Like their real-life personas, my chatbot versions of Herzog and Žižek often talk about philosophy and aesthetics. Because of the esoteric nature of these topics, the listener can temporarily ignore the occasional nonsense that the model generates. For example, AI Žižek's view of Alfred Hitchcock alternates between seeing the famous director as a genius and as a cynical manipulator; in another inconsistency, the real Herzog notoriously hates chickens, but his AI imitator sometimes speaks about the fowl compassionately. Because actual postmodern philosophy can come across as muddled—a problem Žižek himself has noted—the lack of clarity in the Infinite Conversation can be interpreted as profound ambiguity.
This probably contributed to the success of the project. Several hundred of the Infinite Conversation's visitors have listened for more than an hour, and some people have tuned in for much longer. As I mention on the website, my hope for visitors of the Infinite Conversation is that they not dwell too seriously on what the chatbots are saying. Instead I want to give people an awareness of this technology and its consequences. If this AI-generated chatter seems plausible, imagine the realistic-sounding speeches that could be used to tarnish the reputations of politicians, scam business leaders or simply distract people with misinformation that sounds like human-reported news.
But there is a bright side. Infinite Conversation visitors can join a growing number of listeners who report that they use the soothing voices of Werner Herzog and Slavoj Žižek as a form of white noise to fall asleep to. That's a usage of this new technology I can get behind.