Back in 2010 Matt Thompson, then with National Public Radio, forecast in an op-ed that “at some point in the near future, automatic speech transcription will become fast, free, and decent.” He called that moment the “Speakularity,” in a sly reference to inventor Ray Kurzweil's vision of the “singularity,” in which our minds will be uploaded into computers. And Thompson predicted that access to reliable automatic speech-recognition (ASR) software would transform the work of journalists—to say nothing of lawyers, marketers, people with hearing disabilities, and everyone else who deals in both spoken and written language.
Desperate for any technology that would free me from the exhausting process of typing real-time notes during interviews, I was enraptured by Thompson's prediction. But while his brilliant career in radio has continued (he is now editor in chief of the Center for Investigative Reporting's news output, including its show Reveal), the Speakularity seems as far away as ever.
There has been important progress, to be sure. Several start-ups, such as Otter, Sonix, Temi and Trint, offer online services that allow customers to upload digital audio files and, minutes later, receive computer-generated transcripts. In my life as an audio producer, I use these services every day. Their speed keeps increasing, and their cost keeps going down, which is welcome.
But accuracy is another matter. In 2016 a team at Microsoft Research announced that it had trained its machine-learning algorithms to transcribe speech from a standard corpus of recordings with record-high 94 percent accuracy. Professional human transcriptionists performed no better than the program in Microsoft's tests, which led media outlets to celebrate the arrival of “parity” between humans and software in speech recognition.
The thing is, that last 6 percent makes all the difference. I can tell you from bitter experience that cleaning up a transcript that is 94 percent accurate can take almost as long as transcribing the audio manually. And four years after that breakthrough, services such as Temi still claim no better than 95 percent—and then only for recordings of clear, unaccented speech.
Why is accuracy so important? Well, to take one example, more and more audio producers (including myself) are complying with Internet accessibility guidelines by publishing transcripts of their podcasts—and no one wants to share a transcript in which one in every 20 words contains an error. And think how much time people could save if voice assistants such as Alexa, Bixby, Cortana, Google Assistant and Siri understood every question or command the first time.
ASR systems may never reach 100 percent accuracy. After all, humans do not always speak fluently, even in their native languages. And speech is so full of homophones that comprehension always depends on context. (I have seen transcription services render “iOS” as “ayahuasca” and “your podcast” as “your punk ass.”)
But all I am asking for is a 1 or 2 percent improvement in accuracy. In machine learning, one of the main ways to reduce an algorithm's error rate is to feed it higher-quality training data. It is going to be crucial, therefore, for transcription services to figure out privacy-friendly ways of gathering more such data. Every time I clean up a Trint or Sonix transcript, for example, I am generating new, validated data that could be matched to the original audio and used to improve the models. I would be happy to let the companies use it if it meant there would be fewer errors over time.
Getting such data is surely one path to the Speakularity. Given the growing number of conversations we have with our machines and the increasing amount of audio created every day, we should not be thinking of decent automatic transcription as a luxury or an aspiration anymore. It is an absolute necessity.