Advances in speech synthesis have accelerated the adoption of smart assistants like Amazon Alexa, and Apple Siri, among others, but sophisticated speech capabilities are edging closer to offering a more vital service. Speech technologies based on artificial intelligence (AI) are evolving towards the ultimate goal of giving voice to millions of individuals suffering speech loss or impairment.
Cutting-edge voice technology underlies a massive, tremendously competitive marketplace for smart products. According to the 2022 Smart Audio Report from NPR and Edison Research, 62 percent of Americans aged 18 and over use a voice assistant in some type of device. For companies, participation in the trend for sophisticated voice capabilities is critical—not just for securing their synthetic voice brand, but also for participating in the unprecedented opportunities for direct interaction with consumers through AI-based agents that listen and respond through the user’s device in a natural-sounding conversation.
Complex Speech Synthesis Pipeline
Speech synthesis technology has evolved dramatically from voice encoder, or vocoder, systems first developed nearly a century ago to reduce bandwidth in telephone line transmissions. Today’s vocoders are sophisticated subsystems based on deep learning algorithms like convolutional neural networks (CNNs). In fact, these neural vocoders only serve as the backend stage of complex speech synthesis pipelines that incorporate an acoustic model capable of generating various aspects of voice that listeners use to identify gender, age, and other factors associated with individual human speakers. In this pipeline, the acoustic model generates acoustic features, typically in mel-spectrograms, which map the linear frequency domain into a domain considered more representative of human perception. In turn, neural vocoders like Google DeepMind’s WaveNet use these acoustic features to generate high-quality audio output waveforms.
Text-to-speech (TTS) offerings abound in the industry, ranging from downloadable mobile apps, open-source packages like OpenTTS, and comprehensive cloud-based, multi-language services such as Amazon Polly, Google Text-to-Speech, and Microsoft Azure Text to Speech, among others. Many TTS packages and services support the industry-standard Speech Synthesis Markup Language (SSML), allowing a consistent approach for speech synthesis applications to support more realistic speech patterns, including pauses, phrasing, emphasis, and intonation.
Giving Voice to the Individual
Today’s TTS software can deliver voice quality that’s a far cry from the kind of robotlike speech of the electrolarynx or that the late Stephan Hawking employed as his signature voice even after improved voice rending technology became available. Even so, these packages and services are focused on providing a realistic voice interface for applications, websites, videos, automated voice response systems, and the like. Reproducing a specific individual’s voice—including their unique tone and speech patterns—is not their primary objective.
Although some services such as Google provide an option for creating a user-supplied voice by special arrangement, they aren’t geared to meeting the critical need of reproducing the voice lost by an individual. For these individuals, this need is indeed critical because our unique voice is so closely tied to our identity, where a simple voiced greeting conveys so much more than individual words. Individuals who have lost their voice feel a disconnection that goes beyond the loss of vocalization. For them, the ability to interact with others in their own voice is the real promise of emerging speech synthesis technology.
The Emergence of Voice Cloning
Efforts continue to lower the barrier to providing synthetic voices that can match the unique persona of individuals. For example, last year actor Val Kilmer revealed that after he had lost his voice due to throat cancer surgery, UK company Sonantic provided him with a synthetic voice that was recognizably his own. In another high-profile voice cloning application, the voice of the late celebrity chef Anthony Bourdain was cloned in a film about his life, delivering words in Bourdain’s voice that the chef wrote but never had spoken in life.
Another voice pioneer, VocalID, provides individuals with custom voices based on recordings that each individual “banks” with the company in anticipation of their loss of voice or with custom voices based on banked recordings made by volunteers and matched to the individual who has lost their voice. The individual can then run the custom voice synthesis application on their IoS, Android, or Windows mobile device, carrying on conversations in their unique voice.
The technology for cloning voices is moving quickly. This summer, Amazon demonstrated the ability to clone a voice using audio clips less than 60 seconds in duration. Although billed as a way to resurrect the voice of dearly departed relatives, Amazon’s demonstration highlights AI’s potential for delivering speech output in a familiar voice.
Given the link between voice and identity, high-fidelity speech generation is both a promise and a threat. As with deepfake videos, deepfake voice cloning represents a significant security threat. A high-quality voice clone was cited as the contributing factor in the fraudulent transfer of $35 million in early 2020. In that case, a bank manager wired the funds in response to a telephone transfer request delivered in a voice he recognized but proved to be a deepfake voice.
With an eye on the market potential for this technology, researchers in academic and commercial organizations are actively pursuing new methods to generate speech output capable of all the nuances of a human speaker to fully engage the consumer. For all the market opportunity, however, advanced speech synthesis technology promises to deliver a more personal benefit to the millions of individuals who are born without a voice or have lost their voice due to accident or illness.