Meta AI researchers have achieved a groundbreaking advancement in generative AI for speech with the introduction of Voicebox. This cutting-edge model has the unique ability to generalize across various speech-generation tasks, surpassing previous state-of-the-art performance. Voicebox utilizes a method called Flow Matching, which outperforms diffusion models and enables the model to modify any part of a given audio sample. With remarkable results in intelligibility, audio similarity, and task versatility, Voicebox represents a significant breakthrough in generative speech models.
A New Approach to Speech Generation
Existing speech synthesizers have limitations, primarily due to their dependence on meticulously prepared training data. Voicebox overcomes this limitation by building upon the Flow Matching model, allowing it to learn from raw audio and an accompanying transcription. By training on more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in multiple languages, Voicebox can predict and generate speech segments based on the surrounding audio and transcript context. This innovative approach enhances the model’s ability to generate speech in the middle of an audio recording without requiring the recreation of the entire input.
Versatile Applications of Voicebox
Voicebox’s capabilities extend across various speech-generation tasks, demonstrating its versatility and potential impact.
In-context text-to-speech synthesis
Voicebox can synthesize speech by matching the audio style of a given input sample as short as two seconds. This feature holds promise for future projects, such as enabling speech for individuals who cannot speak or allowing customization of voices used by non-player characters and virtual assistants.
Cross-lingual style transfer
With the ability to generate speech in multiple languages, Voicebox can read passages of text in languages including English, French, German, Spanish, Polish, and Portuguese. This breakthrough has the potential to facilitate natural and authentic communication between individuals who speak different languages.
Speech denoising and editing
Voicebox’s in-context learning enables seamless editing of audio recordings. It can effectively remove short-duration noise or replace misspoken words within a speech segment without requiring the entire recording to be redone. This capability may revolutionize audio editing, making it as accessible as popular image-editing tools have made photo adjustments.
Diverse speech sampling
Having learned from diverse real-world data, Voicebox generates speech that is more representative of how people naturally speak. This capability can aid in generating synthetic data for training speech assistant models. Remarkably, speech recognition models trained on Voicebox-generated synthetic speech perform nearly as well as those trained on real speech, with only a 1 percent error rate degradation compared to previous text-to-speech models’ 45 to 70 percent degradation. For cross-lingual style transfer, Voicebox outperforms YourTTS to reduce average word error rate from 10.9 percent to 5.2 percent, and improves audio similarity from 0.335 to 0.481.
Sharing generative AI research responsibly
While Voicebox represents a significant advancement in generative AI, Meta AI acknowledges the potential risks and responsibly handles its release. To address concerns of misuse, a highly effective classifier has been developed to distinguish between authentic speech and audio generated with Voicebox. Although the model and code are not publicly available at this time, Meta AI shares audio samples and a detailed research paper outlining their approach and results, encouraging the research community to build upon their work and engage in conversations about responsible AI development.
Voicebox, Meta AI’s state-of-the-art generative AI model for speech, has achieved groundbreaking results by outperforming previous models in word error rates and audio style similarity. With its ability to generalize across tasks, Voicebox opens up exciting possibilities for various applications, including in-context text-to-speech synthesis, cross-lingual style transfer, speech denoising and editing, and diverse speech sampling. While Meta AI emphasizes responsible sharing of their research, they anticipate the positive impact Voicebox will have on the future of generative AI for speech and look forward to further advancements in the field.