Microsoft’s VALL-E AI Can Simulate Any Voice with Just 3 Seconds of Audio

Microsoft's VALL-E AI Can Simulate Any Voice with Just 3 Seconds of Audio

Microsoft has announced a new artificial intelligence (AI) model, called VALL-E, which can synthesize audio that closely simulates a person’s voice. The model is based on a technology called EnCodec, which breaks the audio into discrete components, or “tokens,” and uses training data to match these tokens with the relevant sounds. To create VALL-E, Microsoft trained the model on an audio library called LibriLight, which contains 60,000 hours of English language speech from more than 7,000 speakers. The model can also imitate the “acoustic environment” of the sample audio, for example, if the sample came from a telephone call, the output will also sound like a telephone call.

VALL-E could be used for high-quality text-to-speech applications, speech editing, and audio content creation when combined with other generative AI models like GPT-3. However, the researchers at Microsoft are aware of the potential risks of misuse, such as impersonating specific speakers or spoofing voice identification. To mitigate these risks, the researchers suggest building a detection model that can distinguish between synthesized and human speech.

While VALL-E’s capabilities are impressive, it is important to consider the ethical implications of this technology. The ability to synthesize audio that closely mimics a person’s voice has the potential to be used for malicious purposes, such as creating fake audio recordings or impersonating individuals. It is essential that measures are put in place to prevent the misuse of this technology and to ensure that it is used responsibly.

Get your daily updates on tech news. Click the button below.

Follow Us on Google News

You May Also Like