back to top
Sunday, July 7, 2024
HomeCryptoMicrosoft's AI Voice Cloning Technology Is So Good, But There's a Catch

Microsoft’s AI Voice Cloning Technology Is So Good, But There’s a Catch

Microsoft research team revealed VALL-E 2a new AI system for speech synthesis that can generate “human-level performance” voices with just a few seconds of audio that is indistinguishable from the source.

“VALL-E 2 is the latest advancement in neural codec language modeling that marks a milestone in training-free text-to-speech (TTS) synthesis, achieving human parity for the first time,” the paper said.

The system builds on its predecessor VALL-E introduced in early 2023. Neural codec language models represent speech as code sequences.

What sets VALL-E 2 apart from other speech cloning techniques is its “Iterative Aware Sampling” approach and adaptive switching between sampling techniques, the team says. These strategies improve consistency and address the most common problems with traditional speech creation.

The researchers write:

“VALL-E 2 synthesizes consistently high-quality speech, even for sentences that are difficult to understand due to complexity or repetitive phrases,” he said, pointing out that the technology could help create voices for people who have lost the ability to speak.

However, this tool is so impressive that it will not be available to the public.

“We currently have no plans to incorporate VALL-E 2 into products or expand its accessibility to the public,” Microsoft said in its ethics statement, noting that such tools carry risks such as voice mimicry without consent and the use of convincing AI voices in fraud and other criminal activities.

The research team stressed the need for a standard method for digitally watermarking AI generations, noting that detecting AI-generated content with high accuracy remains a challenge.

“If the model is to generalize to unseen people in the real world, it must include a protocol to ensure that the speaker consents to the use of their voice and a synthetic voice detection model.”

That said, VALL-E 2’s results are remarkably accurate compared to other tools. In a series of tests conducted by the research team, VALL-E 2 outperformed human standards in terms of the robustness, naturalness, and similarity of the generated speech.

Microsoft

Source: Microsoft

The VALL-E-2 was able to achieve these results with just 3 seconds of audio. However, the team notes that “using a 10-second voice sample yields even better results.”

Microsoft isn’t the only AI company to demonstrate advanced AI models without releasing them to the market. Meta’s Voicebox and OpenAI’s Voice Engine are two impressive voice transcription tools that suffer from similar limitations.

A Meta AI spokesperson said last year:

“There are many interesting use cases for generative speech models, but due to the risk of misuse, we are not publicly releasing the Voicebox model or code at this time.”

Additionally, OpenAI explains that it is first trying to address privacy issues before rolling out its synthetic voice model.

OpenAI explains in a post on the official blog:

“In line with our approach to AI safety and our voluntary commitments, we are choosing to preview but not broadly release the technology at this time.”

Calls for ethical guidelines are spreading across the AI ​​community, especially as regulators begin to raise concerns about the impact of generative AI on our everyday lives.

Home home

According to Decrypt

Mark Tyson
Mark Tyson
Freelance News Writer. Always interested in the way in which technology can change people's lives, and that is why I also advise individuals and companies when it comes to adopting all the advances in Apple devices and services.
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Fresh