Microsoft's Artificial Intelligence Voice Cloning Technology is Quite Excellent

Microsoft’s research team announced the unveiling of VALL-E 2, a new artificial intelligence (AI) system for speech synthesis that can generate voices matching human performance within a few seconds of audio very similar to the source.

According to the research paper, VALL-E 2 is the most recent improvement in neural codec language that marks a breakthrough in zero-shot text-to-speech synthesis (TTS), attaining human equality for the first time. This system was developed on VALL-E, its predecessor, which was unveiled early last year. The neural codec language model represents speech as code sequences.

VALL-E 2 ‘s ‘Repetition Aware Sampling’ method and adaptive switching between sampling tactics sets VALL-E 2 apart from other voice cloning tactics. The approaches enhance consistency and address the most prevalent problems in conventional generative voice.

VALL-E 2 Synthesizes High-quality Speech

According to the researchers, VALL-E 2 continuously synthesizes high-quality speech, including conventionally complicated sentences, because of their intricacy or monotonous phrases. 

Additionally, the researchers pointed out that the technology could aid in producing speech for persons who cannot talk.

Despite being impressive, the tool will not be unveiled to the public. In its ethics statement, Microsoft noted they have no plans to integrate VALL-E 2 into a product or boost access to the public. 

The statement also noted that such tools introduce risks such as voice imitation without approval and the utilization of convincing artificial intelligence voices in scams and other illegal activities. 

The research team stressed that a standard method is needed to digitally mark artificial intelligence generations, acknowledging that identifying artificial intelligence-created content with high accuracy is still a problem. 

According to the team, the model targets unseen speakers in the actual world. It must include a protocol to ensure that the speaker sanctions the utilization of their voice as well as a synthesized speech identification model. 

VALL-E 2’s outcomes are precise compared to the rest of the tools. In a pattern of tests implemented by the research team, VALL-E 2 outclassed human benchmarks in naturalness, sturdiness, and similarity of generated speech. 

VALL-E 2 achieved the results in 3 seconds of audio. According to the research team, ‘utilizing 10-second speech samples led to improved quality.’ 

Generative Speech Models Use Cases

Other artificial intelligence firms have shown advanced artificial intelligence models without releasing them. OpenAI’s Voice Engine and Meta’s Voicebox are two striking cloners with similar limitations. 

In 2023, a Meta AI representative claimed that there are many thrilling use cases for generative speech models. However, due to the risks of misutilization, they were not unveiling the Voicebox model or code at the time. 

Further, Open AI revealed that it is attempting to address the security problem prior to unveiling its synthetic voice model.

In an official blog post, OpenAI noted that in line with its strategy for artificial intelligence safety as well as voluntary obligations, they are opting to preview but not widely unveil the technology at this time.

The call for ethical regulation is spreading across the artificial intelligence community, particularly as regulators begin raising concerns regarding the effect of generative artificial intelligence in people’s daily lives.

Editorial credit: Below the Sky / Shutterstock.com

Michael Scott

By Michael Scott

Michael Scott is a skilled and seasoned news writer with a talent for crafting compelling stories. He is known for his attention to detail, clarity of expression, and ability to engage his readers with his writing.

Leave a Reply

Your email address will not be published. Required fields are marked *