A new frightening AI can simulate your voice perfectly after hearing it for 3 seconds

It's so good that its creators admit that it "can have potential risks in improper use".


Modern technology has revolutionized the way we get things done. Even the most basic version of the smartphones in the pockets of most people Or smart house devices in our salons have an impressive amount of capacity, especially when you consider that you can simply control them by speaking, thanks to artificial intelligence (AI). But even if computers have progressed to facilitate our lives, they also enter a new territory as they become capable of imitating human behavior and even thinking about themselves. And now, a new frightening form of AI can simulate your voice perfectly even after hearing it for only three seconds. Read the rest to learn more about revolutionary technology.

Read this then: Never charge your Android phone in this way, say the experts .

Microsoft has developed a new type of AI that can perfectly simulate your voice.

A young woman recording her voice on a computer using a microphone and headphones
Shutterstock / Soloviova Liudmyla

We all counted on machines to facilitate our daily life in one way or another. But what to do if a computer could intervene and Imitate the way you talk Without others not even noticing?

Last week, Microsoft researchers announced that they had developed a new form of vocal text AI that they nicknamed Vall-E, reports Ars Technica. The technology can simulate a person's voice using a three-second audio clip, even by picking up and preserving the emotional tone of the original speaker and the acoustic sounds of the environment in which they record. The team claims that the model could be practical to create automatic text vocalizations, even if it includes potential risks of highly sophisticated dupes similar to Deepfake videos.

The company claims that new technology is based on a "model of neural codec language".

A man sitting on his computer while talking to his phone's virtual assistant
Shutterstock / Fizkes

In his article Discuss the new technology , Microsoft dubs vall-e a "model of neural codec language". This means that if the traditional dispelment text software (TTS) takes written words and manipulates wave forms to generate vocalizations, AI can collect subtle elements of a specific voice and audio prompts that help create a reliable recreation of a person who talks about any pain It is nourished, according to the interesting engineering website. AE0FCC31AE342FD3A1346EBB1F342FCB

"To synthesize personalized speech (for example, TTS zero-shot), Vall-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the registration registered in 3 seconds and the phoneme prompt, which respectively constitute the information of the information speaker and content ", information" The team explains in their newspaper. "Finally, the acoustic tokens generated are used to synthesize the final waveform with the corresponding neural codec decoder."

RELATED: For more up-to-date information, register for our daily newsletter .

The team used more than 60,000 hours of speech recorded to form the new AI.

author writing on computer
Michael Julius Photos / Shutterstock

To develop the new model, the team said that it used around 60,000 hours of speech recorded in English of more than 7,000 individual speakers from an audio library assembled by Meta known as LibriLight. In most cases, the recordings were drawn from readings of Audio books in the public domain Stored on Librivox, reports Ars Technica. In his tests, the team said that Vall-E needed the voice in the three-second sample to closely resemble one of the votes of their training data to produce a convincing result.

The team now presents their work by Publish specific examples Software in action on a github page. Everyone provides a three-second clip from the voice of a speaker reading the random text and a "ground truth", which is a recorded example of the speaker reading a sentence to be used as a comparison. They then provide a "basic" recording to show how the typical TTS software would generate a spoken sound and a "vall-e" version of the recording as a comparison with the previous two.

Although the results are not entirely perfect, they present very convincing examples where the word generated by the machine seems shocking human. The researchers also add that in addition to imitating inflection and emotion, the software can also reproduce the environment in which the basic audio is recorded - for example, which gives the impression that someone Speak outside, in an echo room or on a telephone call.

So far, Microsoft has not published the program so that others can test or experiment.

hands typing on a laptop
istock

The research team concludes their article by saying that they plan to increase the amount of training data to help the model improve their speaking styles and improve to imitate human voice. But for the moment, Microsoft has also prevented the new software available for developers or the general public to test, due to its ability to deceive people or used for harmful purposes.

"Since Vall-e could synthesize the speech which maintains the identity of the speaker, it may include potential risks in improper use of the model, such as identification of voice identification or the identity of a speaker speaker specific, "wrote the authors in their conclusion. "To mitigate these risks, it is possible to build a detection model to discriminate if an audio clip has been synthesized by Vall-E. We will also put the principles of Microsoft AI in practice during the development of models."


Categories: Smarter Living
By: max-frye
How does Halle Berry look so young at 56? Discover now
How does Halle Berry look so young at 56? Discover now
New research shows that exercise is a huge mood booster for people with mental illness
New research shows that exercise is a huge mood booster for people with mental illness
20 best weight loss tips from Richard Simmons
20 best weight loss tips from Richard Simmons