Microsoft Artificial Intelligence will be able to imitate your voice with just 3 seconds of audio

AI or artificial intelligence is increasingly making a dent in today’s society with different solutions and disrupting any type of industry. Now Microsoft has created a voice model called VALL-E which is able to imitate any voice with a sound of 3 seconds.

[La inteligencia artificial revoluciona todos los ámbitos de la creación gráfica: ¿para sumar o restar?]

AI in the human voice with VALL-E

this artificial intelligence be able to imitate anyone’s voice

with a 3 second sound it’s almost a bit scary. In particular because of the diversions that can be made with all kinds of objectives.

If already in art it makes it impossible to know if a work was made by the hand of an artist (even getting someone who make similar illustrations to those generated by AI is blocked on networks such as reddit), the future that awaits us is completely uncertain.

General view of VALL-E

The free Android

From github the operation of this neural voice model which has been called VALL-E and which uses discrete codes derived from a neural audio codec model.

They used 60,000 hours of English voice data for training this voice model, which is almost hundreds of times larger than current legacy systems.

VALL-E uses these context learning capabilities and thus uses the synthesized custom voice in high quality with only the 3 second recording of a person’s voice.

And it is that this voice model not only remains to imitate the voice, but also maintains the person’s emotion when speaking and even the acoustic environment that surrounds it; that is, it’s almost a copy-paste of someone’s voice.

VALLEY

Different examples can be reproduced on github how VALL-E works, and the truth is that it is so surprising that it exceeds the ability of this voice model to imitate any person’s timbre.