Researchers at MIT have created an artificial intelligence capable of imagining the face of an individual from a recording of his voice . This machine learning algorithm , called Speech2Face(something like ‘Face to Face’), was trained using millions of audio clips from more than 100,000 different speakers, many of them coming from YouTube educational videos.
According to the researchers , this AI uses this dataset to determine the link between vocal signals and certain facial features . And both are determined by factors such as age, sex, bone structure of the nose, the shape of the mouth or the size of the lips.
The operation of the algorithm revolves around the use of two components: an encoder (which extracts and stores the spectrogram of the audio waves, recognizing a series of key features of it ) and a decoder (which, based on the aforementioned characteristics, generates a image of the face, represented in front and with a neutral gesture).
Speech2Face does not do miracles
Of course, the longer you stay listening to a human voice, the easier it will be for AI to guess someone’s face. But Speech2Face can not work miracles: although, based on photos, its representations are photorealistic, they are also too generic to dream of identifying a specific person .
But it does allow to establish with sufficient precision a profile with the ethnicity, sex and age of the subject. There was already technology capable of estimating these last two factors, but the ethnic component is a novelty of Speech2Face .
However, the algorithm still has some biases that show that the dataset on which your training has been based is incomplete. For example: Speech2Face generates images of white men when you hear Asian speaking English , although when they speak Chinese, they identify their ethnicity correctly.
“If a certain language does not appear in the training data, our reconstructions will not capture well the facial attributes that could be correlated with that language.”
There is also some controversy surrounding the fact that the algorithm identifies as feminine the voices of children or men with a particularly acute tone ; a controversy that researchers have tried to cut remembering that it is impossible for Speech2Face to be able to “represent the entire world population equally.”
It is speculated that a possible commercial use of this algorithm would be the possibility of generating a representative image of our interlocutor when we are maintaining a telephone call.