Speech2Face: Learning the Face Behind a Voice

Type: Article

Publication Date: 2019-06-01

Citations: 151

DOI: https://doi.org/10.1109/cvpr.2019.00772

Abstract

How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/Youtube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how–-and in what manner–-our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

Locations

  • arXiv (Cornell University) - View - PDF
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action Title Year Authors
+ Speech2Face: Learning the Face Behind a Voice 2019 Tae-Hyun Oh
Tali Dekel
Chang-Il Kim
Inbar Mosseri
William T. Freeman
Michael Rubinstein
Wojciech Matusik
+ Reconstructing faces from voices 2019 Yandong Wen
Rita Singh
Bhiksha Raj
+ Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks 2019 Amanda Duarte
Francisco Roldan
Miquel Tubau
Janna Escur
Santiago Pascual
Amaia Salvador
Eva Mohedano
Kevin McGuinness
Jordi Torres
Xavier Giró-i-Nieto
+ PDF Chat Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks 2019 Amanda Duarte
Francisco Roldan
Miquel Tubau
Janna Escur
Santiago Pascual
Amaia Salvador
Eva Mohedano
Kevin McGuinness
Jordi Torres
Xavier Giró-i-Nieto
+ Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks 2019 Amanda Duarte
Francisco Velasco
Miquel Tubau
Janna Escur
Santiago Pascual
Amaia Salvador
Eva Mohedano
Kevin McGuinness
Jordi Torres
Xavier Giró-i-Nieto
+ The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features 2023 Liao Qu
Xianwei Zou
Li Xiang
Yandong Wen
Rita Singh
Bhiksha Raj
+ PDF Chat Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging 2022 Yeqi Bai
Tao Ma
Lipo Wang
Zhenjie Zhang
+ PDF Chat Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion 2024 Rong Yan
Li Liu
+ On Learning Associations of Faces and Voices 2018 Chang-Il Kim
Hijung Valentina Shin
Tae-Hyun Oh
Alexandre Kaspar
Mohamed Elgharib
Wojciech Matusik
+ Disentangled Speech Embeddings using Cross-modal Self-supervision 2020 Arsha Nagrani
Joon Son Chung
Samuel Albanie
Andrew Zisserman
+ PDF Chat Learn2Talk: 3D Talking Face Learns from 2D Talking Face 2024 Yixiang Zhuang
Baoping Cheng
Yao Cheng
Yuntao Jin
Renshuai Liu
Chengyang Li
Xuan Cheng
Jing Liao
Juncong Lin
+ Seeing Voices and Hearing Faces: Cross-modal biometric matching 2018 Arsha Nagrani
Samuel Albanie
Andrew Zisserman
+ PDF Chat Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching 2018 Arsha Nagrani
Samuel Albanie
Andrew Zisserman
+ Seeing Voices and Hearing Faces: Cross-modal biometric matching 2018 Arsha Nagrani
Samuel Albanie
Andrew Zisserman
+ Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers 2017 Ken Hoover
Sourish Chaudhuri
Caroline Pantofaru
Malcolm Slaney
Ian Sturdy
+ LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization 2021 Avisek Lahiri
Vivek Kwatra
Christian Frueh
John Lewis
Chris Bregler
+ PDF Chat LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization 2021 Avisek Lahiri
Vivek Kwatra
Christian Frueh
John Lewis
Chris Bregler
+ LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization 2021 Avisek Lahiri
Vivek Kwatra
Christian Frueh
John E. Lewis
Chris Bregler
+ PDF Chat MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement 2021 Alexander Richard
Michael Zollhöfer
Yandong Wen
Fernando De la Torre
Yaser Sheikh
+ MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement 2021 Alexander Richard
Michael Zollhoefer
Yandong Wen
Fernando De la Torre
Yaser Sheikh

Works That Cite This (76)

Action Title Year Authors
+ PDF Chat S2IGAN: Speech-to-Image Generation via Adversarial Learning 2020 Xinsheng Wang
Tingting Qiao
Jihua Zhu
Alan Hanjalić
Odette Scharenborg
+ PDF Chat FaceFilter: Audio-Visual Speech Separation Using Still Images 2020 Soo-Whan Chung
Soyeon Choe
Joon Son Chung
Hong-Goo Kang
+ PDF Chat Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network 2020 Ruijie Tao
Rohan Kumar Das
Haizhou Li
+ PDF Chat Vocoder-Based Speech Synthesis from Silent Videos 2020 Daniel Michelsanti
Olga Slizovskaia
Gloria Haro
Emília Gómez
Zheng‐Hua Tan
Jesper Rindom Jensen
+ PDF Chat Learning to Have an Ear for Face Super-Resolution 2020 Givi Meishvili
Simon Jenni
Paolo Favaro
+ PDF Chat Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision 2020 Soo-Whan Chung
Hong-Goo Kang
Joon Son Chung
+ Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment 2023 Zheng-Yan Sheng
Yang Ai
Yan‐Nian Chen
Zhen-Hua Ling
+ PDF Chat Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association 2021 Peisong Wen
Qianqian Xu
Yangbangyan Jiang
Zhiyong Yang
Yuan He
Qingming Huang
+ PDF Chat Unauthorized AI cannot recognize me: Reversible adversarial example 2022 Jiayang Liu
Weiming Zhang
Kazuto Fukuchi
Youhei Akimoto
Jun Sakuma
+ Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association 2021 Peisong Wen
Qianqian Xu
Yangbangyan Jiang
Zhiyong Yang
Yuan He
Qingming Huang