Speech2Face: Learning the Face Behind a Voice

Tae-Hyun Oh, Tali Dekel, Chang-Il Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik

Type: Article

Publication Date: 2019-06-01

Citations: 151

DOI: https://doi.org/10.1109/cvpr.2019.00772

Abstract

How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/Youtube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how–-and in what manner–-our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

Locations

arXiv (Cornell University) - View - PDF
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Similar Works

Action	Title	Year	Authors
+	Speech2Face: Learning the Face Behind a Voice	2019	Tae-Hyun Oh Tali Dekel Chang-Il Kim Inbar Mosseri William T. Freeman Michael Rubinstein Wojciech Matusik
+	Reconstructing faces from voices	2019	Yandong Wen Rita Singh Bhiksha Raj
+	Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks	2019	Amanda Duarte Francisco Roldan Miquel Tubau Janna Escur Santiago Pascual Amaia Salvador Eva Mohedano Kevin McGuinness Jordi Torres Xavier Giró-i-Nieto
+ PDF Chat	Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks	2019	Amanda Duarte Francisco Roldan Miquel Tubau Janna Escur Santiago Pascual Amaia Salvador Eva Mohedano Kevin McGuinness Jordi Torres Xavier Giró-i-Nieto
+	Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks	2019	Amanda Duarte Francisco Velasco Miquel Tubau Janna Escur Santiago Pascual Amaia Salvador Eva Mohedano Kevin McGuinness Jordi Torres Xavier Giró-i-Nieto
+	The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features	2023	Liao Qu Xianwei Zou Li Xiang Yandong Wen Rita Singh Bhiksha Raj
+ PDF Chat	Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging	2022	Yeqi Bai Tao Ma Lipo Wang Zhenjie Zhang
+ PDF Chat	Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion	2024	Rong Yan Li Liu
+	On Learning Associations of Faces and Voices	2018	Chang-Il Kim Hijung Valentina Shin Tae-Hyun Oh Alexandre Kaspar Mohamed Elgharib Wojciech Matusik
+	Disentangled Speech Embeddings using Cross-modal Self-supervision	2020	Arsha Nagrani Joon Son Chung Samuel Albanie Andrew Zisserman
+ PDF Chat	Learn2Talk: 3D Talking Face Learns from 2D Talking Face	2024	Yixiang Zhuang Baoping Cheng Yao Cheng Yuntao Jin Renshuai Liu Chengyang Li Xuan Cheng Jing Liao Juncong Lin
+	Seeing Voices and Hearing Faces: Cross-modal biometric matching	2018	Arsha Nagrani Samuel Albanie Andrew Zisserman
+ PDF Chat	Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching	2018	Arsha Nagrani Samuel Albanie Andrew Zisserman
+	Seeing Voices and Hearing Faces: Cross-modal biometric matching	2018	Arsha Nagrani Samuel Albanie Andrew Zisserman
+	Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers	2017	Ken Hoover Sourish Chaudhuri Caroline Pantofaru Malcolm Slaney Ian Sturdy
+	LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization	2021	Avisek Lahiri Vivek Kwatra Christian Frueh John Lewis Chris Bregler
+ PDF Chat	LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization	2021	Avisek Lahiri Vivek Kwatra Christian Frueh John Lewis Chris Bregler
+	LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization	2021	Avisek Lahiri Vivek Kwatra Christian Frueh John E. Lewis Chris Bregler
+ PDF Chat	MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement	2021	Alexander Richard Michael Zollhöfer Yandong Wen Fernando De la Torre Yaser Sheikh
+	MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement	2021	Alexander Richard Michael Zollhoefer Yandong Wen Fernando De la Torre Yaser Sheikh

Works That Cite This (76)

Action	Title	Year	Authors
+ PDF Chat	S2IGAN: Speech-to-Image Generation via Adversarial Learning	2020	Xinsheng Wang Tingting Qiao Jihua Zhu Alan Hanjalić Odette Scharenborg
+ PDF Chat	FaceFilter: Audio-Visual Speech Separation Using Still Images	2020	Soo-Whan Chung Soyeon Choe Joon Son Chung Hong-Goo Kang
+ PDF Chat	Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network	2020	Ruijie Tao Rohan Kumar Das Haizhou Li
+ PDF Chat	Vocoder-Based Speech Synthesis from Silent Videos	2020	Daniel Michelsanti Olga Slizovskaia Gloria Haro Emília Gómez Zheng‐Hua Tan Jesper Rindom Jensen
+ PDF Chat	Learning to Have an Ear for Face Super-Resolution	2020	Givi Meishvili Simon Jenni Paolo Favaro
+ PDF Chat	Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision	2020	Soo-Whan Chung Hong-Goo Kang Joon Son Chung
+	Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment	2023	Zheng-Yan Sheng Yang Ai Yan‐Nian Chen Zhen-Hua Ling
+ PDF Chat	Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association	2021	Peisong Wen Qianqian Xu Yangbangyan Jiang Zhiyong Yang Yuan He Qingming Huang
+ PDF Chat	Unauthorized AI cannot recognize me: Reversible adversarial example	2022	Jiayang Liu Weiming Zhang Kazuto Fukuchi Youhei Akimoto Jun Sakuma
+	Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association	2021	Peisong Wen Qianqian Xu Yangbangyan Jiang Zhiyong Yang Yuan He Qingming Huang

Works Cited by This (27)

Action	Title	Year	Authors
+	Distilling the Knowledge in a Neural Network	2015	Geoffrey E. Hinton Oriol Vinyals Jay B. Dean
+	Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift	2015	Sergey Ioffe Christian Szegedy
+	Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers	2017	Ken Hoover Sourish Chaudhuri Caroline Pantofaru Malcolm Slaney Ian Sturdy
+ PDF Chat	VoxCeleb: A Large-Scale Speaker Identification Dataset	2017	Arsha Nagrani Joon Son Chung Andrew Zisserman
+	Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data	2017	Wei-Ning Hsu Yu Zhang James Glass
+	Audio-Visual Scene Analysis with Self-Supervised Multisensory Features	2018	Andrew Owens Alexei A. Efros
+ PDF Chat	X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes	2018	Olivia Wiles A. Sophia Koepke Andrew Zisserman
+ PDF Chat	Emotion Recognition in Speech using Cross-Modal Transfer in the Wild	2018	Samuel Albanie Arsha Nagrani Andrea Vedaldi Andrew Zisserman
+ PDF Chat	Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion	2019	Suwon Shon Tae-Hyun Oh James Glass
+	Diversity in Faces.	2019	Michele Merler Nalini Ratha Rogério Feris John R. Smith