Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan Peng, David Harwath

Type: Article

Publication Date: 2022-09-16

Citations: 25

DOI: https://doi.org/10.21437/interspeech.2022-10652

Abstract

OursFigure 1: HuBERT: sum of attention weights each frame receives from other frames.Ours (VG-HuBERT3): attention weights each frame receives from the [CLS A] token.Attention weights from different attention heads are coded with different colors.

Locations

arXiv (Cornell University) - View - PDF
Interspeech 2022 - View

Similar Works

Action	Title	Year	Authors
+	Word Discovery in Visually Grounded, Self-Supervised Speech Models	2022	Puyuan Peng David Harwath
+ PDF Chat	Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling	2024	Leanne Nortje
+	Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model	2023	Puyuan Peng Shang-Wen Li Okko Räsänen Abdelrahman Mohamed David Harwath
+ PDF Chat	Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model	2024	Hung-Chieh Fang Nai-Xuan Ye Yi-Jen Shih Puyuan Peng Hsuan-Fu Wang Layne Berry Hung-yi Lee David Harwath
+ PDF Chat	Integrating Self-Supervised Speech Model with Pseudo Word-Level Targets from Visually-Grounded Speech Model	2024	Hung-Chieh Fang Nai-Xuan Ye Yi-Jen Shih Puyuan Peng Hsuan-Fu Wang Layne Berry Hung-yi Lee David Harwath
+ PDF Chat	Speech Representation Analysis based on Inter- and Intra-Model Similarities	2024	Yassine El Kheir Ahmed Ali Shammur Absar Chowdhury
+ PDF Chat	Visually Grounded Models of Spoken Language: A Survey of Datasets, Architectures and Evaluation Techniques	2022	Grzegorz Chrupała
+ PDF Chat	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input	2019	David Harwath Adrià Recasens Dídac Surís Galen Chuang Antonio Torralba James Glass
+ PDF Chat	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input	2018	David Harwath Adrià Recasens Dídac Surís Galen Chuang Antonio Torralba James Glass
+	Attention-Based Keyword Localisation in Speech using Visual Grounding	2021	Kayode Olaleye Herman Kamper
+ PDF Chat	Attention-Based Keyword Localisation in Speech Using Visual Grounding	2021	Kayode Olaleye Herman Kamper
+	Attention-Based Keyword Localisation in Speech using Visual Grounding	2021	Kayode Olaleye Herman Kamper
+	Visually grounded learning of keyword prediction from untranscribed speech	2017	Herman Kamper Shane Settle Gregory Shakhnarovich Karen Livescu
+ PDF Chat	Visually Grounded Learning of Keyword Prediction from Untranscribed Speech	2017	Herman Kamper Shane Settle Gregory Shakhnarovich Karen Livescu
+	Visually grounded learning of keyword prediction from untranscribed speech	2017	Herman Kamper Shane Settle Gregory Shakhnarovich Karen Livescu
+	Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech	2018	Herman Kamper Gregory Shakhnarovich Karen Livescu
+ PDF Chat	What Do Self-Supervised Speech Models Know About Words?	2024	Ankita Pasad Chung-Ming Chien Shane Settle Karen Livescu
+	What do self-supervised speech models know about words?	2023	Ankita Pasad Chung-Ming Chien Shane Settle Karen Livescu
+	Models of Visually Grounded Speech Signal Pay Attention To Nouns: a Bilingual Experiment on English and Japanese	2019	William N. Havard Jean‐Pierre Chevrot Laurent Besacier
+ PDF Chat	Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese	2019	William N. Havard Jean‐Pierre Chevrot Laurent Besacier

Works That Cite This (20)

Action	Title	Year	Authors
+	Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples	2023	Hyeonggon Ryu Arda Senocak In So Kweon Joon Son Chung
+ PDF Chat	Generative Spoken Language Model based on continuous word-sized audio tokens	2023	Robin Algayres Yossi Adi Tu Anh Nguyen Jade Copet Gabriel Synnaeve Benoît Sagot Emmanuel Dupoux
+	Computational Insights to Acquisition of Phonemes, Words, and Word Meanings in Early Language: Sequential or Parallel Acquisition?	2023	Khazar Khorrami María Andrea Cruz Blandón Okko Räsänen
+	Visually Grounded Speech Models Have a Mutual Exclusivity Bias	2024	Leanne Nortje Dan Oneață Yevgen Matusevych Herman Kamper
+ PDF Chat	Towards Visually Prompted Keyword Localisation for Zero-Resource Spoken Languages	2023	Leanne Nortje Herman Kamper
+ PDF Chat	ConceptBeam	2022	Yasunori Ohishi Marc Delcroix Tsubasa Ochiai Shoko Araki Daiki Takeuchi Daisuke Niizumi Akisato Kimura Noboru Harada Kunio Kashino
+ PDF Chat	Word Segmentation on Discovered Phone Units With Dynamic Programming and Self-Supervised Scoring	2022	Herman Kamper
+	XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words	2023	Robin Algayres Pablo Diego-Simón Benoît Sagot Emmanuel Dupoux
+ PDF Chat	Self-Supervised Speech Representation Learning: A Review	2022	Abdelrahman Mohamed Hung-yi Lee Lasse Borgholt Jakob D. Havtorn Joakim Edin Christian Igel Katrin Kirchhoff Shang-Wen Li Karen Livescu Lars Maaløe
+ PDF Chat	SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model	2023	Yi-Jen Shih Hsuan-Fu Wang Heng-Jui Chang Layne Berry Hung-yi Lee David Harwath

Works Cited by This (33)

Action	Title	Year	Authors
+	Deep Multimodal Semantic Embeddings for Speech and Images	2015	David Harwath James Glass
+ PDF Chat	A segmental framework for fully-unsupervised large-vocabulary speech recognition	2017	Herman Kamper Aren Jansen Sharon Goldwater
+ PDF Chat	Deep Visual-Semantic Alignments for Generating Image Descriptions	2016	Andrej Karpathy Li Fei-Fei
+ PDF Chat	Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner	2018	Emmanuel Dupoux
+ PDF Chat	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input	2019	David Harwath Adrià Recasens Dídac Surís Galen Chuang Antonio Torralba James Glass
+	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	2018	Jacob Devlin Ming‐Wei Chang Kenton Lee Kristina Toutanova
+ PDF Chat	The zero resource speech challenge 2017	2017	Ewan Dunbar Xuan Cao Juan Benjumea Julien Karadayi Mathieu Bernard Laurent Besacier Xavier Anguera Emmanuel Dupoux
+ PDF Chat	An embedded segmental K-means model for unsupervised segmentation and clustering of speech	2017	Herman Kamper Karen Livescu Sharon Goldwater
+ PDF Chat	Transfer Learning from Audio-Visual Grounding to Speech Recognition	2019	Wei-Ning Hsu David Harwath James Glass
+ PDF Chat	Large-Scale Representation Learning from Visually Grounded Untranscribed Speech	2019	Gabriel Ilharco Yuan Zhang Jason Baldridge