Ask a Question

Prefer a chat interface with context about you and your work?

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Word Discovery in Visually Grounded, Self-Supervised Speech Models

OursFigure 1: HuBERT: sum of attention weights each frame receives from other frames.Ours (VG-HuBERT3): attention weights each frame receives from the [CLS A] token.Attention weights from different attention heads are coded with different colors.