Word Discovery in Visually Grounded, Self-Supervised Speech Models
Word Discovery in Visually Grounded, Self-Supervised Speech Models
OursFigure 1: HuBERT: sum of attention weights each frame receives from other frames.Ours (VG-HuBERT3): attention weights each frame receives from the [CLS A] token.Attention weights from different attention heads are coded with different colors.