Exploring Audio-Visual Information Fusion for Sound Event Localization
and Detection In Low-Resource Realistic Scenarios
Exploring Audio-Visual Information Fusion for Sound Event Localization
and Detection In Low-Resource Realistic Scenarios
This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, …