Dual-modality Seq2Seq Network for Audio-visual Event Localization
Dual-modality Seq2Seq Network for Audio-visual Event Localization
Audio-visual event localization requires one to identify the event which is both visible and audible in a video (either at a frame or video level). To address this task, we propose a deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN). By jointly taking both audio and visual features at …