EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multimodal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities - RGB, Flow and Audio - and combine them with mid-level fusion alongside sparse temporal …