Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
Video question answering requires the models to understand and reason about both the complex video and language data to correctly derive the answers. Existing efforts have been focused on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and …