Align and Attend: Multimodal Summarization with Dual Contrastive Losses

Type: Article

Publication Date: 2023-06-01

Citations: 28

DOI: https://doi.org/10.1109/cvpr52729.2023.01428

Abstract

The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to lever-age the temporal correspondence between different modal-ities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at https://boheumd.github.io/A2Summ/.

Locations

  • arXiv (Cornell University) - View - PDF
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - View

Works Cited by This (58)

Action Title Year Authors
+ PDF Chat Going deeper with convolutions 2015 Christian Szegedy
Wei Liu
Yangqing Jia
Pierre Sermanet
Scott Reed
Dragomir Anguelov
Dumitru Erhan
Vincent Vanhoucke
Andrew Rabinovich
+ PDF Chat Neural Summarization by Extracting Sentences and Words 2016 Jianpeng Cheng
Mirella Lapata
+ Hierarchical Question-Image Co-Attention for Visual Question Answering 2016 Jiasen Lu
Jianwei Yang
Dhruv Batra
Devi Parikh
+ A Deep Reinforced Model for Abstractive Summarization 2017 Romain Paulus
Caiming Xiong
Richard Socher
+ PDF Chat Deep Reinforcement Learning for Unsupervised Video Summarization With Diversity-Representativeness Reward 2018 Kaiyang Zhou
Yu Qiao
Tao Xiang
+ PDF Chat End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features 2019 Chiori Hori
Huda Alamri
Jue Wang
Gordon Wichern
Takaaki Hori
Anoop Cherian
Tim K. Marks
Vincent Cartillier
Raphael Gontijo Lopes
Abhishek Das
+ PDF Chat Iterative Document Representation Learning Towards Summarization with Polishing 2018 Xiuying Chen
Shen Gao
Chongyang Tao
Yan Song
Dongyan Zhao
Rui Yan
+ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018 Jacob Devlin
Ming‐Wei Chang
Kenton Lee
Kristina Toutanova
+ How2: A Large-scale Dataset for Multimodal Language Understanding 2018 Ramon Sanabria
Ozan Çağlayan
Shruti Palaskar
Desmond Elliott
Loïc Barrault
Lucia Specia
Florian Metze
+ Discriminative Feature Learning for Unsupervised Video Summarization. 2018 YunJae Jung
Donghyeon Cho
Dahun Kim
Sanghyun Woo
In So Kweon