Grounding Referring Expressions in Images by Variational Context

Type: Article

Publication Date: 2018-06-01

Citations: 243

DOI: https://doi.org/10.1109/cvpr.2018.00437

Download PDF

Abstract

We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., "largest elephant standing behind baby elephant". This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context - visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind") that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Our model exploits the reciprocal relation between the referent and context, i.e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced. We also extend the model to unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings. The code is available at https://github.com/yuleiniu/vc/.

Locations

  • arXiv (Cornell University) - View - PDF
  • DR-NTU (Nanyang Technological University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions 2019 Yulei Niu
Hanwang Zhang
Zhiwu Lu
Shih‐Fu Chang
+ Grounding Referring Expressions in Images by Variational Context 2017 Hanwang Zhang
Yulei Niu
Shih‐Fu Chang
+ Relationship-Embedded Representation Learning for Grounding Referring Expressions 2019 Sibei Yang
Guanbin Li
Yizhou Yu
+ Referring Expression Grounding by Marginalizing Scene Graph Likelihood 2019 Daqing Liu
Hanwang Zhang
Zheng-Jun Zha
Fanglin Wang
+ Joint Visual Grounding with Language Scene Graphs 2019 Daqing Liu
Hanwang Zhang
Zheng-Jun Zha
Meng Wang
Qianru Sun
+ Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos 2021 Sijie Song
Xudong Lin
Jiaying Liu
Zongming Guo
Shih‐Fu Chang
+ PDF Chat Correspondence Matters for Video Referring Expression Comprehension 2022 Meng Cao
Ji Jiang
Long Chen
Yuexian Zou
+ PDF Chat Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos 2021 Sijie Song
Xudong Lin
Jiaying Liu
Zongming Guo
Shih‐Fu Chang
+ Correspondence Matters for Video Referring Expression Comprehension 2022 Meng Cao
Ji Jiang
Long Chen
Yuexian Zou
+ Referring Transformer: A One-step Approach to Multi-task Visual Grounding 2021 Muchen Li
Leonid Sigal
+ Video Referring Expression Comprehension via Transformer with Content-aware Query 2022 Ji Jiang
Meng Cao
Tengtao Song
Yuexian Zou
+ Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing 2019 Xihui Liu
Zihao Wang
Jing Shao
Xiaogang Wang
Hongsheng Li
+ PDF Chat Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing 2019 Xihui Liu
Zihao Wang
Jing Shao
Xiaogang Wang
Hongsheng Li
+ PDF Chat Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation 2023 Zhao Yang
Jiaqi Wang
Yansong Tang
Kai Chen
Hengshuang Zhao
Philip H. S. Torr
+ Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding 2019 Xuejing Liu
Liang Li
Shuhui Wang
Zheng-Jun Zha
Li Su
Qingming Huang
+ PDF Chat Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding 2019 Xuejing Liu
Liang Li
Shuhui Wang
Zheng-Jun Zha
Li Su
Qingming Huang
+ Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation 2021 Liang Chen
Yu Wu
Tianfei Zhou
Wenguan Wang
Zongxin Yang
Yunchao Wei
Yi Yang
+ PDF Chat ScanFormer: Referring Expression Comprehension by Iteratively Scanning 2024 Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
+ PDF Chat Meta Compositional Referring Expression Segmentation 2023 Li Xu
He Huang
Xindi Shang
Zehuan Yuan
Ying Sun
Jun Liu
+ Meta Compositional Referring Expression Segmentation 2023 Li Xu
He Huang
Xindi Shang
Zehuan Yuan
Ying Sun
Jun Liu

Works Cited by This (32)

Action Title Year Authors
+ Multiple Object Recognition with Visual Attention 2014 Jimmy Ba
Volodymyr Mnih
Koray Kavukcuoglu
+ PDF Chat Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models 2015 Bryan A. Plummer
Liwei Wang
Chris M. Cervantes
Juan C. Caicedo
Julia Hockenmaier
Svetlana Lazebnik
+ Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 2015 Sergey Ioffe
Christian Szegedy
+ PDF Chat VQA: Visual Question Answering 2015 Stanislaw Antol
Aishwarya Agrawal
Jiasen Lu
Margaret Mitchell
Dhruv Batra
C. Lawrence Zitnick
Devi Parikh
+ Neural Machine Translation by Jointly Learning to Align and Translate 2014 Dzmitry Bahdanau
Kyunghyun Cho
Yoshua Bengio
+ Hierarchical Question-Image Co-Attention for Visual Question Answering 2016 Jiasen Lu
Jianwei Yang
Dhruv Batra
Devi Parikh
+ PDF Chat Visual Relationship Detection with Language Priors 2016 Cewu Lu
Ranjay Krishna
Michael S. Bernstein
Li Fei-Fei
+ PDF Chat Modeling Context in Referring Expressions 2016 Licheng Yu
Patrick Poirson
Shan Yang
Alexander C. Berg
Tamara L. Berg
+ Phrase Localization and Visual Relationship Detection with Comprehensive Linguistic Cues. 2016 Bryan A. Plummer
Arun Mallya
Christopher M. Cervantes
Julia Hockenmaier
Svetlana Lazebnik
+ PDF Chat Modeling Relationships in Referential Expressions with Compositional Modular Networks 2017 Ronghang Hu
Marcus Rohrbach
Jacob Andreas
Trevor Darrell
Kate Saenko