PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Type: Article

Publication Date: 2022-01-01

Citations: 20

DOI: https://doi.org/10.18653/v1/2022.emnlp-main.763

Abstract

Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and competitive performance. However, the removal of object detectors also deprives the capability of VLP models in explicit object modeling, which is essential to various position-sensitive vision-language (VL) tasks, such as referring expression comprehension and visual commonsense reasoning. To address the challenge, we introduce PEVL that enhances the pre-training and prompt tuning of VLP models with explicit object position modeling. Specifically, PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks. We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs. We make the data and code for this paper publicly available at https://github.com/thunlp/PEVL.

Locations

  • arXiv (Cornell University) - View - PDF
  • Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing - View - PDF

Similar Works

Action Title Year Authors
+ PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models 2022 Yuan Yao
Qianyu Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat‐Seng Chua
Maosong Sun
+ PDF Chat ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding 2024 Tianren Ma
Lingxi Xie
Yunjie Tian
Boyu Yang
Yuan Zhang
David Doermann
Qixiang Ye
+ CPT: Color-based Prompt Tuning for pre-trained vision-language models 2024 Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat‐Seng Chua
Maosong Sun
+ Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models 2023 Yufei Zhan
Yousong Zhu
Zhiyang Chen
Fan Yang
Ming Tang
Jinqiao Wang
+ PDF Chat Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring 2024 Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
+ CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models 2021 Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat‐Seng Chua
Maosong Sun
+ PDF Chat Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models 2024 Wei Wang
Zhaowei Li
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Xiaobin Li
+ PDF Chat Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception 2024 Junwen He
Yifan Wang
Lijun Wang
Huchuan Lu
Jun-Yan He
Jin-Peng Lan
Bin Luo
Xuansong Xie
+ Language Grounded QFormer for Efficient Vision Language Understanding 2023 Moulik Choraria
Nitesh Sekhar
Yue Wu
Xu Zhang
Prateek Singhal
Lav R. Varshney
+ Position-guided Text Prompt for Vision-Language Pre-training 2022 Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
+ PDF Chat SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models 2024 Tongtian Yue
Jie Cheng
Longteng Guo
Xingyuan Dai
Zijia Zhao
Xingjian He
Gang Xiong
Yisheng Lv
Jing Liu
+ PDF Chat Position-Guided Text Prompt for Vision-Language Pre-Training 2023 Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
+ PDF Chat Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training 2024 David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
+ Fine-Grained Visual Prompting 2023 Lingfeng Yang
Yueze Wang
Xiang Li
Xinlong Wang
Jian Yang
+ GLIPv2: Unifying Localization and Vision-Language Understanding 2022 Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen‐Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Jenq‐Neng Hwang
Jianfeng Gao
+ PDF Chat Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM 2024 Navid Rajabi
Jana KoĹĄeckĂĄ
+ PDF Chat Grounded 3D-LLM with Referent Tokens 2024 Yilun Chen
Shuai Yang
Haifeng Huang
Tai Wang
Ruiyuan Lyu
Runsen Xu
Dahua Lin
Jiangmiao Pang
+ Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models 2023 Jiaying Lu
Jinmeng Rao
Kezhen Chen
Xiaoyuan Guo
Yawen Zhang
Baochen Sun
Carl Yang
Jie Yang
+ PDF Chat Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes 2024 Antonio Carlos Rivera
Anthony F. T. Moore
Stephen Robinson
+ Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models 2020 Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu

Works That Cite This (9)

Action Title Year Authors
+ PDF Chat Fine-Grained Scene Graph Generation with Data Transfer 2022 Ao Zhang
Yuan Yao
Qianyu Chen
Wei Ji
Zhiyuan Liu
Maosong Sun
Tat‐Seng Chua
+ PDF Chat ViM: Vision Middleware for Unified Downstream Transferring 2023 Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
+ CPT: Color-based Prompt Tuning for pre-trained vision-language models 2024 Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat‐Seng Chua
Maosong Sun
+ PDF Chat Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models 2024 Jingru Yi
Burak Uzkent
Oana Ignat
Zili Li
Amanmeet Garg
Xiang Yu
Linda Liu
+ PDF Chat Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models 2023 Chengcheng Ma
Yang Liu
Jiankang Deng
Lingxi Xie
Weiming Dong
Changsheng Xu
+ PDF Chat Learning to Agree on Vision Attention for Visual Commonsense Reasoning 2023 Zhenyang Li
Yangyang Guo
Kejie Wang
Fan Liu
Liqiang Nie
Mohan Kankanhalli
+ Domain-Wise Invariant Learning for Panoptic Scene Graph Generation 2024 Li Li
You Qin
Wei Ji
Yuxiao Zhou
Roger Zimmermann
+ PDF Chat SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data 2024 Ziyan Yang
Kushal Kafle
Zhe Lin
Scott Cohen
Zhihong Ding
Vicente Ordóùez
+ PDF Chat Unified Visual Relationship Detection with Vision and Language Models 2023 L. Zhao
Liangzhe Yuan
Boqing Gong
Yin Cui
Florian Schroff
Ming–Hsuan Yang
Hartwig Adam
Ting Liu

Works Cited by This (63)

Action Title Year Authors
+ PDF Chat VQA: Visual Question Answering 2015 Stanislaw Antol
Aishwarya Agrawal
Jiasen Lu
Margaret Mitchell
Dhruv Batra
C. Lawrence Zitnick
Devi Parikh
+ PDF Chat Modeling Context in Referring Expressions 2016 Licheng Yu
Patrick Poirson
Shan Yang
Alexander C. Berg
Tamara L. Berg
+ PDF Chat Scene Graph Generation by Iterative Message Passing 2017 Danfei Xu
Yuke Zhu
Christopher Choy
Li Fei-Fei
+ PDF Chat Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering 2018 Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Jay Gould
Lei Zhang
+ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018 Jacob Devlin
Ming‐Wei Chang
Kenton Lee
Kristina Toutanova
+ PDF Chat Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization 2017 Ramprasaath R. Selvaraju
Michael Cogswell
Abhishek Das
Ramakrishna Vedantam
Devi Parikh
Dhruv Batra
+ PDF Chat Generation and Comprehension of Unambiguous Object Descriptions 2016 Junhua Mao
Jonathan Huang
Alexander Toshev
Oana Camburu
Alan Yuille
Kevin Murphy
+ PDF Chat From Recognition to Cognition: Visual Commonsense Reasoning 2019 Rowan Zellers
Yonatan Bisk
Ali Farhadi
Yejin Choi
+ PDF Chat GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering 2019 Drew A. Hudson
Christopher D. Manning
+ Bilinear Attention Networks 2018 Jin-Hwa Kim
Jae-Hyun Jun
Byoung‐Tak Zhang