TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Type: Preprint

Publication Date: 2024-11-04

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2411.02545

Abstract

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: https://tripletclip.github.io

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ CoCa: Contrastive Captioners are Image-Text Foundation Models 2022 Jiahui Yu
Zirui Wang
Vijay K. Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
+ PDF Chat The Hard Positive Truth about Vision-Language Compositionality 2024 Amita Kamath
Cheng-Yu Hsieh
Kai-Wei Chang
Ranjay Krishna
+ PDF Chat Modeling Caption Diversity in Contrastive Vision-Language Pretraining 2024 Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wildon
Aaron Courville
Nicolas Ballas
+ Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding 2023 Le Zhang
Rabiul Awal
Aishwarya Agrawal
+ PDF Chat ProtoCLIP: Prototypical Contrastive Language Image Pretraining 2023 Delong Chen
Wu Zhao
Fan Liu
Zaiquan Yang
Shaoqiu Zheng
Ying Tan
Erjin Zhou
+ ProtoCLIP: Prototypical Contrastive Language Image Pretraining 2022 Delong Chen
Wu Zhao
Fan Liu
Zaiquan Yang
Yixiang Huang
Yiping Bao
Erjin Zhou
+ Boosting Visual-Language Models by Exploiting Hard Samples 2023 Haonan Wang
Minbin Huang
Runhui Huang
Lanqing Hong
Hang Xu
Tianyang Hu
Xiaodan Liang
Zhenguo Li
+ BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 2022 Junnan Li
Dongxu Li
Caiming Xiong
Steven C. H. Hoi
+ Improving CLIP Training with Language Rewrites 2023 Lijie Fan
Dilip Krishnan
Phillip Isola
Dina Katabi
Yonglong Tian
+ PDF Chat Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training 2024 Haicheng Wang
Jin Chen
Weixiong Lin
Shuai Xiao
Mengting Chen
Yixuan Huang
Chang Liu
Mingshuai Yao
Jie Lan
Ying Chen
+ CapsFusion: Rethinking Image-Text Data at Scale 2023 Qiying Yu
Quan Sun
Xiaosong Zhang
Yufeng Cui
Fan Zhang
Xinlong Wang
Jingjing Liu
+ CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision 2021 Aman Shrivastava
Ramprasaath R. Selvaraju
Nikhil Naik
Vicente Ordóñez
+ ALIP: Adaptive Language-Image Pre-training with Synthetic Caption 2023 Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
+ PDF Chat ALIP: Adaptive Language-Image Pre-training with Synthetic Caption 2023 Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
+ Scaling Up Vision-Language Pre-training for Image Captioning 2021 Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
+ PDF Chat Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity 2024 Siddharth Joshi
Arnav Jain
Ali Payani
Baharan Mirzasoleiman
+ PDF Chat CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations 2021 Aman Shrivastava
Ramprasaath R. Selvaraju
Nikhil Naik
Vicente Ordóñez
+ Teaching CLIP to Count to Ten 2023 Roni Paiss
Ariel Ephrat
Omer Tov
Shiran Zada
Inbar Mosseri
Michal Irani
Tali Dekel
+ Improved baselines for vision-language pre-training 2023 Enrico Fini
Pietro Astolfi
Adriana Romero-Soriano
Jakob Verbeek
Michal Drozdzal
+ Generative Negative Text Replay for Continual Vision-Language Pretraining 2022 Shipeng Yan
Lanqing Hong
Hang Xu
Jianhua Han
Tinne Tuytelaars
Zhenguo Li
Xuming He

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors