AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Type: Article

Publication Date: 2023-01-01

Citations: 18

DOI: https://doi.org/10.18653/v1/2023.findings-acl.552

Abstract

CLIP (Contrastive Language–Image Pretraining) is an English multimodal representation model learned from a massive amount of English text-image pairs and has achieved great success in various downstream tasks, including image classification, text-to-image retrieval, and image generation. When extending CLIP to other languages, the major problem is the lack of good-quality text-image pairs. In this work, we present AltCLIP, a simple and low-resource method to build a strong multilingual multimodal representation model. Instead of training a model from scratch on multilingual text-image pairs, we take the original CLIP model trained on English text-image pairs and alter its text encoder with a pre-trained multilingual text encoder (XLM-R). We then align text and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. Our method utilizes the existence of rich parallel text data and pre-trained multilingual language models. We present extensive experimental evaluations to demonstrate the effectiveness of our proposed method. Our model sets new state-of-the-art zero-shot performances on a wide range of tasks in multilingual multimodal benchmarks, including ImageNet-CN/IT/JA/KO serials, Flicker30k-CN, COCO-CN, Multi30k, and XTD. Further, our model outperforms the original CLIP model on zero-shot cross-modal retrieval, Image Classification in the Wild (ICinW) tasks, and CLIP Benchmark. We plan to open-source our code, pre-trained model weights, and evaluation toolkits of multilingual multimodal tasks, to facilitate research on multilingual multimodal representation learning.

Locations

  • arXiv (Cornell University) - View - PDF
  • Findings of the Association for Computational Linguistics: ACL 2022 - View - PDF

Similar Works

Action Title Year Authors
+ AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities 2022 Zhongzhi Chen
Guang Liu
Bowen Zhang
Fulong Ye
Qinghong Yang
Ledell Wu
+ CapsFusion: Rethinking Image-Text Data at Scale 2023 Qiying Yu
Quan Sun
Xiaosong Zhang
Yufeng Cui
Fan Zhang
Xinlong Wang
Jingjing Liu
+ PDF Chat jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images 2024 Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Günther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
+ PDF Chat Long-CLIP: Unlocking the Long-Text Capability of CLIP 2024 Beichen Zhang
Pan Zhang
Xiaoyi Dong
Yuhang Zang
Jiaqi Wang
+ Contrastive Language-Image Pre-training for the Italian Language 2021 Federico Bianchi
Giuseppe Attanasio
Raphael Pisoni
Silvia Terragni
Gabriele Sarti
Sri Lakshmi
+ Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages 2023 Jinyi Hu
Yuan Yao
Chongyi Wang
Shan Wang
Yinxu Pan
Qianyu Chen
Tianyu Yu
Hanghao Wu
Yue Zhao
Haoye Zhang
+ PDF Chat Jina CLIP: Your CLIP Model Is Also Your Text Retriever 2024 Andreas Koukounas
Georgios Mastrapas
Michael Günther
Bo Wang
Scott Martens
Isabelle Mohr
Saba Sturua
Mohammad Kalim Akram
Joan Fontanals Martínez
Saahil Ognawala
+ LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs 2021 C Schuhmann
Richard Vencu
Romain Beaumont
Robert Kaczmarczyk
Clayton Mullis
Aarush Katta
Theo Coombes
Jenia Jitsev
Aran Komatsuzaki
+ CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment 2022 Xue Hong-wei
Yuchong Sun
Bei Liu
Jianlong Fu
Ruihua Song
Houqiang Li
Jiebo Luo
+ PDF Chat Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training 2024 Haicheng Wang
Jin Chen
Weixiong Lin
Shuai Xiao
Mengting Chen
Yixuan Huang
Chang Liu
Mingshuai Yao
Jie Lan
Ying Chen
+ PDF Chat FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training 2024 A. Cao
Xing Wei
Zhiheng Ma
+ CoCa: Contrastive Captioners are Image-Text Foundation Models 2022 Jiahui Yu
Zirui Wang
Vijay K. Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
+ PDF Chat CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling 2024 Jihai Zhang
Xiaoye Qu
Tong Zhu
Yu Cheng
+ PDF Chat IDEA: Image Description Enhanced CLIP-Adapter 2025 Zhipeng Ye
Feng Jiang
Qiufeng Wang
Kaizhu Huang
Jiaqi Huang
+ CLIPPO: Image-and-Language Understanding from Pixels Only 2022 Michael Tschannen
Basil Mustafa
Neil Houlsby
+ PDF Chat CLIPPO: Image-and-Language Understanding from Pixels Only 2023 Michael Tschannen
Basil Mustafa
Neil Houlsby
+ ALIP: Adaptive Language-Image Pre-training with Synthetic Caption 2023 Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
+ PDF Chat ALIP: Adaptive Language-Image Pre-training with Synthetic Caption 2023 Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
+ Improving CLIP Training with Language Rewrites 2023 Lijie Fan
Dilip Krishnan
Phillip Isola
Dina Katabi
Yonglong Tian
+ PDF Chat RWKV-CLIP: A Robust Vision-Language Representation Learner 2024 Tiancheng Gu
Kaicheng Yang
Xiang An
Ziyong Feng
Dongnan Liu
Weidong Cai
Jiankang Deng