SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

Type: Preprint

Publication Date: 2024-07-30

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2407.20756

Abstract

Recently, with the rise of web images, managing and understanding large-scale image datasets has become increasingly important. Vision Large Language Models (VLLMs) have recently emerged due to their robust vision-understanding capabilities. However, training these models requires vast amounts of data, posing challenges to efficiency, effectiveness, data quality, and privacy. In this paper, we introduce SynthVLM, a novel data synthesis pipeline for VLLMs. Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions, creating precisely aligned image-text pairs. Leveraging these pairs, we achieve state-of-the-art (SoTA) performance on various vision question answering tasks, maintaining high alignment quality and preserving advanced language abilities. Moreover, SynthVLM surpasses traditional GPT-4 Vision-based caption generation methods in performance while significantly reducing computational overhead. Crucially, our method's reliance on purely generated data ensures the preservation of privacy, achieving SoTA performance with just 100k data points (only 18% of the official dataset size).

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings 2024 Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
Dharshan Kumaran
Anastasija Ilić
Jovana Mitrovic
Charles Blundell
Andrea Banino
+ BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 2022 Junnan Li
Dongxu Li
Caiming Xiong
Steven C. H. Hoi
+ PDF Chat CompCap: Improving Multimodal Large Language Models with Composite Captions 2024 Xiaohong Chen
Satya Narayan Shukla
Mahmoud Azab
Ananya Singh
Qifan Wang
David Dawei Yang
Shengyun Peng
Hanchao Yu
Yan Shen
Xuewen Zhang
+ PDF Chat Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation 2022 Wenliang Dai
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
+ Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation 2023 Samuele Poppi
Tobia Poppi
Federico Cocchi
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
+ Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation 2022 Wenliang Dai
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
+ BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 2023 Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
+ PaLI: A Jointly-Scaled Multilingual Language-Image Model 2022 Xi Chen
Xiao Wang
Soravit Changpinyo
AJ Piergiovanni
Piotr Padlewski
Daniel Salz
Sebastian Goodman
Adam Grycner
Basil Mustafa
Lucas Beyer
+ Image Captioning with Multi-Context Synthetic Data 2023 Feipeng Ma
Yizhou Zhou
Fengyun Rao
Yueyi Zhang
Xiaoyan Sun
+ PDF Chat Image Captioning with Multi-Context Synthetic Data 2024 Feipeng Ma
Yizhou Zhou
Fengyun Rao
Yueyi Zhang
Xiaoyan Sun
+ Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency 2023 Tianhong Li
Sangnie Bhardwaj
Yonglong Tian
Han Zhang
Jarred Barber
Dina Katabi
Guillaume Lajoie
Hui‐Wen Chang
Dilip Krishnan
+ PDF Chat BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions 2024 Anas Awadalla
Le Xue
Manli Shu
An Yan
Jun Wang
Senthil Purushwalkam
Sheng Shen
Hannah Lee
Oscar Lo
Jae Sung Park
+ SimVLM: Simple Visual Language Model Pretraining with Weak Supervision 2021 Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
+ PDF Chat Unified Vision-Language Pre-Training for Image Captioning and VQA 2020 Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
+ Unified Vision-Language Pre-Training for Image Captioning and VQA 2019 Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
+ Kosmos-G: Generating Images in Context with Multimodal Large Language Models 2023 Xichen Pan
Li Dong
Shaohan Huang
Zhiliang Peng
Wenhu Chen
Furu Wei
+ PDF Chat VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks 2024 Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
Wenhai Wang
Zhe Chen
Xizhou Zhu
Lewei Lu
Tong Lu
+ PDF Chat LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation 2024 Weijia Shi
Xiaochuang Han
Chunting Zhou
Weixin Liang
Xi Victoria Lin
Luke Zettlemoyer
Lili Yu
+ CapText: Large Language Model-based Caption Generation From Image Context and Description 2023 S Ghosh
Sagnik Anupam
+ VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation 2023 Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors