SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

Zheng Liu, Hao Liang, Wentao Xiong, Qinhan Yu, Conghui He, Bin Cui, Wentao Zhang

Type: Preprint

Publication Date: 2024-07-30

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2407.20756

Abstract

Recently, with the rise of web images, managing and understanding large-scale image datasets has become increasingly important. Vision Large Language Models (VLLMs) have recently emerged due to their robust vision-understanding capabilities. However, training these models requires vast amounts of data, posing challenges to efficiency, effectiveness, data quality, and privacy. In this paper, we introduce SynthVLM, a novel data synthesis pipeline for VLLMs. Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions, creating precisely aligned image-text pairs. Leveraging these pairs, we achieve state-of-the-art (SoTA) performance on various vision question answering tasks, maintaining high alignment quality and preserving advanced language abilities. Moreover, SynthVLM surpasses traditional GPT-4 Vision-based caption generation methods in performance while significantly reducing computational overhead. Crucially, our method's reliance on purely generated data ensures the preservation of privacy, achieving SoTA performance with just 100k data points (only 18% of the official dataset size).

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings	2024	Sahand Sharifzadeh Christos Kaplanis Shreya Pathak Dharshan Kumaran Anastasija Ilić Jovana Mitrovic Charles Blundell Andrea Banino
+	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	2022	Junnan Li Dongxu Li Caiming Xiong Steven C. H. Hoi
+ PDF Chat	CompCap: Improving Multimodal Large Language Models with Composite Captions	2024	Xiaohong Chen Satya Narayan Shukla Mahmoud Azab Ananya Singh Qifan Wang David Dawei Yang Shengyun Peng Hanchao Yu Yan Shen Xuewen Zhang
+ PDF Chat	Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation	2022	Wenliang Dai Lu Hou Lifeng Shang Xin Jiang Qun Liu Pascale Fung
+	Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation	2023	Samuele Poppi Tobia Poppi Federico Cocchi Marcella Cornia Lorenzo Baraldi Rita Cucchiara
+	Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation	2022	Wenliang Dai Lu Hou Lifeng Shang Xin Jiang Qun Liu Pascale Fung
+	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	2023	Junnan Li Dongxu Li Silvio Savarese Steven C. H. Hoi
+	PaLI: A Jointly-Scaled Multilingual Language-Image Model	2022	Xi Chen Xiao Wang Soravit Changpinyo AJ Piergiovanni Piotr Padlewski Daniel Salz Sebastian Goodman Adam Grycner Basil Mustafa Lucas Beyer
+	Image Captioning with Multi-Context Synthetic Data	2023	Feipeng Ma Yizhou Zhou Fengyun Rao Yueyi Zhang Xiaoyan Sun
+ PDF Chat	Image Captioning with Multi-Context Synthetic Data	2024	Feipeng Ma Yizhou Zhou Fengyun Rao Yueyi Zhang Xiaoyan Sun
+	Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency	2023	Tianhong Li Sangnie Bhardwaj Yonglong Tian Han Zhang Jarred Barber Dina Katabi Guillaume Lajoie Hui‐Wen Chang Dilip Krishnan
+ PDF Chat	BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions	2024	Anas Awadalla Le Xue Manli Shu An Yan Jun Wang Senthil Purushwalkam Sheng Shen Hannah Lee Oscar Lo Jae Sung Park
+	SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	2021	Zirui Wang Jiahui Yu Adams Wei Yu Zihang Dai Yulia Tsvetkov Yuan Cao
+ PDF Chat	Unified Vision-Language Pre-Training for Image Captioning and VQA	2020	Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso Jianfeng Gao
+	Unified Vision-Language Pre-Training for Image Captioning and VQA	2019	Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso Jianfeng Gao
+	Kosmos-G: Generating Images in Context with Multimodal Large Language Models	2023	Xichen Pan Li Dong Shaohan Huang Zhiliang Peng Wenhu Chen Furu Wei
+ PDF Chat	VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	2024	Jiannan Wu Muyan Zhong Sen Xing Zeqiang Lai Zhaoyang Liu Wenhai Wang Zhe Chen Xizhou Zhu Lewei Lu Tong Lu
+ PDF Chat	LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation	2024	Weijia Shi Xiaochuang Han Chunting Zhou Weixin Liang Xi Victoria Lin Luke Zettlemoyer Lili Yu
+	CapText: Large Language Model-based Caption Generation From Image Context and Description	2023	S Ghosh Sagnik Anupam
+	VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	2023	Jinguo Zhu Xiaohan Ding Yixiao Ge Yuying Ge Sijie Zhao Hengshuang Zhao Xiaohua Wang Ying Shan

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors