TokenPacker: Efficient Visual Projector for Multimodal LLM

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang

Type: Preprint

Publication Date: 2024-07-02

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2407.02392

Abstract

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation	2024	Si-Jin Qian Bingquan Liu Chengjie Sun Zhen Xu Baoxun Wang
+ PDF Chat	DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	2024	Linli Yao Lei Li Shuhuai Ren Lean Wang Yuanxin Liu Xu Sun Lu Hou
+ PDF Chat	LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images	2024	Ruyi Xu Yuan Yao Zonghao Guo Junbo Cui Zanlin Ni Chunjiang Ge Tat‐Seng Chua Zhiyuan Liu Maosong Sun Gao Huang
+	PerceptionGPT: Effectively Fusing Visual Perception into LLM	2023	Renjie Pi Lewei Yao Jiahui Gao Jipeng Zhang Tong Zhang
+ PDF Chat	FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression	2024	Yilin Zhu Chi Xie Shuang Liang Bo Zheng Sheng Guo
+ PDF Chat	LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models	2024	Yuzhang Shang Mu Cai Bingxin Xu Yong‐Jae Lee Yan Yan
+ PDF Chat	ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	2024	Chunjiang Ge Sijie Cheng Ziming Wang Jiale Yuan Yuan Gao J. J. Song Shiji Song Gao Huang Bo Zheng
+	Honeybee: Locality-enhanced Projector for Multimodal LLM	2023	Junbum Cha Wooyoung Kang Jonghwan Mun Byungseok Roh
+ PDF Chat	AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning	2024	Yiwu Zhong Zhuoming Liu Yin Li Liwei Wang
+ PDF Chat	Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight	2024	Ziyuan Huang Kaixiang Ji Biao Gong Zhiwu Qing Qinglong Zhang Kecheng Zheng Jian Wang Jingdong Chen Ming Yang
+ PDF Chat	LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token	2025	Shaolei Zhang Qingkai Fang Zhe Yang Yan Feng
+ PDF Chat	Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	2024	Hao Fei Shengqiong Wu Hanwang Zhang Tat‐Seng Chua Shuicheng Yan
+ PDF Chat	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	2024	Shengbang Tong Ellis Brown Penghao Wu S. Woo Manoj Middepogu Sai Charitha Akula Jihan Yang Shusheng Yang Adithya Iyer Xichen Pan
+ PDF Chat	From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models	2024	Yuying Shang Xinyi Zeng Yutao Zhu Yang Xiao Zhengwei Fang Jingyuan Zhang Jiawei Chen Zinan Liu Yu Tian
+ PDF Chat	Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts	2024	Honglin Li Yuting Gao Chenglu Zhu Jiahong Chen Ming–Hsuan Yang Lin Yang
+ PDF Chat	FlexAttention for Efficient High-Resolution Vision-Language Models	2024	Junyan Li Delin Chen Tianle Cai Peihao Chen Yining Hong Zhenfang Chen Yikang Shen Chuang Gan
+ PDF Chat	Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models	2024	Yifan Zhang Qingsong Wen Chaoyou Fu X. Wang Zhang Zhang Chenghao Wang Rong Jin
+ PDF Chat	HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments	2024	Kazi Hasan Ibn Arif JinYi Yoon Dimitrios S. Nikolopoulos Hans Vandierendonck Deepu John Bo Ji
+ PDF Chat	AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding	2024	Yonghui Wang Wengang Zhou Hao Feng Houqiang Li
+ PDF Chat	Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion	2024	Ziyue Wang Chi Chen Yiqi Zhu Fuwen Luo Peng Li Ming Yan Ji Zhang Fei Huang Maosong Sun Yang Liu

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors