TokenPacker: Efficient Visual Projector for Multimodal LLM

Type: Preprint

Publication Date: 2024-07-02

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2407.02392

Abstract

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ PDF Chat Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation 2024 Si-Jin Qian
Bingquan Liu
Chengjie Sun
Zhen Xu
Baoxun Wang
+ PDF Chat DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models 2024 Linli Yao
Lei Li
Shuhuai Ren
Lean Wang
Yuanxin Liu
Xu Sun
Lu Hou
+ PDF Chat LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images 2024 Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat‐Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
+ PerceptionGPT: Effectively Fusing Visual Perception into LLM 2023 Renjie Pi
Lewei Yao
Jiahui Gao
Jipeng Zhang
Tong Zhang
+ PDF Chat FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression 2024 Yilin Zhu
Chi Xie
Shuang Liang
Bo Zheng
Sheng Guo
+ PDF Chat LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models 2024 Yuzhang Shang
Mu Cai
Bingxin Xu
Yong‐Jae Lee
Yan Yan
+ PDF Chat ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models 2024 Chunjiang Ge
Sijie Cheng
Ziming Wang
Jiale Yuan
Yuan Gao
J. J. Song
Shiji Song
Gao Huang
Bo Zheng
+ Honeybee: Locality-enhanced Projector for Multimodal LLM 2023 Junbum Cha
Wooyoung Kang
Jonghwan Mun
Byungseok Roh
+ PDF Chat AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning 2024 Yiwu Zhong
Zhuoming Liu
Yin Li
Liwei Wang
+ PDF Chat Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight 2024 Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
+ PDF Chat LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token 2025 Shaolei Zhang
Qingkai Fang
Zhe Yang
Yan Feng
+ PDF Chat Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing 2024 Hao Fei
Shengqiong Wu
Hanwang Zhang
Tat‐Seng Chua
Shuicheng Yan
+ PDF Chat Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs 2024 Shengbang Tong
Ellis Brown
Penghao Wu
S. Woo
Manoj Middepogu
Sai Charitha Akula
Jihan Yang
Shusheng Yang
Adithya Iyer
Xichen Pan
+ PDF Chat From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models 2024 Yuying Shang
Xinyi Zeng
Yutao Zhu
Yang Xiao
Zhengwei Fang
Jingyuan Zhang
Jiawei Chen
Zinan Liu
Yu Tian
+ PDF Chat Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts 2024 Honglin Li
Yuting Gao
Chenglu Zhu
Jiahong Chen
Ming–Hsuan Yang
Lin Yang
+ PDF Chat FlexAttention for Efficient High-Resolution Vision-Language Models 2024 Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
+ PDF Chat Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models 2024 Yifan Zhang
Qingsong Wen
Chaoyou Fu
X. Wang
Zhang Zhang
Chenghao Wang
Rong Jin
+ PDF Chat HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments 2024 Kazi Hasan Ibn Arif
JinYi Yoon
Dimitrios S. Nikolopoulos
Hans Vandierendonck
Deepu John
Bo Ji
+ PDF Chat AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding 2024 Yonghui Wang
Wengang Zhou
Hao Feng
Houqiang Li
+ PDF Chat Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion 2024 Ziyue Wang
Chi Chen
Yiqi Zhu
Fuwen Luo
Peng Li
Ming Yan
Ji Zhang
Fei Huang
Maosong Sun
Yang Liu

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors