Ask a Question

Prefer a chat interface with context about you and your work?

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in the number of visual tokens input into LLMs, resulting in significant computational costs. Current work develop visual token compression …