FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual
Token Compression
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual
Token Compression
Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in the number of visual tokens input into LLMs, resulting in significant computational costs. Current work develop visual token compression …