Ask a Question

Prefer a chat interface with context about you and your work?

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily focus on importance-based token pruning, which overlooks the redundancy caused by frame resemblance and repetitive visual elements. In this paper, we analyze …