FrameFusion: Combining Similarity and Importance for Video Token
Reduction on Large Visual Language Models
FrameFusion: Combining Similarity and Importance for Video Token
Reduction on Large Visual Language Models
The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily focus on importance-based token pruning, which overlooks the redundancy caused by frame resemblance and repetitive visual elements. In this paper, we analyze …