Visual Context Window Extension: A New Perspective for Long Video
Understanding
Visual Context Window Extension: A New Perspective for Long Video
Understanding
Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. …