Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial …