Ask a Question

Prefer a chat interface with context about you and your work?

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial …