Ask a Question

Prefer a chat interface with context about you and your work?

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which …