Ask a Question

Prefer a chat interface with context about you and your work?

Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important …