Ask AI a math question

Related Paper

Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important …

Ask a Question