Ask a Question

Prefer a chat interface with context about you and your work?

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, i.e. image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate …