Ask a Question

Prefer a chat interface with context about you and your work?

Turning a CLIP Model into a Scene Text Spotter

Turning a CLIP Model into a Scene Text Spotter

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 …