Decoder Pre-Training with only Text for Scene Text Recognition

Type: Preprint

Publication Date: 2024-08-11

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2408.05706

Abstract

Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at https://github.com/Topdu/OpenOCR

Locations

  • arXiv (Cornell University) - View - PDF

Similar Works

Action Title Year Authors
+ CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model 2023 Shuai Zhao
Xiaohan Wang
Linchao Zhu
Ruijie Quan
Yi Yang
+ Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting 2022 Chuhui Xue
Hao Yu
Shijian Lu
Philip Torr
Song Bai
+ VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition 2024 Xianfu Cheng
Weixiao Zhou
Xiang Li
Xiaohong Chen
Jian Yang
Tongliang Li
Zhoujun Li
+ PDF Chat TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models 2024 Jonathan Fhima
Elad Ben Avraham
Oren Nuriel
Yair Kittenplon
Roy Ganz
Aviad Aberdam
Ron Litman
+ CLIPTER: Looking at the Bigger Picture in Scene Text Recognition 2023 Aviad Aberdam
David Bensaïd
Alona Golts
Roy Ganz
Oren Nuriel
Royee Tichauer
Shai Mazor
Ron Litman
+ PDF Chat CLIPTER: Looking at the Bigger Picture in Scene Text Recognition 2023 Aviad Aberdam
David Bensaïd
Alona Golts
Roy Ganz
Oren Nuriel
Royee Tichauer
Shai Mazor
Ron Litman
+ Vision-Language Pre-Training for Boosting Scene Text Detectors 2022 Sibo Song
Jianqiang Wan
Zhibo Yang
Jun Tang
Wenqing Cheng
Xiang Bai
Cong Yao
+ Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition 2023 Zixiao Wang
Hongtao Xie
Yuxin Wang
Jianjun Xu
Boqiang Zhang
Yongdong Zhang
+ Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition 2023 Zixiao Wang
Hongtao Xie
Yuxin Wang
Jianjun Xu
Boqiang Zhang
Yongdong Zhang
+ I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition 2021 Chuhui Xue
Shijian Lu
Song Bai
Wenqing Zhang
Changhu Wang
+ PDF Chat VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer 2024 Humen Zhong
Zhibo Yang
Zhaohai Li
Peng Wang
Jun Tang
Wenqing Cheng
Cong Yao
+ PDF Chat UNIT: Unifying Image and Text Recognition in One Vision Encoder 2024 Yi Zhu
Yanpeng Zhou
Chunwei Wang
Yang Cao
Jianhua Han
Lu Hou
Hang Xu
+ PDF Chat Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval 2024 Gangyan Zeng
Yuan Zhang
Jin Wei
Dongbao Yang
Peng Zhang
Yiwen Gao
Xugong Qin
Yu Zhou
+ MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining 2022 Pengyuan Lyu
Chengquan Zhang
Shanshan Liu
Meina Qiao
Yangliu Xu
Liang Wu
Kun Yao
Junyu Han
Errui Ding
Jingdong Wang
+ Turning a CLIP Model into a Scene Text Detector 2023 Wenwen Yu
Yuliang Liu
Wei Hua
Deqiang Jiang
Bo Ren
Xiang Bai
+ PDF Chat Turning a CLIP Model into a Scene Text Detector 2023 Wenwen Yu
Yuliang Liu
Wei Hua
Deqiang Jiang
Bo Ren
Xiang Bai
+ PDF Chat TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model 2024 Jiahao Lyu
Jin Wei
Gangyan Zeng
Zeng Li
Enze Xie
Wei Wang
Yu Zhou
+ Multimodal Semi-Supervised Learning for Text Recognition 2022 Aviad Aberdam
Roy Ganz
Shai Mazor
Ron Litman
+ Masked Vision-Language Transformers for Scene Text Recognition 2022 Jie Wu
Ying Peng
Shengming Zhang
Weigang Qi
Jian Zhang
+ Turning a CLIP Model into a Scene Text Spotter 2023 Wenwen Yu
Yuliang Liu
Xingkui Zhu
Haoyu Cao
Xing Sun
Xiang Bai

Works That Cite This (0)

Action Title Year Authors

Works Cited by This (0)

Action Title Year Authors