Turning a CLIP Model Into a Scene Text Spotter

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and tex...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 9 vom: 20. Aug., Seite 6040-6054
1. Verfasser: Yu, Wenwen (VerfasserIn)
Weitere Verfasser: Liu, Yuliang, Zhu, Xingkui, Cao, Haoyu, Sun, Xing, Bai, Xiang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM369969863
003 DE-627
005 20240807232513.0
007 cr uuu---uuuuu
008 240322s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2024.3379828  |2 doi 
028 5 2 |a pubmed24n1494.xml 
035 |a (DE-627)NLM369969863 
035 |a (NLM)38507385 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Yu, Wenwen  |e verfasserin  |4 aut 
245 1 0 |a Turning a CLIP Model Into a Scene Text Spotter 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 07.08.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.6% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.55% in text detection and spotting tasks, along with a 47.1% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 4.7% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection 
650 4 |a Journal Article 
700 1 |a Liu, Yuliang  |e verfasserin  |4 aut 
700 1 |a Zhu, Xingkui  |e verfasserin  |4 aut 
700 1 |a Cao, Haoyu  |e verfasserin  |4 aut 
700 1 |a Sun, Xing  |e verfasserin  |4 aut 
700 1 |a Bai, Xiang  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 46(2024), 9 vom: 20. Aug., Seite 6040-6054  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:46  |g year:2024  |g number:9  |g day:20  |g month:08  |g pages:6040-6054 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2024.3379828  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 46  |j 2024  |e 9  |b 20  |c 08  |h 6040-6054