Turning a CLIP Model Into a Scene Text Spotter

We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and tex...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 9 vom: 20. Sept., Seite 6040-6054
1. Verfasser:	Yu, Wenwen (VerfasserIn)
Weitere Verfasser:	Liu, Yuliang, Zhu, Xingkui, Cao, Haoyu, Sun, Xing, Bai, Xiang
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2024
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article

Online verfügbar	Volltext