Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 7 vom: 01. Juni, Seite 4747-4762
1. Verfasser: Wu, Zuxuan (VerfasserIn)
Weitere Verfasser: Weng, Zejia, Peng, Wujian, Yang, Xitong, Li, Ang, Davis, Larry S, Jiang, Yu-Gang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article