Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture t...

Description complète

Détails bibliographiques
Publié dans:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 29., Seite 6253-6267
Auteur principal:	Xu, Yifan (Auteur)
Autres auteurs:	Zhang, Mengdan, Yang, Xiaoshan, Xu, Changsheng
Format:	Article en ligne
Langue:	English
Publié:	2024
Accès à la collection:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:	Journal Article

Accès en ligne	Volltext