Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture t...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 29., Seite 6253-6267
Auteur principal: Xu, Yifan (Auteur)
Autres auteurs: Zhang, Mengdan, Yang, Xiaoshan, Xu, Changsheng
Format: Article en ligne
Langue:English
Publié: 2024
Accès à la collection:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:Journal Article