Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture t...
Publié dans: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 29., Seite 6253-6267 |
---|---|
Auteur principal: | |
Autres auteurs: | , , |
Format: | Article en ligne |
Langue: | English |
Publié: |
2024
|
Accès à la collection: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society |
Sujets: | Journal Article |
Accès en ligne |
Volltext |