Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching
The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words...
Veröffentlicht in: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 19., Seite 1326-1337 |
---|---|
1. Verfasser: | |
Weitere Verfasser: | , , , |
Format: | Online-Aufsatz |
Sprache: | English |
Veröffentlicht: |
2024
|
Zugriff auf das übergeordnete Werk: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society |
Schlagworte: | Journal Article |
Zusammenfassung: | The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge |
---|---|
Beschreibung: | Date Revised 14.02.2024 published: Print-Electronic Citation Status PubMed-not-MEDLINE |
ISSN: | 1941-0042 |
DOI: | 10.1109/TIP.2022.3197972 |