Multi-Perspective Cross-Modal Object Encoding for Referring Expression Comprehension
Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object cand...
| Publié dans: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2025) vom: 16. Okt. |
|---|---|
| Auteur principal: | |
| Autres auteurs: | , , , |
| Format: | Article en ligne |
| Langue: | English |
| Publié: |
2025
|
| Accès à la collection: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society |
| Sujets: | Journal Article |
| Résumé: | Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object candidates in an image, most two-stage methods rely exclusively on features extracted from pre-trained detectors, often neglecting the contextual relationships between an object and its neighboring elements. This limitation hinders the full capture of contextual and relational information, reducing the discriminative power of object representations and negatively impacting subsequent processing. In this paper, we propose two novel plug-and-adapt modules: expression-guided label representation module (ELR) and cross-modal calibrated semantic module (CCS), designed to enhance two-stage REC methods. Specifically, the ELR module connects the noun phases of expression to the categorical labels of object candidates in the image, ensuring effective alignment between them. Guided by these connections, a CCS module is introduced to represent each object candidate by integrating its features with those of neighboring candidates from multiple perspectives. This preserves the intrinsic information of each candidate while incorporating relational cues from other objects, enabling more precise embeddings and effective downstream processing in two-stage REC methods. Extensive experiments on six datasets demonstrate the importance of incorporating prior statistical knowledge, and detailed analysis shows that the proposed modules strengthen the alignment between image and text. As a result, our method achieves competitive performance and is compatible with most two-stage methods in the REC task. The code is available on Github: https://github.com/freedom6927/ELR_CCS.git |
|---|---|
| Description: | Date Revised 16.10.2025 published: Print-Electronic Citation Status Publisher |
| ISSN: | 1941-0042 |
| DOI: | 10.1109/TIP.2025.3620129 |