Multi-Perspective Cross-Modal Object Encoding for Referring Expression Comprehension

Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object cand...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2025) vom: 16. Okt.
1. Verfasser:	Ke, Jingcheng (VerfasserIn)
Weitere Verfasser:	Wen, Jie, Wang, Huiting, Cheng, Wen-Huang, Wang, Jia
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2025
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article

Beschreibung
Zusammenfassung:	Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object candidates in an image, most two-stage methods rely exclusively on features extracted from pre-trained detectors, often neglecting the contextual relationships between an object and its neighboring elements. This limitation hinders the full capture of contextual and relational information, reducing the discriminative power of object representations and negatively impacting subsequent processing. In this paper, we propose two novel plug-and-adapt modules: expression-guided label representation module (ELR) and cross-modal calibrated semantic module (CCS), designed to enhance two-stage REC methods. Specifically, the ELR module connects the noun phases of expression to the categorical labels of object candidates in the image, ensuring effective alignment between them. Guided by these connections, a CCS module is introduced to represent each object candidate by integrating its features with those of neighboring candidates from multiple perspectives. This preserves the intrinsic information of each candidate while incorporating relational cues from other objects, enabling more precise embeddings and effective downstream processing in two-stage REC methods. Extensive experiments on six datasets demonstrate the importance of incorporating prior statistical knowledge, and detailed analysis shows that the proposed modules strengthen the alignment between image and text. As a result, our method achieves competitive performance and is compatible with most two-stage methods in the REC task. The code is available on Github: https://github.com/freedom6927/ELR_CCS.git
Beschreibung:	Date Revised 16.10.2025 published: Print-Electronic Citation Status Publisher
ISSN:	1941-0042
DOI:	10.1109/TIP.2025.3620129