Multi-Perspective Cross-Modal Object Encoding for Referring Expression Comprehension

Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object cand...

Description complète

Détails bibliographiques
Publié dans:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2025) vom: 16. Okt.
Auteur principal:	Ke, Jingcheng (Auteur)
Autres auteurs:	Wen, Jie, Wang, Huiting, Cheng, Wen-Huang, Wang, Jia
Format:	Article en ligne
Langue:	English
Publié:	2025
Accès à la collection:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:	Journal Article

Description
Résumé:	Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object candidates in an image, most two-stage methods rely exclusively on features extracted from pre-trained detectors, often neglecting the contextual relationships between an object and its neighboring elements. This limitation hinders the full capture of contextual and relational information, reducing the discriminative power of object representations and negatively impacting subsequent processing. In this paper, we propose two novel plug-and-adapt modules: expression-guided label representation module (ELR) and cross-modal calibrated semantic module (CCS), designed to enhance two-stage REC methods. Specifically, the ELR module connects the noun phases of expression to the categorical labels of object candidates in the image, ensuring effective alignment between them. Guided by these connections, a CCS module is introduced to represent each object candidate by integrating its features with those of neighboring candidates from multiple perspectives. This preserves the intrinsic information of each candidate while incorporating relational cues from other objects, enabling more precise embeddings and effective downstream processing in two-stage REC methods. Extensive experiments on six datasets demonstrate the importance of incorporating prior statistical knowledge, and detailed analysis shows that the proposed modules strengthen the alignment between image and text. As a result, our method achieves competitive performance and is compatible with most two-stage methods in the REC task. The code is available on Github: https://github.com/freedom6927/ELR_CCS.git
Description:	Date Revised 16.10.2025 published: Print-Electronic Citation Status Publisher
ISSN:	1941-0042
DOI:	10.1109/TIP.2025.3620129