Prompt-Based Modality Alignment for Effective Multi-Modal Object Re-Identification

A critical challenge for multi-modal Object Re-Identification (ReID) is the effective aggregation of complementary information to mitigate illumination issues. State-of-the-art methods typically employ complex and highly-coupled architectures, which unavoidably result in heavy computational costs. M...

Description complète

Détails bibliographiques
Publié dans:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 34(2025) vom: 05., Seite 2450-2462
Auteur principal:	Zhang, Shizhou (Auteur)
Autres auteurs:	Luo, Wenlong, Cheng, De, Xing, Yinghui, Liang, Guoqiang, Wang, Peng, Zhang, Yanning
Format:	Article en ligne
Langue:	English
Publié:	2025
Accès à la collection:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:	Journal Article

Description
Résumé:	A critical challenge for multi-modal Object Re-Identification (ReID) is the effective aggregation of complementary information to mitigate illumination issues. State-of-the-art methods typically employ complex and highly-coupled architectures, which unavoidably result in heavy computational costs. Moreover, the significant distribution gap among different image spectra hinders the joint representation of multi-modal features. In this paper, we propose a framework named as PromptMA to establish effective communication channels between different modality paths, thereby aggregating modal complementary information and bridging the distribution gap. Specifically, we inject a series of learnable multi-modal prompts into the Image Encoder and introduce a prompt exchange mechanism to enable the prompts to alternately interact with different modal token embeddings, thus capturing and distributing multi-modal features effectively. Building on top of the multi-modal prompts, we further propose Prompt-based Token Selection (PBTS) and Prompt-based Modality Fusion (PBMF) modules to achieve effective multi-modal feature fusion while minimizing background interference. Additionally, due to the flexibility of our prompt exchange mechanism, our method is well-suited to handle scenarios with missing modalities. Extensive evaluations are conducted on four widely used benchmark datasets and the experimental results demonstrate that our method achieves state-of-the-art performances, surpassing the current benchmarks by over 15% on the challenging MSVR310 dataset and by 6% on the RGBNT201. The code is available at https://github.com/FHR-L/PromptMA
Description:	Date Revised 05.05.2025 published: Print-Electronic Citation Status PubMed-not-MEDLINE
ISSN:	1941-0042
DOI:	10.1109/TIP.2025.3556531