Entity-Graph Enhanced Cross-Modal Pretraining for Instance-Level Product Retrieval

Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets and define two real practical instance-level retrieval tasks that...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 11 vom: 30. Nov., Seite 13117-13133
1. Verfasser:	Dong, Xiao (VerfasserIn)
Weitere Verfasser:	Zhan, Xunlin, Wei, Yunchao, Wei, Xiaoyong, Wang, Yaowei, Lu, Minlong, Cao, Xiaochun, Liang, Xiaodan
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article

Beschreibung
Zusammenfassung:	Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets and define two real practical instance-level retrieval tasks that enable evaluations on price comparison and personalized recommendations. For both instance-level tasks, accurately identifying the intended product target mentioned in visual-linguistic data and mitigating the impact of irrelevant content are quite challenging. To address this, we devise a more effective cross-modal pretraining model capable of adaptively incorporating key concept information from multi-modal data. This is accomplished by utilizing an entity graph, where nodes represented entities and edges denoted the similarity relations between them. Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, which explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer. This could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantics. Experimental results sufficiently verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP Radford et al. 2021, UNITER Chen et al. 2020 and CAPTURE Zhan et al. 2021
Beschreibung:	Date Revised 03.10.2023 published: Print-Electronic Citation Status PubMed-not-MEDLINE
ISSN:	1939-3539
DOI:	10.1109/TPAMI.2023.3291237