VGSG : Vision-Guided Semantic-Group Network for Text-Based Person Search

Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explic...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2023) vom: 05., Seite 163-176
1. Verfasser: He, Shuting (VerfasserIn)
Weitere Verfasser: Luo, Hao, Jiang, Wei, Jiang, Xudong, Ding, Henghui
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM365425583
003 DE-627
005 20231227132216.0
007 cr uuu---uuuuu
008 231226s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2023.3337653  |2 doi 
028 5 2 |a pubmed24n1226.xml 
035 |a (DE-627)NLM365425583 
035 |a (NLM)38051615 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a He, Shuting  |e verfasserin  |4 aut 
245 1 0 |a VGSG  |b Vision-Guided Semantic-Group Network for Text-Based Person Search 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 11.12.2023 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods 
650 4 |a Journal Article 
700 1 |a Luo, Hao  |e verfasserin  |4 aut 
700 1 |a Jiang, Wei  |e verfasserin  |4 aut 
700 1 |a Jiang, Xudong  |e verfasserin  |4 aut 
700 1 |a Ding, Henghui  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 33(2023) vom: 05., Seite 163-176  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:33  |g year:2023  |g day:05  |g pages:163-176 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2023.3337653  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 33  |j 2023  |b 05  |h 163-176