Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous me...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 5 vom: 25. Apr., Seite 3213-3229
1. Verfasser: Tang, Wei (VerfasserIn)
Weitere Verfasser: Li, Liang, Liu, Xuejing, Jin, Lu, Tang, Jinhui, Li, Zechao
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM365425648
003 DE-627
005 20240404234442.0
007 cr uuu---uuuuu
008 231226s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2023.3339628  |2 doi 
028 5 2 |a pubmed24n1364.xml 
035 |a (DE-627)NLM365425648 
035 |a (NLM)38051621 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Tang, Wei  |e verfasserin  |4 aut 
245 1 0 |a Context Disentangling and Prototype Inheriting for Robust Visual Grounding 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 03.04.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information. Moreover, they are usually designed for the standard scene (without any novel object), which limits their generalization to the open-vocabulary scene. In this paper, we propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes. Specifically, the context disentangling disentangles the referent and context features, which achieves better discrimination between them. The prototype inheriting inherits the prototypes discovered from the disentangled visual features by a prototype bank to fully utilize the seen data, especially for the open-vocabulary scene. The fused features, obtained by leveraging Hadamard product on disentangled linguistic and visual features of prototypes to avoid sharp adjusting the importance between the two types of features, are then attached with a special token and feed to a vision Transformer encoder for bounding box regression. Extensive experiments are conducted on both standard and open-vocabulary scenes. The performance comparisons indicate that our method outperforms the state-of-the-art methods in both scenarios 
650 4 |a Journal Article 
700 1 |a Li, Liang  |e verfasserin  |4 aut 
700 1 |a Liu, Xuejing  |e verfasserin  |4 aut 
700 1 |a Jin, Lu  |e verfasserin  |4 aut 
700 1 |a Tang, Jinhui  |e verfasserin  |4 aut 
700 1 |a Li, Zechao  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 46(2024), 5 vom: 25. Apr., Seite 3213-3229  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:46  |g year:2024  |g number:5  |g day:25  |g month:04  |g pages:3213-3229 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2023.3339628  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 46  |j 2024  |e 5  |b 25  |c 04  |h 3213-3229