Dynamic MDETR : A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue,...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 2 vom: 27. Jan., Seite 1181-1198
1. Verfasser: Shi, Fengyuan (VerfasserIn)
Weitere Verfasser: Gao, Ruopeng, Huang, Weilin, Wang, Limin
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM363818839
003 DE-627
005 20240114233007.0
007 cr uuu---uuuuu
008 231226s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2023.3328185  |2 doi 
028 5 2 |a pubmed24n1253.xml 
035 |a (DE-627)NLM363818839 
035 |a (NLM)37889818 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Shi, Fengyuan  |e verfasserin  |4 aut 
245 1 0 |a Dynamic MDETR  |b A Dynamic Multimodal Transformer Decoder for Visual Grounding 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 08.01.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal detection transformer (DETR) (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce  ∼ 44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. With the same number of encoder layers as TransVG, our Dynamic MDETR (ResNet-50) outperforms TransVG (ResNet-101) but only brings marginal extra computational cost relative to TransVG. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks 
650 4 |a Journal Article 
700 1 |a Gao, Ruopeng  |e verfasserin  |4 aut 
700 1 |a Huang, Weilin  |e verfasserin  |4 aut 
700 1 |a Wang, Limin  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 46(2024), 2 vom: 27. Jan., Seite 1181-1198  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:46  |g year:2024  |g number:2  |g day:27  |g month:01  |g pages:1181-1198 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2023.3328185  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 46  |j 2024  |e 2  |b 27  |c 01  |h 1181-1198