Local-Global Context Aware Transformer for Language-Guided Video Segmentation

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 8 vom: 11. Aug., Seite 10055-10069
1. Verfasser: Liang, Chen (VerfasserIn)
Weitere Verfasser: Wang, Wenguan, Zhou, Tianfei, Miao, Jiaxu, Luo, Yawei, Yang, Yi
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM363129170
003 DE-627
005 20231226092707.0
007 cr uuu---uuuuu
008 231226s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2023.3262578  |2 doi 
028 5 2 |a pubmed24n1210.xml 
035 |a (DE-627)NLM363129170 
035 |a (NLM)37819831 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Liang, Chen  |e verfasserin  |4 aut 
245 1 0 |a Local-Global Context Aware Transformer for Language-Guided Video Segmentation 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 31.10.2023 
500 |a published: Print 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S +, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S + show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution 
650 4 |a Journal Article 
700 1 |a Wang, Wenguan  |e verfasserin  |4 aut 
700 1 |a Zhou, Tianfei  |e verfasserin  |4 aut 
700 1 |a Miao, Jiaxu  |e verfasserin  |4 aut 
700 1 |a Luo, Yawei  |e verfasserin  |4 aut 
700 1 |a Yang, Yi  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 45(2023), 8 vom: 11. Aug., Seite 10055-10069  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:45  |g year:2023  |g number:8  |g day:11  |g month:08  |g pages:10055-10069 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2023.3262578  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 45  |j 2023  |e 8  |b 11  |c 08  |h 10055-10069