Learning Local and Global Temporal Contexts for Video Semantic Segmentation

Contextual information plays a core role for video semantic segmentation (VSS). This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. A...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 46(2024), 10 vom: 10. Sept., Seite 6919-6934
1. Verfasser: Sun, Guolei (VerfasserIn)
Weitere Verfasser: Liu, Yun, Ding, Henghui, Wu, Min, Van Gool, Luc
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM370877071
003 DE-627
005 20240906232647.0
007 cr uuu---uuuuu
008 240411s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2024.3387326  |2 doi 
028 5 2 |a pubmed24n1525.xml 
035 |a (DE-627)NLM370877071 
035 |a (NLM)38598382 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Sun, Guolei  |e verfasserin  |4 aut 
245 1 0 |a Learning Local and Global Temporal Contexts for Video Semantic Segmentation 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 06.09.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Contextual information plays a core role for video semantic segmentation (VSS). This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. As for LTC, it includes static and motional contexts, corresponding to static and moving content in neighboring frames, respectively. Previously, both static and motional contexts have been studied. However, there is no research about simultaneously learning static and motional contexts (highly complementary). Hence, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC. CFFM contains two parts: Coarse-to-Fine Feature Assembling (CFFA) and Cross-frame Feature Mining (CFM). CFFA abstracts static and motional contexts, and CFM mines useful information from nearby frames to enhance target features. To further exploit more temporal contexts, we propose CFFM++ by additionally learning GTC from the whole video. Specifically, we uniformly sample certain frames from the video and extract global contextual prototypes by k-means. The information within those prototypes is mined by CFM to refine target features. Experimental results on popular benchmarks demonstrate that CFFM and CFFM++ perform favorably against state-of-the-art methods 
650 4 |a Journal Article 
700 1 |a Liu, Yun  |e verfasserin  |4 aut 
700 1 |a Ding, Henghui  |e verfasserin  |4 aut 
700 1 |a Wu, Min  |e verfasserin  |4 aut 
700 1 |a Van Gool, Luc  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 46(2024), 10 vom: 10. Sept., Seite 6919-6934  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:46  |g year:2024  |g number:10  |g day:10  |g month:09  |g pages:6919-6934 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2024.3387326  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 46  |j 2024  |e 10  |b 10  |c 09  |h 6919-6934