Object-centric Representation Learning for Video Scene Understanding

Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackli...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - PP(2024) vom: 15. Mai
1. Verfasser: Zhou, Yi (VerfasserIn)
Weitere Verfasser: Zhang, Hui, Park, Seung-In, Yoo, ByungIn, Qi, Xiaojuan
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM372370187
003 DE-627
005 20240521234757.0
007 cr uuu---uuuuu
008 240516s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2024.3401409  |2 doi 
028 5 2 |a pubmed24n1414.xml 
035 |a (DE-627)NLM372370187 
035 |a (NLM)38748520 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Zhou, Yi  |e verfasserin  |4 aut 
245 1 0 |a Object-centric Representation Learning for Video Scene Understanding 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 21.05.2024 
500 |a published: Print-Electronic 
500 |a Citation Status Publisher 
520 |a Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackling each constituent task independently, thus restricting their capacity to leverage interrelationships amongst tasks and requiring parameter tuning for each task. To surmount these constraints, we present Slot-IVPS, a new approach employing an object-centric model to acquire unified object representations, thereby facilitating the model's ability to simultaneously capture semantic and depth information. Specifically, we introduce a novel representation, Integrated Panoptic Slots (IPS), to capture both semantic and depth information for all panoptic objects within a video, encompassing background semantics and foreground instances. Subsequently, we propose an integrated feature generator and enhancer to extract depth-aware features, alongside the Integrated Video Panoptic Retriever (IVPR), which iteratively retrieves spatial-temporal coherent object features and encodes them into IPS. The resulting IPS can be effortlessly decoded into an array of video outputs, including depth maps, classifications, masks, and object instance IDs. We undertake comprehensive analyses across four datasets, attaining state-of-the-art performance in both Depth-aware Video Panoptic Segmentation and Video Panoptic Segmentation tasks. Codes will be available at https://github.com/SAITPublic/ 
650 4 |a Journal Article 
700 1 |a Zhang, Hui  |e verfasserin  |4 aut 
700 1 |a Park, Seung-In  |e verfasserin  |4 aut 
700 1 |a Yoo, ByungIn  |e verfasserin  |4 aut 
700 1 |a Qi, Xiaojuan  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g PP(2024) vom: 15. Mai  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:PP  |g year:2024  |g day:15  |g month:05 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2024.3401409  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d PP  |j 2024  |b 15  |c 05