Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherentl...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2023) vom: 28. Dez.
1. Verfasser: Pu, Tao (VerfasserIn)
Weitere Verfasser: Chen, Tianshui, Wu, Hefeng, Lu, Yongyi, Lin, Liang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM366444670
003 DE-627
005 20240108140250.0
007 cr uuu---uuuuu
008 240108s2023 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2023.3345652  |2 doi 
028 5 2 |a pubmed24n1243.xml 
035 |a (DE-627)NLM366444670 
035 |a (NLM)38153822 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Pu, Tao  |e verfasserin  |4 aut 
245 1 0 |a Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 29.12.2023 
500 |a published: Print-Electronic 
500 |a Citation Status Publisher 
520 |a Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a statistical manner. Then, we design spatial and temporal knowledge-embedded layers that introduce the multi-head cross-attention mechanism to fully explore the interaction between visual representation and the knowledge to generate spatial- and temporal-embedded representations, respectively. Finally, we aggregate these representations for each subject-object pair to predict the final semantic labels and their relationships. Extensive experiments show that STKET outperforms current competing algorithms by a large margin, e.g., improving the mR50 by 8.1%, 4.7%, and 2.1% on different settings over current algorithms 
650 4 |a Journal Article 
700 1 |a Chen, Tianshui  |e verfasserin  |4 aut 
700 1 |a Wu, Hefeng  |e verfasserin  |4 aut 
700 1 |a Lu, Yongyi  |e verfasserin  |4 aut 
700 1 |a Lin, Liang  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g PP(2023) vom: 28. Dez.  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:PP  |g year:2023  |g day:28  |g month:12 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2023.3345652  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d PP  |j 2023  |b 28  |c 12