Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherentl...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2023) vom: 28. Dez.
1. Verfasser:	Pu, Tao (VerfasserIn)
Weitere Verfasser:	Chen, Tianshui, Wu, Hefeng, Lu, Yongyi, Lin, Liang
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:	Journal Article


LEADER	01000naa a22002652 4500
001	NLM366444670
003	DE-627
005	20240108140250.0
007	cr uuu---uuuuu
008	240108s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TIP.2023.3345652 \|2 doi
028	5	2	\|a pubmed24n1243.xml
035			\|a (DE-627)NLM366444670
035			\|a (NLM)38153822
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Pu, Tao \|e verfasserin \|4 aut
245	1	0	\|a Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 29.12.2023
500			\|a published: Print-Electronic
500			\|a Citation Status Publisher
520			\|a Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a statistical manner. Then, we design spatial and temporal knowledge-embedded layers that introduce the multi-head cross-attention mechanism to fully explore the interaction between visual representation and the knowledge to generate spatial- and temporal-embedded representations, respectively. Finally, we aggregate these representations for each subject-object pair to predict the final semantic labels and their relationships. Extensive experiments show that STKET outperforms current competing algorithms by a large margin, e.g., improving the mR50 by 8.1%, 4.7%, and 2.1% on different settings over current algorithms
650		4	\|a Journal Article
700	1		\|a Chen, Tianshui \|e verfasserin \|4 aut
700	1		\|a Wu, Hefeng \|e verfasserin \|4 aut
700	1		\|a Lu, Yongyi \|e verfasserin \|4 aut
700	1		\|a Lin, Liang \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society \|d 1992 \|g PP(2023) vom: 28. Dez. \|w (DE-627)NLM09821456X \|x 1941-0042 \|7 nnns
773	1	8	\|g volume:PP \|g year:2023 \|g day:28 \|g month:12
856	4	0	\|u http://dx.doi.org/10.1109/TIP.2023.3345652 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d PP \|j 2023 \|b 28 \|c 12