TransVOD : End-to-End Video Object Detection With Spatial-Temporal Transformers

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been wel...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 6 vom: 23. Juni, Seite 7853-7869
1. Verfasser:	Zhou, Qianyu (VerfasserIn)
Weitere Verfasser:	Li, Xiangtai, He, Lu, Yang, Yibo, Cheng, Guangliang, Tong, Yunhai, Ma, Lizhuang, Tao, Dacheng
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article


LEADER	01000naa a22002652 4500
001	NLM349308454
003	DE-627
005	20231226042317.0
007	cr uuu---uuuuu
008	231226s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TPAMI.2022.3223955 \|2 doi
028	5	2	\|a pubmed24n1164.xml
035			\|a (DE-627)NLM349308454
035			\|a (NLM)36417746
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Zhou, Qianyu \|e verfasserin \|4 aut
245	1	0	\|a TransVOD \|b End-to-End Video Object Detection With Spatial-Temporal Transformers
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 07.05.2023
500			\|a Date Revised 07.05.2023
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD
650		4	\|a Journal Article
700	1		\|a Li, Xiangtai \|e verfasserin \|4 aut
700	1		\|a He, Lu \|e verfasserin \|4 aut
700	1		\|a Yang, Yibo \|e verfasserin \|4 aut
700	1		\|a Cheng, Guangliang \|e verfasserin \|4 aut
700	1		\|a Tong, Yunhai \|e verfasserin \|4 aut
700	1		\|a Ma, Lizhuang \|e verfasserin \|4 aut
700	1		\|a Tao, Dacheng \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on pattern analysis and machine intelligence \|d 1979 \|g 45(2023), 6 vom: 23. Juni, Seite 7853-7869 \|w (DE-627)NLM098212257 \|x 1939-3539 \|7 nnns
773	1	8	\|g volume:45 \|g year:2023 \|g number:6 \|g day:23 \|g month:06 \|g pages:7853-7869
856	4	0	\|u http://dx.doi.org/10.1109/TPAMI.2022.3223955 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 45 \|j 2023 \|e 6 \|b 23 \|c 06 \|h 7853-7869