SignBERT+ : Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding

Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-super...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence. - 1979. - 45(2023), 9 vom: 01. Sept., Seite 11221-11239
1. Verfasser:	Hu, Hezhen (VerfasserIn)
Weitere Verfasser:	Zhao, Weichao, Zhou, Wengang, Li, Houqiang
Format:	Online-Aufsatz
Sprache:	English
Veröffentlicht:	2023
Zugriff auf das übergeordnete Werk:	IEEE transactions on pattern analysis and machine intelligence
Schlagworte:	Journal Article


LEADER	01000caa a22002652c 4500
001	NLM356031047
003	DE-627
005	20250304165943.0
007	cr uuu---uuuuu
008	231226s2023 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1109/TPAMI.2023.3269220 \|2 doi
028	5	2	\|a pubmed25n1186.xml
035			\|a (DE-627)NLM356031047
035			\|a (NLM)37099464
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Hu, Hezhen \|e verfasserin \|4 aut
245	1	0	\|a SignBERT+ \|b Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding
264		1	\|c 2023
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 08.08.2023
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated. In our framework, the hand pose is regarded as a visual token, which is derived from an off-the-shelf detector. Each visual token is embedded with gesture state and spatial-temporal position encoding. To take full advantage of current sign data resource, we first perform self-supervised learning to model its statistics. To this end, we design multi-level masked modeling strategies (joint, frame and clip) to mimic common failure detection cases. Jointly with these masked modeling strategies, we incorporate model-aware hand prior to better capture hierarchical context over the sequence. After the pre-training, we carefully design simple yet effective prediction heads for downstream tasks. To validate the effectiveness of our framework, we perform extensive experiments on three main SLU tasks, involving isolated and continuous sign language recognition (SLR), and sign language translation (SLT). Experimental results demonstrate the effectiveness of our method, achieving new state-of-the-art performance with a notable gain
650		4	\|a Journal Article
700	1		\|a Zhao, Weichao \|e verfasserin \|4 aut
700	1		\|a Zhou, Wengang \|e verfasserin \|4 aut
700	1		\|a Li, Houqiang \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t IEEE transactions on pattern analysis and machine intelligence \|d 1979 \|g 45(2023), 9 vom: 01. Sept., Seite 11221-11239 \|w (DE-627)NLM098212257 \|x 1939-3539 \|7 nnas
773	1	8	\|g volume:45 \|g year:2023 \|g number:9 \|g day:01 \|g month:09 \|g pages:11221-11239
856	4	0	\|u http://dx.doi.org/10.1109/TPAMI.2023.3269220 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a SYSFLAG_A
912			\|a GBV_NLM
912			\|a GBV_ILN_350
951			\|a AR
952			\|d 45 \|j 2023 \|e 9 \|b 01 \|c 09 \|h 11221-11239