Gloss Prior Guided Visual Feature Learning for Continuous Sign Language Recognition

Continuous sign language recognition (CSLR) is to recognize the glosses in a sign language video. Enhancing the generalization ability of CSLR's visual feature extractor is a worthy area of investigation. In this paper, we model glosses as priors that help to learn more generalizable visual fea...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 30., Seite 3486-3495
1. Verfasser: Guo, Leming (VerfasserIn)
Weitere Verfasser: Xue, Wanli, Liu, Bo, Zhang, Kaihua, Yuan, Tiantian, Metaxas, Dimitris
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000caa a22002652 4500
001 NLM373030274
003 DE-627
005 20240605232928.0
007 cr uuu---uuuuu
008 240531s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2024.3404869  |2 doi 
028 5 2 |a pubmed24n1429.xml 
035 |a (DE-627)NLM373030274 
035 |a (NLM)38814773 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Guo, Leming  |e verfasserin  |4 aut 
245 1 0 |a Gloss Prior Guided Visual Feature Learning for Continuous Sign Language Recognition 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 05.06.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Continuous sign language recognition (CSLR) is to recognize the glosses in a sign language video. Enhancing the generalization ability of CSLR's visual feature extractor is a worthy area of investigation. In this paper, we model glosses as priors that help to learn more generalizable visual features. Specifically, the signer-invariant gloss feature is extracted by a pre-trained gloss BERT model. Then we design a gloss prior guidance network (GPGN). It contains a novel parallel densely-connected temporal feature extraction (PDC-TFE) module for multi-resolution visual feature extraction. The PDC-TFE captures the complex temporal patterns of the glosses. The pre-trained gloss feature guides the visual feature learning through a cross-modality matching loss. We propose to formulate the cross-modality feature matching into a regularized optimal transport problem, it can be efficiently solved by a variant of the Sinkhorn algorithm. The GPGN parameters are learned by optimizing a weighted sum of the cross-modality matching loss and CTC loss. The experiment results on German and Chinese sign language benchmarks demonstrate that the proposed GPGN achieves competitive performance. The ablation study verifies the effectiveness of several critical components of the GPGN. Furthermore, the proposed pre-trained gloss BERT model and cross-modality matching can be seamlessly integrated into other RGB-cue-based CSLR methods as plug-and-play formulations to enhance the generalization ability of the visual feature extractor 
650 4 |a Journal Article 
700 1 |a Xue, Wanli  |e verfasserin  |4 aut 
700 1 |a Liu, Bo  |e verfasserin  |4 aut 
700 1 |a Zhang, Kaihua  |e verfasserin  |4 aut 
700 1 |a Yuan, Tiantian  |e verfasserin  |4 aut 
700 1 |a Metaxas, Dimitris  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 33(2024) vom: 30., Seite 3486-3495  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:33  |g year:2024  |g day:30  |g pages:3486-3495 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2024.3404869  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 33  |j 2024  |b 30  |h 3486-3495