Gloss Prior Guided Visual Feature Learning for Continuous Sign Language Recognition
Continuous sign language recognition (CSLR) is to recognize the glosses in a sign language video. Enhancing the generalization ability of CSLR's visual feature extractor is a worthy area of investigation. In this paper, we model glosses as priors that help to learn more generalizable visual fea...
Veröffentlicht in: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 30., Seite 3486-3495 |
---|---|
1. Verfasser: | |
Weitere Verfasser: | , , , , |
Format: | Online-Aufsatz |
Sprache: | English |
Veröffentlicht: |
2024
|
Zugriff auf das übergeordnete Werk: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society |
Schlagworte: | Journal Article |
Zusammenfassung: | Continuous sign language recognition (CSLR) is to recognize the glosses in a sign language video. Enhancing the generalization ability of CSLR's visual feature extractor is a worthy area of investigation. In this paper, we model glosses as priors that help to learn more generalizable visual features. Specifically, the signer-invariant gloss feature is extracted by a pre-trained gloss BERT model. Then we design a gloss prior guidance network (GPGN). It contains a novel parallel densely-connected temporal feature extraction (PDC-TFE) module for multi-resolution visual feature extraction. The PDC-TFE captures the complex temporal patterns of the glosses. The pre-trained gloss feature guides the visual feature learning through a cross-modality matching loss. We propose to formulate the cross-modality feature matching into a regularized optimal transport problem, it can be efficiently solved by a variant of the Sinkhorn algorithm. The GPGN parameters are learned by optimizing a weighted sum of the cross-modality matching loss and CTC loss. The experiment results on German and Chinese sign language benchmarks demonstrate that the proposed GPGN achieves competitive performance. The ablation study verifies the effectiveness of several critical components of the GPGN. Furthermore, the proposed pre-trained gloss BERT model and cross-modality matching can be seamlessly integrated into other RGB-cue-based CSLR methods as plug-and-play formulations to enhance the generalization ability of the visual feature extractor |
---|---|
Beschreibung: | Date Revised 05.06.2024 published: Print-Electronic Citation Status PubMed-not-MEDLINE |
ISSN: | 1941-0042 |
DOI: | 10.1109/TIP.2024.3404869 |