Semantic-Disentangled Transformer With Noun-Verb Embedding for Compositional Action Recognition

Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 33(2024) vom: 01., Seite 297-309
1. Verfasser: Huang, Peng (VerfasserIn)
Weitere Verfasser: Yan, Rui, Shu, Xiangbo, Tu, Zhewei, Dai, Guangzhao, Tang, Jinhui
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM365910376
003 DE-627
005 20250305135656.0
007 cr uuu---uuuuu
008 231227s2024 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2023.3341297  |2 doi 
028 5 2 |a pubmed25n1219.xml 
035 |a (DE-627)NLM365910376 
035 |a (NLM)38100340 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Huang, Peng  |e verfasserin  |4 aut 
245 1 0 |a Semantic-Disentangled Transformer With Noun-Verb Embedding for Compositional Action Recognition 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 25.12.2023 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.g. bounding box) to enhance the dynamic cues of video features. However, these approaches do not essentially eliminate the inherent inductive bias in the video, which can be regarded as the stumbling block for model generalization. Because the video features are usually extracted from the visually cluttered areas in which many objects cannot be removed or masked explicitly. To this end, this work attempts to implicitly accomplish semantic-level decoupling of "object-action" in the high-level feature space. Specifically, we propose a novel Semantic-Decoupling Transformer framework, dubbed as DeFormer, which contains two insightful sub-modules: Objects-Motion Decoupler (OMD) and Semantic-Decoupling Constrainer (SDC). In OMD, we initialize several learnable tokens incorporating annotation priors to learn an instance-level representation and then decouple it into the appearance feature and motion feature in high-level visual space. In SDC, we use textual information in the high-level language space to construct a dual-contrastive association to constrain the decoupled appearance feature and motion feature obtained in OMD. Extensive experiments verify the generalization ability of DeFormer. Specifically, compared to the baseline method, DeFormer achieves absolute improvements of 3%, 3.3%, and 5.4% under three different settings on STH-ELSE, while corresponding improvements on EPIC-KITCHENS-55 are 4.7%, 9.2%, and 4.4%. Besides, DeFormer gains state-of-the-art results either on ground-truth or detected annotations 
650 4 |a Journal Article 
700 1 |a Yan, Rui  |e verfasserin  |4 aut 
700 1 |a Shu, Xiangbo  |e verfasserin  |4 aut 
700 1 |a Tu, Zhewei  |e verfasserin  |4 aut 
700 1 |a Dai, Guangzhao  |e verfasserin  |4 aut 
700 1 |a Tang, Jinhui  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 33(2024) vom: 01., Seite 297-309  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnas 
773 1 8 |g volume:33  |g year:2024  |g day:01  |g pages:297-309 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2023.3341297  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 33  |j 2024  |b 01  |h 297-309