Action Anticipation Using Pairwise Human-Object Interactions and Transformers

The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object inte...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 30(2021) vom: 22., Seite 8116-8129
1. Verfasser: Roy, Debaditya (VerfasserIn)
Weitere Verfasser: Fernando, Basura
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2021
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM330926578
003 DE-627
005 20231225212340.0
007 cr uuu---uuuuu
008 231225s2021 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2021.3113114  |2 doi 
028 5 2 |a pubmed24n1103.xml 
035 |a (DE-627)NLM330926578 
035 |a (NLM)34550884 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Roy, Debaditya  |e verfasserin  |4 aut 
245 1 0 |a Action Anticipation Using Pairwise Human-Object Interactions and Transformers 
264 1 |c 2021 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 29.09.2021 
500 |a Date Revised 29.09.2021 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation 
650 4 |a Journal Article 
700 1 |a Fernando, Basura  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 30(2021) vom: 22., Seite 8116-8129  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:30  |g year:2021  |g day:22  |g pages:8116-8129 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2021.3113114  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 30  |j 2021  |b 22  |h 8116-8129