Deep Attention Network for Egocentric Action Recognition

Recognizing a camera wearer's actions from videos captured by an egocentric camera is a challenging task. In this paper, we employ a two-stream deep neural network composed of an appearance-based stream and a motion-based stream to recognize egocentric actions. Based on the insight that human a...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 28(2019), 8 vom: 27. Aug., Seite 3703-3713
1. Verfasser: Lu, Minlong (VerfasserIn)
Weitere Verfasser: Li, Ze-Nian, Wang, Yueming, Pan, Gang
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2019
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
LEADER 01000naa a22002652 4500
001 NLM294593578
003 DE-627
005 20231225081823.0
007 cr uuu---uuuuu
008 231225s2019 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2019.2901707  |2 doi 
028 5 2 |a pubmed24n0981.xml 
035 |a (DE-627)NLM294593578 
035 |a (NLM)30835222 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Lu, Minlong  |e verfasserin  |4 aut 
245 1 0 |a Deep Attention Network for Egocentric Action Recognition 
264 1 |c 2019 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 02.01.2020 
500 |a Date Revised 02.01.2020 
500 |a published: Print-Electronic 
500 |a Citation Status MEDLINE 
520 |a Recognizing a camera wearer's actions from videos captured by an egocentric camera is a challenging task. In this paper, we employ a two-stream deep neural network composed of an appearance-based stream and a motion-based stream to recognize egocentric actions. Based on the insight that human action and gaze behavior are highly coordinated in object manipulation tasks, we propose a spatial attention network to predict human gaze in the form of attention map. The attention map helps each of the two streams to focus on the most relevant spatial region of the video frames to predict actions. To better model the temporal structure of the videos, a temporal network is proposed. The temporal network incorporates bi-directional long short-term memory to model the long-range dependencies to recognize egocentric actions. The experimental results demonstrate that our method is able to predict attention maps that are consistent with human attention and achieve competitive action recognition performance with the state-of-the-art methods on the GTEA Gaze and GTEA Gaze+ datasets 
650 4 |a Journal Article 
700 1 |a Li, Ze-Nian  |e verfasserin  |4 aut 
700 1 |a Wang, Yueming  |e verfasserin  |4 aut 
700 1 |a Pan, Gang  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 28(2019), 8 vom: 27. Aug., Seite 3703-3713  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnns 
773 1 8 |g volume:28  |g year:2019  |g number:8  |g day:27  |g month:08  |g pages:3703-3713 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2019.2901707  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 28  |j 2019  |e 8  |b 27  |c 08  |h 3703-3713