Relational Reasoning for Group Activity Recognition via Self-Attention Augmented Conditional Random Field

This paper presents a new relational network for group activity recognition. The essence of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capabili...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 30(2021) vom: 23., Seite 8184-8199
1. Verfasser: Pramono, Rizard Renanda Adhi (VerfasserIn)
Weitere Verfasser: Fang, Wen-Hsien, Chen, Yie-Tarng
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2021
Zugriff auf das übergeordnete Werk:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:This paper presents a new relational network for group activity recognition. The essence of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capability of CRFs in modelling the actors' features that depend on each other and the capability of self-attention in learning the temporal evolution and spatial relational contexts of every actor in videos. Additionally, there are two distinct facets of our CRF and self-attention. First, the pairwise energy of the new CRF relies on both of the temporal self-attention and spatial self-attention, which apply the self-attention mechanism to the features in time and space, respectively. Second, to address both local and non-local relationships in group activities, the spatial self-attention takes account of a collection of cliques with different scales of spatial locality. The associated mean-field inference thereafter can thus be reformulated as a self-attention network to generate the relational contexts of the actors and their individual action labels. Lastly, a bidirectional universal transformer encoder (UTE) is utilized to aggregate the forward and backward temporal context information, scene information and relational contexts for group activity recognition. A new loss function is also employed, consisting of not only the cost for the classification of individual actions and group activities, but also a contrastive loss to address the miscellaneous relational contexts between actors. Simulations show that the new approach can surpass previous works on four commonly used datasets
Beschreibung:Date Completed 29.09.2021
Date Revised 29.09.2021
published: Print-Electronic
Citation Status PubMed-not-MEDLINE
ISSN:1941-0042
DOI:10.1109/TIP.2021.3113570