Relational Reasoning for Group Activity Recognition via Self-Attention Augmented Conditional Random Field
This paper presents a new relational network for group activity recognition. The essence of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capabili...
Veröffentlicht in: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 30(2021) vom: 23., Seite 8184-8199 |
---|---|
1. Verfasser: | |
Weitere Verfasser: | , |
Format: | Online-Aufsatz |
Sprache: | English |
Veröffentlicht: |
2021
|
Zugriff auf das übergeordnete Werk: | IEEE transactions on image processing : a publication of the IEEE Signal Processing Society |
Schlagworte: | Journal Article |
Zusammenfassung: | This paper presents a new relational network for group activity recognition. The essence of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capability of CRFs in modelling the actors' features that depend on each other and the capability of self-attention in learning the temporal evolution and spatial relational contexts of every actor in videos. Additionally, there are two distinct facets of our CRF and self-attention. First, the pairwise energy of the new CRF relies on both of the temporal self-attention and spatial self-attention, which apply the self-attention mechanism to the features in time and space, respectively. Second, to address both local and non-local relationships in group activities, the spatial self-attention takes account of a collection of cliques with different scales of spatial locality. The associated mean-field inference thereafter can thus be reformulated as a self-attention network to generate the relational contexts of the actors and their individual action labels. Lastly, a bidirectional universal transformer encoder (UTE) is utilized to aggregate the forward and backward temporal context information, scene information and relational contexts for group activity recognition. A new loss function is also employed, consisting of not only the cost for the classification of individual actions and group activities, but also a contrastive loss to address the miscellaneous relational contexts between actors. Simulations show that the new approach can surpass previous works on four commonly used datasets |
---|---|
Beschreibung: | Date Completed 29.09.2021 Date Revised 29.09.2021 published: Print-Electronic Citation Status PubMed-not-MEDLINE |
ISSN: | 1941-0042 |
DOI: | 10.1109/TIP.2021.3113570 |