Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization

It is theoretically insufficient to construct a complete set of semantics in the real world using single-modality data. As a typical application of multi-modality perception, the audio-visual event localization task aims to match audio and visual components to identify the simultaneous events of int...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - 30(2021) vom: 15., Seite 7878-7888
Auteur principal: Xuan, Hanyu (Auteur)
Autres auteurs: Luo, Lei, Zhang, Zhenyu, Yang, Jian, Yan, Yan
Format: Article en ligne
Langue:English
Publié: 2021
Accès à la collection:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:Journal Article
LEADER 01000caa a22002652c 4500
001 NLM330208160
003 DE-627
005 20250302105858.0
007 cr uuu---uuuuu
008 231225s2021 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2021.3106814  |2 doi 
028 5 2 |a pubmed25n1100.xml 
035 |a (DE-627)NLM330208160 
035 |a (NLM)34478364 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Xuan, Hanyu  |e verfasserin  |4 aut 
245 1 0 |a Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization 
264 1 |c 2021 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 21.09.2021 
500 |a Date Revised 21.09.2021 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a It is theoretically insufficient to construct a complete set of semantics in the real world using single-modality data. As a typical application of multi-modality perception, the audio-visual event localization task aims to match audio and visual components to identify the simultaneous events of interest. Although some recent methods have been proposed to deal with this task, they cannot handle the practical situation of temporal inconsistency that is widespread in the audio-visual scene. Inspired by the human system which automatically filters out event-unrelated information when performing multi-modality perception, we propose a discriminative cross-modality attention network to simulate such a process. Similar to human mechanism, our network can adaptively select "where" to attend, "when" to attend and "which" to attend for audio-visual event localization. In addition, to prevent our network from getting trivial solutions, a novel eigenvalue-based objective function is proposed to train the whole network to better fuse audio and visual signals, which can obtain discriminative and nonlinear multi-modality representation. In this way, even with large temporal inconsistency between audio and visual sequence, our network is able to adaptively select event-valuable information for audio-visual event localization. Furthermore, we systemically investigate three subtasks of audio-visual event localization, i.e., temporal localization, weakly-supervised spatial localization and cross-modality localization. The visualization results also help us better understand how our network works 
650 4 |a Journal Article 
700 1 |a Luo, Lei  |e verfasserin  |4 aut 
700 1 |a Zhang, Zhenyu  |e verfasserin  |4 aut 
700 1 |a Yang, Jian  |e verfasserin  |4 aut 
700 1 |a Yan, Yan  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g 30(2021) vom: 15., Seite 7878-7888  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnas 
773 1 8 |g volume:30  |g year:2021  |g day:15  |g pages:7878-7888 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2021.3106814  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 30  |j 2021  |b 15  |h 7878-7888