Learning to Localize Sound Sources in Visual Scenes : Analysis and Applications

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsup...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 43(2021), 5 vom: 01. Mai, Seite 1605-1619
1. Verfasser: Senocak, Arda (VerfasserIn)
Weitere Verfasser: Oh, Tae-Hyun, Kim, Junsik, Yang, Ming-Hsuan, Kweon, In So
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2021
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.
LEADER 01000naa a22002652 4500
001 NLM303259280
003 DE-627
005 20231225112440.0
007 cr uuu---uuuuu
008 231225s2021 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2019.2952095  |2 doi 
028 5 2 |a pubmed24n1010.xml 
035 |a (DE-627)NLM303259280 
035 |a (NLM)31722472 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Senocak, Arda  |e verfasserin  |4 aut 
245 1 0 |a Learning to Localize Sound Sources in Visual Scenes  |b Analysis and Applications 
264 1 |c 2021 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 29.09.2021 
500 |a Date Revised 29.09.2021 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos 
650 4 |a Journal Article 
650 4 |a Research Support, Non-U.S. Gov't 
650 4 |a Research Support, U.S. Gov't, Non-P.H.S. 
700 1 |a Oh, Tae-Hyun  |e verfasserin  |4 aut 
700 1 |a Kim, Junsik  |e verfasserin  |4 aut 
700 1 |a Yang, Ming-Hsuan  |e verfasserin  |4 aut 
700 1 |a Kweon, In So  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 43(2021), 5 vom: 01. Mai, Seite 1605-1619  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:43  |g year:2021  |g number:5  |g day:01  |g month:05  |g pages:1605-1619 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2019.2952095  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 43  |j 2021  |e 5  |b 01  |c 05  |h 1605-1619