Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around an...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 40(2018), 5 vom: 19. Mai, Seite 1086-1099
1. Verfasser: Gebru, Israel D (VerfasserIn)
Weitere Verfasser: Ba, Sileye, Li, Xiaofei, Horaud, Radu
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2018
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Research Support, Non-U.S. Gov't
LEADER 01000naa a22002652 4500
001 NLM26815161X
003 DE-627
005 20231224222122.0
007 cr uuu---uuuuu
008 231224s2018 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2017.2648793  |2 doi 
028 5 2 |a pubmed24n0893.xml 
035 |a (DE-627)NLM26815161X 
035 |a (NLM)28103192 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Gebru, Israel D  |e verfasserin  |4 aut 
245 1 0 |a Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion 
264 1 |c 2018 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 19.03.2019 
500 |a Date Revised 19.03.2019 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms 
650 4 |a Journal Article 
650 4 |a Research Support, Non-U.S. Gov't 
700 1 |a Ba, Sileye  |e verfasserin  |4 aut 
700 1 |a Li, Xiaofei  |e verfasserin  |4 aut 
700 1 |a Horaud, Radu  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 40(2018), 5 vom: 19. Mai, Seite 1086-1099  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:40  |g year:2018  |g number:5  |g day:19  |g month:05  |g pages:1086-1099 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2017.2648793  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 40  |j 2018  |e 5  |b 19  |c 05  |h 1086-1099