|
|
|
|
LEADER |
01000naa a22002652 4500 |
001 |
NLM26815161X |
003 |
DE-627 |
005 |
20231224222122.0 |
007 |
cr uuu---uuuuu |
008 |
231224s2018 xx |||||o 00| ||eng c |
024 |
7 |
|
|a 10.1109/TPAMI.2017.2648793
|2 doi
|
028 |
5 |
2 |
|a pubmed24n0893.xml
|
035 |
|
|
|a (DE-627)NLM26815161X
|
035 |
|
|
|a (NLM)28103192
|
040 |
|
|
|a DE-627
|b ger
|c DE-627
|e rakwb
|
041 |
|
|
|a eng
|
100 |
1 |
|
|a Gebru, Israel D
|e verfasserin
|4 aut
|
245 |
1 |
0 |
|a Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
|
264 |
|
1 |
|c 2018
|
336 |
|
|
|a Text
|b txt
|2 rdacontent
|
337 |
|
|
|a ƒaComputermedien
|b c
|2 rdamedia
|
338 |
|
|
|a ƒa Online-Ressource
|b cr
|2 rdacarrier
|
500 |
|
|
|a Date Completed 19.03.2019
|
500 |
|
|
|a Date Revised 19.03.2019
|
500 |
|
|
|a published: Print-Electronic
|
500 |
|
|
|a Citation Status PubMed-not-MEDLINE
|
520 |
|
|
|a Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms
|
650 |
|
4 |
|a Journal Article
|
650 |
|
4 |
|a Research Support, Non-U.S. Gov't
|
700 |
1 |
|
|a Ba, Sileye
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Li, Xiaofei
|e verfasserin
|4 aut
|
700 |
1 |
|
|a Horaud, Radu
|e verfasserin
|4 aut
|
773 |
0 |
8 |
|i Enthalten in
|t IEEE transactions on pattern analysis and machine intelligence
|d 1979
|g 40(2018), 5 vom: 19. Mai, Seite 1086-1099
|w (DE-627)NLM098212257
|x 1939-3539
|7 nnns
|
773 |
1 |
8 |
|g volume:40
|g year:2018
|g number:5
|g day:19
|g month:05
|g pages:1086-1099
|
856 |
4 |
0 |
|u http://dx.doi.org/10.1109/TPAMI.2017.2648793
|3 Volltext
|
912 |
|
|
|a GBV_USEFLAG_A
|
912 |
|
|
|a SYSFLAG_A
|
912 |
|
|
|a GBV_NLM
|
912 |
|
|
|a GBV_ILN_350
|
951 |
|
|
|a AR
|
952 |
|
|
|d 40
|j 2018
|e 5
|b 19
|c 05
|h 1086-1099
|