Partially supervised speaker clustering

Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - 34(2012), 5 vom: 13. Mai, Seite 959-71
1. Verfasser: Tang, Hao (VerfasserIn)
Weitere Verfasser: Chu, Stephen Mingyu, Hasegawa-Johnson, Mark, Huang, Thomas S
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2012
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article Research Support, U.S. Gov't, Non-P.H.S.
LEADER 01000naa a22002652 4500
001 NLM210780304
003 DE-627
005 20231224012305.0
007 cr uuu---uuuuu
008 231224s2012 xx |||||o 00| ||eng c
024 7 |a 10.1109/TPAMI.2011.174  |2 doi 
028 5 2 |a pubmed24n0703.xml 
035 |a (DE-627)NLM210780304 
035 |a (NLM)21844626 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Tang, Hao  |e verfasserin  |4 aut 
245 1 0 |a Partially supervised speaker clustering 
264 1 |c 2012 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 04.09.2012 
500 |a Date Revised 29.06.2012 
500 |a published: Print 
500 |a Citation Status MEDLINE 
520 |a Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance 
650 4 |a Journal Article 
650 4 |a Research Support, U.S. Gov't, Non-P.H.S. 
700 1 |a Chu, Stephen Mingyu  |e verfasserin  |4 aut 
700 1 |a Hasegawa-Johnson, Mark  |e verfasserin  |4 aut 
700 1 |a Huang, Thomas S  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1979  |g 34(2012), 5 vom: 13. Mai, Seite 959-71  |w (DE-627)NLM098212257  |x 1939-3539  |7 nnns 
773 1 8 |g volume:34  |g year:2012  |g number:5  |g day:13  |g month:05  |g pages:959-71 
856 4 0 |u http://dx.doi.org/10.1109/TPAMI.2011.174  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 34  |j 2012  |e 5  |b 13  |c 05  |h 959-71