Limitations of Clustering with PCA and Correlated Noise

It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern stu...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:Journal of statistical computation and simulation. - 1999. - 94(2024), 10 vom: 16., Seite 2291-2319
1. Verfasser: Lippitt, William (VerfasserIn)
Weitere Verfasser: Carlson, Nichole E, Arbet, Jaron, Fingerlin, Tasha E, Maier, Lisa A, Kechris, Katerina
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2024
Zugriff auf das übergeordnete Werk:Journal of statistical computation and simulation
Schlagworte:Journal Article Correlation Gaussian mixture models PCA Unsupervised filtering Variance as relevance
LEADER 01000caa a22002652 4500
001 NLM376632933
003 DE-627
005 20240825232824.0
007 cr uuu---uuuuu
008 240823s2024 xx |||||o 00| ||eng c
024 7 |a 10.1080/00949655.2024.2329976  |2 doi 
028 5 2 |a pubmed24n1512.xml 
035 |a (DE-627)NLM376632933 
035 |a (NLM)39176071 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Lippitt, William  |e verfasserin  |4 aut 
245 1 0 |a Limitations of Clustering with PCA and Correlated Noise 
264 1 |c 2024 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 25.08.2024 
500 |a published: Print-Electronic 
500 |a Citation Status PubMed-not-MEDLINE 
520 |a It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications 
650 4 |a Journal Article 
650 4 |a Correlation 
650 4 |a Gaussian mixture models 
650 4 |a PCA 
650 4 |a Unsupervised filtering 
650 4 |a Variance as relevance 
700 1 |a Carlson, Nichole E  |e verfasserin  |4 aut 
700 1 |a Arbet, Jaron  |e verfasserin  |4 aut 
700 1 |a Fingerlin, Tasha E  |e verfasserin  |4 aut 
700 1 |a Maier, Lisa A  |e verfasserin  |4 aut 
700 1 |a Kechris, Katerina  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t Journal of statistical computation and simulation  |d 1999  |g 94(2024), 10 vom: 16., Seite 2291-2319  |w (DE-627)NLM098160486  |x 0094-9655  |7 nnns 
773 1 8 |g volume:94  |g year:2024  |g number:10  |g day:16  |g pages:2291-2319 
856 4 0 |u http://dx.doi.org/10.1080/00949655.2024.2329976  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 94  |j 2024  |e 10  |b 16  |h 2291-2319