Who's your data? Primary immune deficiency differential diagnosis prediction via machine learning and data mining of the USIDNET registry

Copyright © 2023 Elsevier Inc. All rights reserved.

Bibliographische Detailangaben
Veröffentlicht in:Clinical immunology (Orlando, Fla.). - 1999. - 255(2023) vom: 07. Okt., Seite 109759
1. Verfasser: Méndez Barrera, Jose Alfredo (VerfasserIn)
Weitere Verfasser: Rocha Guzmán, Samuel, Hierro Cascajares, Elisa, Garabedian, Elizabeth K, Fuleihan, Ramsay L, Sullivan, Kathleen E, Lugo Reyes, Saul O
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2023
Zugriff auf das übergeordnete Werk:Clinical immunology (Orlando, Fla.)
Schlagworte:Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't Classification Data mining Diagnosis prediction Extreme gradient boosting Inborn errors of immunity Machine learning Primary immune deficiencies mehr... Rare diseases Registry
LEADER 01000caa a22002652 4500
001 NLM361763247
003 DE-627
005 20240711231925.0
007 cr uuu---uuuuu
008 231226s2023 xx |||||o 00| ||eng c
024 7 |a 10.1016/j.clim.2023.109759  |2 doi 
028 5 2 |a pubmed24n1467.xml 
035 |a (DE-627)NLM361763247 
035 |a (NLM)37678719 
035 |a (PII)S1521-6616(23)00522-3 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Méndez Barrera, Jose Alfredo  |e verfasserin  |4 aut 
245 1 0 |a Who's your data? Primary immune deficiency differential diagnosis prediction via machine learning and data mining of the USIDNET registry 
264 1 |c 2023 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Completed 02.10.2023 
500 |a Date Revised 11.07.2024 
500 |a published: Print-Electronic 
500 |a Citation Status MEDLINE 
520 |a Copyright © 2023 Elsevier Inc. All rights reserved. 
520 |a PURPOSE: There are currently more than 480 primary immune deficiency (PID) diseases and about 7000 rare diseases that together afflict around 1 in every 17 humans. Computational aids based on data mining and machine learning might facilitate the diagnostic task by extracting rules from large datasets and making predictions when faced with new problem cases. In a proof-of-concept data mining study, we aimed to predict PID diagnoses using a supervised machine learning algorithm based on classification tree boosting 
520 |a METHODS: Through a data query at the USIDNET registry we obtained a database of 2396 patients with common diagnoses of PID, including their clinical and laboratory features. We kept 286 features and all 12 diagnoses to include in the model. We used the XGBoost package with parallel tree boosting for the supervised classification model, and SHAP for variable importance interpretation, on Python v3.7. The patient database was split into training and testing subsets, and after boosting through gradient descent, the predictive model provides measures of diagnostic prediction accuracy and individual feature importance. After a baseline performance test, we used the Class Weighting Hyperparameter, or scale_pos_weight to correct for imbalanced classification 
520 |a RESULTS: The twelve PID diagnoses were CVID (1098 patients), DiGeorge syndrome, Chronic granulomatous disease, Congenital agammaglobulinemia, PID not otherwise classified, Specific antibody deficiency, Complement deficiency, Hyper-IgM, Leukocyte adhesion deficiency, ectodermal dysplasia with immune deficiency, Severe combined immune deficiency, and Wiskott-Aldrich syndrome. For CVID, the model found an accuracy on the train sample of 0.80, with an area under the ROC curve (AUC) of 0.80, and a Gini coefficient of 0.60. In the test subset, accuracy was 0.76, AUC 0.75, and Gini 0.51. The positive feature value to predict CVID was highest for upper respiratory infections, asthma, autoimmunity and hypogammaglobulinemia. Features with the highest negative predictive value were high IgE, growth delay, abscess, lymphopenia, and congenital heart disease. For the rest of the diagnoses, accuracy stayed between 0.75 and 0.99, AUC 0.46-0.87, Gini 0.07-0.75, and LogLoss 0.09-8.55 
520 |a DISCUSSION: Clinicians should remember to consider the negative predictive features together with the positives. We are calling this a proof-of-concept study to continue with our explorations. A good performance is encouraging, and feature importance might aid feature selection for future endeavors. In the meantime, we can learn from the rules derived by the model and build a user-friendly decision tree to generate differential diagnoses 
650 4 |a Journal Article 
650 4 |a Research Support, N.I.H., Extramural 
650 4 |a Research Support, Non-U.S. Gov't 
650 4 |a Classification 
650 4 |a Data mining 
650 4 |a Diagnosis prediction 
650 4 |a Extreme gradient boosting 
650 4 |a Inborn errors of immunity 
650 4 |a Machine learning 
650 4 |a Primary immune deficiencies 
650 4 |a Rare diseases 
650 4 |a Registry 
700 1 |a Rocha Guzmán, Samuel  |e verfasserin  |4 aut 
700 1 |a Hierro Cascajares, Elisa  |e verfasserin  |4 aut 
700 1 |a Garabedian, Elizabeth K  |e verfasserin  |4 aut 
700 1 |a Fuleihan, Ramsay L  |e verfasserin  |4 aut 
700 1 |a Sullivan, Kathleen E  |e verfasserin  |4 aut 
700 1 |a Lugo Reyes, Saul O  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t Clinical immunology (Orlando, Fla.)  |d 1999  |g 255(2023) vom: 07. Okt., Seite 109759  |w (DE-627)NLM098196855  |x 1521-7035  |7 nnns 
773 1 8 |g volume:255  |g year:2023  |g day:07  |g month:10  |g pages:109759 
856 4 0 |u http://dx.doi.org/10.1016/j.clim.2023.109759  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_11 
912 |a GBV_ILN_24 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 255  |j 2023  |b 07  |c 10  |h 109759