TY - JOUR
T1 - Novel Insights on Establishing Machine Learning-Based Stroke Prediction Models Among Hypertensive Adults
AU - Huang, Xiao
AU - Cao, Tianyu
AU - Chen, Liangziqian
AU - Li, Junpei
AU - Tan, Ziheng
AU - Xu, Benjamin
AU - Xu, Richard
AU - Song, Yun
AU - Zhou, Ziyi
AU - Wang, Zhuo
AU - Wei, Yaping
AU - Zhang, Yan
AU - Li, Jianping
AU - Huo, Yong
AU - Qin, Xianhui
AU - Wu, Yanqing
AU - Wang, Xiaobin
AU - Wang, Hong
AU - Cheng, Xiaoshu
AU - Xu, Xiping
AU - Liu, Lishun
N1 - Copyright © 2022 Huang, Cao, Chen, Li, Tan, Xu, Xu, Song, Zhou, Wang, Wei, Zhang, Li, Huo, Qin, Wu, Wang, Wang, Cheng, Xu and Liu.
PY - 2022/5/6
Y1 - 2022/5/6
N2 - Background: Stroke is a major global health burden, and risk prediction is essential for the primary prevention of stroke. However, uncertainty remains about the optimal prediction model for analyzing stroke risk. In this study, we aim to determine the most effective stroke prediction method in a Chinese hypertensive population using machine learning and establish a general methodological pipeline for future analysis. Methods: The training set included 70% of data (n = 14,491) from the China Stroke Primary Prevention Trial (CSPPT). Internal validation was processed with the rest 30% of CSPPT data (n = 6,211), and external validation was conducted using a nested case–control (NCC) dataset (n = 2,568). The primary outcome was the first stroke. Four received analysis methods were processed and compared: logistic regression (LR), stepwise logistic regression (SLR), extreme gradient boosting (XGBoost), and random forest (RF). Population characteristic data with inclusion and exclusion of laboratory variables were separately analyzed. Accuracy, sensitivity, specificity, kappa, and area under receiver operating characteristic curves (AUCs) were used to make model assessments with AUCs the top concern. Data balancing techniques, including random under-sampling (RUS) and synthetic minority over-sampling technique (SMOTE), were applied to process this unbalanced training set. Results: The best model performance was observed in RUS-applied RF model with laboratory variables. Compared with null models (sensitivity = 0, specificity = 100, and mean AUCs = 0.643), data balancing techniques improved overall performance with RUS, demonstrating a more satisfactory effect in the current study (RUS: sensitivity = 63.9; specificity = 53.7; and mean AUCs = 0.624. Adding laboratory variables improved the performance of analysis methods. All results were reconfirmed in validation sets. The top 10 important variables were determined by the analysis method with the best performance. Conclusion: Among the tested methods, the most effective stroke prediction model in targeted population is RUS-applied RF. From the insights, the current study revealed, we provided general frameworks for building machine learning-based prediction models.
AB - Background: Stroke is a major global health burden, and risk prediction is essential for the primary prevention of stroke. However, uncertainty remains about the optimal prediction model for analyzing stroke risk. In this study, we aim to determine the most effective stroke prediction method in a Chinese hypertensive population using machine learning and establish a general methodological pipeline for future analysis. Methods: The training set included 70% of data (n = 14,491) from the China Stroke Primary Prevention Trial (CSPPT). Internal validation was processed with the rest 30% of CSPPT data (n = 6,211), and external validation was conducted using a nested case–control (NCC) dataset (n = 2,568). The primary outcome was the first stroke. Four received analysis methods were processed and compared: logistic regression (LR), stepwise logistic regression (SLR), extreme gradient boosting (XGBoost), and random forest (RF). Population characteristic data with inclusion and exclusion of laboratory variables were separately analyzed. Accuracy, sensitivity, specificity, kappa, and area under receiver operating characteristic curves (AUCs) were used to make model assessments with AUCs the top concern. Data balancing techniques, including random under-sampling (RUS) and synthetic minority over-sampling technique (SMOTE), were applied to process this unbalanced training set. Results: The best model performance was observed in RUS-applied RF model with laboratory variables. Compared with null models (sensitivity = 0, specificity = 100, and mean AUCs = 0.643), data balancing techniques improved overall performance with RUS, demonstrating a more satisfactory effect in the current study (RUS: sensitivity = 63.9; specificity = 53.7; and mean AUCs = 0.624. Adding laboratory variables improved the performance of analysis methods. All results were reconfirmed in validation sets. The top 10 important variables were determined by the analysis method with the best performance. Conclusion: Among the tested methods, the most effective stroke prediction model in targeted population is RUS-applied RF. From the insights, the current study revealed, we provided general frameworks for building machine learning-based prediction models.
KW - XGBoost
KW - machine learning
KW - primary prevention
KW - risk assessment
KW - stroke
UR - http://www.scopus.com/inward/record.url?scp=85135097271&partnerID=8YFLogxK
U2 - 10.3389/fcvm.2022.901240
DO - 10.3389/fcvm.2022.901240
M3 - Article
C2 - 35600480
AN - SCOPUS:85135097271
SN - 2297-055X
VL - 9
JO - Frontiers in Cardiovascular Medicine
JF - Frontiers in Cardiovascular Medicine
M1 - 901240
ER -