Abstract
This research focuses on predicting cerebral strokes within imbalanced data contexts, addressing the critical need for early detection through statistical methods. The study identifies stroke risk factors, develops and evaluates precision-oriented classification models (e.g., logistic regression, machine learning), and effectively manages data imbalances. Using the Kaggle Cerebral Stroke dataset with 12 attributes and imbalanced target variable, this investigation examines predictors like gender, age, hypertension, heart disease, marital status, work type, residence type, glucose level, BMI, and smoking status. Previous studies on stroke prediction using Naïve Bayes, decision trees, and neural networks are thoroughly reviewed. The research reveals key risk determinants and employs six data balancing techniques (ROSE, SMOTE, ADASYN, SVM-SMOTE, SMOTEEN, SMOTETOMEK), rigorously evaluating six classification models (Logistic regression, Decision Tree, Support Vector Machine, k-Nearest Neighbor, Random Forest, Naïve Bayes). Notably, combining ADASYN and KNN significantly enhances cerebral stroke prediction accuracy. This study advances early stroke prediction by leveraging advanced statistical techniques to mitigate imbalanced data challenges, holding potential to improve interventions and expedite timely medical responses.