Abstract
Stroke is still one of the top causes of death and long-term disability around the world, and being able to accurately predict it early on is still an important step towards reducing its impact. In this study, used the publicly available Healthcare Stroke Dataset on Kaggle to test a number of machine learning algorithms to see well they could predict stroke. KNN imputation to fill in missing values, RobustScaler to standardise continuous features, and random oversampling of the minority class to make a balanced training set. Thirteen classifiers were examined, including Random Forest, XGBoost, CatBoost, AdaBoost, k-Nearest Neighbours, Decision Tree, Naïve Bayes, Support Vector Machines, and Neural Networks. Thirteen classifiers were examined, including Random Forest, XGBoost, CatBoost, AdaBoost, k-Nearest Neighbours, Decision Tree, Naïve Bayes, Support Vector Machines, and Neural Networks. The F1-score was chosen as the main performance metric because the dataset was very unbalanced. The results show that AdaBoost, after using GridSearchCV to optimise its hyperparameters, did better than all the other classifiers on the test set, with an F1 score of 94.63%. CatBoost (93.73%) and XGBoost (93.29%) were close behind. An analysis of feature importance showed that only eight variables such as age, average glucose level, body mass index (BMI), hypertension, heart disease, smoking status, marital status, and work type were needed to get good results. The study shows that ensemble learning, especially optimised AdaBoost, is a strong and understandable way to predict strokes in clinical settings.