Leveraging machine learning for diabetes prediction: Ensemble model.

Otieno Ogutu, McDonald; Nzioka Kituku, Benson; Karume, Simon M.

CUK REPOSITORY HOME
→
Research Papers
→
School of Computing and Mathematics (SCOM)
→
Department of Computing Science and Information Technology (DCSIT)
→
View Item

dc.contributor.author	Otieno Ogutu, McDonald
dc.contributor.author	Nzioka Kituku, Benson
dc.contributor.author	Karume, Simon M.
dc.date.accessioned	2026-01-13T09:12:28Z
dc.date.available	2026-01-13T09:12:28Z
dc.date.issued	2025-10-07
dc.identifier.issn	eISSN:2582-5003
dc.identifier.uri	https://doi.org/10.30574/gjeta.2025.25.1.0267
dc.identifier.uri	https://repository.cuk.ac.ke/handle/123456789/1866
dc.description	A research article published in the Global Journal of Engineering and Technology Advances.	en_US
dc.description.abstract	Diabetes presents great global health challenge, with delayed diagnosis significantly impeding effective management, particularly in resource-constrained regions. This project aimed to enhance timely and accurate diabetes prediction by developing an advanced ensemble machine learning model. A hybrid dataset, compiled from the PIMA Indian (768 instances) and Hospital Frankfurt Germany (2000 instances) datasets, totaling to 2768 datapoints, was utilized to improve generalizability beyond single-source limitations. The methodology involved comprehensive data preprocessing, including the critical imputation of physiologically impossible zero values and feature standardization. F1-score was selected as the primary performance metric due to its ability to provide a vital balance between precision and recall, which is crucial in a medical context where both false positives and false negatives carry significant consequences. Six single classifier models—Logistic Regression, Decision Tree, K-Nearest Neighbors, Support Vector Machine, Random Forest, and XGBoost—were trained on the data and evaluated after hyperparameter tuning. The F1-scores of these optimized models were: Logistic Regression (0.6328), Decision Tree (0.9843), K-Nearest Neighbors (0.9869), Support Vector Machine (0.9843), Random Forest (0.9947), and XGBoost (0.9974). Based on these results, XGBoost and Random Forest were selected as base learners for a Stacking Classifier ensemble, which utilized a Logistic Regression meta-learner. The developed ensemble model demonstrated exceptional performance, achieving near-perfect ROC-AUC of 0.9999 and an F1-score of 0.9974. This performance not only surpassed results from recent studies but also highlighted the significant potential of machine learning to predict diabetes accurately. The project recommended further development and integration of the ensemble model into a web application.	en_US
dc.language.iso	en	en_US
dc.publisher	Global Journal of Engineering and Technology Advances.	en_US
dc.subject	Machine learning.	en_US
dc.subject	Support vector machine.	en_US
dc.subject	Gradient boosting.	en_US
dc.subject	Random Forest.	en_US
dc.subject	Decision Tree.	en_US
dc.title	Leveraging machine learning for diabetes prediction: Ensemble model.	en_US
dc.type	Article	en_US