Leveraging machine learning for diabetes prediction:Ensemble model

Otieno Ogutu, McDonald

CUK REPOSITORY HOME
→
Master Theses and Dissertations (MST)
→
School of Computing and Mathematics (SCOM)
→
Department of Computing Science and Information Technology (DCSIT)
→
View Item

dc.contributor.author	Otieno Ogutu, McDonald
dc.date.accessioned	2026-07-02T09:54:24Z
dc.date.available	2026-07-02T09:54:24Z
dc.date.issued	2025
dc.identifier.uri	https://repository.cuk.ac.ke/handle/123456789/1964
dc.description	A research project submitted to the Department of Computer Science and Information Technology in the School of Computing and Mathematics in partial fulfillment of the requirements for the award of the degree of master of Science in Data Science of the Cooperative University of Kenya	en_US
dc.description.abstract	Diabetes presents a great global health challenge, with delayed diagnosis significantly impeding effective management, particularly in resource-constrained regions. The critical shortage of medical professionals in regions like Kenya with a doctor-to-population ratio far below the WHO standard severely hampers timely screening and diagnosis diabetes. This deficit necessitates innovative, scalable tools, such as machine learning models, to assist in early prediction and intervention.This project research aimed to enhance timely and accurate diabetes prediction by developing an advanced ensemble machine learning model. Ahybrid dataset, compiled from the PIMA Indian (762 instances) and Hospital Frankfurt Germany (2000 instances) datasets, totaling 2762 datapoints, was utilized to improve generalizability beyond single-source limitations. The research employed a quantitative design which involved comprehensive data preprocessing, including the critical imputation of physiologically impossible zero values and feature standardization.After assessing multicollinearity, all independent variables were retained. Six machine learning algorithms; Logistic Regression, Decision Tree, K-Nearest Neighbors, Support Vector Machine, Random Forest, and XGBoostwere evaluated, undergoing hyperparameter tuning to optimize their performance. XGBoost and Random Forest consistently achieved the highest F1-scores (0.9974 and 0.9947 respectively) among individual classifiers. These two top-performing models were then selected as base learners for a StackingClassifier ensemble, which utilized a Logistic Regression meta-learner. The developed ensemble model demonstrated exceptional predictive capabilities, achieving an F1-score of 0.9974 and a near-perfect ROC-AUC of 0.9999. This performance matched XGBoost's F1-score and marginally surpassed its ROC-AUC. Implemented in Python, this research underscores the significant potential of advanced ensemble machine learning to deliver highly accurate and robust diagnostic solutions, thereby contributing to earlier diabetes detection and improved health outcomes, particularly in underserved healthcare environments.	en_US
dc.language.iso	en	en_US
dc.publisher	Cuk	en_US
dc.title	Leveraging machine learning for diabetes prediction:Ensemble model	en_US
dc.type	Thesis	en_US