Meningkatkan Kemampuan Model dalam Memprediksi Penyakit Jantung dengan Algoritma NCL dan GridSearchCV

Zulfan Ahmadi; Asrul Abdullah; Izhan Fakhruzi

doi:10.30865/mib.v7i3.6142

Authors

Zulfan Ahmadi Universitas Muhammadiyah Pontianak, Pontianak
Asrul Abdullah Universitas Muhammadiyah Pontianak, Pontianak
Izhan Fakhruzi Universitas Muhammadiyah Pontianak, Pontianak

DOI:

https://doi.org/10.30865/mib.v7i3.6142

Keywords:

Heart Disease, GridSearchCV, Weighted Logistic Regression, Neighborhood Cleaning Rule, Machine Learning

Abstract

Heart disease is the main cause of death in the world. To reduce this high mortality rate, accurate prediction capabilities are needed in warning people with heart disease to prevent and manage this condition. This study uses a machine learning model to predict heart disease. The purpose of this research is to improve the ability of a machine learning classification model, namely Logistic Regression (LR), in predicting heart disease. So that prediction errors that can harm patients can be significantly reduced. To achieve this goal, research is carried out using two important approaches, namely data preparation and model optimization. At the data preparation stage, data imbalance problems were found between people with heart disease and non-heart disease sufferers. To deal with this problem, the Neighborhood Cleaning Rule (NCL) algorithm is used to correct data imbalances. The use of NCL in the data preparation stage has a significant impact on improving the performance of the prediction model. Furthermore, at the model optimization stage, the GridSearchCV method is used to find the best hyperparameter combination in the Logistic Regression (LR) algorithm. By finding optimal hyperparameters, the performance of the prediction model can be improved. In addition, this study also implemented Weighted Logistic Regression which allows setting class weights, which also contributes to improving model performance. The results of testing the model using the evaluation metrics Accuracy, Recall, and Area Under Curve (AUC) show an increase in the ability of the model. The recall score increased from 0.10 to 0.93, and the AUC score increased from 0.83 to 0.98. This study used a dataset obtained from Kaggle from the Centers for Disease Control and Prevention (CDC). With better predictive ability in identifying heart disease, it is hoped that it can provide accurate early warning to individuals at risk, thereby significantly reducing mortality from heart disease.

Author Biography

Zulfan Ahmadi, Universitas Muhammadiyah Pontianak, Pontianak

Fakultas Teknik Program Studi Informatika, Mahasiswa Strata-1

References

R. Annisa, â€œANALISIS KOMPARASI ALGORITMA KLASIFIKASI DATA MINING UNTUK PREDIKSI PENDERITA PENYAKIT JANTUNG,â€ J. Tek. Inform. Kaputama, vol. 3, pp. 22â€“28, 2019.

and S. S. M. N, R. A, R. S., â€œKarakteristik Dan Prevalensi Risiko Penyakit Kardiovaskular Pada Tukang Masak Warung Makan Di Wilayah Kerja Puskesmas Tamalanrea,â€ J. Kesehat., vol. 11, pp. 30â€“38, 2018.

A. I. Ayu, E. Widiastuti, R. Cholidah, G. W. Buanayuda and I. B. Alit, â€œDeteksi Dini Faktor Risiko Penyakit Kardiovaskuler pada Pegawai Rektorat Universitas Mataram,â€ J. Pengabdi. Magister Pendidik. IPA, vol. 4, pp. 137â€“142, 2021.

Kemenkes RI, â€œCardiovasular Diseases Guideline,â€ p. 32, 2009.

A. B. Wibisono and A. Fahrurozi, â€œPerbandingan Algoritma Klasifikasi Dalam Pengklasifikasian Data Penyakit Jantung Koroner,â€ J. Ilm. Teknol. dan Rekayasa,â€ vol. 24, pp. 161â€“170, 2019.

C. Krittanawong et al., â€œMachine learning prediction in cardiovascular diseases: a meta-analysis,â€ Sci. Rep., vol. 10, no. 1, pp. 1â€“11, 2020, doi: 10.1038/s41598-020-72685-1.

I. Fakhruzi, â€œAn artificial neural network with bagging to address imbalance datasets on clinical prediction,â€ 2018 Int. Conf. Inf. Commun. Technol. ICOIACT 2018, vol. 2018-Janua, no. 1, pp. 895â€“898, 2018, doi: 10.1109/ICOIACT.2018.8350824.

A. FernÃ¡ndez, S. GarcÃa, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning From Imbalanced Data Sets. Springer US, 2018. doi: 10.1007/978-3-319-98074-4_5.

S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni, â€œComparing different supervised machine learning algorithms for disease prediction,â€ BMC Med. Inform. Decis. Mak., vol. 19, no. 1, pp. 1â€“16, 2019, doi: 10.1186/s12911-019-1004-8.

P. D. Putra and D. P. Rini, â€œPrediksi Penyakit Jantung dengan Algoritma Klasifikasi,â€ Pros. Annu. Res. Semin. 2019, vol. 5, no. 1, pp. 978â€“979, 2019.

F. Pedregosa et al., â€œScikit-learn: Machine Learning in Python,â€ J. Mach. Learn. Res., vol. 12, no. 85, pp. 2825â€“2830, 2011, [Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html

Z. M. Alhakeem, Y. M. Jebur, S. N. Henedy, H. Imran, L. F. A. Bernardo, and H. M. Hussein, â€œPrediction of Ecofriendly Concrete Compressive Strength Using Gradient Boosting Regression Tree Combined with GridSearchCV Hyperparameter-Optimization Techniques,â€ Materials (Basel)., vol. 15, no. 21, p. 7432, 2022, doi: 10.3390/ma15217432.

â€œPersonal Key Indicators of Heart Disease,â€ 2022.

Anikakapoor, â€œML | Data Preprocessing in Python,â€ GeeksforGeeks, 2023.

F. S. Pamungkas, B. D. Prasetya, and I. Kharisudin, â€œPerbandingan Metode Klasifikasi Supervised Learning pada Data Bank Customers Menggunakan Python,â€ Prism. Pros. Semin. Nas. Mat., vol. 3, pp. 692â€“697, 2020.

R. Ordila, R. Wahyuni, Y. Irawan, and M. Yulia Sari, â€œPENERAPAN DATA MINING UNTUK PENGELOMPOKAN DATA REKAM MEDIS PASIEN BERDASARKAN JENIS PENYAKIT DENGAN ALGORITMA CLUSTERING (Studi Kasus : Poli Klinik PT.Inecda),â€ J. Ilmu Komput., vol. 9, no. 2, pp. 148â€“153, 2020, doi: 10.33060/jik/2020/vol9.iss2.181.

C. A. Ramezan, â€œTransferability of Recursive Feature Elimination (RFE)-Derived Feature Sets for Support Vector Machine Land Cover Classification,â€ Remote Sens., vol. 14, no. 24, 2022, doi: 10.3390/rs14246218.

S. Farhad Khorshid and A. Mohsin Abdulazeez, â€œBreast Cancer Diagnosis Based on K-Nearest Neighbors: a Review,â€ J. Archaeol. Egypt/Egyptology, vol. 18, no. 4, pp. 1927â€“1951, 2021.

R. D. King, O. I. Orhobor, and C. C. Taylor, â€œCross-validation is safe to use,â€ Nat. Mach. Intell., vol. 3, no. 4, p. 276, 2021, doi: 10.1038/s42256-021-00332-z.

D. P. Utomo and M. Mesran, â€œAnalisis Komparasi Metode Klasifikasi Data Mining dan Reduksi Atribut Pada Data Set Penyakit Jantung,â€ J. Media Inform. Budidarma, vol. 4, no. 2, p. 437, 2020, doi: 10.30865/mib.v4i2.2080.

T. Gneiting and E. M. Walz, â€œReceiver operating characteristic (ROC) movies, universal ROC (UROC) curves, and coefficient of predictive ability (CPA),â€ Mach. Learn., vol. 111, no. 8, pp. 2769â€“2797, 2022, doi: 10.1007/s10994-021-06114-3.