Evaluasi Kinerja Machine Learning pada Klasifikasi Penyakit Jantung Menggunakan Teknik Penyeimbangan Data

Eni Rohaini; Gunardi Gunardi; Nurhayati Nurhayati; Jasmir Jasmir; Zahra Prisdian Tiararosa

doi:10.61132/prosemnasproit.v2i2.59

Authors

Eni Rohaini Universitas Dinamika Bangsa
Gunardi Gunardi Universitas Dinamika Bangsa
Nurhayati Nurhayati Universitas Dinamika Bangsa
Jasmir Jasmir Universitas Dinamika Bangsa
Zahra Prisdian Tiararosa Universitas Dinamika Bangsa

DOI:

https://doi.org/10.61132/prosemnasproit.v2i2.59

Keywords:

Heart Disease, Machine Learning, Oversampling, Random Oversampling, SMOTE

Abstract

AImbalanced data remains a significant issue in heart disease classification using machine learning, as it tends to cause models to overestimate the majority class while ignoring minority classes with high clinical value. This can lead to a decrease in accuracy and the model's ability to accurately detect disease cases. Therefore, this study aims to assess the effectiveness of oversampling techniques, namely Random Oversampling and Synthetic Minority Oversampling Technique (SMOTE), in improving the performance of the K-Nearest Neighbors (KNN), Naive Bayes (NB), and Random Forest (RF) algorithms. The dataset used comes from Kaggle and consists of 918 data sets with 12 attributes representing patient information related to heart disease prediction. The research stages include data preprocessing, baseline model testing, and re-evaluation using the two oversampling methods. Experimental results show that oversampling can improve the performance of all algorithms. KNN achieved the best results with SMOTE, with an accuracy of 72.98% and an F1-score of 75.39%. In the Naive Bayes algorithm, both oversampling techniques produced relatively stable performance, with the highest F1-score of 73.56% using SMOTE. Meanwhile, Random Forest showed the most optimal performance when combined with Random Oversampling, with an accuracy of 79.19% and an F1-score of 81.51%. These findings confirm that the success of data balancing techniques is strongly influenced by the characteristics of the classification algorithm used, and provide a practical contribution in determining strategies for handling imbalanced data in health research.

References

Alham, S. R. J. I. (2021). Sistem Diagnosis Penyakit Jantung Koroner Dengan Menggunakan Algoritma C4.5 Berbasis Website (Studi Kasus: RSUD Dr. Soedarso Pontianak). Petir, 14(2), 214–222. https://doi.org/10.33322/petir.v14i2.1338

Alwan, J. K., Jaafar, D. S., & Ali, I. R. (2022). Diabetes diagnosis system using modified Naive Bayes classifier. Indonesian Journal of Electrical Engineering and Computer Science, 28(3), 1766–1774. https://doi.org/https://doi.org/10.11591/ijeecs.v28.i3.pp1766-1774

Amen, K., Zohdy, M., & Mahmoud, M. (2020). Machine Learning for Multiple Stage Heart Disease Prediction. 205–223. https://doi.org/10.5121/csit.2020.101118

Anderson, C. J., Cadeddu, R., Anderson, D. N., Huxford, J. A., VanLuik, E. R., Odeh, K., Pittenger, C., Pulst, S. M., & Bortolato, M. (2024). A novel naïve Bayes approach to identifying grooming behaviors in the force-plate actometric platform. Journal of Neuroscience Methods, 403(July 2023), 110026. https://doi.org/10.1016/j.jneumeth.2023.110026

Assegie, T. A., Subhashni, R., Kumar, N. K., Manivannan, J. P., Duraisamy, P., & Engidaye, M. F. (2022). Random forest and support vector machine-based hybrid liver disease detection. Bulletin of Electrical Engineering and Informatics, 11(3), 1650–1656. https://doi.org/https://doi.org/10.11591/eei.v11i3.3787

Badar, M., & Fisichella, M. (2024). Fair-CMNB: Advancing Fairness-Aware Stream Learning with Naïve Bayes and Multi-Objective Optimization. Big Data and Cognitive Computing, 8(2). https://doi.org/https://doi.org/10.3390/bdcc8020016

Bahri, S., Marisa Midyanti, D., Hidayati, R., Sistem Komputer, J., & Mipa, F. (2018). Perbandingan Algoritma Naive Bayes dan C4.5 Untuk Klasifikasi Penyakit Anak. Seminar Nasional Aplikasi Teknologi Informasi (SNATi), 11–2018.

Berrar, D. (2018). Bayes’ theorem and naive bayes classifier. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 1–3(2018), 403–412. https://doi.org/10.1016/B978-0-12-809633-8.20473-1

Chivukula, R., Jaya Lakshmi, T., Uday, S. S., & Pavani, S. T. (2021). Classifying clinically actionable genetic mutations using KNN and SVM. Indonesian Journal of Electrical Engineering and Computer Science, 24(3), 1672–1679. https://doi.org/https://doi.org/10.11591/ijeecs.v24.i3.pp1672-1679

Elin Nurlia, U. E. (2021). PENERAPAN FITUR SELEKSI FORWARD SELECTION UNTUK MENENTUKAN KEMATIAN AKIBAT GAGAL JANTUNG MENGGUNAKAN. 6(1), 42–50.

Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. Journal of Big Data, 11(1). https://doi.org/10.1186/s40537-024-00973-y

Hasib, K. M., Iqbal, M. S., Shah, F. M., Mahmud, J. Al, Popel, M. H., Showrov, M. I. H., Ahmed, S., & Rahman, O. (2020). A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem. Journal of Computer Science, 16(11), 1546–1557. https://doi.org/10.3844/JCSSP.2020.1546.1557

Ige, T., Kiekintveld, C., Piplai, A., Waggler, A., Kolade, O., & Matti, B. H. (2024). An investigation into the performances of the Current state-of-the-art Naive Bayes, Non-Bayesian and Deep Learning Based Classifier for Phishing Detection: A Survey. http://arxiv.org/abs/2411.16751

Islam, M. S., Hasan, M. M., Rahim, M. A., Hasan, A. M., Mynuddin, M., Khandokar, I., & Islam, M. J. (2022). Machine Learning-Based Music Genre Classification with Pre-Processed Feature Analysis. Jurnal Ilmiah Teknik Elektro Komputer Dan Informatika, 7(3), 491. https://doi.org/https://doi.org/10.26555/jiteki.v7i3.22327

Khaleel, A. A., Al-Azzawi, A. A. M., & Alkhazraji, A. M. (2023). Random forest for lung cancer analysis using Apache Mahout and Hadoop based on software defined networking. Indonesian Journal of Electrical Engineering and Computer Science, 32(2), 1086–1093. https://doi.org/https://doi.org/10.11591/ijeecs.v32.i2.pp1086-1093

Liang, X. W., Jiang, A. P., Li, T., Xue, Y. Y., & Wang, G. T. (2020). LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM. Knowledge-Based Systems, 196. https://doi.org/10.1016/j.knosys.2020.105845

Malek, N. H. A., Yaacob, W. F. W., Wah, Y. B., Md Nasir, S. A., Shaadan, N., & Indratno, S. W. (2023). Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data. Indonesian Journal of Electrical Engineering and Computer Science, 29(1), 598–608. https://doi.org/10.11591/ijeecs.v29.i1.pp598-608

Merdekawati, A. (2022). Komparasi Algoritma Data Mining dan Perancangan Aplikasi Prediksi Harapan Hidup Pasien Gagal Jantung. 14(3), 188–202.

Mohammadagha, M. (n.d.). Hyperparameter Optimization Strategies for Tree-Based Machine Learning Models Prediction : A Comparative Study of AdaBoost , Decision Trees , and Random Forest.

Nadeem, M., Arshad, A., Riaz, S., Zahra, S. W., Rashid, M., Band, S. S., & Mosavi, A. (2023). Preventing Cloud Network from Spamming Attacks Using Cloudflare and KNN. Computers, Materials and Continua, 74(2), 2641–2659. https://doi.org/https://doi.org/10.32604/cmc.2023.028796

Ngo, H. L., Nguyen, H. D., Loubiere, P., Tran, T. Van, Șerban, G., Zelenakova, M., Brețcan, P., & Laffly, D. (2022). The composition of time-series images and using the technique SMOTE ENN for balancing datasets in land use/cover mapping. Acta Montanistica Slovaca, 27(2), 342–359. https://doi.org/10.46544/AMS.v27i2.05

Nguyen, L. V., Vo, Q. T., & Nguyen, T. H. (2023). Adaptive KNN-Based Extended Collaborative Filtering Recommendation Services. Big Data and Cognitive Computing, 7(2). https://doi.org/https://doi.org/10.3390/bdcc7020106

Nur Riza Pahlevi, M., & Badriyah, T. (2025). Implementasi dan Optimasi Hyperparameter pada Model Machine learning untuk Prediksi Diabetes dengan Integrasi Aplikasi Telemedicine. JEPIN (Jurnal Edukasi Dan Penelitian Informatika), 11(2), 287–296.

Pan, T., Zhao, J., Wu, W., & Yang, J. (2020). Learning imbalanced datasets based on SMOTE and Gaussian distribution. Information Sciences, 512, 1214–1233. https://doi.org/10.1016/j.ins.2019.10.048

Pratama, Y., Prayitno, A., Nazrian, D., Aini, N., R, Y. R., & Rasywir, E. (2022). BULLETIN OF COMPUTER SCIENCE RESEARCH Klasifikasi Penyakit Gagal Jantung Menggunakan Algoritma K-Nearest Neighbor. 3(1), 52–56. https://doi.org/10.47065/bulletincsr.v3i1.203

Rapacz, S., Chołda, P., & Natkaniec, M. (2021). A method for fast selection of machine-learning classifiers for spam filtering. Electronics (Switzerland), 10(17). https://doi.org/https://doi.org/10.3390/electronics10172083

Reza, D. A. M., Siregar, A. M., & Rahmat. (2022). Penerapan Algoritma K-Nearest Neighbord Untuk Prediksi Kematian Akibat Penyakit Gagal Jantung. Scientific Student Journal for Information, Technology and Science , III(1), 105–112.

Sahar, S. (2020). Analisis Perbandingan Metode K-Nearest Neighbor dan Naïve Bayes Clasiffier Pada Dataset Penyakit Jantung. Indonesian Journal of Data and Science, 1(3), 79–86. https://doi.org/10.33096/ijodas.v1i3.20

Samosir, A., Hasibuan, M. S., Justino, W. E., & Hariyono, T. (2021). Komparasi Algoritma Random Forest, Naïve Bayes dan K- Nearest Neighbor Dalam klasifikasi Data Penyakit Jantung. Prosiding Seminar Nasional Darmajaya, 1(0), 214–222. https://jurnal.darmajaya.ac.id/index.php/PSND/article/view/2955

Sampath, P., Elangovan, G., Ravichandran, K., Shanmuganathan, V., Pasupathi, S., Chakrabarti, T., Chakrabarti, P., & Margala, M. (2024). Robust diabetic prediction using ensemble machine learning models with synthetic minority over-sampling technique. Scientific Reports, 14(1), 1–15. https://doi.org/10.1038/s41598-024-78519-8

Sekulić, A., Kilibarda, M., Heuvelink, G. B. M., Nikolić, M., & Bajat, B. (2020). Random forest spatial interpolation. Remote Sensing, 12(10), 1–29. https://doi.org/https://doi.org/10.3390/rs12101687

Sepharni, A., Hendrawan, I. E., & Rozikin, C. (2022). Klasifikasi Penyakit Jantung dengan Menggunakan Algoritma C4.5. STRING (Satuan Tulisan Riset Dan Inovasi Teknologi), 7(2), 117. https://doi.org/10.30998/string.v7i2.12012

Shakeela, S., Shankar, N. S., Reddy, P. M., Tulasi, T. K., & Koneru, M. M. (2021). Optimal ensemble learning based on distinctive feature selection by univariate ANOVA-F statistics for IDS. International Journal of Electronics and Telecommunications, 67(2), 267–275. https://doi.org/10.24425/ijet.2021.135975

Soltanzadeh, P., & Hashemzadeh, M. (2021). RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem. Information Sciences, 542, 92–111. https://doi.org/10.1016/j.ins.2020.07.014

Sugiyarto, A. W., Abadi, A. M., & Sumarna. (2021). Classification of heart disease based on PCG signal using CNN. Telkomnika (Telecommunication Computing Electronics and Control), 19(5), 1697–1706. https://doi.org/10.12928/TELKOMNIKA.v19i5.20486

Syahputra, H., & Wibowo, A. (2023). Comparison of Support Vector Machine (SVM) and Random Forest Algorithm for Detection of Negative Content on Websites. Jurnal Ilmiah Teknik Elektro Komputer Dan Informatika (JITEKI), 9(1), 165–173. https://doi.org/https://doi.org/10.26555/jiteki.v9i1.25861

Wang, S., Ren, J., & Bai, R. (2023). A semi-supervised adaptive discriminative discretization method improving discrimination power of regularized naive Bayes. Expert Systems with Applications, 225(April), 120094. https://doi.org/10.1016/j.eswa.2023.120094

Wang, X., Zhai, M., Ren, Z., Ren, H., Li, M., Quan, D., Chen, L., & Qiu, L. (2021). Exploratory study on classification of diabetes mellitus through a combined Random Forest Classifier. BMC Medical Informatics and Decision Making, 21(1), 1–14. https://doi.org/10.1186/s12911-021-01471-4

Xin, L. K., & Rashid, N. binti A. (2021). Prediction of depression among women using random oversampling and random forest. 2021 International Conference of Women in Data Science at Taif University, WiDSTaif 2021. https://doi.org/10.1109/WIDSTAIF52235.2021.9430215

Yang, Y., & Liu, X. (n.d.). A re-examination of text categorization methods.

Zhang, J., Li, Y., Shen, F., He, Y., Tan, H., & He, Y. (2024). Hierarchical text classification with multi-label contrastive learning and KNN. Neurocomputing, 577(January), 127323. https://doi.org/https://doi.org/10.1016/j.neucom.2024.127323

Zhu, Y., Kong, B., Liu, R., & Zhao, Y. (2022). Developing biomedical engineering technologies for reproductive medicine. Smart Medicine, 1(1). https://doi.org/10.1002/smmd.20220006