Random Oversampling, Chi-Square, dan AdaBoost dalam Penanganan Ketidakseimbangan Kelas pada Klasifikasi C5.0

Tanti Tanti

Abstract


In data mining, there is a classification method. One of the problems often experienced in data mining classification is class imbalance. Class imbalance is a condition where the distribution of the dataset is uneven, meaning that it is divided into the majority class and the minority class with varying degrees of severity. The minority class is often misclassified because the majority class will be overclassified. This problem makes the classification process difficult and results in sub-optimal classification performance. Due to an imbalance, the classification will produce much higher accuracy for the majority class than for the minority class. This study aims to apply Random Oversampling, Chi-Square, and AdaBoost in overcoming class imbalances to optimize the performance of the C5.0 classification. In dealing with unbalanced datasets, performance appraisal needs to focus more on the positive class. So that the metric that is more suitable for assessing the classification results of unbalanced datasets is recall/sensitivity/TPR. The results showed that the application of Random Oversampling alone was able to improve the recall/sensitivity/TPR performance of standard C5.0. The application of Chi-Square alone has not been able to improve the performance of the C5.0 classification, but it has increased after the application of Random Oversampling. The combination of the three, namely Random Oversampling, Chi-Square, and AdaBoost able to increase the recall/sensitivity/TPR value of the standard C5.0.

Keywords


Class Imbalance; C5.0; Random Oversampling; Chi-Square; AdaBoost

Full Text:

PDF

References


C. M. Annur, “Ada 204,7 Juta Pengguna Internet di Indonesia Awal 2022,†Katadata Media Network, 2022. https://databoks.katadata.co.id/datapublish/2022/03/23/ada-2047-juta-pengguna-internet-di-indonesia-awal-2022.

S. Sahney, K. Ghosh, and A. Shrivastava, “Conceptualizing consumer ‘trust’ in online buying behaviour: An empirical inquiry and model development in Indian context,†J. Asia Bus. Stud., vol. 7, no. 3, pp. 278–298, 2013, doi: 10.1108/JABS-Jul-2011-0038.

D. Wagner, S. Chaipoopirutana, and H. Combs, “A Study of Factors Influencing the Online Purchasing Intention toward Online Shopping in Thailand,†AtMA 2019 Proccedings, no. 2013, pp. 277–292, 2019.

M. R. Kabir, F. Bin Ashraf, and R. Ajwad, “Analysis of different predicting model for online shoppers’ purchase intention from empirical data,†2019 22nd Int. Conf. Comput. Inf. Technol., no. March 2020, 2019, doi: 10.1109/ICCIT48885.2019.9038521.

T. P. Novak, D. L. Hoffman, and Y. F. Yung, “Measuring the customer experience in online environments: A structural modeling approach,†Mark. Sci., vol. 19, no. 1, pp. 22–42, 2000, doi: 10.1287/mksc.19.1.22.15184.

E. Buulolo, Data Mining untuk Perguruan Tinggi. Yogyakarta: Deepublish, 2020.

O. Chouat and A. H. Irawan, “Implementation of Data Mining on Online Shop in Indonesia,†in IOP Conference Series: Materials Science and Engineering, 2018, vol. 407, no. 1, doi: 10.1088/1757-899X/407/1/012013.

D. Nofriansyah and G. W. Nurcahyo, Algoritma Data Mining dan Pengujian, 1st ed. Yogyakarta: Deepublish, 2015.

R. T. Vulandari, Data Mining Teori dan Aplikasi Rapidminer, 1st ed. Yogyakarta: Penerbit Gava Media, 2017.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and regression trees. 2017.

J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

N. Japkowicz, “Assessment metrics for imbalanced learning,†in Imbalanced Learning: Foundations, Algorithms, and Applications, 1st ed., Wiley-IEEE Press, Ed. 2013, pp. 187–206.

T. M. Khoshgoftaar, K. Gao, and N. Seliya, “Attribute selection and imbalanced data: Problems in software defect prediction,†in Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, 2010, vol. 1, doi: 10.1109/ICTAI.2010.27.

J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A survey on addressing high-class imbalance in big data,†J. Big Data, vol. 5, no. 1, 2018, doi: 10.1186/s40537-018-0151-6.

H. He and Y. Ma, Imbalanced learning: Foundations, algorithms, and applications. 2013.

S. Vluymans, “Dealing with imbalanced and weakly labelled data in machine learning using fuzzy and rough set methods,†in Studies in Computational Intelligence, vol. 807, 2019.

G. Douzas and F. Bacao, “Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning,†Expert Syst. Appl., vol. 82, 2017, doi: 10.1016/j.eswa.2017.03.073.

N. Santoso, W. Wibowo, and H. Himawati, “Integration of synthetic minority oversampling technique for imbalanced class,†Indones. J. Electr. Eng. Comput. Sci., 2019, doi: 10.11591/ijeecs.v13.i1.pp102-108.

J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,†J. Big Data, vol. 6, no. 1, 2019, doi: 10.1186/s40537-019-0192-5.

K. Gao, T. Khoshgoftaar, and R. Wald, “Combining feature selection and ensemble learning for software quality estimation,†in Proceedings of the 27th International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2014.

A. Nurmasani and Y. Pristyanto, “Algoritme Stacking untuk Klasifikasi Penyakit Jantung pada Dataset Imbalanced Class,†Pseudocode, vol. 8, no. 1, 2021, doi: 10.33369/pseudocode.8.1.21-26.

J. Ortigosa-Hernández, I. Inza, and J. A. Lozano, “Towards Competitive Classifiers for Unbalanced Classification Problems: A Study on the Performance Scores,†2016, [Online]. Available: http://arxiv.org/abs/1608.08984.

Q. Gu, X. M. Wang, Z. Wu, B. Ning, and C. S. Xin, “An improved SMOTE algorithm based on genetic algorithm for imbalanced data classification,†J. Digit. Inf. Manag., vol. 14, no. 2, 2016.

A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning from Imbalanced Data Sets. 2018.

B. W. Yap, K. A. Rani, H. A. A. Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah, “An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets,†in Lecture Notes in Electrical Engineering, 2014, pp. 13–22, doi: 10.1007/978-981-4585-18-7_2.

Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, “Cost-sensitive boosting for classification of imbalanced data,†Pattern Recognit., vol. 40, no. 12, pp. 3358–3378, 2007, doi: 10.1016/j.patcog.2007.04.009.

D. Tiwari, “Handling Class Imbalance Problem Using Feature Selection,†Int. J. Adv. Res. Comput. Sci. Technol., vol. 2, no. 2, pp. 516–520, 2014.

I. S. Thaseen, C. A. Kumar, and A. Ahmad, “Integrated Intrusion Detection Model Using Chi-Square Feature Selection and Ensemble of Classifiers,†Arab. J. Sci. Eng., vol. 44, no. 4, 2019, doi: 10.1007/s13369-018-3507-5.

A. Thakkar and R. Lohiya, “Attack classification using feature selection techniques: a comparative study,†J. Ambient Intell. Humaniz. Comput., vol. 12, no. 1, 2021, doi: 10.1007/s12652-020-02167-9.

J. Li et al., “Feature selection: A data perspective,†ACM Comput. Surv., vol. 50, no. 6, 2017, doi: 10.1145/3136625.

“Online Shoppers Purchasing Intention Dataset,†UCI Machine Learning Repository, 2018. https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset.

C. O. Sakar, S. O. Polat, M. Katircioglu, and Y. Kastro, “Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks,†Neural Comput. Appl., vol. 31, no. 10, pp. 6893–6908, 2019, doi: 10.1007/s00521-018-3523-0.

H. Kuswanto, N. Sunusi, S. Siswanto, and N. Nirwan, “Application of Resampling and Boosting Methods Using the C5.0 Algorithm,†Proc. Int. Conf. Data Sci. Off. Stat., vol. 2021, no. 1, 2022, doi: 10.34123/icdsos.v2021i1.198.

Y. Xiao and X. Xiao, “An intrusion detection system based on a simplified residual network,†Inf., vol. 10, no. 11, 2019, doi: 10.3390/info10110356.

D. Jain, A. K. Mishra, and S. K. Das, “Machine Learning Based Automatic Prediction of Parkinson’s Disease Using Speech Features,†in Advances in Intelligent Systems and Computing, 2021, vol. 1164, doi: 10.1007/978-981-15-4992-2_33.

L. Gong, S. Jiang, and L. Jiang, “Tackling Class Imbalance Problem in Software Defect Prediction through Cluster-Based Over-Sampling with Filtering,†IEEE Access, vol. 7, 2019, doi: 10.1109/ACCESS.2019.2945858.

S. DEMİR and E. K. ŞAHİN, “Evaluation of Oversampling Methods (OVER, SMOTE, and ROSE) in Classifying Soil Liquefaction Dataset based on SVM, RF, and Naïve Bayes,†Eur. J. Sci. Technol., 2022, doi: 10.31590/ejosat.1077867.

E. Prasetyo, DATA MINING Mengolah Data Menjadi Informasi Menggunakan Matlab. 2014.

D. Kurniawan and D. C. Supriyanto, “Optimasi Algoritma Support Vector Machine (Svm) Menggunakan Adaboost Untuk Penilaian Risiko Kredit,†J. Teknol. Inf., vol. 9, no. 1, 2013.

G. Feng, J. D. Zhang, and S. Shaoyi Liao, “A novel method for combining Bayesian networks, theoretical analysis, and its applications,†Pattern Recognit., vol. 47, no. 5, 2014, doi: 10.1016/j.patcog.2013.12.005.

S. Mulyati, Y. Yulianti, and A. Saifudin, “Penerapan Resampling dan Adaboost untuk Penanganan Masalah Ketidakseimbangan Kelas Berbasis Naϊve Bayes pada Prediksi Churn Pelanggan,†J. Inform. Univ. Pamulang, vol. 2, no. 4, 2017, doi: 10.32493/informatika.v2i4.1440.

R. Hao, X. Xia, S. Shen, and X. Yang, “Bank direct marketing analysis based on ensemble learning,†in Journal of Physics: Conference Series, 2020, vol. 1627, no. 1, doi: 10.1088/1742-6596/1627/1/012026.

X. Wu et al., “Top 10 algorithms in data mining,†Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008, doi: 10.1007/s10115-007-0114-2.

Ross Quinlan, “Is See5/C5.0 Better Than C4.5?,†RuleQuest Research, 2017. https://rulequest.com/see5-comparison.html#:~:text=Decision trees%3A faster%2C smaller&text=0 produce trees with similar,are noticeably smaller and C5.

S. Rajeswari and K. Suthendran, “C5.0: Advanced Decision Tree (ADT) classification model for agricultural data analysis on cloud,†Comput. Electron. Agric., vol. 156, pp. 530–539, 2019, doi: 10.1016/j.compag.2018.12.013.

J. H. Joloudari, M. Haderbadi, A. Mashmool, M. Ghasemigol, S. S. Band, and A. Mosavi, “Early detection of the advanced persistent threat attack using performance analysis of deep learning,†IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3029202.

I. C. Dipto, T. Islam, H. M. M. Rahman, and M. A. Rahman, “Comparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease,†J. Data Anal. Inf. Process., vol. 08, no. 02, 2020, doi: 10.4236/jdaip.2020.82003.




DOI: https://doi.org/10.30865/mib.v7i2.5862

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 JURNAL MEDIA INFORMATIKA BUDIDARMA

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.



JURNAL MEDIA INFORMATIKA BUDIDARMA
Universitas Budi Darma
Secretariat: Sisingamangaraja No. 338 Telp 061-7875998
Email: mib.stmikbd@gmail.com

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.