Random Oversampling, Chi-Square, dan AdaBoost dalam Penanganan Ketidakseimbangan Kelas pada Klasifikasi C5.0

Tanti Tanti

doi:10.30865/mib.v7i2.5862

Authors

Tanti Tanti Universitas Mikroskil, Medan

DOI:

https://doi.org/10.30865/mib.v7i2.5862

Keywords:

Class Imbalance, C5.0, Random Oversampling, Chi-Square, AdaBoost

Abstract

In data mining, there is a classification method. One of the problems often experienced in data mining classification is class imbalance. Class imbalance is a condition where the distribution of the dataset is uneven, meaning that it is divided into the majority class and the minority class with varying degrees of severity. The minority class is often misclassified because the majority class will be overclassified. This problem makes the classification process difficult and results in sub-optimal classification performance. Due to an imbalance, the classification will produce much higher accuracy for the majority class than for the minority class. This study aims to apply Random Oversampling, Chi-Square, and AdaBoost in overcoming class imbalances to optimize the performance of the C5.0 classification. In dealing with unbalanced datasets, performance appraisal needs to focus more on the positive class. So that the metric that is more suitable for assessing the classification results of unbalanced datasets is recall/sensitivity/TPR. The results showed that the application of Random Oversampling alone was able to improve the recall/sensitivity/TPR performance of standard C5.0. The application of Chi-Square alone has not been able to improve the performance of the C5.0 classification, but it has increased after the application of Random Oversampling. The combination of the three, namely Random Oversampling, Chi-Square, and AdaBoost able to increase the recall/sensitivity/TPR value of the standard C5.0.

References

C. M. Annur, â€œAda 204,7 Juta Pengguna Internet di Indonesia Awal 2022,â€ Katadata Media Network, 2022. https://databoks.katadata.co.id/datapublish/2022/03/23/ada-2047-juta-pengguna-internet-di-indonesia-awal-2022.

S. Sahney, K. Ghosh, and A. Shrivastava, â€œConceptualizing consumer â€˜trustâ€™ in online buying behaviour: An empirical inquiry and model development in Indian context,â€ J. Asia Bus. Stud., vol. 7, no. 3, pp. 278â€“298, 2013, doi: 10.1108/JABS-Jul-2011-0038.

D. Wagner, S. Chaipoopirutana, and H. Combs, â€œA Study of Factors Influencing the Online Purchasing Intention toward Online Shopping in Thailand,â€ AtMA 2019 Proccedings, no. 2013, pp. 277â€“292, 2019.

M. R. Kabir, F. Bin Ashraf, and R. Ajwad, â€œAnalysis of different predicting model for online shoppersâ€™ purchase intention from empirical data,â€ 2019 22nd Int. Conf. Comput. Inf. Technol., no. March 2020, 2019, doi: 10.1109/ICCIT48885.2019.9038521.

T. P. Novak, D. L. Hoffman, and Y. F. Yung, â€œMeasuring the customer experience in online environments: A structural modeling approach,â€ Mark. Sci., vol. 19, no. 1, pp. 22â€“42, 2000, doi: 10.1287/mksc.19.1.22.15184.

E. Buulolo, Data Mining untuk Perguruan Tinggi. Yogyakarta: Deepublish, 2020.

O. Chouat and A. H. Irawan, â€œImplementation of Data Mining on Online Shop in Indonesia,â€ in IOP Conference Series: Materials Science and Engineering, 2018, vol. 407, no. 1, doi: 10.1088/1757-899X/407/1/012013.

D. Nofriansyah and G. W. Nurcahyo, Algoritma Data Mining dan Pengujian, 1st ed. Yogyakarta: Deepublish, 2015.

R. T. Vulandari, Data Mining Teori dan Aplikasi Rapidminer, 1st ed. Yogyakarta: Penerbit Gava Media, 2017.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and regression trees. 2017.

J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

N. Japkowicz, â€œAssessment metrics for imbalanced learning,â€ in Imbalanced Learning: Foundations, Algorithms, and Applications, 1st ed., Wiley-IEEE Press, Ed. 2013, pp. 187â€“206.

T. M. Khoshgoftaar, K. Gao, and N. Seliya, â€œAttribute selection and imbalanced data: Problems in software defect prediction,â€ in Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, 2010, vol. 1, doi: 10.1109/ICTAI.2010.27.

J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, â€œA survey on addressing high-class imbalance in big data,â€ J. Big Data, vol. 5, no. 1, 2018, doi: 10.1186/s40537-018-0151-6.

H. He and Y. Ma, Imbalanced learning: Foundations, algorithms, and applications. 2013.

S. Vluymans, â€œDealing with imbalanced and weakly labelled data in machine learning using fuzzy and rough set methods,â€ in Studies in Computational Intelligence, vol. 807, 2019.

G. Douzas and F. Bacao, â€œSelf-Organizing Map Oversampling (SOMO) for imbalanced data set learning,â€ Expert Syst. Appl., vol. 82, 2017, doi: 10.1016/j.eswa.2017.03.073.

N. Santoso, W. Wibowo, and H. Himawati, â€œIntegration of synthetic minority oversampling technique for imbalanced class,â€ Indones. J. Electr. Eng. Comput. Sci., 2019, doi: 10.11591/ijeecs.v13.i1.pp102-108.

J. M. Johnson and T. M. Khoshgoftaar, â€œSurvey on deep learning with class imbalance,â€ J. Big Data, vol. 6, no. 1, 2019, doi: 10.1186/s40537-019-0192-5.

K. Gao, T. Khoshgoftaar, and R. Wald, â€œCombining feature selection and ensemble learning for software quality estimation,â€ in Proceedings of the 27th International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2014.

A. Nurmasani and Y. Pristyanto, â€œAlgoritme Stacking untuk Klasifikasi Penyakit Jantung pada Dataset Imbalanced Class,â€ Pseudocode, vol. 8, no. 1, 2021, doi: 10.33369/pseudocode.8.1.21-26.

J. Ortigosa-HernÃ¡ndez, I. Inza, and J. A. Lozano, â€œTowards Competitive Classifiers for Unbalanced Classification Problems: A Study on the Performance Scores,â€ 2016, [Online]. Available: http://arxiv.org/abs/1608.08984.

Q. Gu, X. M. Wang, Z. Wu, B. Ning, and C. S. Xin, â€œAn improved SMOTE algorithm based on genetic algorithm for imbalanced data classification,â€ J. Digit. Inf. Manag., vol. 14, no. 2, 2016.

A. FernÃ¡ndez, S. GarcÃa, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning from Imbalanced Data Sets. 2018.

B. W. Yap, K. A. Rani, H. A. A. Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah, â€œAn application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets,â€ in Lecture Notes in Electrical Engineering, 2014, pp. 13â€“22, doi: 10.1007/978-981-4585-18-7_2.

Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, â€œCost-sensitive boosting for classification of imbalanced data,â€ Pattern Recognit., vol. 40, no. 12, pp. 3358â€“3378, 2007, doi: 10.1016/j.patcog.2007.04.009.

D. Tiwari, â€œHandling Class Imbalance Problem Using Feature Selection,â€ Int. J. Adv. Res. Comput. Sci. Technol., vol. 2, no. 2, pp. 516â€“520, 2014.

I. S. Thaseen, C. A. Kumar, and A. Ahmad, â€œIntegrated Intrusion Detection Model Using Chi-Square Feature Selection and Ensemble of Classifiers,â€ Arab. J. Sci. Eng., vol. 44, no. 4, 2019, doi: 10.1007/s13369-018-3507-5.

A. Thakkar and R. Lohiya, â€œAttack classification using feature selection techniques: a comparative study,â€ J. Ambient Intell. Humaniz. Comput., vol. 12, no. 1, 2021, doi: 10.1007/s12652-020-02167-9.

J. Li et al., â€œFeature selection: A data perspective,â€ ACM Comput. Surv., vol. 50, no. 6, 2017, doi: 10.1145/3136625.

â€œOnline Shoppers Purchasing Intention Dataset,â€ UCI Machine Learning Repository, 2018. https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset.

C. O. Sakar, S. O. Polat, M. Katircioglu, and Y. Kastro, â€œReal-time prediction of online shoppersâ€™ purchasing intention using multilayer perceptron and LSTM recurrent neural networks,â€ Neural Comput. Appl., vol. 31, no. 10, pp. 6893â€“6908, 2019, doi: 10.1007/s00521-018-3523-0.

H. Kuswanto, N. Sunusi, S. Siswanto, and N. Nirwan, â€œApplication of Resampling and Boosting Methods Using the C5.0 Algorithm,â€ Proc. Int. Conf. Data Sci. Off. Stat., vol. 2021, no. 1, 2022, doi: 10.34123/icdsos.v2021i1.198.

Y. Xiao and X. Xiao, â€œAn intrusion detection system based on a simplified residual network,â€ Inf., vol. 10, no. 11, 2019, doi: 10.3390/info10110356.

D. Jain, A. K. Mishra, and S. K. Das, â€œMachine Learning Based Automatic Prediction of Parkinsonâ€™s Disease Using Speech Features,â€ in Advances in Intelligent Systems and Computing, 2021, vol. 1164, doi: 10.1007/978-981-15-4992-2_33.

L. Gong, S. Jiang, and L. Jiang, â€œTackling Class Imbalance Problem in Software Defect Prediction through Cluster-Based Over-Sampling with Filtering,â€ IEEE Access, vol. 7, 2019, doi: 10.1109/ACCESS.2019.2945858.

S. DEMÄ°R and E. K. ÅžAHÄ°N, â€œEvaluation of Oversampling Methods (OVER, SMOTE, and ROSE) in Classifying Soil Liquefaction Dataset based on SVM, RF, and NaÃ¯ve Bayes,â€ Eur. J. Sci. Technol., 2022, doi: 10.31590/ejosat.1077867.

E. Prasetyo, DATA MINING Mengolah Data Menjadi Informasi Menggunakan Matlab. 2014.

D. Kurniawan and D. C. Supriyanto, â€œOptimasi Algoritma Support Vector Machine (Svm) Menggunakan Adaboost Untuk Penilaian Risiko Kredit,â€ J. Teknol. Inf., vol. 9, no. 1, 2013.

G. Feng, J. D. Zhang, and S. Shaoyi Liao, â€œA novel method for combining Bayesian networks, theoretical analysis, and its applications,â€ Pattern Recognit., vol. 47, no. 5, 2014, doi: 10.1016/j.patcog.2013.12.005.

S. Mulyati, Y. Yulianti, and A. Saifudin, â€œPenerapan Resampling dan Adaboost untuk Penanganan Masalah Ketidakseimbangan Kelas Berbasis NaÏŠve Bayes pada Prediksi Churn Pelanggan,â€ J. Inform. Univ. Pamulang, vol. 2, no. 4, 2017, doi: 10.32493/informatika.v2i4.1440.

R. Hao, X. Xia, S. Shen, and X. Yang, â€œBank direct marketing analysis based on ensemble learning,â€ in Journal of Physics: Conference Series, 2020, vol. 1627, no. 1, doi: 10.1088/1742-6596/1627/1/012026.

X. Wu et al., â€œTop 10 algorithms in data mining,â€ Knowl. Inf. Syst., vol. 14, no. 1, pp. 1â€“37, 2008, doi: 10.1007/s10115-007-0114-2.

Ross Quinlan, â€œIs See5/C5.0 Better Than C4.5?,â€ RuleQuest Research, 2017. https://rulequest.com/see5-comparison.html#:~:text=Decision trees%3A faster%2C smaller&text=0 produce trees with similar,are noticeably smaller and C5.

S. Rajeswari and K. Suthendran, â€œC5.0: Advanced Decision Tree (ADT) classification model for agricultural data analysis on cloud,â€ Comput. Electron. Agric., vol. 156, pp. 530â€“539, 2019, doi: 10.1016/j.compag.2018.12.013.

J. H. Joloudari, M. Haderbadi, A. Mashmool, M. Ghasemigol, S. S. Band, and A. Mosavi, â€œEarly detection of the advanced persistent threat attack using performance analysis of deep learning,â€ IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3029202.

I. C. Dipto, T. Islam, H. M. M. Rahman, and M. A. Rahman, â€œComparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease,â€ J. Data Anal. Inf. Process., vol. 08, no. 02, 2020, doi: 10.4236/jdaip.2020.82003.

Random Oversampling, Chi-Square, dan AdaBoost dalam Penanganan Ketidakseimbangan Kelas pada Klasifikasi C5.0

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Menu Utama

flagcounter

template

statcounter

rji

terindex