Optimalisasi Kinerja Klasifikasi Melalui Seleksi Fitur dan AdaBoost dalam Penanganan Ketidakseimbangan Kelas

Authors

  • Tanti Tanti STMIK Mikroskil, Medan
  • Pahala Sirait STMIK Mikroskil, Medan
  • Andri Andri STMIK Mikroskil, Medan

DOI:

https://doi.org/10.30865/mib.v5i4.3280

Keywords:

Class Imbalance, Classification, C5.0, Chi-Square, AdaBoost

Abstract

One of the problems in data mining classification is class imbalance, where the number of instances in the majority class is more than the minority class. In the classification process, minority classes are often misclassified, because machine learning prioritizes the majority class and ignores the minority class so that this can cause the classification performance to be not optimal. The purpose of this study is to provide a solution to overcome class imbalances so as to optimize classification performance using chi-square and adaboost on one of the classification algorithms, namely C5.0. In this study, the majority class in the dataset used is dominated by the negative class, so the performance appraisal should focus more on the positive class. Therefore, a more suitable assessment is recall/sensitivity/TPR because the resulting value only depends on the positive class. The results showed that both methods were able to increase the recall/sensitivity/TPR value, meaning that the application of chi-square and adaboost was able to improve the classification performance of the minority class

References

J. C. Athapaththu and K. M. S. D. Kulathunga, “Factors affecting online purchase intention: A study of Sri Lankan online customers,†Int. J. Sci. Technol. Res., vol. 7, no. 9, 2018.

D. Wagner, S. Chaipoopirutana, and H. Combs, “A Study of Factors Influencing the Online Purchasing Intention toward Online Shopping in Thailand,†AtMA 2019 Proccedings, no. 2013, pp. 277–292, 2019.

M. R. Kabir, F. Bin Ashraf, and R. Ajwad, “Analysis of different predicting model for online shoppers’ purchase intention from empirical data,†2019 22nd Int. Conf. Comput. Inf. Technol., no. March 2020, 2019, doi: 10.1109/ICCIT48885.2019.9038521.

E. Buulolo, Data Mining untuk Perguruan Tinggi. Yogyakarta: Deepublish, 2020.

Ross Quinlan, “Is See5/C5.0 Better Than C4.5?,†RuleQuest Research, 2017. https://rulequest.com/see5-comparison.html#:~:text=Decision trees%3A faster%2C smaller&text=0 produce trees with similar,are noticeably smaller and C5.

N. Japkowicz, “Assessment metrics for imbalanced learning,†in Imbalanced Learning: Foundations, Algorithms, and Applications, 1st ed., Wiley-IEEE Press, Ed. 2013, pp. 187–206.

“Online Shoppers Purchasing Intention Dataset,†UCI Machine Learning Repository, 2018. https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset.

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,†Expert Syst. Appl., vol. 73, pp. 220–239, 2017, doi: 10.1016/j.eswa.2016.12.035.

K. Gao, T. Khoshgoftaar, and R. Wald, “Combining feature selection and ensemble learning for software quality estimation,†in Proceedings of the 27th International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2014.

A. Nurmasani and Y. Pristyanto, “Algoritme Stacking untuk Klasifikasi Penyakit Jantung pada Dataset Imbalanced Class,†Pseudocode, vol. 8, no. 1, 2021, doi: 10.33369/pseudocode.8.1.21-26.

N. Santoso, W. Wibowo, and H. Himawati, “Integration of synthetic minority oversampling technique for imbalanced class,†Indones. J. Electr. Eng. Comput. Sci., 2019, doi: 10.11591/ijeecs.v13.i1.pp102-108.

P. Yildirim, “Pattern Classification with Imbalanced and Multiclass Data for the Prediction of Albendazole Adverse Event Outcomes,†in Procedia Computer Science, 2016, vol. 83, doi: 10.1016/j.procs.2016.04.216.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: The MIT Press, 2016.

C. O. Sakar, S. O. Polat, M. Katircioglu, and Y. Kastro, “Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks,†Neural Comput. Appl., vol. 31, no. 10, pp. 6893–6908, 2019, doi: 10.1007/s00521-018-3523-0.

B. W. Yap, K. A. Rani, H. A. A. Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah, “An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets,†in Lecture Notes in Electrical Engineering, 2014, pp. 13–22, doi: 10.1007/978-981-4585-18-7_2.

Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, “Cost-sensitive boosting for classification of imbalanced data,†Pattern Recognit., vol. 40, no. 12, pp. 3358–3378, 2007, doi: 10.1016/j.patcog.2007.04.009.

S. Mulyati, Y. Yulianti, and A. Saifudin, “Penerapan Resampling dan Adaboost untuk Penanganan Masalah Ketidakseimbangan Kelas Berbasis Naϊve Bayes pada Prediksi Churn Pelanggan,†J. Inform. Univ. Pamulang, vol. 2, no. 4, 2017, doi: 10.32493/informatika.v2i4.1440.

R. Hao, X. Xia, S. Shen, and X. Yang, “Bank direct marketing analysis based on ensemble learning,†in Journal of Physics: Conference Series, 2020, vol. 1627, no. 1, doi: 10.1088/1742-6596/1627/1/012026.

I. S. Thaseen, C. A. Kumar, and A. Ahmad, “Integrated Intrusion Detection Model Using Chi-Square Feature Selection and Ensemble of Classifiers,†Arab. J. Sci. Eng., vol. 44, no. 4, 2019, doi: 10.1007/s13369-018-3507-5.

R. AlShboul, F. Thabtah, N. Abdelhamid, and M. Al-diabat, “A visualization cybersecurity method based on features’ dissimilarity,†Comput. Secur., vol. 77, 2018, doi: 10.1016/j.cose.2018.04.007.

A. Thakkar and R. Lohiya, “Attack classification using feature selection techniques: a comparative study,†J. Ambient Intell. Humaniz. Comput., vol. 12, no. 1, 2021, doi: 10.1007/s12652-020-02167-9.

V. R. Balasaraswathi, M. Sugumaran, and Y. Hamid, “Feature selection techniques for intrusion detection using non-bio-inspired and bio-inspired optimization algorithms,†J. Commun. Inf. Networks, vol. 2, no. 4, 2017, doi: 10.1007/s41650-017-0033-7.

J. Li et al., “Feature selection: A data perspective,†ACM Comput. Surv., vol. 50, no. 6, 2017, doi: 10.1145/3136625.

P. Appiahene, Y. M. Missah, and U. Najim, “Predicting Bank Operational Efficiency Using Machine Learning Algorithm: Comparative Study of Decision Tree, Random Forest, and Neural Networks,†Adv. Fuzzy Syst., vol. 2020, 2020, doi: 10.1155/2020/8581202.

G. Wang and N. Wu, “A Comparative Study on Contract Recommendation Model: Using Macao Mobile Phone Datasets,†IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.2975029.

S. Sarkar, A. Verma, and J. Maiti, “Prediction of Occupational Incidents Using Proactive and Reactive Data: A Data Mining Approach,†in Industrial Safety Management, 2018.

H. Sulistiani and A. Tjahyanto, “Comparative Analysis of Feature Selection Method to Predict Customer Loyalty,†IPTEK J. Eng., vol. 3, no. 1, 2017, doi: 10.12962/joe.v3i1.2257.

S. K. Trivedi, “A study on credit scoring modeling with different feature selection and machine learning approaches,†Technol. Soc., vol. 63, 2020, doi: 10.1016/j.techsoc.2020.101413.

X. Wu et al., “Top 10 algorithms in data mining,†Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008, doi: 10.1007/s10115-007-0114-2.

F. Yu, G. Li, H. Chen, Y. Guo, Y. Yuan, and B. Coulton, “A VRF charge fault diagnosis method based on expert modification C5.0 decision tree,†Int. J. Refrig., 2018, doi: 10.1016/j.ijrefrig.2018.05.034.

S. Rajeswari and K. Suthendran, “C5.0: Advanced Decision Tree (ADT) classification model for agricultural data analysis on cloud,†Comput. Electron. Agric., vol. 156, pp. 530–539, 2019, doi: 10.1016/j.compag.2018.12.013.

J. H. Joloudari, M. Haderbadi, A. Mashmool, M. Ghasemigol, S. S. Band, and A. Mosavi, “Early detection of the advanced persistent threat attack using performance analysis of deep learning,†IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3029202.

I. C. Dipto, T. Islam, H. M. M. Rahman, and M. A. Rahman, “Comparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease,†J. Data Anal. Inf. Process., vol. 08, no. 02, 2020, doi: 10.4236/jdaip.2020.82003.

J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,†J. Big Data, vol. 6, no. 1, 2019, doi: 10.1186/s40537-019-0192-5.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,†BMC Genomics, vol. 21, no. 1, 2020, doi: 10.1186/s12864-019-6413-7.

Downloads

Published

2021-10-26