Optimalisasi Kinerja Klasifikasi Melalui Seleksi Fitur dan AdaBoost dalam Penanganan Ketidakseimbangan Kelas

Tanti Tanti; Pahala Sirait; Andri Andri

doi:10.30865/mib.v5i4.3280

Authors

Tanti Tanti STMIK Mikroskil, Medan
Pahala Sirait STMIK Mikroskil, Medan
Andri Andri STMIK Mikroskil, Medan

DOI:

https://doi.org/10.30865/mib.v5i4.3280

Keywords:

Class Imbalance, Classification, C5.0, Chi-Square, AdaBoost

Abstract

One of the problems in data mining classification is class imbalance, where the number of instances in the majority class is more than the minority class. In the classification process, minority classes are often misclassified, because machine learning prioritizes the majority class and ignores the minority class so that this can cause the classification performance to be not optimal. The purpose of this study is to provide a solution to overcome class imbalances so as to optimize classification performance using chi-square and adaboost on one of the classification algorithms, namely C5.0. In this study, the majority class in the dataset used is dominated by the negative class, so the performance appraisal should focus more on the positive class. Therefore, a more suitable assessment is recall/sensitivity/TPR because the resulting value only depends on the positive class. The results showed that both methods were able to increase the recall/sensitivity/TPR value, meaning that the application of chi-square and adaboost was able to improve the classification performance of the minority class

References

J. C. Athapaththu and K. M. S. D. Kulathunga, â€œFactors affecting online purchase intention: A study of Sri Lankan online customers,â€ Int. J. Sci. Technol. Res., vol. 7, no. 9, 2018.

D. Wagner, S. Chaipoopirutana, and H. Combs, â€œA Study of Factors Influencing the Online Purchasing Intention toward Online Shopping in Thailand,â€ AtMA 2019 Proccedings, no. 2013, pp. 277â€“292, 2019.

M. R. Kabir, F. Bin Ashraf, and R. Ajwad, â€œAnalysis of different predicting model for online shoppersâ€™ purchase intention from empirical data,â€ 2019 22nd Int. Conf. Comput. Inf. Technol., no. March 2020, 2019, doi: 10.1109/ICCIT48885.2019.9038521.

E. Buulolo, Data Mining untuk Perguruan Tinggi. Yogyakarta: Deepublish, 2020.

Ross Quinlan, â€œIs See5/C5.0 Better Than C4.5?,â€ RuleQuest Research, 2017. https://rulequest.com/see5-comparison.html#:~:text=Decision trees%3A faster%2C smaller&text=0 produce trees with similar,are noticeably smaller and C5.

N. Japkowicz, â€œAssessment metrics for imbalanced learning,â€ in Imbalanced Learning: Foundations, Algorithms, and Applications, 1st ed., Wiley-IEEE Press, Ed. 2013, pp. 187â€“206.

â€œOnline Shoppers Purchasing Intention Dataset,â€ UCI Machine Learning Repository, 2018. https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset.

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, â€œLearning from class-imbalanced data: Review of methods and applications,â€ Expert Syst. Appl., vol. 73, pp. 220â€“239, 2017, doi: 10.1016/j.eswa.2016.12.035.

K. Gao, T. Khoshgoftaar, and R. Wald, â€œCombining feature selection and ensemble learning for software quality estimation,â€ in Proceedings of the 27th International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2014.

A. Nurmasani and Y. Pristyanto, â€œAlgoritme Stacking untuk Klasifikasi Penyakit Jantung pada Dataset Imbalanced Class,â€ Pseudocode, vol. 8, no. 1, 2021, doi: 10.33369/pseudocode.8.1.21-26.

N. Santoso, W. Wibowo, and H. Himawati, â€œIntegration of synthetic minority oversampling technique for imbalanced class,â€ Indones. J. Electr. Eng. Comput. Sci., 2019, doi: 10.11591/ijeecs.v13.i1.pp102-108.

P. Yildirim, â€œPattern Classification with Imbalanced and Multiclass Data for the Prediction of Albendazole Adverse Event Outcomes,â€ in Procedia Computer Science, 2016, vol. 83, doi: 10.1016/j.procs.2016.04.216.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: The MIT Press, 2016.

C. O. Sakar, S. O. Polat, M. Katircioglu, and Y. Kastro, â€œReal-time prediction of online shoppersâ€™ purchasing intention using multilayer perceptron and LSTM recurrent neural networks,â€ Neural Comput. Appl., vol. 31, no. 10, pp. 6893â€“6908, 2019, doi: 10.1007/s00521-018-3523-0.

B. W. Yap, K. A. Rani, H. A. A. Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah, â€œAn application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets,â€ in Lecture Notes in Electrical Engineering, 2014, pp. 13â€“22, doi: 10.1007/978-981-4585-18-7_2.

Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, â€œCost-sensitive boosting for classification of imbalanced data,â€ Pattern Recognit., vol. 40, no. 12, pp. 3358â€“3378, 2007, doi: 10.1016/j.patcog.2007.04.009.

S. Mulyati, Y. Yulianti, and A. Saifudin, â€œPenerapan Resampling dan Adaboost untuk Penanganan Masalah Ketidakseimbangan Kelas Berbasis NaÏŠve Bayes pada Prediksi Churn Pelanggan,â€ J. Inform. Univ. Pamulang, vol. 2, no. 4, 2017, doi: 10.32493/informatika.v2i4.1440.

R. Hao, X. Xia, S. Shen, and X. Yang, â€œBank direct marketing analysis based on ensemble learning,â€ in Journal of Physics: Conference Series, 2020, vol. 1627, no. 1, doi: 10.1088/1742-6596/1627/1/012026.

I. S. Thaseen, C. A. Kumar, and A. Ahmad, â€œIntegrated Intrusion Detection Model Using Chi-Square Feature Selection and Ensemble of Classifiers,â€ Arab. J. Sci. Eng., vol. 44, no. 4, 2019, doi: 10.1007/s13369-018-3507-5.

R. AlShboul, F. Thabtah, N. Abdelhamid, and M. Al-diabat, â€œA visualization cybersecurity method based on featuresâ€™ dissimilarity,â€ Comput. Secur., vol. 77, 2018, doi: 10.1016/j.cose.2018.04.007.

A. Thakkar and R. Lohiya, â€œAttack classification using feature selection techniques: a comparative study,â€ J. Ambient Intell. Humaniz. Comput., vol. 12, no. 1, 2021, doi: 10.1007/s12652-020-02167-9.

V. R. Balasaraswathi, M. Sugumaran, and Y. Hamid, â€œFeature selection techniques for intrusion detection using non-bio-inspired and bio-inspired optimization algorithms,â€ J. Commun. Inf. Networks, vol. 2, no. 4, 2017, doi: 10.1007/s41650-017-0033-7.

J. Li et al., â€œFeature selection: A data perspective,â€ ACM Comput. Surv., vol. 50, no. 6, 2017, doi: 10.1145/3136625.

P. Appiahene, Y. M. Missah, and U. Najim, â€œPredicting Bank Operational Efficiency Using Machine Learning Algorithm: Comparative Study of Decision Tree, Random Forest, and Neural Networks,â€ Adv. Fuzzy Syst., vol. 2020, 2020, doi: 10.1155/2020/8581202.

G. Wang and N. Wu, â€œA Comparative Study on Contract Recommendation Model: Using Macao Mobile Phone Datasets,â€ IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.2975029.

S. Sarkar, A. Verma, and J. Maiti, â€œPrediction of Occupational Incidents Using Proactive and Reactive Data: A Data Mining Approach,â€ in Industrial Safety Management, 2018.

H. Sulistiani and A. Tjahyanto, â€œComparative Analysis of Feature Selection Method to Predict Customer Loyalty,â€ IPTEK J. Eng., vol. 3, no. 1, 2017, doi: 10.12962/joe.v3i1.2257.

S. K. Trivedi, â€œA study on credit scoring modeling with different feature selection and machine learning approaches,â€ Technol. Soc., vol. 63, 2020, doi: 10.1016/j.techsoc.2020.101413.

X. Wu et al., â€œTop 10 algorithms in data mining,â€ Knowl. Inf. Syst., vol. 14, no. 1, pp. 1â€“37, 2008, doi: 10.1007/s10115-007-0114-2.

F. Yu, G. Li, H. Chen, Y. Guo, Y. Yuan, and B. Coulton, â€œA VRF charge fault diagnosis method based on expert modification C5.0 decision tree,â€ Int. J. Refrig., 2018, doi: 10.1016/j.ijrefrig.2018.05.034.

S. Rajeswari and K. Suthendran, â€œC5.0: Advanced Decision Tree (ADT) classification model for agricultural data analysis on cloud,â€ Comput. Electron. Agric., vol. 156, pp. 530â€“539, 2019, doi: 10.1016/j.compag.2018.12.013.

J. H. Joloudari, M. Haderbadi, A. Mashmool, M. Ghasemigol, S. S. Band, and A. Mosavi, â€œEarly detection of the advanced persistent threat attack using performance analysis of deep learning,â€ IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3029202.

I. C. Dipto, T. Islam, H. M. M. Rahman, and M. A. Rahman, â€œComparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease,â€ J. Data Anal. Inf. Process., vol. 08, no. 02, 2020, doi: 10.4236/jdaip.2020.82003.

J. M. Johnson and T. M. Khoshgoftaar, â€œSurvey on deep learning with class imbalance,â€ J. Big Data, vol. 6, no. 1, 2019, doi: 10.1186/s40537-019-0192-5.

D. Chicco and G. Jurman, â€œThe advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,â€ BMC Genomics, vol. 21, no. 1, 2020, doi: 10.1186/s12864-019-6413-7.

Optimalisasi Kinerja Klasifikasi Melalui Seleksi Fitur dan AdaBoost dalam Penanganan Ketidakseimbangan Kelas

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License