Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection

Catur Supriyanto; Fauzi Adi Rafrastara; Afinzaki Amiral; Syafira Rosa Amalia; Muhammad Daffa Al Fahreza; Mohd. Faizal Abdollah

doi:10.30865/mib.v8i1.6970

Authors

Catur Supriyanto Universitas Dian Nuswantoro, Semarang
Fauzi Adi Rafrastara Universitas Dian Nuswantoro, Semarang
Afinzaki Amiral Universitas Dian Nuswantoro, Semarang
Syafira Rosa Amalia Universitas Dian Nuswantoro, Semarang
Muhammad Daffa Al Fahreza Universitas Dian Nuswantoro, Semarang
Mohd. Faizal Abdollah Universiti Teknikal Malaysia Melaka, Melaka

DOI:

https://doi.org/10.30865/mib.v8i1.6970

Keywords:

Classification, Features Selection, Information Gain, K-Nearest Neighbor, Malware Detection

Abstract

Malware is one of the biggest threats in todayâ€™s digital era. Malware detection becomes crucial since it can protect devices or systems from the dangers posed by malware, such as data loss/damage, data theft, account break-ins, and the entry of intruders who can gain full access of system. Considering that malware has also evolved from traditional form (monomorphic) to modern form (polymorphic, metamorphic, and oligomorphic), a malware detection system is needed that is no longer signature-based, but rather machine learning-based. This research will discuss malware detection by classifying the file whether considered as malware or goodware, using one of the classification algorithms in machine learning, namely k-Nearest Neighbor (kNN). To improve the performance of kNN, the number of features was reduced using the Information Gain and Principal Component Analysis (PCA) feature selection methods. The performance of kNN with PCA and Information Gain will then be compared to get the best performance. As a result, by using the PCA method where the number of features was reduced until the remaining 32 PCs, the kNN algorithm succeeded in maintaining classification performance with an accuracy of 95.6% and an F1-Score of 95.6%. Using the same number of features as the basis, the Information Gain method is applied by sorting the features from those with the highest Information Gain score and taking the 32 best features. The result, by using this Information Gain method, the classification performance of the kNN algorithm can be increased to 96.9% for both accuracy and F1-Score.

References

N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. DamaÅ¡eviÄius, â€œWindows PE Malware Detection Using Ensemble Learning,â€ Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.

O. Aslan and R. Samet, â€œA Comprehensive Review on Malware Detection Approaches,â€ IEEE Access, vol. 8, pp. 6249â€“6271, 2020, doi: 10.1109/ACCESS.2019.2963724.

F. A. Rafrastara and F. M. A., â€œAdvanced Virus Monitoring and Analysis System,â€ IJCSIS, vol. 9, no. 1, 2011.

C. S. Yadav and S. Gupta, â€œA Review on Malware Analysis for IoT and Android System,â€ SN COMPUT. SCI., vol. 4, no. 2, p. 118, Dec. 2022, doi: 10.1007/s42979-022-01543-w.

A. Kamboj, P. Kumar, A. K. Bairwa, and S. Joshi, â€œDetection of malware in downloaded files using various machine learning models,â€ Egyptian Informatics Journal, vol. 24, no. 1, pp. 81â€“94, Mar. 2023, doi: 10.1016/j.eij.2022.12.002.

A. Sharma and S. K. Sahay, â€œEvolution and Detection of Polymorphic and Metamorphic Malwares: A Survey,â€ IJCA, vol. 90, no. 2, pp. 7â€“11, Mar. 2014, doi: 10.5120/15544-4098.

S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, â€œBigRC-EML: big-data based ransomware classification using ensemble machine learning,â€ Cluster Comput, vol. 25, no. 5, pp. 3405â€“3422, Oct. 2022, doi: 10.1007/s10586-022-03569-4.

M. J. Hossain Faruk et al., â€œMalware Detection and Prevention using Artificial Intelligence Techniques,â€ in 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA: IEEE, Dec. 2021, pp. 5369â€“5377. doi: 10.1109/BigData52589.2021.9671434.

B. Kundu, N. Gupta, and R. Seal, â€œCyber Vulnerabilities in Smart Grid: A Review,â€ International Journal of Engineering Research, vol. 9, no. 11, 2021.

F. A. Rafrastara, C. Supriyanto, C. Paramita, and Y. P. Astuti, â€œDeteksi Malware menggunakan Metode Stacking berbasis Ensemble,â€ JPIT, vol. 8, no. 1, pp. 11â€“16, 2023.

S. Shakya and M. Dave, â€œAnalysis, Detection, and Classification of Android Malware using System Calls,â€ 2022, [Online]. Available: https://arxiv.org/pdf/2208.06130.pdf

P. Feng, J. Ma, C. Sun, X. Xu, and Y. Ma, â€œA Novel Dynamic Android Malware Detection System With Ensemble Learning,â€ IEEE Access, vol. 6, pp. 30996â€“31011, 2018, doi: 10.1109/ACCESS.2018.2844349.

F. C. C. Garcia and F. P. M. Ii, â€œRandom Forest for Malware Classificationâ€, Accessed: Jun. 01, 2023. [Online]. Available: https://arxiv.org/ftp/arxiv/papers/1609/1609.07770.pdf

I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, â€œThe Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,â€ Procedia Computer Science, vol. 170, pp. 917â€“922, 2020, doi: 10.1016/j.procs.2020.03.110.

J. Hong, H. Kang, and T. Hong, â€œOversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning,â€ Renewable and Sustainable Energy Reviews, vol. 134, p. 110402, Dec. 2020, doi: 10.1016/j.rser.2020.110402.

W. Chandra, B. Suprihatin, and Y. Resti, â€œMedian-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,â€ Symmetry, vol. 15, no. 4, p. 887, Apr. 2023, doi: 10.3390/sym15040887.

F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, â€œPerformance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,â€ JPIT, vol. 8, no. 2, pp. 113â€“118, 2023.

Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, â€œEntropy-based fuzzy support vector machine for imbalanced datasets,â€ Knowledge-Based Systems, vol. 115, pp. 87â€“99, Jan. 2017, doi: 10.1016/j.knosys.2016.09.032.

A. Pandey and A. Jain, â€œComparative Analysis of KNN Algorithm using Various Normalization Techniques,â€ IJCNIS, vol. 9, no. 11, pp. 36â€“42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.

D. Singh and B. Singh, â€œInvestigating the impact of data normalization on classification performance,â€ Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.

Mihoub, A., S. Zidi, and L. Laouamer, â€œInvestigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,â€ IJMLC, vol. 10, no. 2, pp. 299â€“308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.

F. A. Rafrastara, R. A. Pramunendar, D. P. Prabowo, E. Kartikadarma, and U. Sudibyo, â€œOptimasi Algoritma Random Forest menggunakan Principal Component Analysis untuk Deteksi Malware,â€ JTEKSIS, vol. 5, no. 3, pp. 217â€“223, Jul. 2023, doi: 10.47233/jteksis.v5i3.854.

Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, â€œCICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,â€ IEEE Access, vol. 8, pp. 132911â€“132921, 2020, doi: 10.1109/ACCESS.2020.3009843.

J. P. Mueller and L. Massaron, Machine learning for dummies, 2nd edition. Indianapolis: John Wiley & Sons, 2021.

D. L. De Vargas, J. T. Oliva, M. Teixeira, D. Casanova, and J. L. G. Rosa, â€œFeature extraction and selection from electroencephalogram signals for epileptic seizure diagnosis,â€ Neural Comput & Applic, vol. 35, no. 16, pp. 12195â€“12219, Jun. 2023, doi: 10.1007/s00521-023-08350-1.

G. OrrÃ¹, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, â€œMachine Learning in Psychometrics and Psychological Research,â€ Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.

G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, â€œComparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,â€ Machines, vol. 7, no. 4, p. 74, Dec. 2019, doi: 10.3390/machines7040074.

G. Gupta, A. Rai, and V. Jha, â€œPredicting the Bandwidth Requests in XG-PON System using Ensemble Learning,â€ in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, Republic of: IEEE, Oct. 2021, pp. 936â€“941. doi: 10.1109/ICTC52510.2021.9620935.

S. Dev, B. Kumar, D. C. Dobhal, and H. Singh Negi, â€œPerformance Analysis and Prediction of Diabetes using Various Machine Learning Algorithms,â€ in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India: IEEE, Dec. 2022, pp. 517â€“521. doi: 10.1109/ICAC3N56670.2022.10074117.

Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License