Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection

 Catur Supriyanto (Universitas Dian Nuswantoro, Semarang, Indonesia)
 (*)Fauzi Adi Rafrastara Mail (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Afinzaki Amiral (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Syafira Rosa Amalia (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Muhammad Daffa Al Fahreza (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Mohd. Faizal Abdollah (Universiti Teknikal Malaysia Melaka, Melaka, Malaysia)

(*) Corresponding Author

Submitted: November 6, 2023; Published: January 24, 2024


Malware is one of the biggest threats in today’s digital era. Malware detection becomes crucial since it can protect devices or systems from the dangers posed by malware, such as data loss/damage, data theft, account break-ins, and the entry of intruders who can gain full access of system. Considering that malware has also evolved from traditional form (monomorphic) to modern form (polymorphic, metamorphic, and oligomorphic), a malware detection system is needed that is no longer signature-based, but rather machine learning-based. This research will discuss malware detection by classifying the file whether considered as malware or goodware, using one of the classification algorithms in machine learning, namely k-Nearest Neighbor (kNN). To improve the performance of kNN, the number of features was reduced using the Information Gain and Principal Component Analysis (PCA) feature selection methods. The performance of kNN with PCA and Information Gain will then be compared to get the best performance. As a result, by using the PCA method where the number of features was reduced until the remaining 32 PCs, the kNN algorithm succeeded in maintaining classification performance with an accuracy of 95.6% and an F1-Score of 95.6%. Using the same number of features as the basis, the Information Gain method is applied by sorting the features from those with the highest Information Gain score and taking the 32 best features. The result, by using this Information Gain method, the classification performance of the kNN algorithm can be increased to 96.9% for both accuracy and F1-Score.


Classification; Features Selection; Information Gain; K-Nearest Neighbor; Malware Detection

Full Text:


Article Metrics

Abstract view : 212 times
PDF - 64 times


N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. Damaševičius, “Windows PE Malware Detection Using Ensemble Learning,” Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.

O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,” IEEE Access, vol. 8, pp. 6249–6271, 2020, doi: 10.1109/ACCESS.2019.2963724.

F. A. Rafrastara and F. M. A., “Advanced Virus Monitoring and Analysis System,” IJCSIS, vol. 9, no. 1, 2011.

C. S. Yadav and S. Gupta, “A Review on Malware Analysis for IoT and Android System,” SN COMPUT. SCI., vol. 4, no. 2, p. 118, Dec. 2022, doi: 10.1007/s42979-022-01543-w.

A. Kamboj, P. Kumar, A. K. Bairwa, and S. Joshi, “Detection of malware in downloaded files using various machine learning models,” Egyptian Informatics Journal, vol. 24, no. 1, pp. 81–94, Mar. 2023, doi: 10.1016/j.eij.2022.12.002.

A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malwares: A Survey,” IJCA, vol. 90, no. 2, pp. 7–11, Mar. 2014, doi: 10.5120/15544-4098.

S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, “BigRC-EML: big-data based ransomware classification using ensemble machine learning,” Cluster Comput, vol. 25, no. 5, pp. 3405–3422, Oct. 2022, doi: 10.1007/s10586-022-03569-4.

M. J. Hossain Faruk et al., “Malware Detection and Prevention using Artificial Intelligence Techniques,” in 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA: IEEE, Dec. 2021, pp. 5369–5377. doi: 10.1109/BigData52589.2021.9671434.

B. Kundu, N. Gupta, and R. Seal, “Cyber Vulnerabilities in Smart Grid: A Review,” International Journal of Engineering Research, vol. 9, no. 11, 2021.

F. A. Rafrastara, C. Supriyanto, C. Paramita, and Y. P. Astuti, “Deteksi Malware menggunakan Metode Stacking berbasis Ensemble,” JPIT, vol. 8, no. 1, pp. 11–16, 2023.

S. Shakya and M. Dave, “Analysis, Detection, and Classification of Android Malware using System Calls,” 2022, [Online]. Available:

P. Feng, J. Ma, C. Sun, X. Xu, and Y. Ma, “A Novel Dynamic Android Malware Detection System With Ensemble Learning,” IEEE Access, vol. 6, pp. 30996–31011, 2018, doi: 10.1109/ACCESS.2018.2844349.

F. C. C. Garcia and F. P. M. Ii, “Random Forest for Malware Classification”, Accessed: Jun. 01, 2023. [Online]. Available:

I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,” Procedia Computer Science, vol. 170, pp. 917–922, 2020, doi: 10.1016/j.procs.2020.03.110.

J. Hong, H. Kang, and T. Hong, “Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning,” Renewable and Sustainable Energy Reviews, vol. 134, p. 110402, Dec. 2020, doi: 10.1016/j.rser.2020.110402.

W. Chandra, B. Suprihatin, and Y. Resti, “Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,” Symmetry, vol. 15, no. 4, p. 887, Apr. 2023, doi: 10.3390/sym15040887.

F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,” JPIT, vol. 8, no. 2, pp. 113–118, 2023.

Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, “Entropy-based fuzzy support vector machine for imbalanced datasets,” Knowledge-Based Systems, vol. 115, pp. 87–99, Jan. 2017, doi: 10.1016/j.knosys.2016.09.032.

A. Pandey and A. Jain, “Comparative Analysis of KNN Algorithm using Various Normalization Techniques,” IJCNIS, vol. 9, no. 11, pp. 36–42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.

D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.

Mihoub, A., S. Zidi, and L. Laouamer, “Investigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,” IJMLC, vol. 10, no. 2, pp. 299–308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.

F. A. Rafrastara, R. A. Pramunendar, D. P. Prabowo, E. Kartikadarma, and U. Sudibyo, “Optimasi Algoritma Random Forest menggunakan Principal Component Analysis untuk Deteksi Malware,” JTEKSIS, vol. 5, no. 3, pp. 217–223, Jul. 2023, doi: 10.47233/jteksis.v5i3.854.

Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,” IEEE Access, vol. 8, pp. 132911–132921, 2020, doi: 10.1109/ACCESS.2020.3009843.

J. P. Mueller and L. Massaron, Machine learning for dummies, 2nd edition. Indianapolis: John Wiley & Sons, 2021.

D. L. De Vargas, J. T. Oliva, M. Teixeira, D. Casanova, and J. L. G. Rosa, “Feature extraction and selection from electroencephalogram signals for epileptic seizure diagnosis,” Neural Comput & Applic, vol. 35, no. 16, pp. 12195–12219, Jun. 2023, doi: 10.1007/s00521-023-08350-1.

G. Orrù, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, “Machine Learning in Psychometrics and Psychological Research,” Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.

G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, “Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,” Machines, vol. 7, no. 4, p. 74, Dec. 2019, doi: 10.3390/machines7040074.

G. Gupta, A. Rai, and V. Jha, “Predicting the Bandwidth Requests in XG-PON System using Ensemble Learning,” in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, Republic of: IEEE, Oct. 2021, pp. 936–941. doi: 10.1109/ICTC52510.2021.9620935.

S. Dev, B. Kumar, D. C. Dobhal, and H. Singh Negi, “Performance Analysis and Prediction of Diabetes using Various Machine Learning Algorithms,” in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India: IEEE, Dec. 2022, pp. 517–521. doi: 10.1109/ICAC3N56670.2022.10074117.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection


  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

STMIK Budi Darma
Secretariat: Sisingamangaraja No. 338 Telp 061-7875998

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.