Performance Comparison of k-Nearest Neighbor Algorithm with Various k Values and Distance Metrics for Malware Detection

 (*)Fauzi Adi Rafrastara Mail (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Catur Supriyanto (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Afinzaki Amiral (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Syafira Rosa Amalia (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Muhammad Daffa Al Fahreza (Universitas Dian Nuswantoro, Semarang, Indonesia)
 Foez Ahmed (University of Rajshahi, Rajshahi, Bangladesh)

(*) Corresponding Author

Submitted: November 6, 2023; Published: January 24, 2024

Abstract

Malware could evolve and spread very quickly. By these capabilities, malware becomes a threat to anyone who uses a computer, both offline and online. Therefore, research on malware detection is still a hot topic today, due to the need to protect devices or systems from the dangers posed by malware, such as loss/damage of data, data theft, account hacking, and the intrusion of hackers who can control the entire system. Malware has evolved from traditional (monomorphic) to modern forms (polymorphic, metamorphic, and oligomorphic). Conventional antivirus systems cannot detect modern types of viruses effectively, as they constantly change their fingerprints each time they replicate and propagate. With this evolution, a machine learning-based malware detection system is needed to replace the existence of signature-based. Machine learning-based antivirus or malware detection systems detect malware by performing dynamic analysis, not static analysis as used by traditional ones. This research discusses malware detection using one of the classification algorithms in machine learning, namely k-Nearest Neighbor (kNN). To improve the performance of kNN, the number of features is reduced using the Information Gain feature selection method. The performance of kNN with Information Gain will then be measured using the evaluation metrics Accuracy and F1-Score. To get the best score, some adjustments are made to the kNN algorithm, where 3 distance measurement methods will be compared to obtain the best performance along with the variations in the k values of kNN. The distance measurement methods compared are Euclidean, Manhattan, and Chebyshev, while the variations of k values compared are 3, 5, 7, and 9. The result is, kNN with the Manhattan distance measurement method, k = 3, and using information gain features selection method (reduction until 32 features remain) has the highest Accuracy and F1-Score, which is 97.0%.

Keywords


Malware Detection; K-Nearest Neighbor; Euclidean; Manhattan; Chebyshev; k Value

Full Text:

PDF


Article Metrics

Abstract view : 110 times
PDF - 35 times

References

N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. Damaševičius, “Windows PE Malware Detection Using Ensemble Learning,” Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.

O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,” IEEE Access, vol. 8, pp. 6249–6271, 2020, doi: 10.1109/ACCESS.2019.2963724.

A. Kamboj, P. Kumar, A. K. Bairwa, and S. Joshi, “Detection of malware in downloaded files using various machine learning models,” Egyptian Informatics Journal, vol. 24, no. 1, pp. 81–94, Mar. 2023, doi: 10.1016/j.eij.2022.12.002.

A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malwares: A Survey,” IJCA, vol. 90, no. 2, pp. 7–11, Mar. 2014, doi: 10.5120/15544-4098.

S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, “BigRC-EML: big-data based ransomware classification using ensemble machine learning,” Cluster Comput, vol. 25, no. 5, pp. 3405–3422, Oct. 2022, doi: 10.1007/s10586-022-03569-4.

F. A. Rafrastara and F. M. A., “Advanced Virus Monitoring and Analysis System,” IJCSIS, vol. 9, no. 1, 2011.

S. Shakya and M. Dave, “Analysis, Detection, and Classification of Android Malware using System Calls,” 2022, [Online]. Available: https://arxiv.org/pdf/2208.06130.pdf

P. Feng, J. Ma, C. Sun, X. Xu, and Y. Ma, “A Novel Dynamic Android Malware Detection System With Ensemble Learning,” IEEE Access, vol. 6, pp. 30996–31011, 2018, doi: 10.1109/ACCESS.2018.2844349.

F. C. C. Garcia and F. P. M. Ii, “Random Forest for Malware Classification”.

I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,” Procedia Computer Science, vol. 170, pp. 917–922, 2020, doi: 10.1016/j.procs.2020.03.110.

Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, “Entropy-based fuzzy support vector machine for imbalanced datasets,” Knowledge-Based Systems, vol. 115, pp. 87–99, Jan. 2017, doi: 10.1016/j.knosys.2016.09.032.

J. Hong, H. Kang, and T. Hong, “Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning,” Renewable and Sustainable Energy Reviews, vol. 134, p. 110402, Dec. 2020, doi: 10.1016/j.rser.2020.110402.

W. Chandra, B. Suprihatin, and Y. Resti, “Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,” Symmetry, vol. 15, no. 4, p. 887, Apr. 2023, doi: 10.3390/sym15040887.

F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,” JPIT, vol. 8, no. 2, pp. 113–118, 2023.

Y. Prihantono and Kalamullah Ramli, “Model-Based Feature Selection for Developing Network Attack Detection and Alerting System,” J. RESTI (Rekayasa Sist. Teknol. Inf.), vol. 6, no. 2, pp. 322–329, Apr. 2022, doi: 10.29207/resti.v6i2.3989.

Mihoub, A., S. Zidi, and L. Laouamer, “Investigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,” IJMLC, vol. 10, no. 2, pp. 299–308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.

A. Pandey and A. Jain, “Comparative Analysis of KNN Algorithm using Various Normalization Techniques,” IJCNIS, vol. 9, no. 11, pp. 36–42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.

D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.

Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,” IEEE Access, vol. 8, pp. 132911–132921, 2020, doi: 10.1109/ACCESS.2020.3009843.

F. A. Rafrastara, R. A. Pramunendar, D. P. Prabowo, E. Kartikadarma, and U. Sudibyo, “Optimasi Algoritma Random Forest menggunakan Principal Component Analysis untuk Deteksi Malware,” JTEKSIS, vol. 5, no. 3, pp. 217–223, Jul. 2023, doi: 10.47233/jteksis.v5i3.854.

D. M. Saputra, D. Saputra, and L. D. Oswari, “Effect of Distance Metrics in Determining K-Value in K-Means Clustering Using Elbow and Silhouette Method,” in Proceedings of the Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019), Palembang, Indonesia: Atlantis Press, 2020. doi: 10.2991/aisr.k.200424.051.

O. A. Mohamed Jafar and R. Sivakumar, “Distance Based Hybrid Approach for Cluster Analysis Using Variants of K-means and Evolutionary Algorithm,” RJASET, vol. 11, no. 8, pp. 1355–1362, Sep. 2014, doi: 10.19026/rjaset.8.1107.

G. Orrù, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, “Machine Learning in Psychometrics and Psychological Research,” Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.

G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, “Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,” Machines, vol. 7, no. 4, p. 74, Dec. 2019, doi: 10.3390/machines7040074.

G. Gupta, A. Rai, and V. Jha, “Predicting the Bandwidth Requests in XG-PON System using Ensemble Learning,” in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, Republic of: IEEE, Oct. 2021, pp. 936–941. doi: 10.1109/ICTC52510.2021.9620935.

S. Dev, B. Kumar, D. C. Dobhal, and H. Singh Negi, “Performance Analysis and Prediction of Diabetes using Various Machine Learning Algorithms,” in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India: IEEE, Dec. 2022, pp. 517–521. doi: 10.1109/ICAC3N56670.2022.10074117.

D. L. De Vargas, J. T. Oliva, M. Teixeira, D. Casanova, and J. L. G. Rosa, “Feature extraction and selection from electroencephalogram signals for epileptic seizure diagnosis,” Neural Comput & Applic, vol. 35, no. 16, pp. 12195–12219, Jun. 2023, doi: 10.1007/s00521-023-08350-1.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Performance Comparison of k-Nearest Neighbor Algorithm with Various k Values and Distance Metrics for Malware Detection

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 JURNAL MEDIA INFORMATIKA BUDIDARMA

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.



JURNAL MEDIA INFORMATIKA BUDIDARMA
STMIK Budi Darma
Secretariat: Sisingamangaraja No. 338 Telp 061-7875998
Email: mib.stmikbd@gmail.com

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.