Performance Comparison of k-Nearest Neighbor Algorithm with Various k Values and Distance Metrics for Malware Detection
DOI:
https://doi.org/10.30865/mib.v8i1.6971Keywords:
Malware Detection, K-Nearest Neighbor, Euclidean, Manhattan, Chebyshev, k ValueAbstract
Malware could evolve and spread very quickly. By these capabilities, malware becomes a threat to anyone who uses a computer, both offline and online. Therefore, research on malware detection is still a hot topic today, due to the need to protect devices or systems from the dangers posed by malware, such as loss/damage of data, data theft, account hacking, and the intrusion of hackers who can control the entire system. Malware has evolved from traditional (monomorphic) to modern forms (polymorphic, metamorphic, and oligomorphic). Conventional antivirus systems cannot detect modern types of viruses effectively, as they constantly change their fingerprints each time they replicate and propagate. With this evolution, a machine learning-based malware detection system is needed to replace the existence of signature-based. Machine learning-based antivirus or malware detection systems detect malware by performing dynamic analysis, not static analysis as used by traditional ones. This research discusses malware detection using one of the classification algorithms in machine learning, namely k-Nearest Neighbor (kNN). To improve the performance of kNN, the number of features is reduced using the Information Gain feature selection method. The performance of kNN with Information Gain will then be measured using the evaluation metrics Accuracy and F1-Score. To get the best score, some adjustments are made to the kNN algorithm, where 3 distance measurement methods will be compared to obtain the best performance along with the variations in the k values of kNN. The distance measurement methods compared are Euclidean, Manhattan, and Chebyshev, while the variations of k values compared are 3, 5, 7, and 9. The result is, kNN with the Manhattan distance measurement method, k = 3, and using information gain features selection method (reduction until 32 features remain) has the highest Accuracy and F1-Score, which is 97.0%.
References
N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. DamaÅ¡eviÄius, “Windows PE Malware Detection Using Ensemble Learning,†Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.
O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,†IEEE Access, vol. 8, pp. 6249–6271, 2020, doi: 10.1109/ACCESS.2019.2963724.
A. Kamboj, P. Kumar, A. K. Bairwa, and S. Joshi, “Detection of malware in downloaded files using various machine learning models,†Egyptian Informatics Journal, vol. 24, no. 1, pp. 81–94, Mar. 2023, doi: 10.1016/j.eij.2022.12.002.
A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malwares: A Survey,†IJCA, vol. 90, no. 2, pp. 7–11, Mar. 2014, doi: 10.5120/15544-4098.
S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, “BigRC-EML: big-data based ransomware classification using ensemble machine learning,†Cluster Comput, vol. 25, no. 5, pp. 3405–3422, Oct. 2022, doi: 10.1007/s10586-022-03569-4.
F. A. Rafrastara and F. M. A., “Advanced Virus Monitoring and Analysis System,†IJCSIS, vol. 9, no. 1, 2011.
S. Shakya and M. Dave, “Analysis, Detection, and Classification of Android Malware using System Calls,†2022, [Online]. Available: https://arxiv.org/pdf/2208.06130.pdf
P. Feng, J. Ma, C. Sun, X. Xu, and Y. Ma, “A Novel Dynamic Android Malware Detection System With Ensemble Learning,†IEEE Access, vol. 6, pp. 30996–31011, 2018, doi: 10.1109/ACCESS.2018.2844349.
F. C. C. Garcia and F. P. M. Ii, “Random Forest for Malware Classificationâ€.
I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,†Procedia Computer Science, vol. 170, pp. 917–922, 2020, doi: 10.1016/j.procs.2020.03.110.
Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, “Entropy-based fuzzy support vector machine for imbalanced datasets,†Knowledge-Based Systems, vol. 115, pp. 87–99, Jan. 2017, doi: 10.1016/j.knosys.2016.09.032.
J. Hong, H. Kang, and T. Hong, “Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning,†Renewable and Sustainable Energy Reviews, vol. 134, p. 110402, Dec. 2020, doi: 10.1016/j.rser.2020.110402.
W. Chandra, B. Suprihatin, and Y. Resti, “Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,†Symmetry, vol. 15, no. 4, p. 887, Apr. 2023, doi: 10.3390/sym15040887.
F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,†JPIT, vol. 8, no. 2, pp. 113–118, 2023.
Y. Prihantono and Kalamullah Ramli, “Model-Based Feature Selection for Developing Network Attack Detection and Alerting System,†J. RESTI (Rekayasa Sist. Teknol. Inf.), vol. 6, no. 2, pp. 322–329, Apr. 2022, doi: 10.29207/resti.v6i2.3989.
Mihoub, A., S. Zidi, and L. Laouamer, “Investigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,†IJMLC, vol. 10, no. 2, pp. 299–308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.
A. Pandey and A. Jain, “Comparative Analysis of KNN Algorithm using Various Normalization Techniques,†IJCNIS, vol. 9, no. 11, pp. 36–42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.
D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,†Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,†IEEE Access, vol. 8, pp. 132911–132921, 2020, doi: 10.1109/ACCESS.2020.3009843.
F. A. Rafrastara, R. A. Pramunendar, D. P. Prabowo, E. Kartikadarma, and U. Sudibyo, “Optimasi Algoritma Random Forest menggunakan Principal Component Analysis untuk Deteksi Malware,†JTEKSIS, vol. 5, no. 3, pp. 217–223, Jul. 2023, doi: 10.47233/jteksis.v5i3.854.
D. M. Saputra, D. Saputra, and L. D. Oswari, “Effect of Distance Metrics in Determining K-Value in K-Means Clustering Using Elbow and Silhouette Method,†in Proceedings of the Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019), Palembang, Indonesia: Atlantis Press, 2020. doi: 10.2991/aisr.k.200424.051.
O. A. Mohamed Jafar and R. Sivakumar, “Distance Based Hybrid Approach for Cluster Analysis Using Variants of K-means and Evolutionary Algorithm,†RJASET, vol. 11, no. 8, pp. 1355–1362, Sep. 2014, doi: 10.19026/rjaset.8.1107.
G. Orrù, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, “Machine Learning in Psychometrics and Psychological Research,†Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.
G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, “Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,†Machines, vol. 7, no. 4, p. 74, Dec. 2019, doi: 10.3390/machines7040074.
G. Gupta, A. Rai, and V. Jha, “Predicting the Bandwidth Requests in XG-PON System using Ensemble Learning,†in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, Republic of: IEEE, Oct. 2021, pp. 936–941. doi: 10.1109/ICTC52510.2021.9620935.
S. Dev, B. Kumar, D. C. Dobhal, and H. Singh Negi, “Performance Analysis and Prediction of Diabetes using Various Machine Learning Algorithms,†in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India: IEEE, Dec. 2022, pp. 517–521. doi: 10.1109/ICAC3N56670.2022.10074117.
D. L. De Vargas, J. T. Oliva, M. Teixeira, D. Casanova, and J. L. G. Rosa, “Feature extraction and selection from electroencephalogram signals for epileptic seizure diagnosis,†Neural Comput & Applic, vol. 35, no. 16, pp. 12195–12219, Jun. 2023, doi: 10.1007/s00521-023-08350-1.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).