Malware Detection Using K-Nearest Neighbor Algorithm and Feature Selection
DOI:
https://doi.org/10.30865/mib.v8i1.6970Keywords:
Classification, Features Selection, Information Gain, K-Nearest Neighbor, Malware DetectionAbstract
Malware is one of the biggest threats in today’s digital era. Malware detection becomes crucial since it can protect devices or systems from the dangers posed by malware, such as data loss/damage, data theft, account break-ins, and the entry of intruders who can gain full access of system. Considering that malware has also evolved from traditional form (monomorphic) to modern form (polymorphic, metamorphic, and oligomorphic), a malware detection system is needed that is no longer signature-based, but rather machine learning-based. This research will discuss malware detection by classifying the file whether considered as malware or goodware, using one of the classification algorithms in machine learning, namely k-Nearest Neighbor (kNN). To improve the performance of kNN, the number of features was reduced using the Information Gain and Principal Component Analysis (PCA) feature selection methods. The performance of kNN with PCA and Information Gain will then be compared to get the best performance. As a result, by using the PCA method where the number of features was reduced until the remaining 32 PCs, the kNN algorithm succeeded in maintaining classification performance with an accuracy of 95.6% and an F1-Score of 95.6%. Using the same number of features as the basis, the Information Gain method is applied by sorting the features from those with the highest Information Gain score and taking the 32 best features. The result, by using this Information Gain method, the classification performance of the kNN algorithm can be increased to 96.9% for both accuracy and F1-Score.References
N. A. Azeez, O. E. Odufuwa, S. Misra, J. Oluranti, and R. DamaÅ¡eviÄius, “Windows PE Malware Detection Using Ensemble Learning,†Informatics, vol. 8, no. 1, p. 10, Feb. 2021, doi: 10.3390/informatics8010010.
O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,†IEEE Access, vol. 8, pp. 6249–6271, 2020, doi: 10.1109/ACCESS.2019.2963724.
F. A. Rafrastara and F. M. A., “Advanced Virus Monitoring and Analysis System,†IJCSIS, vol. 9, no. 1, 2011.
C. S. Yadav and S. Gupta, “A Review on Malware Analysis for IoT and Android System,†SN COMPUT. SCI., vol. 4, no. 2, p. 118, Dec. 2022, doi: 10.1007/s42979-022-01543-w.
A. Kamboj, P. Kumar, A. K. Bairwa, and S. Joshi, “Detection of malware in downloaded files using various machine learning models,†Egyptian Informatics Journal, vol. 24, no. 1, pp. 81–94, Mar. 2023, doi: 10.1016/j.eij.2022.12.002.
A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malwares: A Survey,†IJCA, vol. 90, no. 2, pp. 7–11, Mar. 2014, doi: 10.5120/15544-4098.
S. Aurangzeb, H. Anwar, M. A. Naeem, and M. Aleem, “BigRC-EML: big-data based ransomware classification using ensemble machine learning,†Cluster Comput, vol. 25, no. 5, pp. 3405–3422, Oct. 2022, doi: 10.1007/s10586-022-03569-4.
M. J. Hossain Faruk et al., “Malware Detection and Prevention using Artificial Intelligence Techniques,†in 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA: IEEE, Dec. 2021, pp. 5369–5377. doi: 10.1109/BigData52589.2021.9671434.
B. Kundu, N. Gupta, and R. Seal, “Cyber Vulnerabilities in Smart Grid: A Review,†International Journal of Engineering Research, vol. 9, no. 11, 2021.
F. A. Rafrastara, C. Supriyanto, C. Paramita, and Y. P. Astuti, “Deteksi Malware menggunakan Metode Stacking berbasis Ensemble,†JPIT, vol. 8, no. 1, pp. 11–16, 2023.
S. Shakya and M. Dave, “Analysis, Detection, and Classification of Android Malware using System Calls,†2022, [Online]. Available: https://arxiv.org/pdf/2208.06130.pdf
P. Feng, J. Ma, C. Sun, X. Xu, and Y. Ma, “A Novel Dynamic Android Malware Detection System With Ensemble Learning,†IEEE Access, vol. 6, pp. 30996–31011, 2018, doi: 10.1109/ACCESS.2018.2844349.
F. C. C. Garcia and F. P. M. Ii, “Random Forest for Malware Classificationâ€, Accessed: Jun. 01, 2023. [Online]. Available: https://arxiv.org/ftp/arxiv/papers/1609/1609.07770.pdf
I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,†Procedia Computer Science, vol. 170, pp. 917–922, 2020, doi: 10.1016/j.procs.2020.03.110.
J. Hong, H. Kang, and T. Hong, “Oversampling-based prediction of environmental complaints related to construction projects with imbalanced empirical-data learning,†Renewable and Sustainable Energy Reviews, vol. 134, p. 110402, Dec. 2020, doi: 10.1016/j.rser.2020.110402.
W. Chandra, B. Suprihatin, and Y. Resti, “Median-KNN Regressor-SMOTE-Tomek Links for Handling Missing and Imbalanced Data in Air Quality Prediction,†Symmetry, vol. 15, no. 4, p. 887, Apr. 2023, doi: 10.3390/sym15040887.
F. A. Rafrastara, C. Supriyanto, C. Paramita, Y. P. Astuti, and F. Ahmed, “Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method,†JPIT, vol. 8, no. 2, pp. 113–118, 2023.
Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, “Entropy-based fuzzy support vector machine for imbalanced datasets,†Knowledge-Based Systems, vol. 115, pp. 87–99, Jan. 2017, doi: 10.1016/j.knosys.2016.09.032.
A. Pandey and A. Jain, “Comparative Analysis of KNN Algorithm using Various Normalization Techniques,†IJCNIS, vol. 9, no. 11, pp. 36–42, Nov. 2017, doi: 10.5815/ijcnis.2017.11.04.
D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,†Applied Soft Computing, vol. 97, p. 105524, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.
Mihoub, A., S. Zidi, and L. Laouamer, “Investigating Best Approaches for Activity Classification in a Fully Instrumented Smarthome Environment,†IJMLC, vol. 10, no. 2, pp. 299–308, Feb. 2020, doi: 10.18178/ijmlc.2020.10.2.935.
F. A. Rafrastara, R. A. Pramunendar, D. P. Prabowo, E. Kartikadarma, and U. Sudibyo, “Optimasi Algoritma Random Forest menggunakan Principal Component Analysis untuk Deteksi Malware,†JTEKSIS, vol. 5, no. 3, pp. 217–223, Jul. 2023, doi: 10.47233/jteksis.v5i3.854.
Kurniabudi, D. Stiawan, Darmawijoyo, M. Y. Bin Idris, A. M. Bamhdi, and R. Budiarto, “CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection,†IEEE Access, vol. 8, pp. 132911–132921, 2020, doi: 10.1109/ACCESS.2020.3009843.
J. P. Mueller and L. Massaron, Machine learning for dummies, 2nd edition. Indianapolis: John Wiley & Sons, 2021.
D. L. De Vargas, J. T. Oliva, M. Teixeira, D. Casanova, and J. L. G. Rosa, “Feature extraction and selection from electroencephalogram signals for epileptic seizure diagnosis,†Neural Comput & Applic, vol. 35, no. 16, pp. 12195–12219, Jun. 2023, doi: 10.1007/s00521-023-08350-1.
G. Orrù, M. Monaro, C. Conversano, A. Gemignani, and G. Sartori, “Machine Learning in Psychometrics and Psychological Research,†Front. Psychol., vol. 10, p. 2970, Jan. 2020, doi: 10.3389/fpsyg.2019.02970.
G. Battineni, G. G. Sagaro, C. Nalini, F. Amenta, and S. K. Tayebati, “Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods,†Machines, vol. 7, no. 4, p. 74, Dec. 2019, doi: 10.3390/machines7040074.
G. Gupta, A. Rai, and V. Jha, “Predicting the Bandwidth Requests in XG-PON System using Ensemble Learning,†in 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, Republic of: IEEE, Oct. 2021, pp. 936–941. doi: 10.1109/ICTC52510.2021.9620935.
S. Dev, B. Kumar, D. C. Dobhal, and H. Singh Negi, “Performance Analysis and Prediction of Diabetes using Various Machine Learning Algorithms,†in 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India: IEEE, Dec. 2022, pp. 517–521. doi: 10.1109/ICAC3N56670.2022.10074117.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).