Optimasi Linear Support Vector Machine untuk Deteksi Smishing Multi-Kelas pada Dataset Tidak Seimbang

Authors

  • Anggun Vannia Universitas Dian Nuswantoro
  • Muljono Universitas Dian Nuswantoro

DOI:

https://doi.org/10.30865/json.v7i2.9299

Keywords:

Smishing; Deteksi; SVM Linear; Hybrid Sampling; Ketidakseimbangan Kelas

Abstract

Serangan smishing (SMS phishing) menghadapi tantangan mendasar dalam deteksi berbasis machine learning akibat ketidakseimbangan distribusi kelas pada dataset dunia nyata, di mana instance kelas minoritas (smishing) justru paling kritis untuk diidentifikasi. Penelitian ini mengusulkan sebuah framework robust yang mengoptimasi Linear Support Vector Machine (SVM) dengan strategi hybrid sampling tiga tingkat untuk klasifikasi multi-kelas pada kondisi data tidak seimbang. Framework yang dikembangkan mengintegrasikan ekstraksi fitur hibrida TF-IDF dan meta-features dengan strategi penanganan ketidakseimbangan data yang komprehensif, yang meliputi Random Oversampling (ROS) untuk kelas minoritas, Random Undersampling (RUS) untuk kelas mayoritas, dan Embedding MixUp untuk augmentasi data level embedding. Optimasi parameter melalui GridSearchCV dengan validasi 5-fold berhasil menentukan konfigurasi optimal SVM Linear (C=0.5). Hasil evaluasi pada test set mendemonstrasikan kemampuan klasifikasi yang tinggi dan seimbang, dengan pencapaian akurasi 96,7% dan F1-macro 87,6%. Kinerja yang konsisten merata pada semua kelas ini tercermin dari recall smishing 84% sambil mempertahankan recall ham 99%. Temuan ini menegaskan bahwa kombinasi Linear SVM dan strategi hybrid sampling  berhasil menghasilkan model deteksi smishing yang robust, seimbang, dan siap diimplementasikan dalam skenario dunia nyata.

References

A. F. Mahmud and S. Wirawan, “Phishing Website Detection Using Machine Learning Classification Method,” SISTEMASI, vol. 13, no. 4, pp. 1368–1380, 2024, doi: 10.32520/stmsi.v13i4.3456.

G. Tanbhir, M. F. Shahriyar, K. Shahed, A. M. R. Chy, and M. Al Adnan, “Hybrid Machine Learning Model for Detecting Bangla Smishing Text Using BERT and Character-Level CNN,” in 13th International Conference on Electrical and Computer Engineering (ICECE), 2024, pp. 57–62. doi: 10.1109/ICECE64886.2024.11024872.

Slamet, “Smishing Guard: Strategi Pengembangan Sistem Deteksi dan Respons Ancaman SMS Phishing,” SPIRIT, vol. 17, no. 1, pp. 12–23, 2024, doi: 10.53567/spirit.v17i1.380.

S. Hosseinpour and S. Das, “POSTER: A Multi-Signal Model for Detecting Evasive Smishing,” in Proceedings of the 18th ACM Conference on Security and Privacy in Wireless and Mobile Networks, 2025, pp. 292–293. doi: 10.1145/3734477.3736147.

M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning, 2nd ed. Cambridge: MIT Press, 2018.

J. Schmidt et al., “Improving Machine-Learning Models in Materials Science through Large Datasets,” Mater. Today Phys., vol. 48, no. September, p. 101560, 2024, doi: 10.1016/j.mtphys.2024.101560.

A. H. Salem, S. M. Azzam, O. E. Emam, and A. A. Abohany, “Advancing Cybersecurity: A Comprehensive Review of AI-Driven Detection Technique,” J. Big Data, vol. 11, no. 105, 2024, doi: 10.1186/s40537-024-00957-y.

S. W. Iriananda, R. W. Budiawan, A. Y. Rahman, and I. Istiadi, “Optimasi Klasifikasi Sentimen Komentar Pengguna Game Bergerak Menggunakan SVM, Grid Search dan Kombinasi N-Gram,” J. Teknol. Inf. dan Ilmu Komput., vol. 11, no. 4, pp. 743–752, 2024, doi: 10.25126/jtiik.1148244.

J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A Comprehensive Survey on Support Vector Machine Classification: Applications, Challenges and Trends,” Neurocomputing, vol. 408, pp. 189–215, 2020, doi: 10.1016/j.neucom.2019.10.118.

S. Mishra and D. Soni, “Smishing Detector: A Security Model to Detect Smishing through SMS Content Analysis and URL Behavior Analysis,” Futur. Gener. Comput. Syst., vol. 108, pp. 803–815, 2020, doi: 10.1016/j.future.2020.03.021.

M. Alshinwan, O. A. Khashan, Z. Alarnaout, S. S. Shreem, A. Y. Shdefat, and N. A. Karim, “A Novel Smishing Defense Approach Based on Meta-Heuristic Optimization Algorithms,” Cybersecurity, vol. 8, no. 1, pp. 8–35, 2025, doi: 10.1186/s42400-024-00328-3.

P. Sun, Z. Wang, L. Jia, and Z. Xu, “SMOTE-kTLNN: A Hybrid Re-sampling Method Based on SMOTE and a Two-Layer Nearest Neighbor Classifier,” Expert Syst. Appl., vol. 238, p. 121848, 2023, doi: 10.1016/j.eswa.2023.121848.

H. I. Hussein, S. A. Anwar, and M. I. Ahmad, “Imbalanced Data Classification Using SVM Based on Improved Simulated Annealing Featuring Synthetic Data Generation and Reduction,” Comput. Mater. Contin., vol. 75, no. 1, pp. 547–564, 2023, doi: 10.32604/cmc.2023.036025.

A. Salehi and M. Khedmati, “Hybrid Clustering Strategies for Effective Oversampling and Undersampling in Multiclass Classification,” Sci. Rep., vol. 15, p. 3460, 2025, doi: 10.1038/s41598-024-84786-2.

A. Ahmad, O. Chaudhari, and R. Chandra, “A Review of Ensemble Learning and Data Augmentation Models for Class Imbalanced Problems : Combination , Implementation and Evaluation,” Expert Syst. Appl., vol. 244, p. 122778, 2024, doi: 10.1016/j.eswa.2023.122778.

R. Asyrofi, “Synthetic-MixUp : A Simple Framework for Imbalanced Text Classification,” 2023 IEEE 12th Glob. Conf. Consum. Electron., pp. 927–929, 2023, doi: 10.1109/GCCE59613.2023.10315313.

H. Sun et al., “Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion,” Neural Networks, vol. 178, no. February, pp. 1–12, 2024, doi: 10.1016/j.neunet.2024.106493.

M. Liang and T. Niu, “Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs,” Procedia Comput. Sci., vol. 208, pp. 460–470, 2022, doi: 10.1016/j.procs.2022.10.064.

L. B. V. de Amorim, G. D. C. Cavalcanti, and R. M. O. Cruz, “The Choice of Scaling Technique Matters for Classification Performance,” Appl. Soft Comput., vol. 133, p. 109924, 2023, doi: 10.1016/j.asoc.2022.109924.

G. Kou, H. Chen, and M. A. Hefni, “Improved Hybrid Resampling and Ensemble Model for Imbalance Learning and Credit Evaluation,” J. Manag. Sci. Eng., vol. 7, no. 4, pp. 511–529, 2022, doi: 10.1016/j.jmse.2022.06.002.

C. N. Mohammed and A. M. Ahmed, “A Semantic-Based Model With a Hybrid Feature Engineering Process for Accurate Spam Detection,” J. Electr. Syst. Inf. Technol., vol. 11, p. 26, 2024, doi: 10.1186/s43067-024-00151-3.

B. Li, Y. Hou, and W. Che, “Data Augmentation Approaches in Natural Language Processing: A Survey,” AI Open, vol. 3, pp. 71–90, 2022, doi: 10.1016/j.aiopen.2022.03.001.

M. C. Untoro and M. A. N. M. Yusuf, “Evaluate of Random Undersampling Method and Majority Weighted Minority Oversampling Technique in Resolve Imbalanced Dataset,” IT J. Res. Dev., vol. 8, no. 1, pp. 1–13, 2023, doi: 10.25299/itjrd.2023.12412.

D. S. Cross-validation, “A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning,” vol. 23, no. 4, p. 2333, 2023, doi: 10.3390/s23042333.

H. Wang and Y. Shao, “Sparse and Robust SVM Classifier for Large Scale Classification,” Appl. Intell., vol. 53, no. 16, pp. 19647–19671, 2023, doi: 10.1007/s10489-023-04511-w.

M. Mujahid et al., “Data Oversampling and Imbalanced Datasets: an Investigation of Performance for Machine Learning and Feature Engineering,” J. Big Data, vol. 11, p. 87, 2024, doi: 10.1186/s40537-024-00943-4.

S. Al Hasan et al., “Classification of Multi-Labeled Text Articles with Reuters Dataset using SVM,” in International Conference on Science and Technology (ICOSTECH), 2022, pp. 1–5. doi: 10.1109/ICOSTECH54296.2022.9829153.

M. Soni, Artificial Intelligence. India: Poorav Publications, 2024.

Q. Li, S. Zhao, S. Zhao, and J. Wen, “Logistic Regression Matching Pursuit Algorithm for Text Classification,” Knowledge-Based Syst., vol. 277, p. 110761, 2023, doi: 10.1016/j.knosys.2023.110761.

L. Zhang, “Features Extraction Based on Naive Bayes Algorithm and TF-IDF for news classification,” PLoS One, vol. 20, no. 7, p. e0327347, 2025, doi: 10.1371/journal.pone.0327347.

S. Alsufyani and S. Alajmani, “A Deep Learning for Arabic SMS Phishing Based on URLs Detection,” Int. J. Adv. Comput. Sci. Appl., vol. 16, no. 1, pp. 388–396, 2025, doi: 10.14569/IJACSA.2025.0160138.

Downloads

Published

2025-12-31

How to Cite

Vannia, A., & Muljono. (2025). Optimasi Linear Support Vector Machine untuk Deteksi Smishing Multi-Kelas pada Dataset Tidak Seimbang . Jurnal Sistem Komputer Dan Informatika (JSON), 7(2), 624–634. https://doi.org/10.30865/json.v7i2.9299