Comparative Analysis of Multinomial Naïve Bayes and Logistic Regression Models for Prediction of SMS Spam
DOI:
https://doi.org/10.30865/mib.v6i3.4019Keywords:
Fraud, SMS Spam, Supervised Learning, Model ValidationAbstract
This research was conducted based on a report from the United States Federal Trade Commission regarding fraud through electronic text messages via SMS that fraudsters use to manipulate potential victims. Usually, scammers spread SMS spam as an intermediary for the crime. The development of a supervised learning algorithm is applied to predict SMS spam into three categories, such as SMS spam, SMS fraud, and promotional SMS. The prediction system is dividing into several stages in the development process, including data labelling, data preprocessing, modelling, and model validation. The known accuracy based on modelling using Logistic Regression using a test size of 15% is 99%, using a test size of 20% is 99%, and using a test size of 25% is 98%. The Multinomial Naïve Bayes algorithm's accuracy with a test size of 15%, 20%, 25% is 97%. So, the SMS spam prediction approach uses the logistic regression method, which has the highest accuracy.References
United State of America Federal Trade Commision, “How to Recognize and Report Spam Text Messages,†Consumer Information, 2020. https://www.consumer.ftc.gov/articles/how-recognize-and-report-spam-text-messages (accessed Dec. 12, 2020).
O. S. Yee, S. Sagadevan, and N. H. A. H. Malim, “Credit Card Fraud Detection Using Machine Learning As Data Mining Technique,†J. Telecommun. Electron. Comput. Eng., vol. 10, no. 1–4, pp. 23–27, 2018.
Y. Vernanda, S. Hansun, and M. B. Kristanda, “Indonesian language email spam detection using N-gram and Naïve Bayes algorithm,†Bull. Electr. Eng. Informatics, vol. 9, no. 5, pp. 2012–2019, 2020, doi: 10.11591/eei.v9i5.2444.
M. Rifauddin and A. N. Halida, “Waspada Cybercrime dan Informasi Hoax Pada Media Sosial Facebook,†Khizanah al-Hikmah J. Ilmu Perpustakaan, Informasi, dan Kearsipan, vol. 6, no. 2, pp. 98–111, 2018, doi: 10.24252/kah.v6i2a2.
P. K. Roy, J. P. Singh, and S. Banerjee, “Deep learning to filter SMS Spam,†Futur. Gener. Comput. Syst., vol. 102, pp. 524–533, 2020, doi: 10.1016/j.future.2019.09.001.
I. Rahmawati, “Analisis Manajemen Resiko Ancaman Kejahatan Siber (Cyber Crime) Dalam Peningkatan Cyber Defense,†J. Pertahanan Bela Negara, vol. 7, no. 2, pp. 51–66, 2017, doi: 10.33172/jpbh.v7i2.193.
R. C. Perkins, C. J. Howell, C. E. Dodge, G. W. Burruss, and D. Maimon, “Malicious Spam Distribution: A Routine Activities Approach,†Deviant Behav., vol. 00, no. 00, pp. 1–17, 2020, doi: 10.1080/01639625.2020.1794269.
D. Kawade and K. Oza, “Content-Based SMS Spam Filtering Using Machine Learning Technique,†Int. J. Comput. Eng. Appl., vol. 13, no. 4, 2018.
M. Bassiouni, M. Ali, and E. A. El-Dahshan, “Ham and Spam E-Mails Classification Using Machine Learning Techniques,†J. Appl. Secur. Res., vol. 13, no. 3, pp. 315–331, 2018, doi: 10.1080/19361610.2018.1463136.
A. K. Jain, S. K. Yadav, and N. Choudhary, “A novel Approach to Detect Spam and Smishing SMS using Machine Learning Techniques,†Int. J. E-Services Mob. Appl., vol. 12, no. 1, pp. 21–38, 2020, doi: 10.4018/IJESMA.2020010102.
N. K. Nagwani and A. Sharaff, “SMS Spam Filtering and Thread Identification using Bi-Level Text Classification and Clustering Techniques,†J. Inf. Sci., vol. 43, no. 1, pp. 1–13, 2017, doi: 10.1177/0165551515616310.
A. Ghourabi, M. A. Mahmood, and Q. M. Alzubi, “A hybrid CNN-LSTM model for SMS spam detection in arabic and english messages,†Futur. Internet, vol. 12, no. 9, pp. 1–16, 2020, doi: 10.3390/FI12090156.
M. Manap, M. H. Jopri, A. R. Abdullah, R. Karim, M. R. Yusoff, and A. H. Azahar, “A verification of periodogram technique for harmonic source diagnostic analytic by using logistic regression,†Telkomnika (Telecommunication Comput. Electron. Control., vol. 17, no. 1, pp. 497–507, 2019, doi: 10.12928/TELKOMNIKA.v17i1.10390.
A. Setiyono and H. F. Pardede, “Klasifikasi Sms Spam Menggunakan Support Vector Machine,†J. Pilar Nusa Mandiri, vol. 15, no. 2, pp. 275–280, 2019, doi: 10.33480/pilar.v15i2.693.
N. Shiri Harzevili and S. H. Alizadeh, “Mixture of Latent Multinomial Naïve Bayes Classifier,†Appl. Soft Comput. J., vol. 69, pp. 516–527, 2018, doi: 10.1016/j.asoc.2018.04.020.
J. Feldman, A. Thomas-Bachli, J. Forsyth, Z. H. Patel, and K. Khan, “Development of a Global Infectious Disease Activity Database using Natural Language Processing, Machine Learning, and Human Expertise,†J. Am. Med. Informatics Assoc., vol. 26, no. 11, pp. 1355–1359, 2019, doi: 10.1093/jamia/ocz112.
H. M. Safhi, B. Frikh, and B. Ouhbi, “Assessing reliability of Big Data Knowledge Discovery process,†Procedia Comput. Sci., vol. 148, pp. 30–36, 2019, doi: 10.1016/j.procs.2019.01.005.
X. Zheng, M. Wang, and J. Ordieres-Meré, “Comparison of Data Preprocessing Approaches for Applying Deep Learning to Human Activity Recognition in the Context of Industry 4.0,†Sensors (Switzerland), vol. 18, no. 7, 2018, doi: 10.3390/s18072146.
S. Khomsah and Agus Sasmito Aribowo, “Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia,†Rekayasa Sist. dan Teknol. Informasi, RESTI, vol. 4, no. 10, pp. 648–654, 2020.
W. T. H. Putri, M. S. Prastio, R. Hendrowati, Y. Sari, and H. T. Y. Achsan, “Content-based Filtering Model for Recommendation of Indonesian Legal Article Study Case of Klinik Hukumonline,†in 2019 International Workshop on Big Data and Information Security, IWBIS 2019, 2019, pp. 9–14, doi: 10.1109/IWBIS.2019.8935726.
F. Rahmi and W. Yudi, “Aplikasi SMS Spam Filtering pada Android menggunakan Naïve Bayes,†Universitas Pendidikan Indonesia, 2017.
S. R. Kunze and S. Auer, “Dataset retrieval,†in Proceedings - 2013 IEEE 7th International Conference on Semantic Computing, ICSC 2013, 2013, pp. 1–8, doi: 10.1109/ICSC.2013.12.
S. Vijayarani and J. Rajaraman, “Text Mining: open Source Tokenization Tools – An Analysis,†Adv. Comput. Intell. An Int. J., vol. 3, no. 1, pp. 37–47, 2016, doi: 10.5121/acii.2016.3104.
C. C. Aggarwal, Machine Learning for Text. Yorktown Heights: Springer, 2018.
F. Rahutomo and A. R. T. H. Ririd, “Evaluasi Daftar Stopword Bahasa Indonesia,†J. Teknol. Inf. dan Ilmu Komput., vol. 6, no. 1, pp. 41–47, 2019, doi: 10.25126/jtiik.2019611226.
A. F. Hidayatullah, “Pengaruh Stopword Terhadap Performa Klasifikasi Tweet Berbahasa Indonesia,†JISKA (Jurnal Inform. Sunan Kalijaga), vol. 1, no. 1, pp. 1–4, 2016.
A. B. Arifa, G. F. Fitriana, and A. R. Hasan, “Temu Kembali Informasi pada Soal Ujian dengan Rencana Pembelajaran Menggunakan Vector Space Model,†J. Resti, vol. 5, no. 1, pp. 8–12, 2021.
L. A. Wirasakti, R. Permadi, A. D. Hartanto, and H. Hartatik, “Pembuatan Kata Kunci Otomatis Dalam Artikel Dengan Pemodelan Topik,†J. Media Inform. Budidarma, vol. 4, no. 1, p. 27, 2020, doi: 10.30865/mib.v4i1.1707.
N. Abdulloh and A. F. Hidayatullah, “Deteksi Cyberbullying pada Cuitan Media Sosial Twitter,†Automata, vol. Vol 1, no. 1, pp. 1–5, 2019.
L. Mutawalli, M. T. A. Zaen, and W. Bagye, “Klasifikasi Teks Sosial Media Twitter Menggunakan Support Vector Machine (Studi Kasus Penusukan Wiranto),†J. Inform. dan Rekayasa Elektron., vol. 2, no. 2, pp. 43–51, 2019, doi: 10.36595/jire.v2i2.117.
A. Santoso and G. Ariyanto, “Implementasi Deep Learning Berbasis Keras untuk Pengenalan Wajah,†Emitor, vol. 18, no. 01, pp. 15–21, 2018, doi: 10.23917/emitor.v18i01.6235.
K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification,†Augment. Hum. Res., vol. 5, no. 1, pp. 1–16, 2020, doi: 10.1007/s41133-020-00032-0.
S. Fanissa, M. A. Fauzi, and S. Adinugroho, “Analisis Sentimen Pariwisata di Kota Malang Menggunakan Metode Naive Bayes dan Seleksi Fitur Query Expansion Ranking | Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer,†J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 2, no. 8, pp. 2766–2770, 2018, [Online]. Available: http://j-ptiik.ub.ac.id/index.php/j-ptiik/article/view/1962.
H. Lu, H. Xu, N. Liu, Y. Zhou, and X. Wang, “Data sanity check for deep learning systems via learnt assertions,†in ASE 2019, 2019, pp. 1–3, [Online]. Available: https://2019.ase-conferences.org/details/ase-2019-Late-Breaking-Results/5/Data-Sanity-Check-for-Deep-Learning-Systems-via-Learnt-Assertions.
E. Indrayuni, “Klasifikasi Text Mining Review Produk Kosmetik Untuk Teks Bahasa Indonesia Menggunakan Algoritma Naive Bayes,†J. Khatulistiwa Inform., vol. 7, no. 1, pp. 29–36, 2019, doi: 10.31294/jki.v7i1.1.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).