Comparative Analysis of Multinomial NaÃ¯ve Bayes and Logistic Regression Models for Prediction of SMS Spam

Pradana Ananda Raharja; Muhammad Fajar Sidiq; Diandra Chika Fransisca

doi:10.30865/mib.v6i3.4019

Authors

Pradana Ananda Raharja Institut Teknologi Telkom Purwokerto, Banyumas http://orcid.org/0000-0002-5777-5902
Muhammad Fajar Sidiq Institut Teknologi Telkom Purwokerto, Banyumas
Diandra Chika Fransisca Institut Teknologi Telkom Purwokerto, Banyumas

DOI:

https://doi.org/10.30865/mib.v6i3.4019

Keywords:

Fraud, SMS Spam, Supervised Learning, Model Validation

Abstract

This research was conducted based on a report from the United States Federal Trade Commission regarding fraud through electronic text messages via SMS that fraudsters use to manipulate potential victims. Usually, scammers spread SMS spam as an intermediary for the crime. The development of a supervised learning algorithm is applied to predict SMS spam into three categories, such as SMS spam, SMS fraud, and promotional SMS. The prediction system is dividing into several stages in the development process, including data labelling, data preprocessing, modelling, and model validation. The known accuracy based on modelling using Logistic Regression using a test size of 15% is 99%, using a test size of 20% is 99%, and using a test size of 25% is 98%. The Multinomial NaÃ¯ve Bayes algorithm's accuracy with a test size of 15%, 20%, 25% is 97%. So, the SMS spam prediction approach uses the logistic regression method, which has the highest accuracy.

Author Biography

Pradana Ananda Raharja, Institut Teknologi Telkom Purwokerto, Banyumas

Teknik Informatika

References

United State of America Federal Trade Commision, â€œHow to Recognize and Report Spam Text Messages,â€ Consumer Information, 2020. https://www.consumer.ftc.gov/articles/how-recognize-and-report-spam-text-messages (accessed Dec. 12, 2020).

O. S. Yee, S. Sagadevan, and N. H. A. H. Malim, â€œCredit Card Fraud Detection Using Machine Learning As Data Mining Technique,â€ J. Telecommun. Electron. Comput. Eng., vol. 10, no. 1â€“4, pp. 23â€“27, 2018.

Y. Vernanda, S. Hansun, and M. B. Kristanda, â€œIndonesian language email spam detection using N-gram and NaÃ¯ve Bayes algorithm,â€ Bull. Electr. Eng. Informatics, vol. 9, no. 5, pp. 2012â€“2019, 2020, doi: 10.11591/eei.v9i5.2444.

M. Rifauddin and A. N. Halida, â€œWaspada Cybercrime dan Informasi Hoax Pada Media Sosial Facebook,â€ Khizanah al-Hikmah J. Ilmu Perpustakaan, Informasi, dan Kearsipan, vol. 6, no. 2, pp. 98â€“111, 2018, doi: 10.24252/kah.v6i2a2.

P. K. Roy, J. P. Singh, and S. Banerjee, â€œDeep learning to filter SMS Spam,â€ Futur. Gener. Comput. Syst., vol. 102, pp. 524â€“533, 2020, doi: 10.1016/j.future.2019.09.001.

I. Rahmawati, â€œAnalisis Manajemen Resiko Ancaman Kejahatan Siber (Cyber Crime) Dalam Peningkatan Cyber Defense,â€ J. Pertahanan Bela Negara, vol. 7, no. 2, pp. 51â€“66, 2017, doi: 10.33172/jpbh.v7i2.193.

R. C. Perkins, C. J. Howell, C. E. Dodge, G. W. Burruss, and D. Maimon, â€œMalicious Spam Distribution: A Routine Activities Approach,â€ Deviant Behav., vol. 00, no. 00, pp. 1â€“17, 2020, doi: 10.1080/01639625.2020.1794269.

D. Kawade and K. Oza, â€œContent-Based SMS Spam Filtering Using Machine Learning Technique,â€ Int. J. Comput. Eng. Appl., vol. 13, no. 4, 2018.

M. Bassiouni, M. Ali, and E. A. El-Dahshan, â€œHam and Spam E-Mails Classification Using Machine Learning Techniques,â€ J. Appl. Secur. Res., vol. 13, no. 3, pp. 315â€“331, 2018, doi: 10.1080/19361610.2018.1463136.

A. K. Jain, S. K. Yadav, and N. Choudhary, â€œA novel Approach to Detect Spam and Smishing SMS using Machine Learning Techniques,â€ Int. J. E-Services Mob. Appl., vol. 12, no. 1, pp. 21â€“38, 2020, doi: 10.4018/IJESMA.2020010102.

N. K. Nagwani and A. Sharaff, â€œSMS Spam Filtering and Thread Identification using Bi-Level Text Classification and Clustering Techniques,â€ J. Inf. Sci., vol. 43, no. 1, pp. 1â€“13, 2017, doi: 10.1177/0165551515616310.

A. Ghourabi, M. A. Mahmood, and Q. M. Alzubi, â€œA hybrid CNN-LSTM model for SMS spam detection in arabic and english messages,â€ Futur. Internet, vol. 12, no. 9, pp. 1â€“16, 2020, doi: 10.3390/FI12090156.

M. Manap, M. H. Jopri, A. R. Abdullah, R. Karim, M. R. Yusoff, and A. H. Azahar, â€œA verification of periodogram technique for harmonic source diagnostic analytic by using logistic regression,â€ Telkomnika (Telecommunication Comput. Electron. Control., vol. 17, no. 1, pp. 497â€“507, 2019, doi: 10.12928/TELKOMNIKA.v17i1.10390.

A. Setiyono and H. F. Pardede, â€œKlasifikasi Sms Spam Menggunakan Support Vector Machine,â€ J. Pilar Nusa Mandiri, vol. 15, no. 2, pp. 275â€“280, 2019, doi: 10.33480/pilar.v15i2.693.

N. Shiri Harzevili and S. H. Alizadeh, â€œMixture of Latent Multinomial NaÃ¯ve Bayes Classifier,â€ Appl. Soft Comput. J., vol. 69, pp. 516â€“527, 2018, doi: 10.1016/j.asoc.2018.04.020.

J. Feldman, A. Thomas-Bachli, J. Forsyth, Z. H. Patel, and K. Khan, â€œDevelopment of a Global Infectious Disease Activity Database using Natural Language Processing, Machine Learning, and Human Expertise,â€ J. Am. Med. Informatics Assoc., vol. 26, no. 11, pp. 1355â€“1359, 2019, doi: 10.1093/jamia/ocz112.

H. M. Safhi, B. Frikh, and B. Ouhbi, â€œAssessing reliability of Big Data Knowledge Discovery process,â€ Procedia Comput. Sci., vol. 148, pp. 30â€“36, 2019, doi: 10.1016/j.procs.2019.01.005.

X. Zheng, M. Wang, and J. Ordieres-MerÃ©, â€œComparison of Data Preprocessing Approaches for Applying Deep Learning to Human Activity Recognition in the Context of Industry 4.0,â€ Sensors (Switzerland), vol. 18, no. 7, 2018, doi: 10.3390/s18072146.

S. Khomsah and Agus Sasmito Aribowo, â€œModel Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia,â€ Rekayasa Sist. dan Teknol. Informasi, RESTI, vol. 4, no. 10, pp. 648â€“654, 2020.

W. T. H. Putri, M. S. Prastio, R. Hendrowati, Y. Sari, and H. T. Y. Achsan, â€œContent-based Filtering Model for Recommendation of Indonesian Legal Article Study Case of Klinik Hukumonline,â€ in 2019 International Workshop on Big Data and Information Security, IWBIS 2019, 2019, pp. 9â€“14, doi: 10.1109/IWBIS.2019.8935726.

F. Rahmi and W. Yudi, â€œAplikasi SMS Spam Filtering pada Android menggunakan NaÃ¯ve Bayes,â€ Universitas Pendidikan Indonesia, 2017.

S. R. Kunze and S. Auer, â€œDataset retrieval,â€ in Proceedings - 2013 IEEE 7th International Conference on Semantic Computing, ICSC 2013, 2013, pp. 1â€“8, doi: 10.1109/ICSC.2013.12.

S. Vijayarani and J. Rajaraman, â€œText Mining: open Source Tokenization Tools â€“ An Analysis,â€ Adv. Comput. Intell. An Int. J., vol. 3, no. 1, pp. 37â€“47, 2016, doi: 10.5121/acii.2016.3104.

C. C. Aggarwal, Machine Learning for Text. Yorktown Heights: Springer, 2018.

F. Rahutomo and A. R. T. H. Ririd, â€œEvaluasi Daftar Stopword Bahasa Indonesia,â€ J. Teknol. Inf. dan Ilmu Komput., vol. 6, no. 1, pp. 41â€“47, 2019, doi: 10.25126/jtiik.2019611226.

A. F. Hidayatullah, â€œPengaruh Stopword Terhadap Performa Klasifikasi Tweet Berbahasa Indonesia,â€ JISKA (Jurnal Inform. Sunan Kalijaga), vol. 1, no. 1, pp. 1â€“4, 2016.

A. B. Arifa, G. F. Fitriana, and A. R. Hasan, â€œTemu Kembali Informasi pada Soal Ujian dengan Rencana Pembelajaran Menggunakan Vector Space Model,â€ J. Resti, vol. 5, no. 1, pp. 8â€“12, 2021.

L. A. Wirasakti, R. Permadi, A. D. Hartanto, and H. Hartatik, â€œPembuatan Kata Kunci Otomatis Dalam Artikel Dengan Pemodelan Topik,â€ J. Media Inform. Budidarma, vol. 4, no. 1, p. 27, 2020, doi: 10.30865/mib.v4i1.1707.

N. Abdulloh and A. F. Hidayatullah, â€œDeteksi Cyberbullying pada Cuitan Media Sosial Twitter,â€ Automata, vol. Vol 1, no. 1, pp. 1â€“5, 2019.

L. Mutawalli, M. T. A. Zaen, and W. Bagye, â€œKlasifikasi Teks Sosial Media Twitter Menggunakan Support Vector Machine (Studi Kasus Penusukan Wiranto),â€ J. Inform. dan Rekayasa Elektron., vol. 2, no. 2, pp. 43â€“51, 2019, doi: 10.36595/jire.v2i2.117.

A. Santoso and G. Ariyanto, â€œImplementasi Deep Learning Berbasis Keras untuk Pengenalan Wajah,â€ Emitor, vol. 18, no. 01, pp. 15â€“21, 2018, doi: 10.23917/emitor.v18i01.6235.

K. Shah, H. Patel, D. Sanghvi, and M. Shah, â€œA Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification,â€ Augment. Hum. Res., vol. 5, no. 1, pp. 1â€“16, 2020, doi: 10.1007/s41133-020-00032-0.

S. Fanissa, M. A. Fauzi, and S. Adinugroho, â€œAnalisis Sentimen Pariwisata di Kota Malang Menggunakan Metode Naive Bayes dan Seleksi Fitur Query Expansion Ranking | Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer,â€ J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 2, no. 8, pp. 2766â€“2770, 2018, [Online]. Available: http://j-ptiik.ub.ac.id/index.php/j-ptiik/article/view/1962.

H. Lu, H. Xu, N. Liu, Y. Zhou, and X. Wang, â€œData sanity check for deep learning systems via learnt assertions,â€ in ASE 2019, 2019, pp. 1â€“3, [Online]. Available: https://2019.ase-conferences.org/details/ase-2019-Late-Breaking-Results/5/Data-Sanity-Check-for-Deep-Learning-Systems-via-Learnt-Assertions.

E. Indrayuni, â€œKlasifikasi Text Mining Review Produk Kosmetik Untuk Teks Bahasa Indonesia Menggunakan Algoritma Naive Bayes,â€ J. Khatulistiwa Inform., vol. 7, no. 1, pp. 29â€“36, 2019, doi: 10.31294/jki.v7i1.1.

Comparative Analysis of Multinomial NaÃ¯ve Bayes and Logistic Regression Models for Prediction of SMS Spam

Authors

DOI:

Keywords:

Abstract

Author Biography

Pradana Ananda Raharja, Institut Teknologi Telkom Purwokerto, Banyumas

References

Downloads

Published

How to Cite

Issue

Section

License

Menu Utama

flagcounter

template

statcounter

rji

terindex