Comparative Analysis of Multinomial Naïve Bayes and Logistic Regression Models for Prediction of SMS Spam

− This research was conducted based on a report from the United States Federal Trade Commission regarding fraud through electronic text messages via SMS that fraudsters use to manipulate potential victims. Usually, scammers spread SMS spam as an intermediary for the crime. The development of a supervised learning algorithm is applied to predict SMS spam into three categories, such as SMS spam, SMS fraud, and promotional SMS. The prediction system is dividing into several stages in the development process, including data labelling, data preprocessing, modelling, and model validation. The known accuracy based on modelling using Logistic Regression using a test size of 15% is 99%, using a test size of 20% is 99%, and using a test size of 25% is 98%. The Multinomial Naïve Bayes algorithm's accuracy with a test size of 15%, 20%, 25% is 97%. So, the SMS spam prediction approach uses the logistic regression method, which has the highest accuracy.


INTRODUCTION
The United States Federal Trade Commission states that fraud involves sending fake text messages to trick someone into providing personal information such as passwords, account numbers, and identification numbers. Fraudsters use this information to access email or bank accounts or sell victim information to other fraudsters. Fraudsters use a variety of changing scenarios to try to get the victim's attention. Standard methods include promising gifts, gift cards, or coupons and offering low or no interest credit cards. Scammers usually send fake messages stating that they have information about the victim's account or transaction. The mode used usually says that the fraudster saw some suspicious activity on the victim's account, made a claim that there was a problem with payment information, sent fake invoices, and told the victim to contact the fraudster if the victim was going to cancel the purchase. There was even an incident where a fraudster sent a victim a fake package delivery notification [1] [2]. According to the spam statistics submitted by AV-TEST, Indonesia is ranked 8th out of the world's total population in the world for global spam. The law regarding the spread of spam in Indonesia is Undang-Undang No. 11 Tahun 2008 / Undang-Undang Informasi dan Transaksi Elektronik (UU ITE) has not been explicitly implementing. However, sending spam can be categorized as prohibited in chapter VII article [27][28][29][30][31][32][33][34], to be precise in article 33 [3][4]. Short Message Service (SMS) has developed over the decades so that it is used for business activities. SMS containing text messages is more effective than email. [5]. So that SMS is used as a tool to commit crimes and lure victims into manipulating the victim's condition [6] [7].
Research conducted by Sudibyo et al. regarding the classification of spam attack attributes on email using the Decision Tree approach. Research on spam attacks with a spam dataset of 4601 records consisting of 1813 records considered spam and not spam data 278 with an initial attribute of 57 with class 1 details. One carried out three testing experiments with 30%, 50%, and 70% attribute results from unique point feature 70% better result obtained from 30% or 50% with an accuracy value of 92,469% [8]. The research conducted by Fitrian et al. aims to create an email filtering application that utilizes the naive Bayes classifier method to classify email types, including SPAM or HAM emails, and lemmatization to process words into essential words. The test results used 131 email samples, and 119 files were successfully classified correctly and while the 12 files tested got the wrong prediction value. The accuracy value obtained in this study was 90.83% [9]. Research conducted by Setiyono and Pardede investigates various data mining techniques, namely Support Vector Machine, Multinomial Naïve Bayes, and Decision Tree for automatic spam detection. Our experimental results show that the Support Vector Machine algorithm is the best of the three evaluated algorithms. Support Vector Machine reached 98.33%, while Multinomial Naïve Bayes reached 98.13% and Decision Tree with 97.10% accuracy [10]. This research was developed by evaluating the comparison of algorithms and datasets so that the aim is to compare other approaches to have a more optimum accuracy of prediction.
The development of computational methods for identifying various SMS in cyberspace requires analyse different SMS patterns [11] [12]. Then make predictions against spam using processed datasets [13]. In developing a data-based SMS spam detection model, we can use techniques of machine learning. However, the prediction of SMS spam using machine learning algorithms has limitations on identifying double classification results, which means it depends on the data's characteristics [14]. Analyse several machine learning algorithms in the SMS spam detection system is to protect users from cybercrime [15]. In connection with this research, several popular machine learning classification techniques are applied, including Logistic Regression (LR) and Multinomial Naïve Bayes (MNB), to provide intelligent services in information and communication technology [16] [17].
The algorithm's effectiveness is tested by conducting experiments on SMS spam datasets consisting of 3 SMS categories and evaluating the algorithm's effectiveness by measuring the performance of metrics precision, recall, f1-score, and accuracy for a machine learning-based SMS spam detection model [18].

RESEARCH METHODOLOGY
Describe the research sequence, including research design, explain data pre-processing to process text data, make predictions using machine learning-based modelling, and model validation to determine accuracy, precision, recall, and f1-score. The explanation of the research steps is supported by references so that the explanation can be accepted scientifically. The datasets used are SMS data with various types at the data selection stage, then sorted into three data categories, including original SMS, SMS Fraud SMS, and SMS Promo. Then the pre-processing data in this study intends to process text, such as removing punctuation marks, changing to lowercase, and removing stopwords. Then the text data that has gone through the preprocessing stage is transformed into an array to be easily read by the applied algorithm. Finally, its goal is to predict text based on its category at the data mining stage. This stage aims to predict new text data not yet in the datasets. Prediction results also need to be evaluated using a confusion matrix approach to determine how accurate the method used in making predictions is. As for what needs to know that the SMS spam datasets in this study have obtained permission from previous researchers to conduct development research, using the Knowledge Discovery and Data Mining (KDD) methodology [19]. The following are the research steps carried out in extracting SMS spam text data, shown in Figure 1. The process carried out during the study consisted of the following stages.

Selection and Pre-processing
Selection and pre-processing are essential part of research that develops machine learning-based modelling and takes part in the analytical pipeline as our research method. The importance of applying pre-processing data in machine learning-based modelling to obtain the expected performance results [20]. The pre-processing data consisted of datasets availability, tokenization, case-folding, stop word removal, stemming, and vectorization [21] [22]. a. Datasets Availability The dataset we use in this study is SMS spam data that should make labelling by type. There are three types of SMS labels: label 0 the original SMS, label 1 is a fraud, and label 2 is SMS promotion [23]. Datasets are several datasets repositories that have information content and have relevance to research. So that data can be used to support research to be carried out [24]. b. Tokenization and Case-folding In general, at the initial stage, the data text consists of a set of characters, and the text analysis process requires words that are available in the data set. Tokenization is simply because the text is already saved in a format that a machine can read. However, there are problems such as punctuation marks so that that punctuation marks will be removed at the tokenization stage [25]. Case-folding is briefly changing capital letters to lowercase letters to prevent ambiguity in the engine, so engine performance becomes more efficient [26]. c. Stopwords removal One of the text processing processes in retrieving information in text or text mining or better known as stopwords removal is by deleting text from irrelevant words for indexing. There are many types of words in- text documents, such as prepositions, conjunctions, pronouns, adjectives, Etc. Some of these words may not index the document because they are not unique or never used in the search query. Therefore, this process of filtering out words is carried out-filter by providing a stoplist list. Zipf's law is sometimes used as the basis for forming non-indexable word lists, especially in the analysis of the occurrence of words [27] [28].

d. Stemming
The stemming process is a method for extracting a word into a root word by removing all word affixes. The prefixes include prefix, suffix, and confix [29]. The application of stemming in each language has differences depending on the morphology of each language. The result of the stemming process is stem.

Transformation
Vectorization is part of data transformation, vectorization is the last stage in pre-processing data, namely changing the form of the word represented into a number [30]. The vectorization stage uses the Term Frequency -Inverse Document Frequency (TF-IDF) method to obtain each token's weight in the vector dataset. Equation (1)

Data Mining
This case study uses two-approach models as a comparison, namely LR and MNB. Modelling utilizing text classification of SMS spam is using to obtain information about fraudulent SMS messages, promo SMS messages or original SMS messages [32]. Before modelling, the datasets were testing to obtain the right level of accuracy [33]. Logistic Regression is a supervised learning algorithm used to classify individuals based on a logistic function. Equation (2) is an equation of LR [34].
ln : natural logarithm B0+B1X : the equation known as Ordinary Least Square P : logistic probability The way MNB works is to calculate the frequency of each token appearance from the document. The document sequence of occurrences of words in the document is not to account, so the document or "bag of word" is processed using a multinomial distribution with equation (3) [35]. Sanity check is a testing mechanism to identify valid input data after modelling [36].

Evaluation
The method that is generally using calculate the accuracy in machine learning in this study is the Confusion Matrix., the Confusion Matrix loads correctly predicted classification information through the classification model. The parameters used include precision, recall, f1-score, and accuracy [37].

RESULT AND DISCUSSION
Based Based on the results of research conducted using methods with data pre-processing stages, modelling and model validation. The research conducted by Rami and Wibisono used SMS datasets that were label as many as 1143 messages with 569 original SMS information, 335 SMS frauds, and 239 SMS promos shown in Figure 2.
The modelling applied in this study uses two supervised learning methods, namely, LR and MNB.

Selection, Pre-processing and Transformation
The data pre-processing stage consists of tokenization, case-folding, stopwords removal, stemming, and vectorization using libraries available in the Python programming language, which shown in Figure 3. Figure 4 is the output of data pre-processing which has been in the form of vectors.

Data mining
Prediction of modelling variation to predict three SMS text classifications using LR and MNB supported by the scikit-learn library by testing dataset sizes of 15%, 20%, and 25% of the total data and accompanied by the results of checking the accuracy of prediction algorithms, which following in Table 1.

Evaluation
Then the accuracy performance test results by dividing the datasets sorted from lowest to highest accuracy, namely the MNB method, with datasets of 75%, 80%, and 85% having an accuracy rate of 97%. While the LR algorithm has better results, namely on datasets, 75% have an accuracy of 98%, 80% have an accuracy of 99%, and 85% have an accuracy of 99%, as shown in Figure 5.

CONCLUSION
Based on results of research that has been done with validation using confusion matrix, the conclusion of the LR algorithm with a test size of 15% has an accuracy of 99%, a test size of 20% has an accuracy of 99%, and a test size of 25% has an accuracy of 98%. The MNB algorithm with a test size of 15%, 20%, 25% has the same accuracy, namely 97%. With the information obtained from this study, the LR algorithm has the best accuracy in making predictions.