Sentiment and Discussion Topic Analysis on Social Media Group using Support Vector Machine

− The growth of social media in this modern era is increasingly rapid, where people are very active digitally interacting with each other. People who have a common interest or simply like to be in a community often gather in an online group, especially on Facebook. Alumni of Telkom University are no exception, who are also actively discussing and sharing information in Telkom University Alumni Forum Facebook group (FAST). By using their status from that group, sentiment and topic discussion analysis can be performed to determine whether the polarity is positive, neutral, or negative. In Addition, topic modeling extracts what topics are often discussed in the group. In this research, sentiment analysis was performed using the Support Vector Machine (SVM) method. Also, the classification process involved TF-IDF for word weighting and confusion matrix as performance measurement. Several testing scenarios were carried out to get the best accuracy value. Based on the tests performed on the preprocessing technique and feature extraction n-gram addition, the highest accuracy value obtained is 80.56%. The result indicates that the best performance is obtained by combining preprocessing techniques without the stopword removal process and feature extraction unigram. Moreover, the topics discussed based on topic modeling results were related to telecommunication and Telkom, Indonesia, alumni, and FAST.


INTRODUCTION
In today's modern era, social media is growing rapidly along with the development of the internet and technology. People often share opinions or responses about an event, product, activity, or someone through social media, which results in interaction between users. Those who share common interests or simply like to be in a community often come together and create groups online, especially on Facebook. The reason is that Facebook social media provides convenience for its users to create a group with features that can be customized according to their needs.
Facebook is a popular social network and is widely used by Indonesian and global people. According to various online sources, one of them is statista [1], statistically, as of January 2022, Facebook is still ranked first as the most popular social network in the world and also in Indonesia with the highest number of active users. Related to that, people often share information, criticism, and opinions about something or someone to the public through social media. As a result, Facebook has a large and diverse dataset that can be used to overcome the limitations of lab-based studies by providing access to records of user behavior expressed in a natural environment [2] so that researchers no longer need to rely on traditional survey methods.
Referring to the Ministry of Education and Culture policy regarding the Tracer Study program, every university is required to conduct an alumni survey that aims to measure the performance of the college in shaping its students to be ready to enter the working world as well as input for evaluation. For this reason, sentiment analysis can be done as an attempt to conduct the survey without having to interact directly with alumni. Sentiment analysis is one of the important areas of research in social media analysis because it concentrates on detecting the polarity of opinions or emotions from texts on social media [3].
Research on sentiment analysis and topic modeling has been carried out by several researchers. In a study conducted by Handayani et al in 2020, The Support Vector Machine algorithm is used to classify positive and negative sentiments towards comments from BNI Mobile Banking application users. By applying the K-Fold Cross-Validation method, the highest accuracy value was originally 78.19% increased to 78.45%. 10-Fold Cross-Validation is also used because the method has become the standard validation method of previous research [4].
In another study conducted by Jaman et al in 2019, sentiment analysis was performed on tweets discussing online motorcycle taxi services collected from Twitter using the SVM method with the TF-IDF feature selection. The dataset is divided into three classes: positive, neutral, and negative sentiment. The classification process is carried out using several scenarios of comparing train and test data, which are 50:50, 60:40, 70:30, 80:20, and 90:10. The four kernels used in the classification process are linear, RBF, sigmoid and polynomial. The highest accuracy results were obtained in the scenario of 90% of the train data and 10% of the test data using linear and sigmoid kernels of more than 80% [5].
Moreover, in the research conducted by Kumari et al in 2017, SVM is applied to a dataset containing smartphone product reviews to determine the polarity of the sentiment, whether it is positive or negative. The highest accuracy value of 90.99% was obtained, and it was also mentioned that SVM is a better and robust method [6].
Another study, performed by Najadat et al in 2018, sentiment analysis of customer status on the official Facebook pages of 3 Jordanian telecommunications companies using and comparing several supervised learning methods, which are K Nearest Neighbors, Support Vector Machine, Naïve Bayes, and Decision Tree. The results

System Flow Design
The alumni's sentiment and discussion topics analysis system has several stages: a. Data collecting aims to gather a set of data for training and testing the machine learning model. b. Data labeling is where each of the data is labeled positive, neutral, or negative manually. c. Text preprocessing stage consisting data cleaning to clean the data from marks, symbols, numbers, etc, then case folding to change all the letters into lowercase, tokenizing to break the sentences into words, normalization to turn the non-standard words into their original words, stemming to remove the affix from the words, and stopword removal to erase the words which considered meaningless. d. Splitting data into training and testing data. e. Word weighting using TF-IDF and N-gram to extract the features. f. Training the model with the Support Vector Machine algorithm. g. Evaluating the performance of the model with confusion matrix. h. The topic modeling is carried out with the Latent Dirichlet Allocation to extract topics from the data. i. Lastly, Topic visualization is performed to visualize the topics for further analysis.
The design of the generally built system is shown in figure 1.

Dataset
The dataset used in this study amounted to 481 statuses containing discussions of Telkom University alumni collected from the Telkom University Alumni Forum (FAST) Facebook group from August 27, 2018, to February 19, 2022. Data collection is performed through crawling by utilizing Selenium tools. Henceforth, each of these statuses will be labeled with three classes, namely positive (1), neutral (0), and negative (-1). The data labeling process requires three people to label each data in the dataset to reduce bias. For example, the first person labels positive, the second person labels negative, and the third person labels positive, then the data is labeled with the most results, which is positive. The purpose of this process is to train machine learning models to make predictions in the classification process. An example of data that has already been labeled can be seen in Table 1.

Preprocessing Text
Data preprocessing is the first step in text processing [4]. This step aims to prepare the text on the dataset before entering the following process by changing the text into a better form so that the resulting information has good quality [9]. Preprocessing techniques applied in this study include data cleaning, case folding, tokenizing, normalization, stemming, and stopword removal [10]. a. Data cleaning is the process of cleaning text by removing punctuation marks, numbers, symbols, emoticons or emojis, and URL links. b. Case Folding is the next process that aims to convert each letter into the same form, i.e., lowercase. c. In the Tokenizing process, the breakdown of sentences into parts of words or called tokens, is carried out [11].
This process aims to simplify the text into a concise input for the classification process [12]. d. Then, normalization proceeds to turn non-standard words into standards and abbreviations or acronyms into their original words [9]. e. The next step is stemming. The purpose is to remove the affix from the word and return it to the base word [4].
This process is carried out by using Sastrawi as an Indonesian stemming algorithm. f. The last preprocessing step is stopword removal. In this step, the removal of common words that are considered meaningless and the number of occurrences is large in the text [12]. An overview of this preprocessing stage is shown in Table 2.

Term Frequency-Inverse Document Frequency (TF-IDF) Word Weighting
After preprocessing, the data is ready to be processed at the next stage, weighting with TF-IDF. Term Frequency-Inverse Document Frequency, commonly known as TF-IDF, is a method of determining the weight of a word by giving different weights to each word in a document based on the frequency of words per document and the frequency of words in all documents [13]. The first step in this process is to calculate the frequency of appearance of a word in a document (TF) with the equation (1).
Where tf t is the number of occurrences of the word t. Next, the calculation of the number of documents containing a certain word is carried out, and then calculated its inverse (IDF) [13] with equations (2).
Note: idf t : inverse document frequency D : the number of all existing documents df t : number of documents containing the word t The last step in this process is to calculate the TF-IDF value by multiplying the TF result by the IDF calculation result [13] with equations (3). : number of occurrences of the word t idf t : inverse document frequency containing the word t

N-gram Feature Extraction
N-gram is a piece of n-word taken from a string and the word will be separated based on the order of the words in a particular sentence [14]. In this study, the n-grams used were n=1 or also called unigram and n=2 or bigram. An example of unigrams and bigrams application can be seen in table 3. Table 3. N-gram application example N-gram Hasil Unigram "turut", "bahagia", "dan", "berbangga" Bigram "turut bahagia", "bahagia dan", "dan berbangga"

Classification with Support Vector Machine (SVM)
Data that has been weighted will begin to be classified with the Support Vector Machine (SVM) algorithm. The concept of this classification method is to find the best hyperplane by taking hyperplane measurements at the margin so that the maximum point is found [15]. The hyperplane is a one-dimensional subspace that smaller than the surrounding space and is used to separate data when there are three dimensions or more [16]. SVM is a nonlinear classification algorithm that operates in a vector space whose dimensions are larger than the original feature space of the given dataset [3]. Therefore, SVM provides a kernel function feature that consists of linear, polynomial, RBF, and sigmoid [17]. This is the reason this classification method was chosen because, in some cases, sentiment analysis, the data processed is not always linear.
d. Sigmoid Note: x, x ′ : data to be classified γ : gamma, values from 0 to 1 d : degree, on polynomial kernels r : constant

Evaluation
The performance of the built model can be evaluated with several parameters such as accuracy, precision, recall, and f-measure or also known as the Confusion Matrix. [18] This evaluation method refers to True Positive (TP) which means that the data is correctly predicted positive, False Positive (FP) which means the data is incorrectly predicted positive, True Negative (TN) which means the data is correctly predicted negative, and False Negative (FN) which means that the data is incorrectly predicted negative. From these four terms, the value of the Confusion Matrix can be calculated with the following equation [18]: a. Accuracy is a statistical measurement to measure how good the model is at classifying correctly b. Precision shows the ratio of the amount of data correctly labeled positive to the total amount of data predicted to be positive c. Recall shows the ratio of the amount of data classified as positive to the total data that is actually positive

Latent Dirichlet Allocation (LDA) Modeling
The Latent Dirichlet Allocation (LDA) method was used to model the topic of discussion in this study. LDA is an unsupervised learning method that is able to explore and produce topics from a large number of documents so that it is possible to identify which topic composition is most appropriate that represents the content of each document [8]. The concept of this method is that a document consists of several topics, and each topic consists of a distribution of words [19]. Once the LDA model is built, topics are then visualized using the pyLDAvis library to make analysis easier.

RESULT AND DISCUSSION
In this study, classification testing of sentiment from Indonesian texts was performed using the Support Vector Machine (SVM) algorithm with linear kernel. There are 481 data in the dataset grouped into three classes, which are -1, 0, and 1, each of them representing negative, neutral, and positive sentiments. From the data labeling process, 369 data with neutral sentiment, 72 data with positive sentiment, and 40 data with negative sentiment were obtained. A large number of neutral sentiments is due to the majority of the status in the FAST group talking about job vacancies information, invitations to an event, and news intended for alumni. The test scenario focused on the preprocessing stage and the addition of n-gram feature extraction. The first scenario aims to find out the combination of preprocessing techniques that can produce the best accuracy value. Then, the second scenario aims to determine the influence of the application of unigrams and bigrams on the performance of the classification model. Each test scenario was performed hyperparameter tuning using the GridsearchCV method with 10-Fold-Cross-Validation, which aims to get the best combination of SVM hyperparameters so that the results obtained are maximized.

The Effect of Preprocessing Testing Result
In this scenario, testing is carried out at the preprocessing stage by experimenting on the effects of each preprocessing technique. A complete preprocessing technique (consisting of data cleaning, case folding, tokenizing, normalization, stemming, and stopword removal process) was carried out for the first test, then testing without removing stopword, then eliminating the normalization, stemming, and stopword removal processes, and finally testing without preprocessing. The model uses different split data sizes, which are 70% train data 30% test data and 80% train data 20% test data which aims to observe the effect of the ratio of split data to the confusion matrix value. In addition, the extraction of TF-IDF and unigram features is applied in this sentiment classification model. The results of the first scenario can be seen in Table 4. Based on the results of the experiments in Table 4, it can be concluded that the size of the split data and different preprocessing techniques have a not very significant influence on accuracy but quite a noticeable influence on other confusion matrix parameters; precision, recall, and F1-measure. Furthermore, hyperparameter tuning is applied to search for the best combination of SVM parameters to get maximum results. The best parameters results are shown in Table 5. The effect of the best SVM parameters application to the performance measurement with confusion matrix can be seen in Table 6. Based on the experimental results in Table 6, hyperparameter tuning using GridsearchCV by applying 10-Fold-Cross-Validation was able to increase the confusion matrix value. The occurrence of a decrease in value is caused by customization parameters in the code that do not represent the entire possible hyperparameter combination from SVM, as a result, the selection of a new hyperparameter combination is worse compared to the default hyperparameter. However, overall, the implementation of this method can increase the accuracy, precision, recall, and F1-measure values.
For this testing of the effect of preprocessing technique, the highest accuracy value was obtained by preprocessing experiments without stopword removal, split data size of 70% train data and 30% of test data which resulted in a value of 80.56%. In contrast, the smallest accuracy is in experiments without preprocessing and split data size of 70% train data and 30% test data with a value of 75.17%. This is due to the magnitude of the influence of each preprocessing technique, especially the normalization and stemming processes, so it is essential to do. Still, for stopword removal, it is better not to do it. In the process of normalization, words such as "tdk", "yg", and "nggak" will be changed to "tidak", "yang", and "tidak". If the words are not normalized, they will be considered different words even if they have the same meaning ("yg" will have a different meaning from "yang") and resulting in the system misclassifying the sentence. Then, with the stemming process, each word in the sentence will be changed to its original form. The effect is that the word that has the affix will be considered the same as the base word, for example, "mendaftarkan" will be returned to its basic word, which is "daftar". If the stemming process is not carried out, then the words "mendaftarkan" and "daftar" will be considered as different words, similar to the influence of the non-implementation of the normalization process, that is the system will misclassify. As for the stopword removal process is better not to do it because it will result in the loss of information from a sentence that affects the classification process. For example, the phrase "tidak kreatif" whose meaning is more negative, then the stopword removal process will eliminate the word "tidak" and the remaining word "kreatif" which has a meaning tends to be positive so that the system will classify the sentence in a positive class instead of negative class. These results are also supported by low accuracy in preprocessing tests that only apply data cleaning, case folding, and tokenizing techniques compared to accuracy in the tests using complete preprocessing.

The Effect of N-gram Feature Extraction Testing Result
In the second scenario, the test was performed by applying the addition of n-gram feature extraction with complete techniques of preprocessing stage and the classification process by the SVM method using a linear kernel. The purpose of this experiment is to determine the influence of unigrams and bigrams on the value of the confusion matrix generated by the model. The results of the tests in this scenario can be seen in Table 7. Before further analysis, similar with the previous scenario, with implementing the GridsearchCV algorithm, the best parameters of SVM were obtained. The result of such process can be seen in Table 8. Then, for the test results after applying the best SVM parameters based on 10-Fold-Cross-Validation, is shown in Table 9. Based on the experimental results in table 9, it can be concluded that feature extraction using unigrams resulted in higher accuracy values than bigram both on the scenario split data size ratio 70:30 and 80:20. This is due to the discovery of more single features used in the data train than in the bigram cases where the probability of combining the same two sequential words appearing in the training data is lesser. By applying unigram in the scenario of split data size 80% data train and 20% data test with preprocessing using data cleaning, case folding, tokenizing, normalization, stemming, and stopword removal technique and TF-IDF word weighting, the highest accuracy obtained is 80.21%, means the model is able to correctly classify the data quite well. However, based on the recall value, the ability of the model to find all the positive data is still low.

LDA Topic Modelling Result
The data that has entered a complete preprocessing stage then continue to the following process, the topic modeling stage by applying the LDA algorithm. With this method, the dataset is extracted, and 10 topics are obtained; each topics have 10 related words. The results of the topic extraction can be seen in Table 10.