Feature Expansion Using Word2vec for Hate Speech Detection on Indonesian Twitter with Classification Using SVM and Random Forest

Abstrack− Hate speech is one of the most common cases on Twitter. It is limited to 280 characters in uploading tweets, resulting in many word variations and possible vocabulary mismatches. Therefore, this study aims to overcome these problems and build a hate speech detection system on Indonesian Twitter. This study uses 20,571 tweet data and implements the Feature Expansion method using Word2vec to overcome vocabulary mismatches. Other methods applied are Bag of Word (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) to represent feature values in tweets. This study examines two methods in the classification process, namely Support Vector Machine (SVM) and Random Forest (RF). The final result shows that the Feature Expansion method with TF-IDF weighting in the Random Forest classification gives the best accuracy result, which is 88,37%. The Feature Expansion method with TF-IDF weighting can increase the accuracy value from several tests in detecting hate speech and overcoming vocabulary mismatches.


INTRODUCTION
Social media is a source of information, entertainment, and communication to exchange messages online with someone without being separated by distance and time [1]. One of the social media that Indonesian people most often use is Twitter [2]. Twitter is used to send messages in the form of text and is called a tweet [3]. Based on information from the General Resources for Post and Information Technology at the Ministry of Communication and Information, Twitter users in Indonesia are 19.5 million and the fifth-ranked country with active Twitter users [4]. In uploading or commenting on tweets from other users, the Twitter application provides freedom of expression [2], so it can cause the problems such as hate speech [5]. Not only that, other problems like Twitter limits only 280 characters to upload a tweet. This restriction sometimes contains word variations, allowing for vocabulary mismatches in tweets [6]. Hate speech can occur when individuals or groups communicate with each other. The actions include provocation, incitement, or insult to various aspects, whether based on race, gender, ethnicity, disability, nationality, religion, sexual orientation, or other characteristics [7]. According to the Indonesian Ministry of Communication and Information, hate speech is the highest case on social media [5]. Classification of hate speech is a solution to deal with this problem. Many researchers have conducted research related to this topic, but the final results obtained are sometimes unsatisfactory such as low accuracy results. Low accuracy can be caused by limited training data [8]. The hate speech detection system has been developed, and is expected to reduce the problem of hate speech on twitter, so as to create healthy habits.
Studies on hate speech detection have been widely by other researchers. The study [9] used 14.509 tweet data and compared classification methods such as RF, AdaBoost, and NN. The results showed that the Random Forest method produced the best accuracy value of 72,2%. According to the authors, Random Forest results also have better recall and F1-Measure than AdaBoost and Neural Network. Other researchers compared classifications such as SVM, NB, BLR, and RFDT. Performed three test scenarios, one of which was to compare feature extraction performance using four classification algorithms. The final result shows that the Random Forest Decision Tree (RFDT) classification using the Word N-Gram feature can provide the best F-Measure value up to 93,5% [10]. The weakness of this research is that the data used is only 520 tweets. The data has gone through preprocessing, so there is a significant reduction from the initial data. Another experiment shown in research [11] used 1.000 data sets of hate tweets and applied labeling techniques based on hate speech against ethnicity, race, religion, intergroup, and neutrality. This study tested the SVM kernel, and the best accuracy results were 93% when using the RBF kernel with the TF-IDF method.
Applying the Word Embedding method is one way to represent words into vector numbers. Researchers [12] compared weighting schemes, such as Binary, TF, TF-IDF, and Word2vec. This study obtained the best accuracy of 90% when applying the SVM classification with the word2vec method. According to the author, the words similarity method is appropriate for the weighting of the Word Embedding approach. The study [13] applied the TF-IDF, N-gram Word2vec, and Doc2vec methods. Tweet data used is 14.509 and apply classification methods such as NB, SVM, LR, KNN, DT, RF, AdaBoost, and MLP. The best accuracy result in this study was 79%, obtained when implementing SVM with a combination of TF-IDF and Bigram features. From these results, the TF-IDF method is superior to Word2vec and Doc2vec. According to the author, this method cannot handle OOV words in the Twitter domain, and Word2vec also requires many training data. Another study also applies the same method [6], using a dataset of 19.401 tweets from 97 Twitter accounts and collecting on Indonews and Google

System Design
This section will explain the system plan of the hate speech detection system. It consists of several steps, which include data crawling, data labeling, data preprocessing, Feature Extraction (TF-IDF, BOW), Feature Expansion (Word2vec), data classification (SVM, RF) and evaluate the system using the confusion matrix method. Figure 1 will explain the flow of the system.

Data Crawling
In collecting data, we apply a crawling method with the Application Programming Interface (API) and use Python for building the system [5]. A total of 20.571 Indonesian-language Tweet data were collected based on specific topics or keywords, and the tweet data contains some information such as keywords, tweets and usernames.

Text Preprocessing
Text processing is applied to select and filter data that is still noise or dirty to be cleaner and more structured [11]. It is essential to apply it before being processed to the next step, because get better/structured data quality and can improve classification accuracy [1]. This process consists of several steps, Data Cleaning is the first step taken to delete mentions, hashtags, characters/symbols such as punctuation marks, URLs, numbers (0-9), and emoticon characters [14]. Case Folding is a step to change a text that contains capital letters to lowercase [6]. Normalization is a step in checking each tweet containing abbreviated words, informal words, and slang words and then converted to actual words. Stop words removal is a step to eliminate words that are considered unimportant in the tweet text  [6]. Stemming is the step of changing words that have affixes into essential words. Tokenizing is a step to convert the tweet into a word token [1].

Bag of Word (BOW)
BOW is a set of vector numbers based on the frequency of words in the document, and its characteristic is to ignore word order and grammar but still maintain diversity [15]. This method is simpler than TF-IDF, because it does not require a special formula for processing.

Term Frequency -Inverse Document Frequency (TF-IDF)
Another method is TF-IDF, a feature weighting technique with numerical statistical methods that show relevant words/terms for several documents [16]. The concept of TF-IDF method is to calculate TF and IDF values. TF is the value of the occurrence of the word in a document based on its frequency. At the same time, IDF calculates the distribution of a word in a document [11]. The TF-IDF formula will be explained in equation (1).
Wich , is the weight of term in document , , is value of TF and is the number of document containing word [17].

N-Gram
N-gram is a text mining method used in this study to process text data by cutting it into a word. The types of ngrams based on their units are divided into n-grams of characters and n-grams of words [18]. In the n-gram character an input text will be cut per character based on the number of n, while in the n-gram word, the text data will be cut per word based on the number of n. This study uses the word N-gram with specifications, namely unigram, bigram, and trigram.

Word2vec
Word2vec is one of the Word Embedding methods, which converts words into a vector. It is an unsupervised learning algorithm using a neural network consisting of a hidden layer and a fully connected layer [19]. This method was developed by Mikolov in 2013 and had two architectures, namely Continuous Bag of Word (CBOW) and Skip-Gram. The CBOW architecture focuses on predicting the word target using the context of the surrounding word. In contrast, Skip-Gram focuses on predicting the target context based on word usage [20]. Figure 2 will explain the CBOW and Skip-gram architecture.

Figure 2. CBOW and Skip-gram Architecture
The input for word2vec is a text corpus, and the output is a set of vectors [21]. In this study, apply the word2vec method to determine each word's vector value and measure the closeness between words semantically in maximizing the possibility of predicting the word context or surrounding words. This method to determine the similarity of words requires calculations using the cosine similarity method.

Feature Expansion
Feature Expansion is a method to solve the problem of vocabulary mismatch, has a concept to identify features in tweets with a value of 0 (zero). That value will be replaced by related words with semantic similarities, and it is formed into a word dictionary (corpus) using the Word2vec method [6]. The data used to create a word dictionary (corpus) and as training data for the word2vec model consists of news data, tweet data, and a combination of news and tweet data. This study uses 142.544 news data, which consists of various news sources or topics, and this data is obtained from previous researchers. The word dictionary (corpus) contains information about vocabulary and similar words. The Feature Expansion process will take the word similarity based on the order. Table 1 will show an explanation of taking similar words.  Table 2 will show the results of making a word dictionary (corpus) on news data. For example, the vocabulary displayed is the word "Twitter" in the similarity Top 10.

Table 2. Word Dictionary (Corpus) on News Data
The concept of feature expansion is explained in the following flowchart in Figure 3.

Support Vector Machine (SVM)
Support Vector Machine (SVM) is a machine learning algorithm supervised (directed learning) and it has the concept of training on data to make predictions on class labels [4]. Other researchers widely implement support Vector Machine (SVM) because its implementation uses statistical learning to provide better results [22].
The concept of Support Vector Machine (SVM) defines the best Hyperplane (the boundary between two classes) by maximizing the distance between classes. In determining the best Hyperplane, it is necessary to calculate the margins. Based on the definition, the margin is the distance between hyperplanes from the closest point to each class (support vector) [22].

Random Forest (RF)
Random Forest is an ensemble method [23], it is a method to combine several classification which aim to find the best prediction. Random Forests are generally used to classify large amounts of data [9]. The learning concept of the Random Forest algorithm is to vote for each result from the decision tree and predict the class based on the most dominant result [13].

Top Similarity
Description Top 1 Taking word similarity based on the order of the top one. Top 5 Taking word similarity based on the order of the top five.

Top 10
Taking word similarity based on the order of the top ten.  Based on table 3, it is possible to measure performance in determining the value of accuracy, precision, recall and, F1 score [13]. Accuracy to identify the accuracy of the classification model in predicting the data correctly and will be shown in equation (2). Precision calculations to identify how accurate the results from the requested data are with the predictions provided by the model and will be shown in equation (3). Recall calculation to determine the model's success rate in finding information will be shown in equation (4). F1 -Score is the average harmonic value (Harmonic mean) of precision and recall and is shown in equation (5).

Data
This research uses a collection of tweets and news data. The topic selection of these tweets is based on the trending topics that occurred on Twitter in October 2020 -June 2021, and on the topic of abusive words was determined to train the model in recognizing tweets that included hate speech. The news data consists of several sources of Indonesian news media with a total of 1.111 articles and 142.544 news data. In contrast to tweet data, news data were obtained from researchers [6] who became the reference in this study in applying the method. The news data also contains some information such as news topics, news media sources, URLs and news texts. Table 4 shows the detail crawled data with a list of keywords.

Table 4. Show The Crawled Data with A List of Keyword
Labeling tweet data is processed manually and involves four people. Tweet data is labeled based on two classes, tweets with hate speech and tweets with non-hate speech. In general, tweets classified as hate speech contain several elements: provocation, humiliation, discrimination against a person or group, even threats against ethnicity, religion, race, intergroup (SARA) [24], and tweets containing harsh words. Meanwhile, tweets that are not hate speech have the context of neutral words or sentences with a positive vibe. From the data labeling process,

Pre-processing
From several pre-processing step that have been carried out on tweet data and news data, this section will describe the results of the process, for example in Table 6 the pre-processing results of tweet data will be shown.

N-Gram
This study uses the features of unigram, bigram and trigram, and in Table 7 it will be explained the number of words that have been successfully processed from this method. In the process, tweets with bigram and trigram features cannot process all the words at the classification step, due to the limited memory capacity of the hardware. Therefore, the solution is to set the maximum limit of features / words used, determine the features of 10,000, 20,000, and 30,000, this aims to find features with the best accuracy. However, Unigram does not apply feature restrictions, because the number of word features formed is only 17,943 and can still be processed by hardware.

Built Corpus Data
The formation of a word dictionary (corpus) uses news data, tweet data and a combination of news and tweet data. At this step, the Word2vec method will be used to find the similarity of the vocabulary. This study has tested each data, by taking based on the level of similarity (Top1, Top 5 and Top 10) which has been described in section 2, then in Table 8 will describe the number of vocabularies formed from each data tested.

Test Result
This study tested several scenarios by applying the Bag of Word (BOW), TF-IDF, and N-gram methods to represent values and weights in tweet data. Then, the Feature Expansion method will be applied to deal with vocabulary mismatches and classify the data using the Support Vector Machine (SVM) and Random Forest methods. The classification stage was repeated five times in determining the accuracy value and will be taken from the average value. At the classification stage, this study applies a ratio of 90:10, where 90% is for training data and 10% for test data. Determination of this ratio because the resulting accuracy value is better than other ratios. This study  Table 9, the best accuracy at Support Vector Machine (SVM) baseline is 86,60% and at Random Forest (RF) baseline is 86,49%. In this scenario, the comparison of values for the two baselines (SVM and RF) is not too different. However, the random forest classification still gives the best accuracy results.

Effect of TF-IDF on Baseline
The second scenario is testing using the TF-IDF method. This method will be applied to each baseline to see the effect of the weighting method. Table 10 will show the results of the second scenario using SVM and Random Forest classification. Based on Table 10, the TF-IDF (weighting) method can increase the accuracy value of the two classification methods. At SVM baseline is increased accuracy up to 0,673%, while Random Forest baseline up to 1,174%.

Feature Expansion
The third scenario applies the Feature Expansion method using a word dictionary (corpus) from news data, tweet data, and a combination of news and tweet data. The similarity of the corpus applied is Top 1, Top 5, and Top 10. This section will implement the Feature Expansion method on the baseline to see its effect if it is applied to the Baseline (SVM and Random Forest). Table 11 will show the results of the third scenario on the SVM and Random Forest Baseline. Based on Table 11, the Feature Expansion method can increase the accuracy of Baseline SVM. Accuracy results of 87,41% are obtained from news data in Top 1 and can increase accuracy up to 0,93%. the Feature Expansion method can increase the accuracy of the Baseline Random Forest up to 0,57%. Accuracy results of 87,43% were obtained from the combination of news and tweet data included in the Top 10. Figure 4 will show  Figure 4, the SVM classification gets the highest accuracy of 87,41%, obtained using the news data corpus in the Top 1 retrieval. However, the percentage of feature replacement is relatively lower than the others, which is 15,65%. So that the best accuracy value in the SVM classification tends to increase with a lower percentage of replacement. While the Random Forest classification gets the highest accuracy of 87,43%, obtained when using corpus from a combination of news and tweet data in the Top 10. Based on the graph above, the percentage of feature replacement is relatively higher than the others, which is 77,91%. The feature replacement in Random Forest affects increasing the accuracy value and tends to increase with the feature replacement that has the highest percentage.
The fourth scenario is to implement Feature Expansion with the TF-IDF method on the baseline. This scenario has the same concept as the third scenario. However, the difference is that it only adds the TF-IDF method. This section also uses a word dictionary (corpus) of news, tweet, combination data (news and tweets), and Top similarity implementation. Based on Table 12, the Feature Expansion + TF-IDF method increases the accuracy of the SVM baseline. The best classification is obtained based on the use of corpus from news + tweet data in the Top 1, which is 87,55% and has an increase of 1,1%. The Feature Expansion + TF-IDF method also improves the accuracy of the Random Forest baseline. The best classification is obtained based on the corpus of combined data (news + tweet) in the Top 10, which is 88,37% and has an increase of 1,64%. Figure 5 will explain the statistics of the replacement of each word feature based on the most optimal results in both classification methods. This graph will identify the percentage of feature replacement based on the best accuracy results from using corpus data.  Based on Figure 5, the SVM classification gets the highest accuracy of 87,55%, obtained using the corpus of combined data (news + tweets) in the Top 1 retrieval. However, the percentage of feature replacement is relatively lower than the others, which is 39,80%. So that the best accuracy value in the SVM classification tends to increase with a lower percentage of replacement. While the Random Forest classification gets the highest accuracy of 88,37%, obtained when using corpus from a combination of news and tweet data in the Top 10. Based on the graph above, the percentage of feature replacement is relatively higher than the others, which is 77,91%. The replacement feature in Random Forest affects the increase in the accuracy value and tends to increase with the feature replacement that has the highest percentage.

CONCLUSION
The hate speech detection system that has been built in this study uses the Feature Expansion (Word2vec) and Feature Extraction (TF-IDF and BOW) methods. In the Feature Expansion section, a corpus has been created using a set of data by taking the similarity of words in Top 1, Top5, and Top 10. The classification methods used in this research are Support Vector Machine (SVM) and Random Forest (RF). Based on the research results, applying the Feature Expansion method with the weighting method (TF-IDF) can increase the accuracy of both classifications and improve performance better than other test scenarios. The Support Vector Machine (SVM) classification method gets an accuracy of 87,55% or an increase of 1,10% from the baseline value. In comparison, the Random Forest method gets an accuracy of 88,37% or an increase of 1,64% from the baseline value. The highest accuracy results were obtained from the corpus of combined data (news+tweet), taking on the similarities in the Top 1 and Top 10. The researcher observed that taking similarity in the Top 5 for all data in the corpus could increase accuracy. However, the optimal and stable results remain on the similarity of Top 1 and Top 10 in the combined data. Researchers also observed the accuracy results in both classification methods (SVM and RF). Most of the highest accuracy of the test scenarios was obtained through the random forest classification. However, sometimes the SVM method gets better accuracy than random forest, which happens when the feature expansion method is implemented.