Building Synonym Set for Indonesian WordNet using Commutative Method and Hierarchical Clustering

− WordNet is a compilation of Synonyms Set (synset), which consists of the words that have the same synonymous. The development of Indonesian WordNet has a goal to build an application that can accommodate and exhibit the relation of words. Synonym Set is a set composed of one or more words that have a similar meaning or synonym relation originated from the Indonesian Thesaurus. In previous studies, the establishment of synsets were transmitted with several approaches, one of which was the cluster ring to produce synsets and WSD (Word Sense Disambiguation). In this research, research is held up to discover the semantic similarities between words in the Indonesian Thesaurus automatically, and also to know the performance of the Agglomerative Hierarchical Clustering method for the development of Indonesian synsets. To calculate performance and evaluation, this research is using the F-measure method involving the gold standard.


INTRODUCTION
WordNet is a set of several synonyms called Synonym Set (Synset) consisting of words that have equivalent meanings or sense which are interrelated [1]. At first, WordNet is a semantic dictionary that made in English version which was first built by Princeton University, then along with development of technology, WordNet at present is one of the most widely used sources of referral information. Language dictionaries around the world in general is a dictionary that has a focus words while WordNet focuses on the meaning of words or synonyms. In WordNet, several classes of words such as nouns, verbs, adjectives, and adverbs is grouped into a synsets. Lines of words in WordNet can symbolize a meaning which is called synset.
In the process of building WordNet, the first thing to do is produce a synset or collection of synonym that have same meanings [1], that means the words are grouped into a synset according to their meaning. That is because synset is a basic concept that supports the formation of semantic relations in the lexical database [2]. Monolingual resource that used as a lexical resource is Thesaurus, because Thesaurus contains words that have an interrelated synonym relation [1]. Thesaurus that has been through the extraction process, will produce one or more synset. To combine the synset that produced from the previous process, one way to produce the best synset is using the clustering techniques. Therefore, need some further research to find out the performance of clustering techniques in the development of synset for Indonesian WordNet.
Previously, there was a development of Indonesian WordNet using Hierarchical Clustering. In that study, the data that used as input is a synset that was generated from the commutative process, then that data (synset) will be grouped and combined in the Clustering process. However, the data used as input are data generated from the results of manual commutative process. This research will focus on two main stages, the first is the stage to doing synset extraction, and the second stages is the process of combining synsets using clustering technique if in the first stages there is a word produces more than one synset. In the synset extraction process, to produce a valid synset value will use Commutative method using available monolingual resources that is Thesaurus Bahasa Indonesia, this means that if a word k1 has a synonym k2, then k2 must also be a synonym of k1. In fact, commutative relations like this do not always occur in Thesaurus Bahasa Indonesia [1]. And for the second stages, clustering technique that used in this research is Agglomerative Hierarchical Clustering.
The purpose of this research is to find out the semantic similarities between words in Thesaurus Bahasa Indonesia automatically, and also implement the Agglomerative Hierarchical Clustering method on the system to be built to determine the performance of that clustering techniques in the development of synset for Indonesian WordNet.
The flow of the system to be built is depicted at the Figure 1, explained that the initial input is a word from Thesaurus Bahasa Indonesia that would be processed to find out the equivalent meaning with the other words. After that is doing the process of synset extraction that is identification process of the data test using commutative method to produce a synsets. Then, synset that has been through the extraction process will be processed in preprocessing, this process aims to remove excessive characters and spaces in the synset that produced from previous process. Other than that, preprocessing can also make the system read the data properly. After that, the synset that produced from preprocessing will be grouped and combined using Agglomerative Hierachical Clustering.
The system to be built are expected can be produce the synset from the words that entered as an input, also to find out the semantic similarities between words in Thesaurus Bahasa Indonesia, to find out the performance of clustering techniques in the development of synset for Indonesian WordNet.

WordNet
WordNet is an online lexical database. This development is based on the theory psycholinguistic human lexical memory. In WordNet, verbs, nouns, adjectives, and adverbs are grouped into a collection of cognitive synonyms (synset), in order to represent different concepts [3].
In research [4], WordNet contains information about 155,000 from various classes of words such as nouns, verbs, adjectives, and adverbs, these words are grouped into a synset according to their meaning. At present, Indonesian WordNet has 1203 (synset) and 1659 unique words in it. The number of semantic relations that can be made from existing synset reaches 2261 relations [5].

Automatic WordNet Development
In the development of Indonesian WordNet, previously there was a development of Indonesian WordNet based on Linked Data, while the stages of development included identification of data sources, data extraction, data transformation, loading data into relational databases, and mapping relational databases to RDF models [6].
Other than that, previously there was a development of Indonesian WordNet using Hierarchical Clustering. In that study, the data that used as input is a synset that was generated from the manual commutative process, then that data (synset) will be grouped and combined in the Clustering process [7]. While in this study, the data used as input is a word derived from the Thesaurus Bahasa Indonesia which will then be processed using the automatic commutative process. Then, the data (synset) that generated from commutative process will be used to be grouped and combined in the Clustering process.

Synonym Set (Synset)
Synonym Set (Synset) is a collection that composed of one or more words that have a relation of synonym. Each member of this set can replace each other without changing the sense or meaning of the sentence that contains it [1]. The role of synset becomes very important in WordNet of any language, because all semantic relations connecting a synset, not the word.  Figure 2 is an example of a word that has a synonym relation that is the word "akomodasi" and "fasilitas". These two words have the same meaning which is something that serves to meet the needs for launch the process of a particular business. Synonym sets not always have more than one member, synonym sets also have a single member, this can happen because the entry word has no synonym.

Commutative Method
Synset is valid if there has commutative concept. In Princeton WordNet, synonym relations should be commutative which means that if a word k1 has a synonym k2, then k2 also must be a synonym of k1 [1]. Finding a synset that has a valid value is done using a matrix table. Table 1 is shows the matrix table of the word "Fana". Sense of "fana" is "sementara", and "kontemporer", which that words will be used as rows and columns [8]. Table 1, the cells that has a true value (T) indicating that the relation of the words contained in the row and column is commutative, and for the cells that has a false value (F) they have the opposite meaning. For example, the row of "temporer" and the column of "fana" has the cell that valued false (F), this means "fana" does not the sense of "temporer". From that matrix table produced a synset that is [fana, sementara].

Synset Extraction
Before the preprocessing stage is carried out, there is an extraction process using the commutative concept. The extraction process is carried out in several stages of the algorithm as follows [9]. a. Searching for a sense of the entry word. The selected word is then searched for its meaning using a thesaurus. The chosen word for example is "ahad" which has a sense of "minggu", "esa", "satu" and "tunggal".
b. Searching for synonyms on every sense from point (a). Every sense in point 1 will be searched for its synonyms.
c. Searching for "ahad" in the sense that being sought. The two Senses of the word "ahad" are then searched for a set of synonyms. As seen in Table 2. The next step is to look for synonyms that have the keywords: "ahad", i.e. "minggu", "esa", "satu", and "tunggal", then these words and "ahad" themselves will be made into columns and rows in the matrix.
d. Identify the prospective synset to be sought. At this stage the process is done is to look for candidates for the synset that can be generated from each item from the words in the dataset.
f. Elimination of candidate synset which is a subset of the other synset. At this stage the process carried out is the elimination of the candidate synset which is a subset of the other synset.
g. Take the remainder of the elimination synset candidate. In this process the remaining synset candidates will be taken and used as a synset. For example in "ahad" the resulting final synset is. .

Hierarchical Clustering
Hierarchical clustering is some method of clustering that aims to grouping some data based on the concept of hierarchy. In this method, the two closest groups will be combined in each iteration [10]. In this research, Agglomerative Hierarchical Clustering will be used as Hierarchical Clustering.

Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering is an bottom-up approach [11], that means where clusters have subclusters, which in turn have sub-clusters, etc.. However, a modification needs to be done, because the goal is not to cluster the candidates into a single cluster, in which data will be grouped based on distance value and the clustering process will be stopped after it reached a condition decided by threshold value [2] [12]. Calculation of the distance values can be seen in Equation Unique Words are the number of words that exist in the two synonym sets that are compared, and Similar Words are the same number of words in the two synonym sets that are compared. The clustering process will stop if the threshold value is greater or equal to than the maximum distance value. Calculation of the threshold value can be seen in Equation 2. ℎ ℎ = × In Equation 2, threshold value is obtained by multiplying the coefficient value with the maximum distance value that obtained from the first iteration. And each coefficient value can be changed manually in the range of 0.1 to 1.0. This Clustering process needs to be done to grouping the synset on redundant dataset to produce a better synset. The purpose of this clustering is to group based on the largest similarity value and the largest distance value. Similarity value is obtained from the same number of words between one synset with another synset. Indexes indexes that have the same distance value will be combined with other indexes that have the largest distance value. The clustering process will stop if the threshold value is greater or equal to than the maximum distance value [2]. Distance value and threshold values are generated from Equation 1 and Equation 2. The algorithm used in this study can also be described in the form of a flowchart depicted in the Figure 3

Tesaurus Bahasa Indonesia
Thesaurus contains a set of words that are related together. Basically, a thesaurus is a means to divert ideas into words, or vice versa. Thesaurus is distinguished from the dictionary. In the dictionaries can be found information about the meaning of words, while in the thesaurus words can be used to express the ideas of the users. Therefore, the thesaurus can help the users to express the ideas according to what is meant [13]. In this research, the thesaurus used is the Tesaurus Bahasa Indonesia in pdf format which was published in 2008.

Data Test
In the testing phase, 80 words will be used as the data test that taken randomly from the thesaurus, these words will be extracted to produce a valid synset. The selected word is then processed by the system and will produce one or more synset. The list of words used as data test is shown in Table 3. The data test will be extracted using commutative methods to produce one or more synset which will be processed using Agglomerative Hierarchical Clustering.

Gold Standard
Gold Standard aims to find out how much the correlation between the score issued by the system and the relevance of the words being tested. The gold standard value is obtained from a collection of human opinions. This value is used as a reference measurement of similarity between words. In this study, the gold standard used is the result of validating synonym sets performed by lexical experts (lexicographers). The validation is done very carefully so that it can be used as a comparison for the results of the system as a measure of accuracy [12].

F-Measure
F-measure is a popular performance metric, especially for tasks with unbalanced data sets [14]. This F-measure method involves the precision method and the recall method. For the calculation of precision (P) and recall (R) can be seen in equations 3 and 4 [15].
The F-Measure method calculates multiple propositions multiplied by the results of the first method (precision) and the second method (recall) divided by the sum of that two. The calculation of F-Measure can be seen in equation 5.
Precision is taken from the calculation of the number of correct words in the synset that has been produced by the system adjusted to the gold standard generated from manual calculations and then divided by the number of words in the resulting synset, while the recall is taken from the calculation of the number of correct words in the synset that has been generated by the system adjusted to the results of the manual calculation then divided by the number of words in the synset that has been calculated manually by humans or the gold standard.

RESULT AND DISCUSSION
In this section, will explain the related things about the testing result and the analysis of testing result that have been carried out.

Testing Result
The method used for the evaluation process is using the F-measure method. In addition, testing is done by changing the coefficient value from the range 0.1 to 1.0. This test is carried out to find out which coefficient values can combine a set of synonyms exactly according to their sense. The test results that have been carried out by changing the coefficient values from the range 0.1 to 1.0 produce data that can be seen in Table 4. Based on the data in Table 4, it appears that the coefficient values 0.9 and 1.0 have the same number of synset and number of loops, this is because the same maximum distance value. The clustering process continues because the coefficient value is still lower than the maximum distance value generated from each loop. As explained earlier, the clustering process will stop if the threshold value is greater or equal to than the maximum distance value.
Other than that, clustering performance testing process has been carried out on commutative and clustering data set results with a range of coefficient values 0.1 to 1.0. Here is a comparison of performance data generated from the tests that have been carried out.  Table 5 is a comparison of performance data generated from the tests that have been carried out. The clustering performance testing process has been carried out on commutative and clustering data set results with a range of coefficient values 0.1 to 1.0. This is done to find out how the performance of each process by calculating the synset generated by the system.  Table 6 is a comparison between Synset Validation, Synset Commutative, and Synset Clustering. Synset Validation synset that used as validation, Synset Commutative is a synset that generated from automatic commutative process, and Synset Clustering is a synset that generated from clustering process.

Analysis of Testing Result
Referring to the data shown in Table 4, the results of testing on each coefficient value produce different values in each aspect, for example the difference in the amount of synset and Distance Value always increases along with the coefficient value which also increases. Inversely, the number of loops in the clustering process always decreases even though the coefficient value continues to increase, this happens because the clustering process will continue as long as the coefficient value is less than the maximum distance value that generated from each loop. Other than that, the performance in the data set from the smallest coefficient to the largest always increases and reaches the optimum point at Coefficients 0.9 and 1.0, because on these coefficients the value of Max Distance Value, Number of Loops, and Number of Synset has been produced does not change.
Based on the data listed in Table 5, building synonym set using commutative method (before the clustering process) with predetermined data test, obtained a value of Precision at 51.3%, Recall at 63.11%, and F-Measure at 56.19%, where that value has a big difference compared to the performance possessed by the clustering process. For example, the value that generated from the optimum coefficient has a value of Precision at 77.48%, Recall at 93.48%, and F-Measure at 84.73%. This could be indicates that the clustering process is very useful for the process of combining synset to be better.
Then based on Table 6, 3 samples were taken that were used for comparison between validation data, data from commutative process, and data from clustering process. That table shows a different similarity of generated synset. It certainly affects the evaluation value, it means that if the words in validation synset and the synset that produced by the system have significant differences, the resulting evaluation value will be lower. However, if the validation synset and the synset that produced by the system have a similarity, then the evaluation value that will be generated can be even greater.

CONCLUSION
Threshold value can be found from the coefficient. Threshold value used to stop looping in the clustering process if the threshold value is greater or equal to than the maximum distance value. Based on data obtained from tests that have been done, the clustering process is very useful for the process of combining several synset to be better synset, as evidenced by the F-Measure value that produced from the clustering process is at 84.21% compared to the F-Measure value produced from the process that was passed without clustering which was only at 56.19%. In addition, the coefficient values in the range 0.1 to 1.0 in the clustering process obtained the coefficient value 0.9 as the most optimal coefficient because of the things that have been explained in the previous chapter. However, the optimum coefficient value can vary depending on the dataset used in the testing.
The suggestion for further research, it is expected to do the addition of a list of words to be used, this has a purpose in order to determine the optimum coefficient value and also to measure the performance of Agglomerative Hierarchical Clustering on a bigger data scale. And then, further research is expected to be used another clustering method in the development of synset for Indonesian WordNet.