Analisis Perbandingan Metode Similarity untuk Kemiripan Dokumen Bahasa Indonesia pada Deteksi Kemiripan Teks Bahasa Indonesia

 (*)Sheraton Pawestri Mail (Universitas Ahmad Dahlan, Yogyakarta, Indonesia)
 Yohanes Suyanto (Universitas Gadjah Mada, Yogyakarta, Indonesia)

(*) Corresponding Author

Submitted: April 25, 2024; Published: July 26, 2024

Abstract

Ease of accessing information brings diverse benefits, including the ability to develop models that can detect similarities between documents, a plagiarism-checking system, automatic summarization, classification, etc. The various benefits of word similarity detection make research on similarity detection between documents an important area to develop. However, studies regarding similarity detection specifically for Indonesian language documents are still relatively small and the performance can still be developed. Therefore, this research aims to conduct a comparative analysis of the performance of Doc2Vec compared to the Jaccard Coefficient, Cosine Similarity, and Euclidean Distance in detecting the similarity of documents with Indonesian text. Three datasets are used in this analysis, with the first dataset consisting of 200 news from Google News, the second dataset from IndoNLU, and the third dataset from TaPaCo. The findings from this study show that overall Cosine Similarity has better performance than Jaccard Coefficient and Euclidean Distance for average performance. The superior performance was with accuracy of 0.98, precision of 0.84, recall of 0.95, and F-1 score of 0.89, with the model formed in 10.56 seconds using the Cosine Similarity algorithm on the Google News dataset. This is because doc2vec is better suited to datasets with higher dimensions than datasets that only contain a few words.

Keywords


Similarity Indonesian Text; Doc2vec; Jaccard Coefficient; Cosine Similarity; Euclidean Distance

Full Text:

PDF


Article Metrics

Abstract view : 205 times
PDF - 149 times

References

P. Sitikhu, K. Pahi, P. Thapa, and S. Shakya, “A Comparison of Semantic Similarity Methods for Maximum Human Interpretability,” Oct. 2019, [Online]. Available: http://arxiv.org/abs/1910.09129

T. A. W. Tyas, Z. K. A. Baizal, and R. Dharayani, “Tourist Places Recommender System Using Cosine Similarity and Singular Value Decomposition Methods,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 4, p. 1201, Oct. 2021, doi: 10.30865/mib.v5i4.3151.

I. Mawanta, T. S. Gunawan, and W. Wanayumini, “Uji Kemiripan Kalimat Judul Tugas Akhir dengan Metode Cosine Similarity dan Pembobotan TF-IDF,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 2, p. 726, Apr. 2021, doi: 10.30865/mib.v5i2.2935.

R. Jasmi, Z. K. A. Baizal, and D. Richasdy, “Question Answering Chatbot using Ontology for History of the Sumedang Larang Kingdom using Cosine Similarity as Similarity Measure,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 6, no. 4, p. 2040, Oct. 2022, doi: 10.30865/mib.v6i4.4530.

L. Mayola, M. Hafizh, and D. Marse Putra, “Algoritma Jaccard Similarity untuk Deteksi Kemiripan Judul Disertasi dengan Pendekatan Variasi Stop Word Removal,” vol. 8, no. 1, pp. 477–487, 2024, doi: 10.30865/mib.v8i1.7109.

S. Pawestri, “Analisis Perbandingan Metode Jaccard Coefficient dan Cosine Similarity untuk Kemiripan Teks Bahasa Indonesia,” Tesis, Universitas Gadjah Mada, Yogyakarta, 2022. Accessed: Apr. 22, 2024. [Online]. Available: https://etd.repository.ugm.ac.id/penelitian/detail/219434

R. Singh and S. Singh, “Text Similarity Measures in News Articles by Vector Space Model Using NLP,” Journal of The Institution of Engineers (India): Series B, vol. 102, no. 2, pp. 329–338, 2021, doi: 10.1007/s40031-020-00501-5.

I. R. Hendrawan, E. Utami, and A. D. Hartanto, “Comparison of Word2vec and Doc2vec Methods for Text Classification of Product Reviews,” in 2022 6th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), 2022, pp. 530–534. doi: 10.1109/ICITISEE57756.2022.10057702.

B. Walek and P. Müller, “An approach for recommending relevant articles in news portal based on Doc2Vec,” in 2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), 2022, pp. 26–31. doi: 10.1109/AIKE55402.2022.00010.

N. V. A. Kumar and S. Mehrotra, “A Comparative Analysis of word embedding techniques and text similarity Measures,” in 2022 5th International Conference on Contemporary Computing and Informatics (IC3I), 2022, pp. 1581–1585. doi: 10.1109/IC3I56241.2022.10072927.

A. Mandal, K. Ghosh, S. Ghosh, and S. Mandal, “Unsupervised approaches for measuring textual similarity between legal court case reports,” Artif Intell Law (Dordr), vol. 29, no. 3, pp. 417–451, 2021, doi: 10.1007/s10506-020-09280-2.

P. K. Reshma, S. Rajagopal, and V. L. Lajish, “A Novel Document and Query Similarity Indexing using VSM for Unstructured Documents,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 676–681. doi: 10.1109/ICACCS48705.2020.9074255.

K. Iwamoto, H. Uchida, Y. Li, and Y. Nakatoh, “Automatic Text-to-sound Generation by Doc2Vec,” in Human Interaction & Emerging Technologies (IHIET 2023): Artificial Intelligence & Future Applications, AHFE International, 2023. doi: 10.54941/ahfe1004033.

K. Chen, J. Huang, Y. Cui, and W. Ren, “Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 22, no. 3, Apr. 2023, doi: 10.1145/3532852.

M. Alobed, A. M. M. Altrad, and Z. B. A. Bakar, “A Comparative Analysis of Euclidean, Jaccard and Cosine Similarity Measure and Arabic Wordnet for Automated Arabic Essay Scoring,” in 2021 Fifth International Conference on Information Retrieval and Knowledge Management (CAMP), 2021, pp. 70–74. doi: 10.1109/CAMP51653.2021.9498119.

J. Zhang, F. Wang, F. Ma, and G. Song, “Text Similarity Calculation Method Based on Optimized Cosine Distance,” in 2022 International Conference on Electronics and Devices, Computational Science (ICEDCS), 2022, pp. 37–39. doi: 10.1109/ICEDCS57360.2022.00015.

S. Dash, T. Mohanty, S. R. Das, A. Mohanty, and R. Rautray, “PCTS: Partition Based Clustering for Text Summarization,” in 2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT), 2023, pp. 1–6. doi: 10.1109/APSIT58554.2023.10201655.

Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” May 2014, [Online]. Available: http://arxiv.org/abs/1405.4053

J. Leskovec, A. Rajaraman, and J. D. Ullman, “Finding Similar Items,” in Mining of Massive Datasets, 2nd ed., J. Leskovec, A. Rajaraman, and J. D. Ullman, Eds., Cambridge: Cambridge University Press, 2014, pp. 68–122. doi: DOI: 10.1017/CBO9781139924801.004.

J. Han, M. Kamber, and J. Pei, “2 - Getting to Know Your Data,” in Data Mining (Third Edition), J. Han, M. Kamber, and J. Pei, Eds., Boston: Morgan Kaufmann, 2012, pp. 39–82. doi: https://doi.org/10.1016/B978-0-12-381479-1.00002-2.

H. Parvin, H. Alizadeh, and B. Minati, “A Modification on K-Nearest Neighbor Classifier,” 2010.

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 JURNAL MEDIA INFORMATIKA BUDIDARMA

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.



JURNAL MEDIA INFORMATIKA BUDIDARMA
STMIK Budi Darma
Secretariat: Sisingamangaraja No. 338 Telp 061-7875998
Email: mib.stmikbd@gmail.com

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.