Analisis Perbandingan Metode Similarity untuk Kemiripan Dokumen Bahasa Indonesia pada Deteksi Kemiripan Teks Bahasa Indonesia

Sheraton Pawestri; Yohanes Suyanto

doi:10.30865/mib.v8i3.7648

Authors

Sheraton Pawestri Universitas Ahmad Dahlan, Yogyakarta http://orcid.org/0009-0000-6911-7854
Yohanes Suyanto Universitas Gadjah Mada, Yogyakarta http://orcid.org/0000-0003-1670-8620

DOI:

https://doi.org/10.30865/mib.v8i3.7648

Keywords:

Similarity Indonesian Text, Doc2vec, Jaccard Coefficient, Cosine Similarity, Euclidean Distance

Abstract

Ease of accessing information brings diverse benefits, including the ability to develop models that can detect similarities between documents, a plagiarism-checking system, automatic summarization, classification, etc. The various benefits of word similarity detection make research on similarity detection between documents an important area to develop. However, studies regarding similarity detection specifically for Indonesian language documents are still relatively small and the performance can still be developed. Therefore, this research aims to conduct a comparative analysis of the performance of Doc2Vec compared to the Jaccard Coefficient, Cosine Similarity, and Euclidean Distance in detecting the similarity of documents with Indonesian text. Three datasets are used in this analysis, with the first dataset consisting of 200 news from Google News, the second dataset from IndoNLU, and the third dataset from TaPaCo. The findings from this study show that overall Cosine Similarity has better performance than Jaccard Coefficient and Euclidean Distance for average performance. The superior performance was with accuracy of 0.98, precision of 0.84, recall of 0.95, and F-1 score of 0.89, with the model formed in 10.56 seconds using the Cosine Similarity algorithm on the Google News dataset. This is because doc2vec is better suited to datasets with higher dimensions than datasets that only contain a few words.

References

P. Sitikhu, K. Pahi, P. Thapa, and S. Shakya, â€œA Comparison of Semantic Similarity Methods for Maximum Human Interpretability,â€ Oct. 2019, [Online]. Available: http://arxiv.org/abs/1910.09129

T. A. W. Tyas, Z. K. A. Baizal, and R. Dharayani, â€œTourist Places Recommender System Using Cosine Similarity and Singular Value Decomposition Methods,â€ JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 4, p. 1201, Oct. 2021, doi: 10.30865/mib.v5i4.3151.

I. Mawanta, T. S. Gunawan, and W. Wanayumini, â€œUji Kemiripan Kalimat Judul Tugas Akhir dengan Metode Cosine Similarity dan Pembobotan TF-IDF,â€ JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 2, p. 726, Apr. 2021, doi: 10.30865/mib.v5i2.2935.

R. Jasmi, Z. K. A. Baizal, and D. Richasdy, â€œQuestion Answering Chatbot using Ontology for History of the Sumedang Larang Kingdom using Cosine Similarity as Similarity Measure,â€ JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 6, no. 4, p. 2040, Oct. 2022, doi: 10.30865/mib.v6i4.4530.

L. Mayola, M. Hafizh, and D. Marse Putra, â€œAlgoritma Jaccard Similarity untuk Deteksi Kemiripan Judul Disertasi dengan Pendekatan Variasi Stop Word Removal,â€ vol. 8, no. 1, pp. 477â€“487, 2024, doi: 10.30865/mib.v8i1.7109.

S. Pawestri, â€œAnalisis Perbandingan Metode Jaccard Coefficient dan Cosine Similarity untuk Kemiripan Teks Bahasa Indonesia,â€ Tesis, Universitas Gadjah Mada, Yogyakarta, 2022. Accessed: Apr. 22, 2024. [Online]. Available: https://etd.repository.ugm.ac.id/penelitian/detail/219434

R. Singh and S. Singh, â€œText Similarity Measures in News Articles by Vector Space Model Using NLP,â€ Journal of The Institution of Engineers (India): Series B, vol. 102, no. 2, pp. 329â€“338, 2021, doi: 10.1007/s40031-020-00501-5.

I. R. Hendrawan, E. Utami, and A. D. Hartanto, â€œComparison of Word2vec and Doc2vec Methods for Text Classification of Product Reviews,â€ in 2022 6th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), 2022, pp. 530â€“534. doi: 10.1109/ICITISEE57756.2022.10057702.

B. Walek and P. MÃ¼ller, â€œAn approach for recommending relevant articles in news portal based on Doc2Vec,â€ in 2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), 2022, pp. 26â€“31. doi: 10.1109/AIKE55402.2022.00010.

N. V. A. Kumar and S. Mehrotra, â€œA Comparative Analysis of word embedding techniques and text similarity Measures,â€ in 2022 5th International Conference on Contemporary Computing and Informatics (IC3I), 2022, pp. 1581â€“1585. doi: 10.1109/IC3I56241.2022.10072927.

A. Mandal, K. Ghosh, S. Ghosh, and S. Mandal, â€œUnsupervised approaches for measuring textual similarity between legal court case reports,â€ Artif Intell Law (Dordr), vol. 29, no. 3, pp. 417â€“451, 2021, doi: 10.1007/s10506-020-09280-2.

P. K. Reshma, S. Rajagopal, and V. L. Lajish, â€œA Novel Document and Query Similarity Indexing using VSM for Unstructured Documents,â€ in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 676â€“681. doi: 10.1109/ICACCS48705.2020.9074255.

K. Iwamoto, H. Uchida, Y. Li, and Y. Nakatoh, â€œAutomatic Text-to-sound Generation by Doc2Vec,â€ in Human Interaction & Emerging Technologies (IHIET 2023): Artificial Intelligence & Future Applications, AHFE International, 2023. doi: 10.54941/ahfe1004033.

K. Chen, J. Huang, Y. Cui, and W. Ren, â€œResearch on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec,â€ ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 22, no. 3, Apr. 2023, doi: 10.1145/3532852.

M. Alobed, A. M. M. Altrad, and Z. B. A. Bakar, â€œA Comparative Analysis of Euclidean, Jaccard and Cosine Similarity Measure and Arabic Wordnet for Automated Arabic Essay Scoring,â€ in 2021 Fifth International Conference on Information Retrieval and Knowledge Management (CAMP), 2021, pp. 70â€“74. doi: 10.1109/CAMP51653.2021.9498119.

J. Zhang, F. Wang, F. Ma, and G. Song, â€œText Similarity Calculation Method Based on Optimized Cosine Distance,â€ in 2022 International Conference on Electronics and Devices, Computational Science (ICEDCS), 2022, pp. 37â€“39. doi: 10.1109/ICEDCS57360.2022.00015.

S. Dash, T. Mohanty, S. R. Das, A. Mohanty, and R. Rautray, â€œPCTS: Partition Based Clustering for Text Summarization,â€ in 2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT), 2023, pp. 1â€“6. doi: 10.1109/APSIT58554.2023.10201655.

Q. V. Le and T. Mikolov, â€œDistributed Representations of Sentences and Documents,â€ May 2014, [Online]. Available: http://arxiv.org/abs/1405.4053

J. Leskovec, A. Rajaraman, and J. D. Ullman, â€œFinding Similar Items,â€ in Mining of Massive Datasets, 2nd ed., J. Leskovec, A. Rajaraman, and J. D. Ullman, Eds., Cambridge: Cambridge University Press, 2014, pp. 68â€“122. doi: DOI: 10.1017/CBO9781139924801.004.

J. Han, M. Kamber, and J. Pei, â€œ2 - Getting to Know Your Data,â€ in Data Mining (Third Edition), J. Han, M. Kamber, and J. Pei, Eds., Boston: Morgan Kaufmann, 2012, pp. 39â€“82. doi: https://doi.org/10.1016/B978-0-12-381479-1.00002-2.

H. Parvin, H. Alizadeh, and B. Minati, â€œA Modification on K-Nearest Neighbor Classifier,â€ 2010.

Analisis Perbandingan Metode Similarity untuk Kemiripan Dokumen Bahasa Indonesia pada Deteksi Kemiripan Teks Bahasa Indonesia

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Menu Utama

flagcounter

template

statcounter

rji

terindex