Analisis Perbandingan Metode Similarity untuk Kemiripan Dokumen Bahasa Indonesia pada Deteksi Kemiripan Teks Bahasa Indonesia
DOI:
https://doi.org/10.30865/mib.v8i3.7648Keywords:
Similarity Indonesian Text, Doc2vec, Jaccard Coefficient, Cosine Similarity, Euclidean DistanceAbstract
Ease of accessing information brings diverse benefits, including the ability to develop models that can detect similarities between documents, a plagiarism-checking system, automatic summarization, classification, etc. The various benefits of word similarity detection make research on similarity detection between documents an important area to develop. However, studies regarding similarity detection specifically for Indonesian language documents are still relatively small and the performance can still be developed. Therefore, this research aims to conduct a comparative analysis of the performance of Doc2Vec compared to the Jaccard Coefficient, Cosine Similarity, and Euclidean Distance in detecting the similarity of documents with Indonesian text. Three datasets are used in this analysis, with the first dataset consisting of 200 news from Google News, the second dataset from IndoNLU, and the third dataset from TaPaCo. The findings from this study show that overall Cosine Similarity has better performance than Jaccard Coefficient and Euclidean Distance for average performance. The superior performance was with accuracy of 0.98, precision of 0.84, recall of 0.95, and F-1 score of 0.89, with the model formed in 10.56 seconds using the Cosine Similarity algorithm on the Google News dataset. This is because doc2vec is better suited to datasets with higher dimensions than datasets that only contain a few words.
References
P. Sitikhu, K. Pahi, P. Thapa, and S. Shakya, “A Comparison of Semantic Similarity Methods for Maximum Human Interpretability,†Oct. 2019, [Online]. Available: http://arxiv.org/abs/1910.09129
T. A. W. Tyas, Z. K. A. Baizal, and R. Dharayani, “Tourist Places Recommender System Using Cosine Similarity and Singular Value Decomposition Methods,†JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 4, p. 1201, Oct. 2021, doi: 10.30865/mib.v5i4.3151.
I. Mawanta, T. S. Gunawan, and W. Wanayumini, “Uji Kemiripan Kalimat Judul Tugas Akhir dengan Metode Cosine Similarity dan Pembobotan TF-IDF,†JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 2, p. 726, Apr. 2021, doi: 10.30865/mib.v5i2.2935.
R. Jasmi, Z. K. A. Baizal, and D. Richasdy, “Question Answering Chatbot using Ontology for History of the Sumedang Larang Kingdom using Cosine Similarity as Similarity Measure,†JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 6, no. 4, p. 2040, Oct. 2022, doi: 10.30865/mib.v6i4.4530.
L. Mayola, M. Hafizh, and D. Marse Putra, “Algoritma Jaccard Similarity untuk Deteksi Kemiripan Judul Disertasi dengan Pendekatan Variasi Stop Word Removal,†vol. 8, no. 1, pp. 477–487, 2024, doi: 10.30865/mib.v8i1.7109.
S. Pawestri, “Analisis Perbandingan Metode Jaccard Coefficient dan Cosine Similarity untuk Kemiripan Teks Bahasa Indonesia,†Tesis, Universitas Gadjah Mada, Yogyakarta, 2022. Accessed: Apr. 22, 2024. [Online]. Available: https://etd.repository.ugm.ac.id/penelitian/detail/219434
R. Singh and S. Singh, “Text Similarity Measures in News Articles by Vector Space Model Using NLP,†Journal of The Institution of Engineers (India): Series B, vol. 102, no. 2, pp. 329–338, 2021, doi: 10.1007/s40031-020-00501-5.
I. R. Hendrawan, E. Utami, and A. D. Hartanto, “Comparison of Word2vec and Doc2vec Methods for Text Classification of Product Reviews,†in 2022 6th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), 2022, pp. 530–534. doi: 10.1109/ICITISEE57756.2022.10057702.
B. Walek and P. Müller, “An approach for recommending relevant articles in news portal based on Doc2Vec,†in 2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), 2022, pp. 26–31. doi: 10.1109/AIKE55402.2022.00010.
N. V. A. Kumar and S. Mehrotra, “A Comparative Analysis of word embedding techniques and text similarity Measures,†in 2022 5th International Conference on Contemporary Computing and Informatics (IC3I), 2022, pp. 1581–1585. doi: 10.1109/IC3I56241.2022.10072927.
A. Mandal, K. Ghosh, S. Ghosh, and S. Mandal, “Unsupervised approaches for measuring textual similarity between legal court case reports,†Artif Intell Law (Dordr), vol. 29, no. 3, pp. 417–451, 2021, doi: 10.1007/s10506-020-09280-2.
P. K. Reshma, S. Rajagopal, and V. L. Lajish, “A Novel Document and Query Similarity Indexing using VSM for Unstructured Documents,†in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 676–681. doi: 10.1109/ICACCS48705.2020.9074255.
K. Iwamoto, H. Uchida, Y. Li, and Y. Nakatoh, “Automatic Text-to-sound Generation by Doc2Vec,†in Human Interaction & Emerging Technologies (IHIET 2023): Artificial Intelligence & Future Applications, AHFE International, 2023. doi: 10.54941/ahfe1004033.
K. Chen, J. Huang, Y. Cui, and W. Ren, “Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec,†ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 22, no. 3, Apr. 2023, doi: 10.1145/3532852.
M. Alobed, A. M. M. Altrad, and Z. B. A. Bakar, “A Comparative Analysis of Euclidean, Jaccard and Cosine Similarity Measure and Arabic Wordnet for Automated Arabic Essay Scoring,†in 2021 Fifth International Conference on Information Retrieval and Knowledge Management (CAMP), 2021, pp. 70–74. doi: 10.1109/CAMP51653.2021.9498119.
J. Zhang, F. Wang, F. Ma, and G. Song, “Text Similarity Calculation Method Based on Optimized Cosine Distance,†in 2022 International Conference on Electronics and Devices, Computational Science (ICEDCS), 2022, pp. 37–39. doi: 10.1109/ICEDCS57360.2022.00015.
S. Dash, T. Mohanty, S. R. Das, A. Mohanty, and R. Rautray, “PCTS: Partition Based Clustering for Text Summarization,†in 2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT), 2023, pp. 1–6. doi: 10.1109/APSIT58554.2023.10201655.
Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,†May 2014, [Online]. Available: http://arxiv.org/abs/1405.4053
J. Leskovec, A. Rajaraman, and J. D. Ullman, “Finding Similar Items,†in Mining of Massive Datasets, 2nd ed., J. Leskovec, A. Rajaraman, and J. D. Ullman, Eds., Cambridge: Cambridge University Press, 2014, pp. 68–122. doi: DOI: 10.1017/CBO9781139924801.004.
J. Han, M. Kamber, and J. Pei, “2 - Getting to Know Your Data,†in Data Mining (Third Edition), J. Han, M. Kamber, and J. Pei, Eds., Boston: Morgan Kaufmann, 2012, pp. 39–82. doi: https://doi.org/10.1016/B978-0-12-381479-1.00002-2.
H. Parvin, H. Alizadeh, and B. Minati, “A Modification on K-Nearest Neighbor Classifier,†2010.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).