Analisis Perbandingan Performa NMF dengan LDA pada Topik Modeling Berita Online Indonesia

Latifah Nurrohmah Handayani; Lucia Nugraheni Harnaningrum

doi:10.30865/json.v7i3.9469

Authors

Latifah Nurrohmah Handayani Magister Teknologi Informasi - Universitas Teknologi Digital Indonesia
Lucia Nugraheni Harnaningrum Universitas Teknologi Digital Indonesia

DOI:

https://doi.org/10.30865/json.v7i3.9469

Keywords:

Topic Modelling, Non-negative Matrix Factorization, Latent Dirichlet Allocation, Natural Language Processing, Analisis Berita

Abstract

Pertumbuhan konten berita digital di Indonesia menciptakan kebutuhan akan metode otomatis untuk mengekstraksi topik-topik utama dari dataset teks berita berskala besar. Penelitian ini melakukan analisis komparatif performa Non-negative Matrix Factorization (NMF) dan Latent Dirichlet Allocation (LDA) dalam tugas topic modeling berita online Indonesia dari tiga media: CNBC Indonesia, Kompas.com, dan Detik.com. Dataset terdiri dari 4.500 artikel berita dengan preprocessing meliputi tokenisasi, penghapusan stopwords, serta ekstraksi fitur menggunakan TF-IDF untuk NMF dan Count Vectorizer untuk LDA. Evaluasi performa dilakukan menggunakan coherence score (Cᵥ), topic diversity, silhouette score, dan uji chi-square untuk distribusi topik antar media. Hasil menunjukkan bahwa NMF memiliki nilai coherence lebih tinggi (0.7544) dibandingkan LDA (0.5600), topic diversity yang lebih baik (0.9400 vs 0.8400), serta efisiensi waktu training yang lebih tinggi (1.60 detik vs 108.30 detik). Uji chi-square mengonfirmasi perbedaan signifikan (p < 0.001) dalam distribusi topik antar media. Berdasarkan hasil evaluasi pada dataset yang digunakan, NMF menunjukkan performa yang lebih baik dibandingkan LDA dalam konteks topic modeling berita online Indonesia.

References

R. Egger and J. Yu, “A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts,” Front. Sociol., vol. 7, no. May, pp. 1–16, 2022, doi: 10.3389/fsoc.2022.886498.

M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” 2022, [Online]. Available: http://arxiv.org/abs/2203.05794

E. Puspita, D. F. Shiddieq, and F. F. Roji, “Pemodelan Topik pada Media Berita Online Menggunakan Latent Dirichlet Allocation (Studi Kasus Merek Somethinc),” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 4, no. 2, pp. 481–489, 2024, doi: 10.57152/malcom.v4i2.1204.

A. Ikegami and I. D. M. B. A. Darmawan, “Analisis Sentimen dan Pemodelan Topik Ulasan Aplikasi Noice Menggunakan XGBoost dan LDA,” Jnatia, vol. 1, no. 1, pp. 325–336, 2022.

D. Ridhwanulah and D. H. Fudholi, “Pemodelan Topik pada Cuitan tentang Penyakit Tropis di Indonesia dengan Metode Latent Dirichlet Allocation,” J. Ilm. SINUS, vol. 20, no. 1, p. 11, Jan. 2022, doi: 10.30646/sinus.v20i1.589.

N. A. Sanjaya ER, “Implementasi Latent Dirichlet Allocation (LDA) untuk Klasterisasi Cerita Berbahasa Bali,” J. Teknol. Inf. dan Ilmu Komput., vol. 8, no. 1, p. 127, 2021, doi: 10.25126/jtiik.0813556.

O. Ozyurt, H. Özköse, and A. Ayaz, “Evaluating the latest trends of Industry 4.0 based on LDA topic model,” J. Supercomput., vol. 80, no. 13, pp. 19003–19030, 2024, doi: 10.1007/s11227-024-06247-x.

R. P. F. Afidh and Syahrial, “Pemodelan Topik Menggunakan n-Gram dan Non-negative Matrix Factorization,” J. Inf. dan Teknol., vol. 5, no. 1, pp. 265–275, 2023, doi: 10.60083/jidt.v5i1.385.

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” COLING 2020 - 28th Int. Conf. Comput. Linguist. Proc. Conf., pp. 757–770, 2020, doi: 10.18653/v1/2020.coling-main.66.

U. Khairani, V. Mutiawani, and H. Ahmadian, “Pengaruh Tahapan Preprocessing Terhadap Model Indobert Dan Indobertweet Untuk Mendeteksi Emosi Pada Komentar Akun Berita Instagram,” J. Teknol. Inf. dan Ilmu Komput., vol. 11, no. 4, pp. 887–894, 2024, doi: 10.25126/jtiik.1148315.

A. Nanyonga, H. Wasswa, and G. Wild, “Topic Modeling Analysis of Aviation Accident Reports: A Comparative Study between LDA and NMF Models,” 2023 3rd Int. Conf. Smart Gener. Comput. Commun. Networking, SMART GENCON 2023, 2023, doi: 10.1109/SMARTGENCON60755.2023.10442471.

O. Babalola, B. Ojokoh, and O. Boyinbode, “Comprehensive Evaluation of LDA, NMF, and BERTopic’s Performance on News Headline Topic Modeling,” J. Comput. Theor. Appl., vol. 2, no. 2, pp. 268–289, 2024, doi: 10.62411/jcta.11635.

M. S. Khine, Text Mining in Educational Research: Topic Modeling and Latent Dirichlet Allocation. Singapore: Springer Nature Singapore, 2025. doi: 10.1007/978-981-97-7858-4.

J. H. Jurafsky, Daniel; Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Stanford NLP Group, 2026. [Online]. Available: https://web.stanford.edu/~jurafsky/slp3/

Maulidya Prastita Syah, Ajeng Puspa Wardani, Mohammad Idhom, and Trimono, “Perbandingan Representasi Teks Tf-Idf Dan Bert Terhadap Akurasi Cosine Similarity Dalam Penilaian Otomatis Jawaban Berbasis Teks,” Data Sci. Indones., vol. 5, no. 1, pp. 47–59, 2025, doi: 10.47709/dsi.v5i1.6021.

D. Patel, V. Parikh, O. Patel, A. Shah, and B. Chaudhury, “Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization,” IEEE Trans. Artif. Intell., pp. 1–29, 2025, doi: 10.1109/TAI.2025.3579459.

Pavithra and Savitha, “Topic Modeling for Evolving Textual Data Using LDA, HDP, NMF, BERTOPIC, and DTM With a Focus on Research Papers,” J. Technol. Informatics, vol. 5, no. 2, pp. 53–63, 2024, doi: 10.37802/joti.v5i2.618.

J. Gan and Y. Qi, “Selection of the optimal number of topics for LDA topic model—Taking patent policy analysis as an example,” Entropy, vol. 23, no. 10, 2021, doi: 10.3390/e23101301.

A. Farea, S. Tripathi, G. Glazko, and F. Emmert-Streib, “Investigating the optimal number of topics by advanced text-mining techniques: Sustainable energy research,” Eng. Appl. Artif. Intell., vol. 136, p. 108877, Oct. 2024, doi: 10.1016/J.ENGAPPAI.2024.108877.

S. Mohammed et al., “The effects of data quality on machine learning performance on tabular data,” Inf. Syst., vol. 132, p. 102549, Jul. 2025, doi: 10.1016/J.IS.2025.102549.

Y. O. Odhianto, D. Swanjaya, and J. Sahertian, “Optimalisasi Latent Dirichlet Allocation untuk Ekstraksi Topik Utama dalam Teks Dongeng,” Semin. Nas. Inov. Teknol., vol. 9, p. 2025, 2025, [Online]. Available: https://proceeding.unpkediri.ac.id/index.php/inotek/article/view/7712

P. Malviya, V. Bhandari, P. S. Sisodiya, and S. Suman, “Customer Segmentation and Business Sales Forecasting using Machine Learning for Business Development,” Int. J. Recent Innov. Trends Comput. Commun., vol. 11, no. 11s, pp. 416–424, Oct. 2023, doi: 10.17762/IJRITCC.V11I11S.8170.

C. Meaney, T. A. Stukel, P. C. Austin, R. Moineddin, M. Greiver, and M. Escobar, “Quality indices for topic model selection and evaluation: a literature review and case study,” BMC Med. Inform. Decis. Mak., vol. 23, no. 1, pp. 1–18, 2023, doi: 10.1186/s12911-023-02216-1.

H. Rahimi, D. Mimno, J. L. Hoover, H. Naacke, C. Constantin, and B. Amann, “Contextualized Topic Coherence Metrics,” EACL 2024 - 18th Conf. Eur. Chapter Assoc. Comput. Linguist. Find. EACL 2024, pp. 1760–1773, 2024.

Y. Bu, M. Li, W. Gu, and W. bin Huang, “Topic diversity: A discipline scheme-free diversity measurement for journals,” J. Assoc. Inf. Sci. Technol., vol. 72, no. 5, pp. 523–539, May 2021, doi: 10.1002/ASI.24433;REQUESTEDJOURNAL:JOURNAL:23301643;WGROUP:STRING:PUBLICATION.

S. Terragni, E. Fersini, B. Galuzzi, P. Tropeano, and A. Candelieri, “OCTIS: Comparing and Optimizing Topic Models is Simple!,” pp. 263–270, Accessed: Jan. 02, 2026. [Online]. Available: http://people.csail.mit.edu/jrennie/

M. Shutaywi and N. N. Kachouie, “Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Clustering,” Entropy 2021, Vol. 23, Page 759, vol. 23, no. 6, p. 759, Jun. 2021, doi: 10.3390/E23060759.

D. Chicco, A. Campagner, A. Spagnolo, D. Ciucci, and G. Jurman, “The Silhouette coefficient and the Davies-Bouldin index are more informative than Dunn index, Calinski-Harabasz index, Shannon entropy, and Gap statistic for unsupervised clustering internal evaluation of two convex clusters,” PeerJ Comput. Sci., vol. 11, p. e3309, Nov. 2025, doi: 10.7717/PEERJ-CS.3309.