BERTopic Modeling of Natural Language Processing Abstracts: Thematic Structure and Trajectory

Samsir Samsir; Reagan Surbakti Saragih; Selamat Subagio; Rahmad Aditiya; Ronal Watrianthos

doi:10.30865/mib.v7i3.6426

Authors

Samsir Samsir Universitas Al Washliyah, Rantauprapat
Reagan Surbakti Saragih Universitas HKBP Nommensen, Pematangsiantar
Selamat Subagio Universitas Al Washliyah, Rantauprapat
Rahmad Aditiya Universitas Al Washliyah, Rantauprapat
Ronal Watrianthos Universitas Al Washliyah, Rantauprapat

DOI:

https://doi.org/10.30865/mib.v7i3.6426

Keywords:

Clustering Algorithms, BERTopic, Natural Language Processing, Scopus Database, Scientific Papers

Abstract

The rapid growth in the academic literature presents challenges in identifying relevant studies. This research aimed to apply unsupervised clustering techniques to 13,027 Scopus abstracts to uncover structure and themes in natural language processing (NLP) publications. Abstracts were pre-processed with tokenization, lemmatization, and vectorization. The BERTopic algorithm was used for clustering, using the MiniLM-L6-v2 embedding model and a minimum topic size of 50. Quantitative analysis revealed eight main topics, with sizes ranging from 205 to 4089 abstracts per topic. The language models topic was most prominent with 4089 abstracts. The topics were evaluated using coherence scores between 0.42 and 0.58, indicating meaningful themes. Keywords and sample documents provided interpretable topic representations. The results showcase the ability to produce coherent topics and capture connections between NLP studies. Clustering supports focused browsing and identification of relevant literature. Unlike human-curated classifications, the unsupervised data-driven approach prevents bias. Given the need to understand research trends, clustering abstracts enables efficient knowledge discovery from scientific corpora. This methodology can be applied to various datasets and fields to uncover overlooked patterns. The ability to adjust parameters allows for customized analysis. In general, unsupervised clustering provides a versatile framework for navigating, summarizing, and analyzing academic literature as volumes expand exponentially.

Author Biography

Ronal Watrianthos, Universitas Al Washliyah, Rantauprapat

Googla Scholar ID:Â r2QIH5cAAAAJSINTA ID: 6154310
SCOPUS ID:Â 57207884978Â

Â Â

Â

References

M. C. Thrun and Q. Stier, â€œFundamental clustering algorithms suite,â€ SoftwareX, vol. 13, 2021, doi: 10.1016/j.softx.2020.100642.

K. P. Sinaga and M. S. Yang, â€œUnsupervised K-means clustering algorithm,â€ IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.2988796.

R. Xu and D. Wunsch, â€œSurvey of clustering algorithms,â€ IEEE Transactions on Neural Networks, vol. 16, no. 3. 2005. doi: 10.1109/TNN.2005.845141.

A. MeÅ¡troviÄ‡, â€œCollaboration Networks Analysis: Combining Structural and Keyword-Based Approaches,â€ 2018, pp. 111â€“122. doi: 10.1007/978-3-319-74497-1_11.

D. Khurana, A. Koli, K. Khatter, and S. Singh, â€œNatural language processing: state of the art, current trends and challenges,â€ Multimed Tools Appl, vol. 82, no. 3, 2023, doi: 10.1007/s11042-022-13428-4.

M. Arifin, G. W. Bhawika, M. M. A. Habibi, and ..., â€œApplication of the Cluster Classification Data Mining Method to Child Illiteracy in Indonesia,â€ Library Philosophy â€¦, 2021, [Online]. Available: https://search.proquest.com/openview/6623878dc817b6a46fb0d3c4f536d392/1?pq-origsite=gscholar&cbl=54903

N. Azis et al., â€œMapping study using the unsupervised learning clustering approach,â€ IOP Conf Ser Mater Sci Eng, vol. 1088, no. 1, p. 012005, Feb. 2021, doi: 10.1088/1757-899X/1088/1/012005.

T. WeiÃŸer, T. SaÃŸmannshausen, D. Ohrndorf, P. BurggrÃ¤f, and J. Wagner, â€œA clustering approach for topic filtering within systematic literature reviews,â€ MethodsX, vol. 7, p. 100831, 2020, doi: 10.1016/j.mex.2020.100831.

G. Matheron, N. Perrin, and O. Sigaud, â€œPBCS: Efficient Exploration and Exploitation Using a Synergy Between Reinforcement Learning and Motion Planning,â€ 2020, pp. 295â€“307. doi: 10.1007/978-3-030-61616-8_24.

T. S. Barrett and G. Lockhart, â€œEfficient Exploration of Many Variables and Interactions Using Regularized Regression,â€ Prevention Science, vol. 20, no. 4, pp. 575â€“584, May 2019, doi: 10.1007/s11121-018-0963-9.

C. Zhang, â€œResearch on Literature Clustering Algorithm for Massive Scientific and Technical Literature Query Service,â€ Comput Intell Neurosci, vol. 2022, pp. 1â€“12, Aug. 2022, doi: 10.1155/2022/3392489.

X. Gao, R. Tan, and G. Li, â€œResearch on Text Mining of Material Science Based on Natural Language Processing,â€ IOP Conf Ser Mater Sci Eng, vol. 768, p. 072094, Mar. 2020, doi: 10.1088/1757-899X/768/7/072094.

A. Subakti, H. Murfi, and N. Hariadi, â€œThe performance of BERT as data representation of text clustering,â€ J Big Data, vol. 9, no. 1, p. 15, Dec. 2022, doi: 10.1186/s40537-022-00564-9.

M. J. SÃ¡nchez-Franco, A. Calvo-Mora, and R. PeriÃ¡Ã±ez-Cristobal, â€œClustering abstracts from the literature on Quality Management (1980â€“2020),â€ Total Quality Management & Business Excellence, vol. 34, no. 7â€“8, pp. 959â€“989, May 2023, doi: 10.1080/14783363.2022.2139674.

G. George and R. Rajan, â€œA FAISS-based Search for Story Generation,â€ in INDICON 2022 - 2022 IEEE 19th India Council International Conference, 2022. doi: 10.1109/INDICON56171.2022.10039758.

D. Wilianto and A. S. Girsang, â€œAutomatic Short Answer Grading on High Schoolâ€™s E-Learning Using Semantic Similarity Methods,â€ TEM Journal, vol. 12, no. 1, 2023, doi: 10.18421/TEM121-37.

J. Alzubi, A. Nayyar, and A. Kumar, â€œMachine Learning from Theory to Algorithms: An Overview,â€ J Phys Conf Ser, vol. 1142, p. 012012, Nov. 2018, doi: 10.1088/1742-6596/1142/1/012012.

A. Galassi, M. Lippi, and P. Torroni, â€œAttention in Natural Language Processing,â€ IEEE Trans Neural Netw Learn Syst, vol. 32, no. 10, 2021, doi: 10.1109/TNNLS.2020.3019893.

J. F. Burnham, â€œScopus database: A review,â€ Biomedical Digital Libraries, vol. 3. 2006. doi: 10.1186/1742-5581-3-1.

M. Thelwall, â€œDimensions: A competitor to Scopus and the Web of Science?,â€ J Informetr, vol. 12, no. 2, 2018, doi: 10.1016/j.joi.2018.03.006.

S. Sun, C. Luo, and J. Chen, â€œA review of natural language processing techniques for opinion mining systems,â€ Information Fusion, vol. 36, 2017, doi: 10.1016/j.inffus.2016.10.004.

A. J. C. Trappey, C. V. Trappey, J. L. Wu, and J. W. C. Wang, â€œIntelligent compilation of patent summaries using machine learning and natural language processing techniques,â€ Advanced Engineering Informatics, vol. 43, 2020, doi: 10.1016/j.aei.2019.101027.

S. S. T. Gontumukkala, Y. S. V. Godavarthi, B. R. R. T. Gonugunta, D. Gupta, and S. Palaniswamy, â€œQuora Question Pairs Identification and Insincere Questions Classification,â€ in 2022 13th International Conference on Computing Communication and Networking Technologies, ICCCNT 2022, 2022. doi: 10.1109/ICCCNT54827.2022.9984492.

N. Yanes, A. M. Mostafa, M. Ezz, and S. N. Almuayqil, â€œA machine learning-based recommender system for improving students learning experiences,â€ IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3036336.

J. T. Santoso, S. Jumini, G. W. Bhawika, and ..., â€œUnsupervised Data Mining Technique for Clustering Library in Indonesia,â€ Library Philosophy â€¦, 2021, [Online]. Available: https://search.proquest.com/openview/e01d7a04c7d0bc3bf1fe0d2acbae813a/1?pq-origsite=gscholar&cbl=54903

BERTopic Modeling of Natural Language Processing Abstracts: Thematic Structure and Trajectory

Authors

DOI:

Keywords:

Abstract

Author Biography

Ronal Watrianthos, Universitas Al Washliyah, Rantauprapat

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Menu Utama

flagcounter

template

statcounter

rji

terindex