BERTopic Modeling of Natural Language Processing Abstracts: Thematic Structure and Trajectory
Abstract
The rapid growth in the academic literature presents challenges in identifying relevant studies. This research aimed to apply unsupervised clustering techniques to 13,027 Scopus abstracts to uncover structure and themes in natural language processing (NLP) publications. Abstracts were pre-processed with tokenization, lemmatization, and vectorization. The BERTopic algorithm was used for clustering, using the MiniLM-L6-v2 embedding model and a minimum topic size of 50. Quantitative analysis revealed eight main topics, with sizes ranging from 205 to 4089 abstracts per topic. The language models topic was most prominent with 4089 abstracts. The topics were evaluated using coherence scores between 0.42 and 0.58, indicating meaningful themes. Keywords and sample documents provided interpretable topic representations. The results showcase the ability to produce coherent topics and capture connections between NLP studies. Clustering supports focused browsing and identification of relevant literature. Unlike human-curated classifications, the unsupervised data-driven approach prevents bias. Given the need to understand research trends, clustering abstracts enables efficient knowledge discovery from scientific corpora. This methodology can be applied to various datasets and fields to uncover overlooked patterns. The ability to adjust parameters allows for customized analysis. In general, unsupervised clustering provides a versatile framework for navigating, summarizing, and analyzing academic literature as volumes expand exponentially.
Keywords
Full Text:
PDFReferences
M. C. Thrun and Q. Stier, “Fundamental clustering algorithms suite,†SoftwareX, vol. 13, 2021, doi: 10.1016/j.softx.2020.100642.
K. P. Sinaga and M. S. Yang, “Unsupervised K-means clustering algorithm,†IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.2988796.
R. Xu and D. Wunsch, “Survey of clustering algorithms,†IEEE Transactions on Neural Networks, vol. 16, no. 3. 2005. doi: 10.1109/TNN.2005.845141.
A. Meštrović, “Collaboration Networks Analysis: Combining Structural and Keyword-Based Approaches,†2018, pp. 111–122. doi: 10.1007/978-3-319-74497-1_11.
D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,†Multimed Tools Appl, vol. 82, no. 3, 2023, doi: 10.1007/s11042-022-13428-4.
M. Arifin, G. W. Bhawika, M. M. A. Habibi, and ..., “Application of the Cluster Classification Data Mining Method to Child Illiteracy in Indonesia,†Library Philosophy …, 2021, [Online]. Available: https://search.proquest.com/openview/6623878dc817b6a46fb0d3c4f536d392/1?pq-origsite=gscholar&cbl=54903
N. Azis et al., “Mapping study using the unsupervised learning clustering approach,†IOP Conf Ser Mater Sci Eng, vol. 1088, no. 1, p. 012005, Feb. 2021, doi: 10.1088/1757-899X/1088/1/012005.
T. Weißer, T. Saßmannshausen, D. Ohrndorf, P. Burggräf, and J. Wagner, “A clustering approach for topic filtering within systematic literature reviews,†MethodsX, vol. 7, p. 100831, 2020, doi: 10.1016/j.mex.2020.100831.
G. Matheron, N. Perrin, and O. Sigaud, “PBCS: Efficient Exploration and Exploitation Using a Synergy Between Reinforcement Learning and Motion Planning,†2020, pp. 295–307. doi: 10.1007/978-3-030-61616-8_24.
T. S. Barrett and G. Lockhart, “Efficient Exploration of Many Variables and Interactions Using Regularized Regression,†Prevention Science, vol. 20, no. 4, pp. 575–584, May 2019, doi: 10.1007/s11121-018-0963-9.
C. Zhang, “Research on Literature Clustering Algorithm for Massive Scientific and Technical Literature Query Service,†Comput Intell Neurosci, vol. 2022, pp. 1–12, Aug. 2022, doi: 10.1155/2022/3392489.
X. Gao, R. Tan, and G. Li, “Research on Text Mining of Material Science Based on Natural Language Processing,†IOP Conf Ser Mater Sci Eng, vol. 768, p. 072094, Mar. 2020, doi: 10.1088/1757-899X/768/7/072094.
A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,†J Big Data, vol. 9, no. 1, p. 15, Dec. 2022, doi: 10.1186/s40537-022-00564-9.
M. J. Sánchez-Franco, A. Calvo-Mora, and R. Periáñez-Cristobal, “Clustering abstracts from the literature on Quality Management (1980–2020),†Total Quality Management & Business Excellence, vol. 34, no. 7–8, pp. 959–989, May 2023, doi: 10.1080/14783363.2022.2139674.
G. George and R. Rajan, “A FAISS-based Search for Story Generation,†in INDICON 2022 - 2022 IEEE 19th India Council International Conference, 2022. doi: 10.1109/INDICON56171.2022.10039758.
D. Wilianto and A. S. Girsang, “Automatic Short Answer Grading on High School’s E-Learning Using Semantic Similarity Methods,†TEM Journal, vol. 12, no. 1, 2023, doi: 10.18421/TEM121-37.
J. Alzubi, A. Nayyar, and A. Kumar, “Machine Learning from Theory to Algorithms: An Overview,†J Phys Conf Ser, vol. 1142, p. 012012, Nov. 2018, doi: 10.1088/1742-6596/1142/1/012012.
A. Galassi, M. Lippi, and P. Torroni, “Attention in Natural Language Processing,†IEEE Trans Neural Netw Learn Syst, vol. 32, no. 10, 2021, doi: 10.1109/TNNLS.2020.3019893.
J. F. Burnham, “Scopus database: A review,†Biomedical Digital Libraries, vol. 3. 2006. doi: 10.1186/1742-5581-3-1.
M. Thelwall, “Dimensions: A competitor to Scopus and the Web of Science?,†J Informetr, vol. 12, no. 2, 2018, doi: 10.1016/j.joi.2018.03.006.
S. Sun, C. Luo, and J. Chen, “A review of natural language processing techniques for opinion mining systems,†Information Fusion, vol. 36, 2017, doi: 10.1016/j.inffus.2016.10.004.
A. J. C. Trappey, C. V. Trappey, J. L. Wu, and J. W. C. Wang, “Intelligent compilation of patent summaries using machine learning and natural language processing techniques,†Advanced Engineering Informatics, vol. 43, 2020, doi: 10.1016/j.aei.2019.101027.
S. S. T. Gontumukkala, Y. S. V. Godavarthi, B. R. R. T. Gonugunta, D. Gupta, and S. Palaniswamy, “Quora Question Pairs Identification and Insincere Questions Classification,†in 2022 13th International Conference on Computing Communication and Networking Technologies, ICCCNT 2022, 2022. doi: 10.1109/ICCCNT54827.2022.9984492.
N. Yanes, A. M. Mostafa, M. Ezz, and S. N. Almuayqil, “A machine learning-based recommender system for improving students learning experiences,†IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3036336.
J. T. Santoso, S. Jumini, G. W. Bhawika, and ..., “Unsupervised Data Mining Technique for Clustering Library in Indonesia,†Library Philosophy …, 2021, [Online]. Available: https://search.proquest.com/openview/e01d7a04c7d0bc3bf1fe0d2acbae813a/1?pq-origsite=gscholar&cbl=54903
DOI: https://doi.org/10.30865/mib.v7i3.6426
Refbacks
- There are currently no refbacks.
Copyright (c) 2023 JURNAL MEDIA INFORMATIKA BUDIDARMA

This work is licensed under a Creative Commons Attribution 4.0 International License.
JURNAL MEDIA INFORMATIKA BUDIDARMA
Universitas Budi Darma
Secretariat: Sisingamangaraja No. 338 Telp 061-7875998
Email: mib.stmikbd@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License.