BERTopic Modeling of Natural Language Processing Abstracts: Thematic Structure and Trajectory
DOI:
https://doi.org/10.30865/mib.v7i3.6426Keywords:
Clustering Algorithms, BERTopic, Natural Language Processing, Scopus Database, Scientific PapersAbstract
The rapid growth in the academic literature presents challenges in identifying relevant studies. This research aimed to apply unsupervised clustering techniques to 13,027 Scopus abstracts to uncover structure and themes in natural language processing (NLP) publications. Abstracts were pre-processed with tokenization, lemmatization, and vectorization. The BERTopic algorithm was used for clustering, using the MiniLM-L6-v2 embedding model and a minimum topic size of 50. Quantitative analysis revealed eight main topics, with sizes ranging from 205 to 4089 abstracts per topic. The language models topic was most prominent with 4089 abstracts. The topics were evaluated using coherence scores between 0.42 and 0.58, indicating meaningful themes. Keywords and sample documents provided interpretable topic representations. The results showcase the ability to produce coherent topics and capture connections between NLP studies. Clustering supports focused browsing and identification of relevant literature. Unlike human-curated classifications, the unsupervised data-driven approach prevents bias. Given the need to understand research trends, clustering abstracts enables efficient knowledge discovery from scientific corpora. This methodology can be applied to various datasets and fields to uncover overlooked patterns. The ability to adjust parameters allows for customized analysis. In general, unsupervised clustering provides a versatile framework for navigating, summarizing, and analyzing academic literature as volumes expand exponentially.
References
M. C. Thrun and Q. Stier, “Fundamental clustering algorithms suite,†SoftwareX, vol. 13, 2021, doi: 10.1016/j.softx.2020.100642.
K. P. Sinaga and M. S. Yang, “Unsupervised K-means clustering algorithm,†IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.2988796.
R. Xu and D. Wunsch, “Survey of clustering algorithms,†IEEE Transactions on Neural Networks, vol. 16, no. 3. 2005. doi: 10.1109/TNN.2005.845141.
A. Meštrović, “Collaboration Networks Analysis: Combining Structural and Keyword-Based Approaches,†2018, pp. 111–122. doi: 10.1007/978-3-319-74497-1_11.
D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,†Multimed Tools Appl, vol. 82, no. 3, 2023, doi: 10.1007/s11042-022-13428-4.
M. Arifin, G. W. Bhawika, M. M. A. Habibi, and ..., “Application of the Cluster Classification Data Mining Method to Child Illiteracy in Indonesia,†Library Philosophy …, 2021, [Online]. Available: https://search.proquest.com/openview/6623878dc817b6a46fb0d3c4f536d392/1?pq-origsite=gscholar&cbl=54903
N. Azis et al., “Mapping study using the unsupervised learning clustering approach,†IOP Conf Ser Mater Sci Eng, vol. 1088, no. 1, p. 012005, Feb. 2021, doi: 10.1088/1757-899X/1088/1/012005.
T. Weißer, T. Saßmannshausen, D. Ohrndorf, P. Burggräf, and J. Wagner, “A clustering approach for topic filtering within systematic literature reviews,†MethodsX, vol. 7, p. 100831, 2020, doi: 10.1016/j.mex.2020.100831.
G. Matheron, N. Perrin, and O. Sigaud, “PBCS: Efficient Exploration and Exploitation Using a Synergy Between Reinforcement Learning and Motion Planning,†2020, pp. 295–307. doi: 10.1007/978-3-030-61616-8_24.
T. S. Barrett and G. Lockhart, “Efficient Exploration of Many Variables and Interactions Using Regularized Regression,†Prevention Science, vol. 20, no. 4, pp. 575–584, May 2019, doi: 10.1007/s11121-018-0963-9.
C. Zhang, “Research on Literature Clustering Algorithm for Massive Scientific and Technical Literature Query Service,†Comput Intell Neurosci, vol. 2022, pp. 1–12, Aug. 2022, doi: 10.1155/2022/3392489.
X. Gao, R. Tan, and G. Li, “Research on Text Mining of Material Science Based on Natural Language Processing,†IOP Conf Ser Mater Sci Eng, vol. 768, p. 072094, Mar. 2020, doi: 10.1088/1757-899X/768/7/072094.
A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,†J Big Data, vol. 9, no. 1, p. 15, Dec. 2022, doi: 10.1186/s40537-022-00564-9.
M. J. Sánchez-Franco, A. Calvo-Mora, and R. Periáñez-Cristobal, “Clustering abstracts from the literature on Quality Management (1980–2020),†Total Quality Management & Business Excellence, vol. 34, no. 7–8, pp. 959–989, May 2023, doi: 10.1080/14783363.2022.2139674.
G. George and R. Rajan, “A FAISS-based Search for Story Generation,†in INDICON 2022 - 2022 IEEE 19th India Council International Conference, 2022. doi: 10.1109/INDICON56171.2022.10039758.
D. Wilianto and A. S. Girsang, “Automatic Short Answer Grading on High School’s E-Learning Using Semantic Similarity Methods,†TEM Journal, vol. 12, no. 1, 2023, doi: 10.18421/TEM121-37.
J. Alzubi, A. Nayyar, and A. Kumar, “Machine Learning from Theory to Algorithms: An Overview,†J Phys Conf Ser, vol. 1142, p. 012012, Nov. 2018, doi: 10.1088/1742-6596/1142/1/012012.
A. Galassi, M. Lippi, and P. Torroni, “Attention in Natural Language Processing,†IEEE Trans Neural Netw Learn Syst, vol. 32, no. 10, 2021, doi: 10.1109/TNNLS.2020.3019893.
J. F. Burnham, “Scopus database: A review,†Biomedical Digital Libraries, vol. 3. 2006. doi: 10.1186/1742-5581-3-1.
M. Thelwall, “Dimensions: A competitor to Scopus and the Web of Science?,†J Informetr, vol. 12, no. 2, 2018, doi: 10.1016/j.joi.2018.03.006.
S. Sun, C. Luo, and J. Chen, “A review of natural language processing techniques for opinion mining systems,†Information Fusion, vol. 36, 2017, doi: 10.1016/j.inffus.2016.10.004.
A. J. C. Trappey, C. V. Trappey, J. L. Wu, and J. W. C. Wang, “Intelligent compilation of patent summaries using machine learning and natural language processing techniques,†Advanced Engineering Informatics, vol. 43, 2020, doi: 10.1016/j.aei.2019.101027.
S. S. T. Gontumukkala, Y. S. V. Godavarthi, B. R. R. T. Gonugunta, D. Gupta, and S. Palaniswamy, “Quora Question Pairs Identification and Insincere Questions Classification,†in 2022 13th International Conference on Computing Communication and Networking Technologies, ICCCNT 2022, 2022. doi: 10.1109/ICCCNT54827.2022.9984492.
N. Yanes, A. M. Mostafa, M. Ezz, and S. N. Almuayqil, “A machine learning-based recommender system for improving students learning experiences,†IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3036336.
J. T. Santoso, S. Jumini, G. W. Bhawika, and ..., “Unsupervised Data Mining Technique for Clustering Library in Indonesia,†Library Philosophy …, 2021, [Online]. Available: https://search.proquest.com/openview/e01d7a04c7d0bc3bf1fe0d2acbae813a/1?pq-origsite=gscholar&cbl=54903
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).