People Entity Recognition for the English Quran Translation using BERT

Authors

  • Retno Diah Ayu Ningtias Telkom University, Bandung
  • Moch. Arif Bijaksana Telkom University, Bandung

DOI:

https://doi.org/10.30865/mib.v7i1.5586

Keywords:

Quran, Information Extraction, Named Entity Recognition, Extraction of Human Entities, BERT

Abstract

The Quran is a holy book for Muslims all over the world. Therefore, the Quran is not only translated into Indonesian but also into many other languages, including English. The contents of the Quran are a collection of thousands of verses, each verse having different topics and entities. Sometimes, someone may find it difficult to understand and study the contents of the Quran. Therefore, to make it easier, it is done by extracting information and identifying various entities in the Quran, such as human entities. An important thing to do in order to extract information on human entities is to extract information related to the human entity itself first. Because it can help in the search process, particularly the search for names of people in the Quran. The extraction of human entities is commonly known as Named Entity Recognition (NER). With NER, it can automatically recognize important entities such as people's names, group names, and other entities in a sentence or verse in the Quran. Currently, research on the Quran's English translation is not widely done. Therefore, in this research, we are building an information extraction system model for human entities based on a pre-trained deep learning model called Bidirectional Encoder Representations from Transformer (BERT). The dataset used is made up of 19473 tokens and 720 entities taken from the website tanzil.net. The development of the model shows that BERT can be used to extract information for NER on the Quran translation in English by obtaining a F1-score value of 53 %.

References

S. Hossein Nasr, The Study Quran, 1st ed. New York: Harper One, 2015.

A. Drajat, Ulumul Qur’an Pengantar Ilmu-Ilmu Al-Qur’an, 1st ed. Depok: Kencana, 2017.

R. Grishman, “Information Extraction,†IEEE Intell Syst, vol. 30, no. 5, pp. 8–15, Sep. 2015, doi: 10.1109/MIS.2015.68.

“Speech and Language Processing.†https://web.stanford.edu/~jurafsky/slp3/ (accessed May 26, 2022).

A. Goyal, M. Kumar, and V. Gupta, “Named Entity Recognition: Applications, Approaches and Challengesâ€.

S. Malmasi, A. Fang, B. Fetahu, S. Kar, and O. Rokhlenko, “MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition,†Aug. 2022, doi: 10.48550/arxiv.2208.14536.

T. Al-Moslmi, M. Gallofre Ocana, A. L. Opdahl, and C. Veres, “Named Entity Extraction for Knowledge Graphs: A Literature Overview,†IEEE Access, vol. 8, pp. 32862–32881, 2020, doi: 10.1109/ACCESS.2020.2973928.

“7 NLP Techniques for Extracting Information from Unstructured Text using Algorithms | Width.ai.†https://www.width.ai/post/extracting-information-from-unstructured-text-using-algorithms (accessed Jan. 16, 2023).

G. Popovski, B. K. Seljak, and T. Eftimov, “A Survey of Named-Entity Recognition Methods for Food Information Extraction,†IEEE Access, vol. 8, pp. 31586–31594, 2020, doi: 10.1109/ACCESS.2020.2973502.

C. Chantrapornchai and A. Tunsakul, “Information extraction on tourism domain using SpaCy and BERT,†ECTI Transactions on Computer and Information Technology, vol. 15, no. 1, pp. 108–122, Apr. 2021, doi: 10.37936/ecti-cit.2021151.228621.

J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “Bert: Pre-training of deep bidirectional transformers for language understanding,†arxiv.org, Accessed: May 26, 2022. [Online]. Available: https://arxiv.org/abs/1810.04805

M. Aris, M. #1, M. Arif, B. #2, and A. Fatchul Huda, “Entity Recognition for Quran English Version with Supervised Learning Approach,†socj.telkomuniversity.ac.id, doi: 10.21108/indojc.2019.4.3.362.

K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,†Global Transitions Proceedings, vol. 3, no. 1, pp. 91–99, Jun. 2022, doi: 10.1016/J.GLTP.2022.04.020.

K. Kurniawan and A. Fikri Aji, “Toward a Standardized and More Accurate Indonesian Part-of-Speech Taggingâ€, doi: 10.1109/IALP.2018.8629236.

R. MAK, M. Bijaksana, A. H.-P. C. Science, and undefined 2019, “Person entity recognition for the Indonesian Qur’an translation with the approach hidden Markov model-viterbi,†Elsevier, Accessed: May 26, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1877050919310786

Downloads

Published

2023-02-02