Handling Imbalance Dataset on Hoax Indonesian Political News Classification using IndoBERT and Random Sampling

Muhammad Ammar Fathin; Yuliant Sibaroni; Sri Suryani Prasetyowati

doi:10.30865/mib.v8i1.7099

Authors

Muhammad Ammar Fathin Telkom University, Bandung
Yuliant Sibaroni Telkom University, Bandung
Sri Suryani Prasetyowati Telkom University, Bandung

DOI:

https://doi.org/10.30865/mib.v8i1.7099

Keywords:

Hoax Detection, IndoBERT, Imbalanced Data, Political News, BERT

Abstract

The rapid adoption of the internet in Indonesia, with over 200 million active users as of January 2022, has dramatically transformed information dissemination, particularly through social media and online platforms. These platforms, while democratizing information sharing, have also become hotbeds for the spread of misinformation and hoaxes, significantly impacting the political landscape, as seen in the Jakarta gubernatorial election from late 2016 to April 2017. Research by the Indonesian Telematics Society (MASTEL) revealed a high prevalence of hoax content, predominantly socio-political, underscoring the critical need to address this misinformation and hoaxes challenge. This research delves into the challenge of detecting hoaxes in Indonesian political news, particularly focusing on the classification of news as factual or hoax in the presence of class imbalances within datasets. The dataset exhibits a significant class imbalance with 6,947 articles identified as hoaxes and 20,945 as non-hoaxes, Utilizing the IndoBERT model, a specialized variant of the BERT framework pre-trained on the Indonesian language, the study aims to assess its effectiveness in discerning between factual and hoax news. This involves fine-tuning IndoBERT for specific text classification tasks and exploring the impact of various resampling techniques, such as Random Over Sampling and Random Under Sampling, to address class imbalances since the dataset, significantly imbalanced with 6,947 articles labeled as hoaxes and 20,945 as non-hoaxes, necessitated these approaches. The study's findings demonstrate the IndoBERT model's consistent accuracy across different resampling methods like Random Over Sampling (ROS) and Random Under Sampling (RUS), highlighting its effectiveness in handling imbalanced datasets produce the accuracy of hoax detection with the 98.2% accuracy, 97.5% Recall, 97.8% F1-score, and 97.2% Precision. This is particularly relevant for tasks like misinformation detection, where data imbalance is common. The success of IndoBERT, a language-specific BERT model, in text classification for the Indonesian language contributes to the understanding of BERT-based models in diverse linguistic contexts.

Author Biographies

Muhammad Ammar Fathin, Telkom University, Bandung

School of Computing, Informatics

Yuliant Sibaroni, Telkom University, Bandung

School of Computing, Informatics

Sri Suryani Prasetyowati, Telkom University, Bandung

School of Computing, Informatics

References

M. A. Rahmat, Indrabayu, and I. S. Areni, â€œHoax Web Detection For News in Bahasa Using Support Vector Machine,â€ 2019 International Conference on Information and Communications Technology (ICOIACT), 2019, doi: 10.1109/ICOIACT46704.2019.8938425.

Hanadian Nurhayati Wolff, â€œInternet usage in Indonesia - statistics & facts.â€ Accessed: Nov. 11, 2023. [Online]. Available: https://www.statista.com/topics/2431/internet-usage-in-indonesia/

SIMON KEMP, â€œDIGITAL 2020: INDONESIA.â€ Accessed: Nov. 11, 2023. [Online]. Available: https://datareportal.com/reports/digital-2020-indonesia

P. Utami, â€œHoax in Modern Politics: The Meaning of Hoax in Indonesian Politics and Democracy,â€ Jurnal Ilmu Sosial dan Ilmu Politik, vol. 22, no. 2, p. 85, Jan. 2019, doi: 10.22146/jsp.34614.

J. A. Nasir, O. S. Khan, and I. Varlamis, â€œFake news detection: A hybrid CNN-RNN based deep learning approach,â€ International Journal of Information Management Data Insights, vol. 1, no. 1, Apr. 2021, doi: 10.1016/j.jjimei.2020.100007.

A. Wani, I. Joshi, S. Khandve, V. Wagh, and R. Joshi, â€œEvaluating Deep Learning Approaches for Covid19 Fake News Detectionâ€, doi: 10.48550/arXiv.2101.04012.

R. K. Kaliyar, A. Goswami, and P. Narang, â€œFakeBERT: Fake news detection in social media with a BERT-based deep learning approach,â€ Multimed Tools Appl, vol. 80, no. 8, pp. 11765â€“11788, Mar. 2021, doi: 10.1007/s11042-020-10183-2.

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, â€œIndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,â€ Nov. 2020, doi: 10.48550/arXiv.2011.00677.

M. N. Fakhruzzaman, S. Z. Jannah, R. A. Ningrum, and I. Fahmiyah, â€œClickbait Headline Detection in Indonesian News Sites using Multilingual Bidirectional Encoder Representations from Transformers (M-BERT),â€ Feb. 2021, [Online]. Available: http://arxiv.org/abs/2102.01497

D. R. Faisal and R. Mahendra, â€œTwo-Stage Classifier for COVID-19 Misinformation Detection Using BERT: a Study on Indonesian Tweets,â€ Jun. 2022, doi: 10.48550/arXiv.2102.01497.

Muhammad Ikram Kaer Sinapoy, Yuliant Sibaroni, and Sri Suryani Prasetyowati, â€œComparison of LSTM and IndoBERT Method in Identifying Hoax on Twitter,â€ Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 7, no. 3, pp. 657â€“662, Jun. 2023, doi: 10.29207/resti.v7i3.4830.

S. Al-Azani and E. S. M. El-Alfy, â€œImbalanced Sentiment Polarity Detection Using Emoji-Based Features and Bagging Ensemble,â€ in 1st International Conference on Computer Applications and Information Security, ICCAIS 2018, Institute of Electrical and Electronics Engineers Inc., Aug. 2018. doi: 10.1109/CAIS.2018.8441956.

H. A. Najada and X. Zhu, â€œiSRD: Spam review detection with imbalanced data distributions,â€ Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014), 2014, doi: 10.1109/IRI.2014.7051938.

S. Alâ€“Azani and E. M. Elâ€“Alfy, â€œImbalanced Sentiment Polarity Detection Using Emoji-Based Features and Bagging Ensemble,â€ 2018 1st International Conference on Computer Applications & Information Security (ICCAIS), pp. 1â€“5, 2018, doi: 10.1109/CAIS.2018.8441956.

H. A. Najada and X. Zhu, â€œiSRD: Spam review detection with imbalanced data distributions,â€ Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014), 2014.

Fransiscus and A. S. Girsang, â€œSentiment Analysis of COVID-19 Public Activity Restriction (PPKM) Impact using BERT Method,â€ International Journal of Engineering Trends and Technology, vol. 70, no. 12, pp. 281â€“288, Dec. 2022, doi: 10.14445/22315381/IJETT-V70I12P226.

W. Satriaji and R. Kusumaningrum, â€œEffect of Synthetic Minority Oversampling Technique (SMOTE), Feature Representation, and Classification Algorithm on Imbalanced Sentiment Analysis,â€ 2018 2nd International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia, 2018, doi: 10.1109/ICICOS.2018.8621648.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, â€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,â€ Oct. 2018, doi: 10.18653/v1/N19-1423.

B. Wilie et al., â€œIndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,â€ Sep. 2020, doi: 10.48550/arXiv.2009.05387.

L. H. Suadaa, I. Santoso, and A. T. B. Panjaitan, â€œTransfer Learning of Pre-trained Transformers for Covid-19 Hoax Detection in Indonesian Language,â€ IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 15, no. 3, p. 317, Jul. 2021, doi: 10.22146/ijccs.66205.

Y. Muliono, F. L. Gaol, B. Soewito, and H. L. H. S. Warnars, â€œHoax Classification in Imbalanced Datasets Based on Indonesian News Title using RoBERTa,â€ in 2022 3rd International Conference on Artificial Intelligence and Data Sciences: Championing Innovations in Artificial Intelligence and Data Sciences for Sustainable Future, AiDAS 2022 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 264â€“268. doi: 10.1109/AiDAS56890.2022.9918747.

A. D. Sanya and L. H. Suadaa, â€œHandling Imbalanced Dataset on Hate Speech Detection in Indonesian Online News Comments,â€ 2022 10th International Conference on Information and Communication Technology (ICoICT), pp. 380â€“385, 2022, doi: 10.1109/ICoICT55009.2022.9914883.

W. Obaid and A. Nassif Bou, â€œThe Effects of Resampling on Classifying Imbalanced Datasets,â€ 2022 Advances in Science and Engineering Technology International Conferences (ASET), 2022, doi: 10.1109/ASET53988.2022.9735021.

Handling Imbalance Dataset on Hoax Indonesian Political News Classification using IndoBERT and Random Sampling

Authors

DOI:

Keywords:

Abstract

Author Biographies

Muhammad Ammar Fathin, Telkom University, Bandung

Yuliant Sibaroni, Telkom University, Bandung

Sri Suryani Prasetyowati, Telkom University, Bandung

References

Downloads

Published

Issue

Section

License