Pengaruh Distribusi Panjang Data Teks pada Klasifikasi: Sebuah Studi Awal

Authors

  • Said Al Faraby Telkom University, Bandung
  • Ade Romadhony Telkom University, Bandung

DOI:

https://doi.org/10.30865/mib.v6i3.4259

Keywords:

Text Classification, Cross-Length

Abstract

In text classification, there is a problem with text domain differences (cross-domain) between the data used to train the model and the data used when the model is applied. In addition to the problem of domain differences, there are also language differences (cross-lingual). Many previous studies have looked for ways how classification models can be applied effectively and efficiently in these cross-domain and cross-lingual situations. However, there is one difference that is not given special attention because it is considered not very influential, namely the difference in text length (cross-length). In this study, we further investigated the cross-length condition by creating a special dataset and testing it with various commonly used classification models. The results showed that the difference in the distribution of text length between the training data and the test data could affect the performances. Cross-length transfers from long to short texts show an average decrease in F1-scores across all models of 14%, while transfers from short to long texts give an average decrease of 9%.

References

B. Zhang, X. Zhang, Y. Liu, L. Cheng, and Z. Li, “Matching Distributions between Model and Data: Cross-domain Knowledge Distillation for Unsupervised Domain Adaptation,†in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug. 2021, pp. 5423–5433.

H. S. Bhatt, M. Sinha, and S. Roy, “Cross-domain text classification with multiple domains and disparate label sets,†in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1641–1650.

B. Jing, C. Lu, D. Wang, and C. Zhuang Fuzhen and Niu, “Cross-Domain Labeled LDA for Cross-Domain Text Classification,†in 2018 IEEE International Conference on Data Mining (ICDM), Nov. 2018, pp. 187–196.

A. Mogadala and A. Rettinger, “Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification,†in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 692–702.

G. Barbosa, R. Camelo, A. P. Cavalcanti, P. Miranda, V. Mello Rafael Ferreira and Kovanović, and D. Gašević, “Towards automatic cross-language classification of cognitive presence in online discussions,†in Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, Mar. 2020, pp. 605–614.

G. Karamanolakis, D. Hsu, and L. Gravano, “Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher,†in Findings of the Association for Computational Linguistics: EMNLP 2020, Nov. 2020, pp. 3604–3622.

S. Jarvis and S. A. Crossley, Approaching Language Transfer Through Text Classification: Explorations in the Detectionbased Approach. Multilingual Matters, 2012.

R. K. Amplayo, S. Lim, and S.-W. Hwang, “Text Length Adaptation in Sentiment Classification,†in Proceedings of The Eleventh Asian Conference on Machine Learning, 2019, vol. 101, pp. 646–661.

K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text Classification Algorithms: A Survey,†Information, vol. 10, no. 4, p. 150, Apr. 2019.

V. S. Jagtap and K. Pawar, “Analysis of different approaches to sentence-level sentiment classification,†International Journal of Scientific Engineering and Technology, vol. 2, no. 3, pp. 164–170, 2013.

Y. Kim, “Convolutional Neural Networks for Sentence Classification,†in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct. 2014, pp. 1746–1751.

Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical attention networks for document classification,†in Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, 2016, pp. 1480–1489.

M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood, and M. T. Sadiq, “Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network,†IEEE Access, vol. 8, pp. 42689–42707, 2020.

G. Wang and S. Y. Shin, “An Improved Text Classification Method for Sentiment Classification,†Journal of information and communication convergence engineering, vol. 17, no. 1, pp. 41–48, 2019.

Y. Goldberg, “Neural Network Methods for Natural Language Processing,†Synthesis Lectures on Human Language Technologies, vol. 10, no. 1, pp. 1–309, Apr. 2017.

S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural Networks for Text Classification,†Feb. 2015.

M. M. Mirończuk and J. Protasiewicz, “A recent overview of the state-of-the-art elements of text classification,†Expert Syst. Appl., vol. 106, pp. 36–54, Sep. 2018.

B. Jang, M. Kim, G. Harerimana, S.-U. Kang, and J. W. Kim, “Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism,†NATO Adv. Sci. Inst. Ser. E Appl. Sci., vol. 10, no. 17, p. 5841, Aug. 2020.

A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very Deep Convolutional Networks for Text Classification,†in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Apr. 2017, pp. 1107–1116.

Downloads

Published

2022-07-25

Issue

Section

Articles