Pengaruh Distribusi Panjang Data Teks pada Klasifikasi: Sebuah Studi Awal

Said Al Faraby; Ade Romadhony

doi:10.30865/mib.v6i3.4259

Authors

Said Al Faraby Telkom University, Bandung
Ade Romadhony Telkom University, Bandung

DOI:

https://doi.org/10.30865/mib.v6i3.4259

Keywords:

Text Classification, Cross-Length

Abstract

In text classification, there is a problem with text domain differences (cross-domain) between the data used to train the model and the data used when the model is applied. In addition to the problem of domain differences, there are also language differences (cross-lingual). Many previous studies have looked for ways how classification models can be applied effectively and efficiently in these cross-domain and cross-lingual situations. However, there is one difference that is not given special attention because it is considered not very influential, namely the difference in text length (cross-length). In this study, we further investigated the cross-length condition by creating a special dataset and testing it with various commonly used classification models. The results showed that the difference in the distribution of text length between the training data and the test data could affect the performances. Cross-length transfers from long to short texts show an average decrease in F1-scores across all models of 14%, while transfers from short to long texts give an average decrease of 9%.

References

B. Zhang, X. Zhang, Y. Liu, L. Cheng, and Z. Li, â€œMatching Distributions between Model and Data: Cross-domain Knowledge Distillation for Unsupervised Domain Adaptation,â€ in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug. 2021, pp. 5423â€“5433.

H. S. Bhatt, M. Sinha, and S. Roy, â€œCross-domain text classification with multiple domains and disparate label sets,â€ in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1641â€“1650.

B. Jing, C. Lu, D. Wang, and C. Zhuang Fuzhen and Niu, â€œCross-Domain Labeled LDA for Cross-Domain Text Classification,â€ in 2018 IEEE International Conference on Data Mining (ICDM), Nov. 2018, pp. 187â€“196.

A. Mogadala and A. Rettinger, â€œBilingual word embeddings from parallel and non-parallel corpora for cross-language text classification,â€ in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 692â€“702.

G. Barbosa, R. Camelo, A. P. Cavalcanti, P. Miranda, V. Mello Rafael Ferreira and KovanoviÄ‡, and D. GaÅ¡eviÄ‡, â€œTowards automatic cross-language classification of cognitive presence in online discussions,â€ in Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, Mar. 2020, pp. 605â€“614.

G. Karamanolakis, D. Hsu, and L. Gravano, â€œCross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher,â€ in Findings of the Association for Computational Linguistics: EMNLP 2020, Nov. 2020, pp. 3604â€“3622.

S. Jarvis and S. A. Crossley, Approaching Language Transfer Through Text Classification: Explorations in the Detectionbased Approach. Multilingual Matters, 2012.

R. K. Amplayo, S. Lim, and S.-W. Hwang, â€œText Length Adaptation in Sentiment Classification,â€ in Proceedings of The Eleventh Asian Conference on Machine Learning, 2019, vol. 101, pp. 646â€“661.

K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, â€œText Classification Algorithms: A Survey,â€ Information, vol. 10, no. 4, p. 150, Apr. 2019.

V. S. Jagtap and K. Pawar, â€œAnalysis of different approaches to sentence-level sentiment classification,â€ International Journal of Scientific Engineering and Technology, vol. 2, no. 3, pp. 164â€“170, 2013.

Y. Kim, â€œConvolutional Neural Networks for Sentence Classification,â€ in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct. 2014, pp. 1746â€“1751.

Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, â€œHierarchical attention networks for document classification,â€ in Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, 2016, pp. 1480â€“1489.

M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood, and M. T. Sadiq, â€œDocument-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network,â€ IEEE Access, vol. 8, pp. 42689â€“42707, 2020.

G. Wang and S. Y. Shin, â€œAn Improved Text Classification Method for Sentiment Classification,â€ Journal of information and communication convergence engineering, vol. 17, no. 1, pp. 41â€“48, 2019.

Y. Goldberg, â€œNeural Network Methods for Natural Language Processing,â€ Synthesis Lectures on Human Language Technologies, vol. 10, no. 1, pp. 1â€“309, Apr. 2017.

S. Lai, L. Xu, K. Liu, and J. Zhao, â€œRecurrent Convolutional Neural Networks for Text Classification,â€ Feb. 2015.

M. M. MiroÅ„czuk and J. Protasiewicz, â€œA recent overview of the state-of-the-art elements of text classification,â€ Expert Syst. Appl., vol. 106, pp. 36â€“54, Sep. 2018.

B. Jang, M. Kim, G. Harerimana, S.-U. Kang, and J. W. Kim, â€œBi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism,â€ NATO Adv. Sci. Inst. Ser. E Appl. Sci., vol. 10, no. 17, p. 5841, Aug. 2020.

A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, â€œVery Deep Convolutional Networks for Text Classification,â€ in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Apr. 2017, pp. 1107â€“1116.

Pengaruh Distribusi Panjang Data Teks pada Klasifikasi: Sebuah Studi Awal

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License