Pengaruh Distribusi Panjang Data Teks pada Klasifikasi: Sebuah Studi Awal
DOI:
https://doi.org/10.30865/mib.v6i3.4259Keywords:
Text Classification, Cross-LengthAbstract
In text classification, there is a problem with text domain differences (cross-domain) between the data used to train the model and the data used when the model is applied. In addition to the problem of domain differences, there are also language differences (cross-lingual). Many previous studies have looked for ways how classification models can be applied effectively and efficiently in these cross-domain and cross-lingual situations. However, there is one difference that is not given special attention because it is considered not very influential, namely the difference in text length (cross-length). In this study, we further investigated the cross-length condition by creating a special dataset and testing it with various commonly used classification models. The results showed that the difference in the distribution of text length between the training data and the test data could affect the performances. Cross-length transfers from long to short texts show an average decrease in F1-scores across all models of 14%, while transfers from short to long texts give an average decrease of 9%.References
B. Zhang, X. Zhang, Y. Liu, L. Cheng, and Z. Li, “Matching Distributions between Model and Data: Cross-domain Knowledge Distillation for Unsupervised Domain Adaptation,†in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug. 2021, pp. 5423–5433.
H. S. Bhatt, M. Sinha, and S. Roy, “Cross-domain text classification with multiple domains and disparate label sets,†in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1641–1650.
B. Jing, C. Lu, D. Wang, and C. Zhuang Fuzhen and Niu, “Cross-Domain Labeled LDA for Cross-Domain Text Classification,†in 2018 IEEE International Conference on Data Mining (ICDM), Nov. 2018, pp. 187–196.
A. Mogadala and A. Rettinger, “Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification,†in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 692–702.
G. Barbosa, R. Camelo, A. P. Cavalcanti, P. Miranda, V. Mello Rafael Ferreira and Kovanović, and D. Gašević, “Towards automatic cross-language classification of cognitive presence in online discussions,†in Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, Mar. 2020, pp. 605–614.
G. Karamanolakis, D. Hsu, and L. Gravano, “Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher,†in Findings of the Association for Computational Linguistics: EMNLP 2020, Nov. 2020, pp. 3604–3622.
S. Jarvis and S. A. Crossley, Approaching Language Transfer Through Text Classification: Explorations in the Detectionbased Approach. Multilingual Matters, 2012.
R. K. Amplayo, S. Lim, and S.-W. Hwang, “Text Length Adaptation in Sentiment Classification,†in Proceedings of The Eleventh Asian Conference on Machine Learning, 2019, vol. 101, pp. 646–661.
K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text Classification Algorithms: A Survey,†Information, vol. 10, no. 4, p. 150, Apr. 2019.
V. S. Jagtap and K. Pawar, “Analysis of different approaches to sentence-level sentiment classification,†International Journal of Scientific Engineering and Technology, vol. 2, no. 3, pp. 164–170, 2013.
Y. Kim, “Convolutional Neural Networks for Sentence Classification,†in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct. 2014, pp. 1746–1751.
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical attention networks for document classification,†in Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, 2016, pp. 1480–1489.
M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood, and M. T. Sadiq, “Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network,†IEEE Access, vol. 8, pp. 42689–42707, 2020.
G. Wang and S. Y. Shin, “An Improved Text Classification Method for Sentiment Classification,†Journal of information and communication convergence engineering, vol. 17, no. 1, pp. 41–48, 2019.
Y. Goldberg, “Neural Network Methods for Natural Language Processing,†Synthesis Lectures on Human Language Technologies, vol. 10, no. 1, pp. 1–309, Apr. 2017.
S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural Networks for Text Classification,†Feb. 2015.
M. M. Mirończuk and J. Protasiewicz, “A recent overview of the state-of-the-art elements of text classification,†Expert Syst. Appl., vol. 106, pp. 36–54, Sep. 2018.
B. Jang, M. Kim, G. Harerimana, S.-U. Kang, and J. W. Kim, “Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism,†NATO Adv. Sci. Inst. Ser. E Appl. Sci., vol. 10, no. 17, p. 5841, Aug. 2020.
A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very Deep Convolutional Networks for Text Classification,†in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Apr. 2017, pp. 1107–1116.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).