Home   >   CSC-OpenAccess Library   >    Manuscript Information
Text Data Augmentation to Manage Imbalanced Classification: Apply to BERT-based Large Multiclass Classification for Product Sheets
Yu DU, Erwann LAVAREC, Colin LALOUETTE
Pages - 1 - 18     |    Revised - 30-04-2023     |    Published - 01-06-2023
Volume - 14   Issue - 1    |    Publication Date - June 2023  Table of Contents
MORE INFORMATION
KEYWORDS
Text Classification, Imbalanced Classification, Natural Language Processing, BERT, CamemBERT, Data Augmentation, Deep Learning.
ABSTRACT
Recent studies have showcased the effectiveness of deep pre-trained language models, such as BERT (Bidirectional Encoder Representations from Transformers), in tasks related to natural language processing and understanding. BERT, with its ability to learn contextualized word vectors, has proven highly accurate for binary text classification and basic multiclass classification, where the number of unique labels is relatively small. However, the performance of BERT-based models in more ambitious multiclass classification tasks, involving hundreds of unique labels, is seldom explored, despite the prevalence of such problems in real-world scenarios. Moreover, real-world datasets often exhibit class imbalance issues, with certain classes having significantly fewer corresponding texts than others. This paper makes two primary contributions: first, it examines the performance of BERT-based pre-trained language models in handling tasks of large multiclass classification system within a specific real-world context; second, it investigates the application of text data augmentation techniques to mitigate the class imbalance problem. Through rigorous experiments in a real-world SaaS (Software as a Service) domain, the results demonstrate that: 1) BERT-based models can effectively tackle tasks of large multiclass classification system, delivering reasonable prediction performance; and 2) text data augmentation can significantly enhance prediction performance in terms of accuracy (by 34.7%) and F1-score (by 37.1%).
Abd Elrahman, S. M., & Abraham, A. (2013). A review of class imbalance problem. Journal of Network and Innovative Computing, 1, 332-40.
Ali, H., Salleh, M. N. M., Saedudin, R., Hussain, K. & Mushtaq, M. F. (2019). Imbalance class problems in data mining: A review. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), pp. 1560-1571.
Althobaiti M. J. (2020). Automatic Arabic Dialect Identification systems for Written Texts: A survey. International Journal of Computational Linguistics (IJCL). Vol (11), Issue (3).
Baeza-Yates, R., Ribeiro-Neto, B. et al. (1999). Modern information retrieval, volume 463. ACM press New York.
Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and lemmatization: A comparison of retrieval performances. Lecture Notes on Software Engineering, 2, 262-7. doi:10.7763/LNSE.2014.V2.134.
Bayer, M., Kaufhold, M.-A., & Reuter, C. (2021). A survey on data augmentation for text classification. ACM Computing Surveys, 55. doi:10.1145/ 3544558.
Bhopale, A. P., & Tiwari, A. (2021). An application of transfer learning: Fine- tuning bert for spam email classification. In International Conference on Machine Learning and Big Data Analytics (pp. 67-77). Springer. doi:10. 1007/978-3-030-82469-3_6.
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acmsigkdd international conference on knowledge discovery and data mining KDD '16 (pp. 785-94). doi:10.1145/2939672. 2939785.
Conneau, A., & Lample, G. (2019). Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (p. 7059-7069).
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440-51). doi:10.18653/v1/2020.acl-main.747.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13, 21-7. doi:10.1109/TIT.1967. 1053964.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,. doi:10.48550/arXiv.1810.04805.
González-Carvajal, S & Garrido-Merchán, E. C. (2020). Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012.
Itkonen, S., Tiedemann, J., & Creutz,M. (2022). Helsinki-nlp at semeval- 2022 task 2: A feature-based approach to multilingual idiomaticity detection.In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp. 122-34). doi:10.18653/v1/2022.semeval-1.14.
Jin, D., Jin, Z., Zhou, J. T., & Szolovits, P. (2020). Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence (pp. 8018-25). volume 34. doi:10.1609/aaai.v34i05.6311.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (pp. 137-42). Springer. doi:10.1007/BFb0026683.
Kaliyar, R. K., Goswami, A., & Narang,P. (2021). Fakebert: Fake news detection in social media with abert-based deep learning approach. Multimedia tools and applications, 80, 11765-88. doi:10.1007/s11042-020-10183-2.
Kannan, S., Gurusamy, V., Vijayarani, S., Ilamathi, J., Nithya, M., Kannan, S., & Gurusamy, V. (2014). Preprocessing techniques for text mining. International Journal of Computer Science & Communication Networks, 5, 7-16.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems NIPS'17 (p. 3149-3157).
Kowsari, K. JafariMeimandi, K., Heidarysafa, M. Mendu, S. Barnes, L. & Brown, D. (2019). Text classification algorithms: A survey. Information; 10(4), pp. 150.
Li, D., Zhang, Y., Peng, H., Chen, L., Brockett, C., Sun, M.-T., & Dolan, B. (2021). Contextualized perturbation for textual adversarial attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 5053-69). doi:10.18653/v1/2021.naacl-main.400.
Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., & He, L.(2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13, 1-41. doi:10.1145/3495162.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. doi:10.48550/arXiv.1907.11692.
Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, doi:10.48550/arXiv. 1608.03983.
Marivate, V., & Sefara, T. (2020). Improving short text classification through global augmentation methods. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction (pp. 385-99). Springer. doi:10. 1007/978-3-030-57321-8_21.
Martin, L., Muller, B., Ortiz Suárez, P. J.,Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., & Sagot, B. (2020). CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7203-19). Online: Association for Computational Linguistics, doi:10.18653/v1/2020.acl-main.645.
Mihalcea, R. & Tarau, P., “TextRank: Bringing Order into Text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, Jul. 2004, pp. 404-411.
Mikolov, T., Chen, K., Corrado, G.,& Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, doi:10.48550/arXiv.1301.3781.
Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38, 39-41. doi:10.1145/219717.219748.
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao,J. (2021). Deep learning-based text classification: a comprehensive review.ACM Computing Surveys (CSUR), 54, 1-40. doi:10.1145/3439726.
Mishra, A., & Jain, S. K. (2016). A survey on question answering systems with classification. Journal of King Saud University-Computer and Information Sciences, 28, 345-61. doi:10.1016/j.jksuci.2014.10.007.
Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D., & Qi, Y. (2020). Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909, doi:10.48550/arXiv.2005.05909.
Negesse F. (2015). Classification of Oromo Dialects: A Computational Approach. International Journal of Computational Linguistics (IJCL). Vol. (6), Issue (1)
Parmar, A., Katariya, R., & Patel, V. (2018). A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (pp. 758-63). Springer volume 26. doi:10.1007/978-3-030-03146-6_86.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. (1532-1543). doi:10.3115/v1/D14-1162
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63, 1872-97. doi:10.1007/s11431-020-1647-3.
Rahman, M. A., & Akter, Y. A. (2019). Topic classification from text using decision tree, knn and multinomial naïve bayes. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT) (pp. 1-4). IEEE. doi:10.1109/ICASERT.2019.8934502.
Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. Citeseer. doi:10.13140/2.1.2393.1847.
Rizos, G., Hemker, K., & Schuller, B. (2019). Augment to prevent: short-text data augmentation in deep learning for hate-speech classification. In Proceedings of the 28th ACM international conference on information and knowledge management (pp. 991 - 1000).
Santiso, S., Casillas, A., & Pérez, A. (2019). The class imbalance problem detecting adverse drug reactions in electronic health records. Health informatics journal, 25, 1768-78. doi:10.1177/1460458218799470.
Sedai, A., & Houghton B. (2022). Unicode-based Data Processing for Text Classification. International Journal of Computational Linguistics (IJCL), vol (13), issue (2).
Serrano-Guerrero, J., Olivas, J. A., Romero, F. P., & Herrera-Viedma, E. (2015). Sentiment analysis: A review and comparative analysis of web services. Information Sciences, 311, 18-38. doi:10.1016/j.ins.2015.03.040.
Shaheen, Z., Wohlgenannt, G. & Filtz, E. (2020). Large scale legal text classification using transformer models. arXiv preprint arXiv:2010.12871.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. (pp. 1-14).
Sinha, K., Jia, R., Hupkes, D., Pineau, J. Williams, A. & Kiela, D. (2021). Masked language modeling and the distributional hypothesis: Order word matters pre-training for little.arXiv preprint arXiv:2104.06644.
Sugiyama, A., & Yoshinaga, N. (2019). Data augmentation using back- translation for context-aware neural machine translation. In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)(pp. 35-44). doi:10.18653/v1/D19-6504.
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune bert for text classification? In China national conference on Chinese computational linguistics (pp. 194-206). Springer volume 11856. doi:10.1007/978-3-030-32381-3_16.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Vo, M. T., Vo, A. H., Nguyen, T. Sharma, R., & Le, T. (2021).Dealing with the class imbalance problem in the detection of fake job descriptions. Comput. Mater. Continua, 68(1), pp. 521- 535.
Wei, J., & Zou, K. (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196, doi:10.48550/arXiv.1901.11196.
Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., & Grave, E. (2019). Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4003-12).
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M. et al. (2020). Huggingface's transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38-45), doi:10.18653/v1/2020.emnlp-demos.6.
Yang, J., Bai, L., & Guo, Y. (2020). A survey of text classification models. In Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence, pp. 327-334.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C.-J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, doi:10.48550/arXiv.1904.00962.
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: a statistical framework. International journal of machine learning and cybernetics, 1,43-52. doi:10.1007/s13042-010-0001-0.
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence (pp. 13001-8). volume 34. doi:10.1609/aaai.v34i07.7000.
Mr. Yu DU
Cloud-is-Mine R&D, Perpignan, 66100 - France
du.yu.1411@gmail.com
Mr. Erwann LAVAREC
Cloud-is-Mine R&D, Perpignan, 66100 - France
Mr. Colin LALOUETTE
Cloud-is-Mine R&D, Perpignan, 66100 - France


CREATE AUTHOR ACCOUNT
 
LAUNCH YOUR SPECIAL ISSUE
View all special issues >>
 
PUBLICATION VIDEOS