Text Data Augmentation to Manage Imbalanced Classification: Apply to BERT-based Large Multiclass Classification for Product Sheets
Recent studies have showcased the effectiveness of deep pre-trained language models, such as BERT (Bidirectional Encoder Representations from Transformers), in tasks related to natural language processing and understanding. BERT, with its ability to learn contextualized word vectors, has proven highly accurate for binary text classification and basic multiclass classification, where the number of unique labels is relatively small. However, the performance of BERT-based models in more ambitious multiclass classification tasks, involving hundreds of unique labels, is seldom explored, despite the prevalence of such problems in real-world scenarios. Moreover, real-world datasets often exhibit class imbalance issues, with certain classes having significantly fewer corresponding texts than others. This paper makes two primary contributions: first, it examines the performance of BERT-based pre-trained language models in handling tasks of large multiclass classification system within a specific real-world context; second, it investigates the application of text data augmentation techniques to mitigate the class imbalance problem. Through rigorous experiments in a real-world SaaS (Software as a Service) domain, the results demonstrate that: 1) BERT-based models can effectively tackle tasks of large multiclass classification system, delivering reasonable prediction performance; and 2) text data augmentation can significantly enhance prediction performance in terms of accuracy (by 34.7%) and F1-score (by 37.1%).
