Home   >   CSC-OpenAccess Library   >    Manuscript Information
Unicode-based Data Processing for Text Classification
Akash Sedai, Ben Houghton
Pages - 18 - 25     |    Revised - 30-06-2022     |    Published - 01-08-2022
Volume - 13   Issue - 2    |    Publication Date - August 2022  Table of Contents
NLP, Classification, Text Processing, Machine Learning, Labeling.
In this paper we demonstrate a Unicode based text data processing approach for machine learning classification. The fields are first converted to Unicode, and then the features are generated by splitting the characters by vowels or any custom set of delimiters contained within fields. The fields are labelled into classes and the model outputs the class predictions for each field. It provides a simpler approach for text preprocessing that can maintain high accuracy. It will be useful to database managers or researchers who work with large unlabeled datasets that needs to be labelled into several classes.
Aljedani, N., Alotaibi, R., and Taileb, M. (2021). Hmatc: Hierarchical multi-label arabic text classification model using machine learning. Egyptian Informatics Journal, 22(3):225- 237.
Alkhazi, I. S. and Teahan, W. J. (2019). Compression-based parts-of-speech tagger for the arabic language. International Journal of Computational Linguistics, 10:1-15.
Altınel, B., Can Ganiz, M., and Diri, B. (2015). A corpus-based semantic kernel for text classification by using meaning values of terms. Engineering Applications of Artificial Intelligence, 43:54-66.
Altamimi, M. B. and Teahan, W. J. (2019). Arabic dialect identification of twitter text using ppm compression. International Journal of Computational Linguistics, 10:47-59.
Asahiah, F. O. (2021). Comparison of rule-based and data-driven approaches for syllabification of simple syllable languages and the effect of orthography. Computer Speech Language, 70:101233.
Bhattacharyya, R. P., Jung, S., Kruse, L., Senanayake, R., and Kochenderfer, M. J. (2021). A hybrid rule-based and data-driven approach to driver modeling through particle filtering. CoRR, abs/2108.12820.
Brucker, F., Benites, F., and Sapozhnikova, E. (2011). Multi-label classification and extracting predicted class hierarchies. Pattern Recognition, 44(3):724-738.
Chorozoglou, G., Zacharis, N. Z., Papakitsos, E. C., Galiotou, E., and Giovanis, A. (2021). Review of parsing in modern greek - a new approach. International Journal of Computational Linguistics, 12:1-8.
Coates, A., Huval, B., Wang, T., Wu, D., Ng, A., and Catanzaro, B. (2013). Deep learning with cots hpc systems. 30th International Conference on Machine Learning, ICML 2013, pages 2374-2382.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding.
Duwairi, R. and Al-zubaidi, R. (2011). A hierarchical k-nn classifier for textual data. The International Arab Journal of Information Technology, pages 251-259.
Joshi, R., Karnavat, R., Jirapure, K., and Joshi, R. (2021). Evaluation of deep learning models for hostility detection in hindi text. CoRR, abs/2101.04144.
Kumar, S., Kar, A. K., and Ilavarasan, P. V. (2021). Applications of text mining in services management: A systematic literature review. International Journal of Information Management Data Insights, 1(1):100008.
Malema, G. A., Motlhanka, M., Okgetheng, B., and Motlogelwa, N. (2018). Setswana noun analyzer and generator. International Journal of Computational Linguistics, 9:32-40.
Malema, G. A., Motlogelwa, N., Okgetheng, B., and Mogotlhwane, O. (2016). Setswana verb analyzer and generator. International Journal of Computational Linguistics, 7:1-11.
Mercelis, W. (2021). Developing ai tools for a writing assistant: Automatic detection of dt-mistakes in dutch. International Journal of Computational Linguistics, 12:9-23.
Meshesha, M. and Solomon, Y. (2018). English-afaanoromo statistical machine translation. International Journal of Computational Linguistics, 9:26-31.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space.
Mironczuk, M. M. and Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106:36-54.
Otter, D., Medina, J., and Kalita, J. (2020). A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems, PP:1-21.
Patil, N. N. and Patil, J. B. (2016). Robust text watermarking technique for authorship protection of hindi language documents. International Journal of Computational Linguistics, 7:12-22.
Pinheiro, R. H., Cavalcanti, G. D., and Tsang, I. R. (2017). Combining dissimilarity spaces for text categorization. Information Sciences, 406-407:87-101.
Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics processors. volume 382, page 110.
Statista (2022). Natural language processing market revenue worldwide. https://www.statista.com/statistics/607891/worldwide-natural-language-processing-market-revenues. Accessed 03-07-2022.
Sumamo, J. S. and Teferra, S. (2018). Designing a rule based stemming algorithm for kambaata language text. International Journal of Computational Linguistics, 9:41-54.
Wang, D., Wu, J., Zhang, H., Xu, K., and Lin, M. (2013). Towards enhancing centroid classifier for text classification—a border-instance approach. Neurocomputing, 101:299- 308.
Xin, Y. and Zhang, Z. (2020). An improved chinese text multi-label classification method based on CNN. Journal of Physics: Conference Series, 1619(1):012017.
Zhu, H. and Lei, L. (2022). The research trends of text classification studies (2000-2020): A bibliometric analysis. SAGE Open, 12(2).
Mr. Akash Sedai
Institute of Finance & Technology, University College London, London, WC1e 6BT - United Kingdom
Mr. Ben Houghton
Quantexa Ltd, London, SE1 7ND - United Kingdom

View all special issues >>