EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Unicode-based Data Processing for Text Classification

Akash Sedai, Ben Houghton

Pages - 18 - 25 | Revised - 30-06-2022 | Published - 01-08-2022

Published in International Journal of Computational Linguistics (IJCL)

Volume - 13 Issue - 2 | Publication Date - August 2022 Table of Contents

MORE INFORMATION

References | Abstracting & Indexing

KEYWORDS

NLP, Classification, Text Processing, Machine Learning, Labeling.

ABSTRACT

In this paper we demonstrate a Unicode based text data processing approach for machine learning classification. The fields are first converted to Unicode, and then the features are generated by splitting the characters by vowels or any custom set of delimiters contained within fields. The fields are labelled into classes and the model outputs the class predictions for each field. It provides a simpler approach for text preprocessing that can maintain high accuracy. It will be useful to database managers or researchers who work with large unlabeled datasets that needs to be labelled into several classes.

REFERENCES

Aljedani, N., Alotaibi, R., and Taileb, M. (2021). Hmatc: Hierarchical multi-label arabic text classification model using machine learning. Egyptian Informatics Journal, 22(3):225- 237.

Alkhazi, I. S. and Teahan, W. J. (2019). Compression-based parts-of-speech tagger for the arabic language. International Journal of Computational Linguistics, 10:1-15.

Altınel, B., Can Ganiz, M., and Diri, B. (2015). A corpus-based semantic kernel for text classification by using meaning values of terms. Engineering Applications of Artificial Intelligence, 43:54-66.

Altamimi, M. B. and Teahan, W. J. (2019). Arabic dialect identification of twitter text using ppm compression. International Journal of Computational Linguistics, 10:47-59.

Asahiah, F. O. (2021). Comparison of rule-based and data-driven approaches for syllabification of simple syllable languages and the effect of orthography. Computer Speech Language, 70:101233.

Bhattacharyya, R. P., Jung, S., Kruse, L., Senanayake, R., and Kochenderfer, M. J. (2021). A hybrid rule-based and data-driven approach to driver modeling through particle filtering. CoRR, abs/2108.12820.

Brucker, F., Benites, F., and Sapozhnikova, E. (2011). Multi-label classification and extracting predicted class hierarchies. Pattern Recognition, 44(3):724-738.

Chorozoglou, G., Zacharis, N. Z., Papakitsos, E. C., Galiotou, E., and Giovanis, A. (2021). Review of parsing in modern greek - a new approach. International Journal of Computational Linguistics, 12:1-8.

Coates, A., Huval, B., Wang, T., Wu, D., Ng, A., and Catanzaro, B. (2013). Deep learning with cots hpc systems. 30th International Conference on Machine Learning, ICML 2013, pages 2374-2382.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding.

Duwairi, R. and Al-zubaidi, R. (2011). A hierarchical k-nn classifier for textual data. The International Arab Journal of Information Technology, pages 251-259.

Joshi, R., Karnavat, R., Jirapure, K., and Joshi, R. (2021). Evaluation of deep learning models for hostility detection in hindi text. CoRR, abs/2101.04144.

Kumar, S., Kar, A. K., and Ilavarasan, P. V. (2021). Applications of text mining in services management: A systematic literature review. International Journal of Information Management Data Insights, 1(1):100008.

Malema, G. A., Motlhanka, M., Okgetheng, B., and Motlogelwa, N. (2018). Setswana noun analyzer and generator. International Journal of Computational Linguistics, 9:32-40.

Malema, G. A., Motlogelwa, N., Okgetheng, B., and Mogotlhwane, O. (2016). Setswana verb analyzer and generator. International Journal of Computational Linguistics, 7:1-11.

Mercelis, W. (2021). Developing ai tools for a writing assistant: Automatic detection of dt-mistakes in dutch. International Journal of Computational Linguistics, 12:9-23.

Meshesha, M. and Solomon, Y. (2018). English-afaanoromo statistical machine translation. International Journal of Computational Linguistics, 9:26-31.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space.

Mironczuk, M. M. and Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106:36-54.

Otter, D., Medina, J., and Kalita, J. (2020). A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems, PP:1-21.

Patil, N. N. and Patil, J. B. (2016). Robust text watermarking technique for authorship protection of hindi language documents. International Journal of Computational Linguistics, 7:12-22.

Pinheiro, R. H., Cavalcanti, G. D., and Tsang, I. R. (2017). Combining dissimilarity spaces for text categorization. Information Sciences, 406-407:87-101.

Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics processors. volume 382, page 110.

Statista (2022). Natural language processing market revenue worldwide. https://www.statista.com/statistics/607891/worldwide-natural-language-processing-market-revenues. Accessed 03-07-2022.

Sumamo, J. S. and Teferra, S. (2018). Designing a rule based stemming algorithm for kambaata language text. International Journal of Computational Linguistics, 9:41-54.

Wang, D., Wu, J., Zhang, H., Xu, K., and Lin, M. (2013). Towards enhancing centroid classifier for text classificationâ€”a border-instance approach. Neurocomputing, 101:299- 308.

Xin, Y. and Zhang, Z. (2020). An improved chinese text multi-label classification method based on CNN. Journal of Physics: Conference Series, 1619(1):012017.

Zhu, H. and Lei, L. (2022). The research trends of text classification studies (2000-2020): A bibliometric analysis. SAGE Open, 12(2).

MANUSCRIPT AUTHORS

Mr. Akash Sedai

Institute of Finance & Technology, University College London, London, WC1e 6BT - United Kingdom

akash.sharma@ucl.ac.uk

Mr. Ben Houghton

Quantexa Ltd, London, SE1 7ND - United Kingdom

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS