Unicode-based Data Processing for Text Classification
Akash Sedai, Ben Houghton
Pages - 18 - 25     |    Revised - 30-06-2022     |    Published - 01-08-2022
Volume - 13   Issue - 2    |    Publication Date - August 2022  Table of Contents
NLP, Classification, Text Processing, Machine Learning, Labeling.
In this paper we demonstrate a Unicode based text data processing approach for machine learning classification. The fields are first converted to Unicode, and then the features are generated by splitting the characters by vowels or any custom set of delimiters contained within fields. The fields are labelled into classes and the model outputs the class predictions for each field. It provides a simpler approach for text preprocessing that can maintain high accuracy. It will be useful to database managers or researchers who work with large unlabeled datasets that needs to be labelled into several classes.
Mr. Akash Sedai
Institute of Finance & Technology, University College London, London, WC1e 6BT - United Kingdom
Mr. Ben Houghton
Quantexa Ltd, London, SE1 7ND - United Kingdom

