Computing Perplexity Values for Under-resourced Languages using n-gram and Deep Learning Approaches
Pages - 36 - 47     |    Revised - 30-09-2022     |    Published - 31-10-2022
Volume - 13   Issue - 3    |    Publication Date - October 2022  Table of Contents
Language Model, Tpuri, n-grams, Neural Network, Long Short-term Memory, Multilayer Perceptron, Perplexity.
The interactions between computers and human language, through the approach called natural languages processing, need a very good model describing the language and a large amount of data. But for under-resourced languages, however due to lack of resources (texts resources), it becomes challenging to devise a good model adapted for minority languages. To cope with this issue, in this paper, we focus on the collection of data for the construction of a language model adapted to poorly endowed languages. Firstly, we describe the concept of under-resourced languages and difficulties related to the digital processing of those languages. To illustrate our model, we collect some text data of Tpuri an African language spoken in Cameroon and Chad. For the collection, we used diverse sources like existing printed documents. Our dataset contains 1640128 words and 108553 sentences. With the collected dataset, two main stemming approaches(n-gram and recurrent neural network) have been evaluated. The perplexity values have been computed in order to judge how good language model is according to the characteristics of under-resourced languages. For the statistical n-gram language model, we obtained the perplexity valueof 420.01 for bigram and 270.45 for trigram. Relying on a linear interpolation with xs= [0.2, 0.2, 0.4, 0.2], a best perplexity value of 56.74 could be determined. We also obtained a best perplexity equal to 47.21 with Laplace smoothing using 4-grams, when x has a value of 0.03.Implementing a recurrent neural network model using the multilayer perceptron (long short-term memory), we obtain a perplexity value of 77.18 which is to be considered as a better result.
Faculty of Science/ Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon
Faculty of Science/Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Ngaoundéré - Cameroon
Higher Teachers' Training College/Department of Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon
Faculty of Science/Department of Mathematics and Computer Science, National Institute of Cartography, Cameroon, The University of Ngaoundéré - Cameroon

