Home   >   CSC-OpenAccess Library   >    Manuscript Information
Computing Perplexity Values for Under-resourced Languages using n-gram and Deep Learning Approaches
Pages - 36 - 47     |    Revised - 30-09-2022     |    Published - 31-10-2022
Volume - 13   Issue - 3    |    Publication Date - October 2022  Table of Contents
Language Model, Tpuri, n-grams, Neural Network, Long Short-term Memory, Multilayer Perceptron, Perplexity.
The interactions between computers and human language, through the approach called natural languages processing, need a very good model describing the language and a large amount of data. But for under-resourced languages, however due to lack of resources (texts resources), it becomes challenging to devise a good model adapted for minority languages. To cope with this issue, in this paper, we focus on the collection of data for the construction of a language model adapted to poorly endowed languages. Firstly, we describe the concept of under-resourced languages and difficulties related to the digital processing of those languages. To illustrate our model, we collect some text data of Tpuri an African language spoken in Cameroon and Chad. For the collection, we used diverse sources like existing printed documents. Our dataset contains 1640128 words and 108553 sentences. With the collected dataset, two main stemming approaches(n-gram and recurrent neural network) have been evaluated. The perplexity values have been computed in order to judge how good language model is according to the characteristics of under-resourced languages. For the statistical n-gram language model, we obtained the perplexity valueof 420.01 for bigram and 270.45 for trigram. Relying on a linear interpolation with xs= [0.2, 0.2, 0.4, 0.2], a best perplexity value of 56.74 could be determined. We also obtained a best perplexity equal to 47.21 with Laplace smoothing using 4-grams, when x has a value of 0.03.Implementing a recurrent neural network model using the multilayer perceptron (long short-term memory), we obtain a perplexity value of 77.18 which is to be considered as a better result.
App, L. M. D., Blachon, D., Gauthier, E., & Besacier, L. (2016). Parallel Speech Collection for Under-resourced Language Studies Using the Parallel Speech Collection for Under-resourced Language Studies using the L ig -A ikuma Mobile Device App. December. https://doi.org/10.1016/j.procs.2016.04.030
Bellegarda, J. R., & Monz, C. (2015). State of the art in statistical methods for language and speech processing. Computer Speech & Language, 35, 163-184. https://doi.org/10.1016/j.csl.2015.07.001
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137-1155.
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56(1), 85-100. https://doi.org/10.1016/j.specom.2013.07.008
Brour, M., & Benabbou, A. (2019). ATLASLang NMT : Arabic text language into Arabic sign language neural machine translation. Journal of King Saud University - Computer and Information Sciences, xxxx. https://doi.org/10.1016/j.jksuci.2019.07.006
Caelen, J., Besacier, L., Bigi, B., Boitet, M. C., Mori, M. R. De, Haton, M. J., Berment, M. V., Caelen, M. J., & Besacier, M. L. (2006). Reconnaissance automatique de la parole pour des langues peu dotées.
Camara, É., Ndamba, J., Nstadi, C., Rey, V., & Véronis, J. (2004). Traitement informatique des langues africaines. Documents ALAF-ALAI, Paris, CNRS.
Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359-394.
Chen, S. F., & Rosenfeld, R. (2000). A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, 8(1), 37-50.
De Wet, F., Badenhorst, J., & Modipa, T. (2016). Developing Speech Resources from Parliamentary Data for South African English. Procedia Computer Science, 81. https://doi.org/10.1016/j.procs.2016.04.028
Eiselen, R., & Puttkammer, M. J. (2014). Developing Text Resources for Ten South African Languages. LREC, 3698-3703.
El-Haj, M., Kruschwitz, U., & Fox, C. (2015). Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Language Resources and Evaluation, 49(3), 549-580. https://doi.org/10.1007/s10579-014-9274-3
Eshkol, I., & Antoine, J.-Y. (2017). 24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) Actes de TALN 2017, volume 2 : articles courts. 2. http://taln2017.cnrs.fr/wp-content/uploads/2017/06/actes_TALN_2017-vol2.pdf#page=177
Esuli, A., Fagni, T., Fern, A. M., & National, I. (2016). JaTeCS , a Java library focused on automatic text categorization. 1-5.
Etman, A., & Beex, A. A. L. (2015). Language and Dialect Identification: A survey. IntelliSys 2015 - Proceedings of 2015 SAI Intelligent Systems Conference, December, 220-231. https://doi.org/10.1109/IntelliSys.2015.7361147
Gauthier, E., Besacier, L., & Voisin, S. (2016). Automatic Speech Recognition for African Languages with Vowel Length Contrast. Procedia Computer Science, 81, 136-143. https://doi.org/10.1016/j.procs.2016.04.041
Jivani, A. G., & others. (2011). A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl, 2(6), 1930-1938.
Lakew, S. M., Negri, M., & Turchi, M. (2020). L OW -R ESOURCE N EURAL M ACHINE T RANSLATION : 1-10.
Lau, J. H., Baldwin, T., & Cohn, T. (2017). Topically driven neural language model. ArXiv Preprint ArXiv:1704.08012.
Le, V. B., Bigi, B., Besacier, L., & Castelli, E. (2003). Using the Web for fast language model construction in minority languages. EUROSPEECH 2003 - 8th European Conference on Speech Communication and Technology, 3117-3120.
Mahtout, M. (2014). A Methodology for semi-automatic structuring of a bilingual lexicographical corpus: the French-Kabyle case (MÃthodologie pour la structuration semi-automatique du corpus dans une perspective de traitement automatique des langues: le cas du dictionnaire fr. TALN-RECITAL 2014 Workshop TALAf 2014: Traitement Automatique Des Langues Africaines (TALAf 2014: African Language Processing), 123-133.
McKellar, C. A., & Puttkammer, M. J. (2020). Dataset for comparable evaluation of machine translation between 11 South African languages. Data in Brief, 29, 105146. https://doi.org/https://doi.org/10.1016/j.dib.2020.105146
Nimaan, A., Nocera, P., & Torres-Moreno, J.-M. (2006). Boîte à outils TAL pour des langues peu informatisées : le cas du somali. Jadt. http://lexicometrica.univ-paris3.fr/jadt/jadt2006/PDF/II-062.pdf
Onyenwe, I. E. (2017). Developing methods and resources for automated processing of the african language igbo. University of Sheffield.
Paolillo, J. C. (2006). Evaluating Language Statistics : The Ethnologue and Beyond A report prepared for the UNESCO Institute for Statistics. Language.
Pellegrini, T., & Lamel, L. (2006). Investigating automatic decomposition for ASR in less represented languages. Ninth International Conference on Spoken Language Processing.
Peter Jackson, Ni. M. (2004). Review of “Natural language processing for online applications: Text retrieval, extraction and categorization.”TerminologyTerminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 10(1), 177-179. https://doi.org/10.1075/term.10.1.12dro
Rialland, A., Aborobongui, M. E., Adda-Decker, M., & Lamel, L. (n.d.). Mbochi: corpus oral, traitement automatique et exploration phonologique. Jep-Taln-Recital 2012, 1, 1. http://anthology.aclweb.org/W/W12/W12-1301.pdf%5Cnhttp://aclweb.org/anthology//W/W12/W12-1301.pdf
Ruelland, S. (1992). Description du parler tupuri de Mindaore (Tchad): phonologie, morphologie, syntaxe.
Ruelland, S. (1998). Dictionnaire Tupuri - Français - Anglais. Peeters.
Shikali, C. S., & Mokhosi, R. (2020). Enhancing African low-resource languages: Swahili data for language modelling. Data in Brief, 31, 105951. https://doi.org/https://doi.org/10.1016/j.dib.2020.105951
Tapo, A. A., Coulibaly, B., Diarra, S., Homan, C., Kreutzer, J., Luger, S., Nagashima, A., Zampieri, M., & Leventhal, M. (2014). Languages : A Case Study on Bambara.
Tomasz. (2018). Spoken Language Identification. July 2013. https://doi.org/10.13140/RG.2.2.29465.62561
Vu-minh, Q., Besacier, L., Blanchon, H., & Bigi, B. (n.d.). Modèle de langage sémantique pour la reconnaissance automatique de parole dans un contexte de traduction Mots clés-Key words 1 Introduction.
Vydrin, V., Rovenchak, A., & Maslinsky, K. (2016). Maninka Reference Corpus: A Presentation. TALAf 2016 : Traitement Automatique Des Langues Africaines (Écrit et Parole). Atelier JEP-TALN-RECITAL 2016 - Paris Le. https://halshs.archives-ouvertes.fr/halshs-01358144
Vydrin, V., Umr-, C., Bp, M., & Cedex, V. (2014). Projet des corpus écrits des langues manding : le bambara, le maninka 1.
Faculty of Science/ Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon
Faculty of Science/Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Ngaoundéré - Cameroon
Higher Teachers' Training College/Department of Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon
Faculty of Science/Department of Mathematics and Computer Science, National Institute of Cartography, Cameroon, The University of Ngaoundéré - Cameroon

View all special issues >>