Home   >   CSC-OpenAccess Library   >    Manuscript Information
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Ibrahim S. Alkhazi, William J. Teahan
Pages - 1 - 15     |    Revised - 31-03-2019     |    Published - 30-04-2019
Volume - 10   Issue - 1    |    Publication Date - April 2019  Table of Contents
Natural Language Processing, Arabic Part-of-Speech Tagger, Hidden Markov Model, Statistical Language Model.
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
1 Google Scholar 
2 Semantic Scholar 
3 BibSonomy 
4 refSeek 
5 Doc Player 
6 Scribd 
7 SlideShare 
Abdelali, Ahmed, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. �Farasa: A Fast and Furious Segmenter for Arabic.� Pp. 11�16 in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations.
Abumalloh, Rabab Ali, Hassan Maudi Al-Sarhan, Othman Ibrahim, and Waheeb Abu-Ulbeh. 2016. �Arabic Part-of-Speech Tagging.� Journal of Soft Computing and Decision Support Systems 3(2):45�52.
Al Shamsi, Fatma and Ahmed Guessoum. 2006. �A Hidden Markov Model-Based POS Tagger for Arabic.� Pp. 31�42 in Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France.
Al-Harbi, S., A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh. 2008. �Automatic Arabic Text Classification.� in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data.
Al-Kazaz, Noor R., Sean A. Irvine, and William J. Teahan. 2016. �An Automatic Cryptanalysis of Transposition Ciphers Using Compression.� Pp. 36�52 in International Conference on Cryptology and Network Security.
Alabbas, Maytham and Allan Ramsay. 2012. �Improved POS-Tagging for Arabic by Combining Diverse Taggers.� Pp. 107�16 in IFIP International Conference on Artificial Intelligence Applications and Innovations.
Alghamdi, Mansoor A., Ibrahim S. Alkhazi, and William J. Teahan. 2016. �Arabic OCR Evaluation Tool.� Pp. 1�6 in Computer Science and Information Technology (CSIT), 2016 7th International Conference on. IEEE.
Alhawiti, Khaled M. 2014. �Adaptive Models of Arabic Text.� Ph.D. thesis, Bangor University.
Alkahtani, Saad and William J. Teahan. 2016. �A New Parallel Corpus of Arabic/English.� Pp. 279�84 in Proceedings of the Eighth Saudi Students Conference in the UK.
Alkahtani, Saad. 2015. �Building and Verifying Parallel Corpora between Arabic and English.� Ph.D. thesis, Bangor University.
Alkhazi, Ibrahim S. and William J. Teahan. 2017. �Classifying and Segmenting Classical and Modern Standard Arabic Using Minimum Cross-Entropy.� INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 8(4):421�30.
Alkhazi, Ibrahim S. and William J. Teahan. 2018. �BAAC: Bangor Arabic Annotated Corpus.� INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 9(11):131�40.
Alkhazi, Ibrahim S., Mansoor A. Alghamdi, and William J. Teahan. 2017. �Tag Based Models for Arabic Text Compression.� Pp. 697�705 in 2017 Intelligent Systems Conference (IntelliSys). IEEE.
Alosaimy, Abdulrahman Mohammed S. 2018. �Ensemble Morphosyntactic Analyser for Classical Arabic.� Ph.D. thesis, University of Leeds.
Alqrainy, Shihadeh. 2008. �A Morphological-Syntactical Analysis Approach for Arabic Textual Tagging.�
Anon. n.d. �Madamira Arabic Analyzer - Online.� Retrieved February 17, 2019a (https://camel.abudhabi.nyu.edu/madamira/).
Anon. n.d. �The Stanford Natural Language Processing Group.� Retrieved February 17, 2019b (https://nlp.stanford.edu/software/tagger.shtml).
Brill, Eric. 1992. �A Simple Rule-Based Part of Speech Tagger.� Pp. 152�55 in Proceedings of the third conference on Applied natural language processing.
Brown, Peter F., Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. �An Estimate of an Upper Bound for the Entropy of English.� Computational Linguistics 18(1):31�40.
Cleary, John and Witten, Ian. 1984. �Data Compression Using Adaptive Coding and Partial String Matching.� C(4):396�402.
Columbia University. n.d. �Arabic Language Disambiguation for Natural Language Processing Applications - Cu14012 - Columbia Technology Ventures.� Retrieved (http://innovation.columbia.edu/technologies/cu14012_arabic-language-disambiguation-for-natural-language-processing-applications).
Darwish, Kareem, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy, and Laura Kallmeyer. 2018. �Multi-Dialect Arabic POS Tagging: A CRF Approach.� in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).
Diab, Mona T. 2007. �Improved Arabic Base Phrase Chunking with a New Enriched POS Tag Set.� Pp. 89�96 in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources.
Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2004. �Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks.� Pp. 149�52 in Proceedings of HLT-NAACL 2004: Short papers.
Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2007. �Automatic Processing of Modern Standard Arabic Text.� Pp. 159�79 in Arabic Computational Morphology. Springer.
El Hadj, Yahya, I. Al-Sughayeir, and A. Al-Ansari. 2009. �Arabic Part-of-Speech Tagging Using the Sentence Structure.� in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt.
El-Kareh, Seham and Sameh Al-Ansary. 2000. �An Interactive Multi-Features POS Tagger.� P. 83Y88 in the Proceedings of the International Conference on Artificial and Computational Intelligence for Decision Control and Automation in Intelligence for Decision Control and Automation in Engineering and Industrial Applications.
Francis, W. Nelson and Henry Kucera. 1979. �The Brown Corpus: A Standard Corpus of Present-Day Edited American English.� Providence, RI: Department of Linguistics, Brown University [Producer and Distributor].
Green, Spence and Cd Manning. 2010. �Better Arabic Parsing: Baselines, Evaluations, and Analysis.� COLING �10 Proceedings of the 23rd International Conference on Computational Linguistics (August):394�402.
Green, Spence, Marie-Catherine de Marneffe, and Christopher D. Manning. 2013. �Parsing Models for Identifying Multiword Expressions.� Computational Linguistics 39(1):195�227.
Greene, Barbara B. and Gerald M. Rubin. 1971. �Automated Grammatical Tagging of English.�
Habash, Nizar and Owen Rambow. 2005. �Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop.� Pp. 573�80 in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics.
Habash, Nizar, Owen Rambow, and Ryan Roth. 2009. �MADA+ TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization.� Pp. 102�9 in Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.
Habash, Nizar, Ryan Roth, Owen Rambow, Ramy Eskander, and Nadi Tomeh. 2013. �Morphological Analysis and Disambiguation for Dialectal Arabic.� Pp. 426�32 in Hlt-Naacl.
Hadni, Meryeme, Said Alaoui Ouatik, Abdelmonaime Lachkar, and Mohammed Meknassi. 2013. �Hybrid Part-of-Speech Tagger for Non-Vocalized Arabic Text.� International Journal on Natural Language Computing (IJNLC) Vol 2.
Hajic, Jan, Otakar Smrz, Petr Zem�nek, Jan �naidauf, and Emanuel Be�ka. 2004. �Prague Arabic Dependency Treebank: Development in Data and Tools.� Pp. 110�17 in Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools.
Jelinek, Fred. 1990. �Self-Organized Language Modeling for Speech Recognition.� Readings in Speech Recognition 450�506.
Katz, Slava. 1987. �Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.� IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3):400�401.
Khmelev, Dmitry V and William J. Teahan. 2003. �A Repetition Based Measure for Verification of Text Collections and for Text Categorization.� Pp. 104�10 in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM.
Khoja, Shereen, Roger Garside, and Gerry Knowles. 2001. �A Tagset for the Morphosyntactic Tagging of Arabic.� Proceedings of the Corpus Linguistics. Lancaster University (UK) 13.
Khoja, Shereen. 2001. �APT: Arabic Part-of-Speech Tagger.� Pp. 20�25 in Proceedings of the Student Workshop at NAACL.
Khoja, Shereen. 2003. �APT: An Automatic Arabic Part-of-Speech Tagger.� Ph.D. thesis, Lancaster University.
Klein, Sheldon and Robert F. Simmons. 1963. �A Computational Approach to Grammatical Coding of English Words.� Journal of the ACM (JACM) 10(3):334�47.
Kuhn, Roland and Renato De Mori. 1990. �A Cache-Based Natural Language Model for Speech Recognition.� IEEE Transactions on Pattern Analysis and Machine Intelligence 12(6):570�83.
Linguistic Data Consortium. 2002. Buckwalter Arabic Morphological Analyzer?: Version 1.0. Linguistic Data Consortium.
Maamouri, Mohamed and Ann Bies. 2004. �Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools.� Pp. 2�9 in Proceedings of the Workshop on Computational Approaches to Arabic Script-based languages.
Martinez, Angel R. 2012. �Part-of-Speech Tagging.� Wiley Interdisciplinary Reviews: Computational Statistics 4(1):107�13.
Mohamed, Emad and Sandra K�bler. 2010. �Arabic Part of Speech Tagging.� in LREC.
Nguyen, Dat Quoc, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. 2014. �RDRPOSTagger: A Ripple down Rules-Based Part-of-Speech Tagger.� Pp. 17�20 in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics.
nltk.org. n.d. �Simple Pipeline Architecture for an Information Extraction System.� Retrieved February 8, 2019 (http://www.nltk.org/book/ch07.html).
Pasha, Arfath, Mohamed Al-badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. �MADAMIRA?: A Fast , Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic.� Proceedings of the 9th Language Resources and Evaluation Conference (LREC�14) 14:1094�1101.
Richards, Debbie. 2009. �Two Decades of Ripple down Rules Research.� The Knowledge Engineering Review 24(2):159�84.
Soudi, Abdelhadi, Ali Farghaly, G�nter Neumann, and Rabih Zbib. 2012. Challenges for Arabic Machine Translation. Vol. 9. John Benjamins Publishing.
Taylor, Ann, Mitchell Marcus, and Beatrice Santorini. 2003. �The Penn Treebank: An Overview.� Pp. 5�22 in Treebanks. Springer.
Teahan, W. J. and John G. Cleary. 1998. �Tag Based Models of English Text.� Pp. 43�52 in Data Compression Conference. IEEE.
Teahan, William J. and John G. Cleary. 1997. �Applying Compression to Natural Language Processing.� in SPAE: The Corpus of Spoken Professional American-English.
Teahan, William J., Yingying Wen, Rodger McNab, and Ian H. Witten. 2000. �A Compression-Based Algorithm for Chinese Word Segmentation.� Computational Linguistics 26(3):375�93.
Teahan, William John, Stuart Inglis, John G. Cleary, and Geoffrey Holmes. 1998. �Correcting English Text Using PPM Models.� Pp. 289�98 in Data Compression Conference, 1998. DCC�98. Proceedings.
Teahan, William John. 1998. �Modelling English Text.� Ph.D. thesis, Waikato University.
Teahan, William John. 2000. �Text Classification and Segmentation Using Minimum Cross-Entropy.� Pp. 943�61 in Content-Based Multimedia Information Access-Volume 2.
Teahan, William. 2018. �A Compression-Based Toolkit for Modelling and Processing Natural Language Text.� Information 9(12):294.
Tim Buckwalter. n.d. �Buckwalter Arabic Transliteration.� Retrieved January 29, 2019 (http://www.qamus.org/transliteration.htm).
Wintner, Shuly. 2014. �Morphological Processing of Semitic Languages.� Pp. 43�66 in Natural language processing of Semitic languages. Springer.
Wu, Peiliang. 2007. �Adaptive Models of Chinese Text.� University of Wales, Bangor.
Mr. Ibrahim S. Alkhazi
College of Computers & Information Technology Tabuk University Tabuk, Saudi Arabia - Saudi Arabia
Dr. William J. Teahan
School of Computer Science and Electronic Engineering Bangor University United Kingdom - United Kingdom