Compression-Based Parts-of-Speech Tagger for The Arabic Language

Ibrahim S. Alkhazi; William J. Teahan

Call for Papers - Ongoing round of submission, notification and publication.

Home | Login or Register | Contact CSC

Home > CSC-OpenAccess Library > Manuscript Information

Full Text Available
(no registration required)

(644.33KB)

-- CSC-OpenAccess Policy

-- Creative Commons Attribution NonCommercial 4.0 International License

>> COMPLETE LIST OF JOURNALS

EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Compression-Based Parts-of-Speech Tagger for The Arabic Language

Ibrahim S. Alkhazi, William J. Teahan

Pages - 1 - 15 | Revised - 31-03-2019 | Published - 30-04-2019

Published in International Journal of Computational Linguistics (IJCL)

Volume - 10 Issue - 1 | Publication Date - April 2019 Table of Contents

MORE INFORMATION

References | Abstracting & Indexing

KEYWORDS

Natural Language Processing, Arabic Part-of-Speech Tagger, Hidden Markov Model, Statistical Language Model.

ABSTRACT

This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.

ABSTRACTING & INDEXING

1	Google Scholar

2	Semantic Scholar

3	BibSonomy

4	refSeek

5	Doc Player

6	Scribd

7	SlideShare

REFERENCES

Abdelali, Ahmed, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. ï¿½Farasa: A Fast and Furious Segmenter for Arabic.ï¿½ Pp. 11ï¿½16 in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations.

Abumalloh, Rabab Ali, Hassan Maudi Al-Sarhan, Othman Ibrahim, and Waheeb Abu-Ulbeh. 2016. ï¿½Arabic Part-of-Speech Tagging.ï¿½ Journal of Soft Computing and Decision Support Systems 3(2):45ï¿½52.

Al Shamsi, Fatma and Ahmed Guessoum. 2006. ï¿½A Hidden Markov Model-Based POS Tagger for Arabic.ï¿½ Pp. 31ï¿½42 in Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France.

Al-Harbi, S., A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh. 2008. ï¿½Automatic Arabic Text Classification.ï¿½ in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data.

Al-Kazaz, Noor R., Sean A. Irvine, and William J. Teahan. 2016. ï¿½An Automatic Cryptanalysis of Transposition Ciphers Using Compression.ï¿½ Pp. 36ï¿½52 in International Conference on Cryptology and Network Security.

Alabbas, Maytham and Allan Ramsay. 2012. ï¿½Improved POS-Tagging for Arabic by Combining Diverse Taggers.ï¿½ Pp. 107ï¿½16 in IFIP International Conference on Artificial Intelligence Applications and Innovations.

Alghamdi, Mansoor A., Ibrahim S. Alkhazi, and William J. Teahan. 2016. ï¿½Arabic OCR Evaluation Tool.ï¿½ Pp. 1ï¿½6 in Computer Science and Information Technology (CSIT), 2016 7th International Conference on. IEEE.

Alhawiti, Khaled M. 2014. ï¿½Adaptive Models of Arabic Text.ï¿½ Ph.D. thesis, Bangor University.

Alkahtani, Saad and William J. Teahan. 2016. ï¿½A New Parallel Corpus of Arabic/English.ï¿½ Pp. 279ï¿½84 in Proceedings of the Eighth Saudi Students Conference in the UK.

Alkahtani, Saad. 2015. ï¿½Building and Verifying Parallel Corpora between Arabic and English.ï¿½ Ph.D. thesis, Bangor University.

Alkhazi, Ibrahim S. and William J. Teahan. 2017. ï¿½Classifying and Segmenting Classical and Modern Standard Arabic Using Minimum Cross-Entropy.ï¿½ INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 8(4):421ï¿½30.

Alkhazi, Ibrahim S. and William J. Teahan. 2018. ï¿½BAAC: Bangor Arabic Annotated Corpus.ï¿½ INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 9(11):131ï¿½40.

Alkhazi, Ibrahim S., Mansoor A. Alghamdi, and William J. Teahan. 2017. ï¿½Tag Based Models for Arabic Text Compression.ï¿½ Pp. 697ï¿½705 in 2017 Intelligent Systems Conference (IntelliSys). IEEE.

Alosaimy, Abdulrahman Mohammed S. 2018. ï¿½Ensemble Morphosyntactic Analyser for Classical Arabic.ï¿½ Ph.D. thesis, University of Leeds.

Alqrainy, Shihadeh. 2008. ï¿½A Morphological-Syntactical Analysis Approach for Arabic Textual Tagging.ï¿½

Anon. n.d. ï¿½Madamira Arabic Analyzer - Online.ï¿½ Retrieved February 17, 2019a (https://camel.abudhabi.nyu.edu/madamira/).

Anon. n.d. ï¿½The Stanford Natural Language Processing Group.ï¿½ Retrieved February 17, 2019b (https://nlp.stanford.edu/software/tagger.shtml).

Atwell, Eric Steven, Salim Elsheikh, and Mohammad Elsheikh. 2018. ï¿½TIMELINE OF THE DEVELOPMENT OF ARABIC POS TAGGERS AND MORPHOLOGICALANALYSERS.ï¿½

Brill, Eric. 1992. ï¿½A Simple Rule-Based Part of Speech Tagger.ï¿½ Pp. 152ï¿½55 in Proceedings of the third conference on Applied natural language processing.

Brown, Peter F., Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. ï¿½An Estimate of an Upper Bound for the Entropy of English.ï¿½ Computational Linguistics 18(1):31ï¿½40.

Cleary, John and Witten, Ian. 1984. ï¿½Data Compression Using Adaptive Coding and Partial String Matching.ï¿½ C(4):396ï¿½402.

Columbia University. n.d. ï¿½Arabic Language Disambiguation for Natural Language Processing Applications - Cu14012 - Columbia Technology Ventures.ï¿½ Retrieved (http://innovation.columbia.edu/technologies/cu14012_arabic-language-disambiguation-for-natural-language-processing-applications).

Darwish, Kareem, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy, and Laura Kallmeyer. 2018. ï¿½Multi-Dialect Arabic POS Tagging: A CRF Approach.ï¿½ in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).

Diab, Mona T. 2007. ï¿½Improved Arabic Base Phrase Chunking with a New Enriched POS Tag Set.ï¿½ Pp. 89ï¿½96 in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources.

Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2004. ï¿½Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks.ï¿½ Pp. 149ï¿½52 in Proceedings of HLT-NAACL 2004: Short papers.

Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2007. ï¿½Automatic Processing of Modern Standard Arabic Text.ï¿½ Pp. 159ï¿½79 in Arabic Computational Morphology. Springer.

El Hadj, Yahya, I. Al-Sughayeir, and A. Al-Ansari. 2009. ï¿½Arabic Part-of-Speech Tagging Using the Sentence Structure.ï¿½ in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt.

El-Kareh, Seham and Sameh Al-Ansary. 2000. ï¿½An Interactive Multi-Features POS Tagger.ï¿½ P. 83Y88 in the Proceedings of the International Conference on Artificial and Computational Intelligence for Decision Control and Automation in Intelligence for Decision Control and Automation in Engineering and Industrial Applications.

Francis, W. Nelson and Henry Kucera. 1979. ï¿½The Brown Corpus: A Standard Corpus of Present-Day Edited American English.ï¿½ Providence, RI: Department of Linguistics, Brown University [Producer and Distributor].

Green, Spence and Cd Manning. 2010. ï¿½Better Arabic Parsing: Baselines, Evaluations, and Analysis.ï¿½ COLING ï¿½10 Proceedings of the 23rd International Conference on Computational Linguistics (August):394ï¿½402.

Green, Spence, Marie-Catherine de Marneffe, and Christopher D. Manning. 2013. ï¿½Parsing Models for Identifying Multiword Expressions.ï¿½ Computational Linguistics 39(1):195ï¿½227.

Greene, Barbara B. and Gerald M. Rubin. 1971. ï¿½Automated Grammatical Tagging of English.ï¿½

Habash, Nizar and Owen Rambow. 2005. ï¿½Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop.ï¿½ Pp. 573ï¿½80 in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics.

Habash, Nizar, Owen Rambow, and Ryan Roth. 2009. ï¿½MADA+ TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization.ï¿½ Pp. 102ï¿½9 in Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.

Habash, Nizar, Ryan Roth, Owen Rambow, Ramy Eskander, and Nadi Tomeh. 2013. ï¿½Morphological Analysis and Disambiguation for Dialectal Arabic.ï¿½ Pp. 426ï¿½32 in Hlt-Naacl.

Hadni, Meryeme, Said Alaoui Ouatik, Abdelmonaime Lachkar, and Mohammed Meknassi. 2013. ï¿½Hybrid Part-of-Speech Tagger for Non-Vocalized Arabic Text.ï¿½ International Journal on Natural Language Computing (IJNLC) Vol 2.

Hajic, Jan, Otakar Smrz, Petr Zemï¿½nek, Jan ï¿½naidauf, and Emanuel Beï¿½ka. 2004. ï¿½Prague Arabic Dependency Treebank: Development in Data and Tools.ï¿½ Pp. 110ï¿½17 in Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools.

Jelinek, Fred. 1990. ï¿½Self-Organized Language Modeling for Speech Recognition.ï¿½ Readings in Speech Recognition 450ï¿½506.

Katz, Slava. 1987. ï¿½Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.ï¿½ IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3):400ï¿½401.

Khmelev, Dmitry V and William J. Teahan. 2003. ï¿½A Repetition Based Measure for Verification of Text Collections and for Text Categorization.ï¿½ Pp. 104ï¿½10 in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM.

Khoja, Shereen, Roger Garside, and Gerry Knowles. 2001. ï¿½A Tagset for the Morphosyntactic Tagging of Arabic.ï¿½ Proceedings of the Corpus Linguistics. Lancaster University (UK) 13.

Khoja, Shereen. 2001. ï¿½APT: Arabic Part-of-Speech Tagger.ï¿½ Pp. 20ï¿½25 in Proceedings of the Student Workshop at NAACL.

Khoja, Shereen. 2003. ï¿½APT: An Automatic Arabic Part-of-Speech Tagger.ï¿½ Ph.D. thesis, Lancaster University.

Klein, Sheldon and Robert F. Simmons. 1963. ï¿½A Computational Approach to Grammatical Coding of English Words.ï¿½ Journal of the ACM (JACM) 10(3):334ï¿½47.

Kuhn, Roland and Renato De Mori. 1990. ï¿½A Cache-Based Natural Language Model for Speech Recognition.ï¿½ IEEE Transactions on Pattern Analysis and Machine Intelligence 12(6):570ï¿½83.

Linguistic Data Consortium. 2002. Buckwalter Arabic Morphological Analyzer?: Version 1.0. Linguistic Data Consortium.

Maamouri, Mohamed and Ann Bies. 2004. ï¿½Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools.ï¿½ Pp. 2ï¿½9 in Proceedings of the Workshop on Computational Approaches to Arabic Script-based languages.

Martinez, Angel R. 2012. ï¿½Part-of-Speech Tagging.ï¿½ Wiley Interdisciplinary Reviews: Computational Statistics 4(1):107ï¿½13.

Mohamed, Emad and Sandra Kï¿½bler. 2010. ï¿½Arabic Part of Speech Tagging.ï¿½ in LREC.

Nguyen, Dat Quoc, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. 2014. ï¿½RDRPOSTagger: A Ripple down Rules-Based Part-of-Speech Tagger.ï¿½ Pp. 17ï¿½20 in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics.

nltk.org. n.d. ï¿½Simple Pipeline Architecture for an Information Extraction System.ï¿½ Retrieved February 8, 2019 (http://www.nltk.org/book/ch07.html).

Pasha, Arfath, Mohamed Al-badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. ï¿½MADAMIRA?: A Fast , Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic.ï¿½ Proceedings of the 9th Language Resources and Evaluation Conference (LRECï¿½14) 14:1094ï¿½1101.

Richards, Debbie. 2009. ï¿½Two Decades of Ripple down Rules Research.ï¿½ The Knowledge Engineering Review 24(2):159ï¿½84.

Soudi, Abdelhadi, Ali Farghaly, Gï¿½nter Neumann, and Rabih Zbib. 2012. Challenges for Arabic Machine Translation. Vol. 9. John Benjamins Publishing.

Taylor, Ann, Mitchell Marcus, and Beatrice Santorini. 2003. ï¿½The Penn Treebank: An Overview.ï¿½ Pp. 5ï¿½22 in Treebanks. Springer.

Teahan, W. J. and John G. Cleary. 1998. ï¿½Tag Based Models of English Text.ï¿½ Pp. 43ï¿½52 in Data Compression Conference. IEEE.

Teahan, William J. and John G. Cleary. 1997. ï¿½Applying Compression to Natural Language Processing.ï¿½ in SPAE: The Corpus of Spoken Professional American-English.

Teahan, William J., Yingying Wen, Rodger McNab, and Ian H. Witten. 2000. ï¿½A Compression-Based Algorithm for Chinese Word Segmentation.ï¿½ Computational Linguistics 26(3):375ï¿½93.

Teahan, William John, Stuart Inglis, John G. Cleary, and Geoffrey Holmes. 1998. ï¿½Correcting English Text Using PPM Models.ï¿½ Pp. 289ï¿½98 in Data Compression Conference, 1998. DCCï¿½98. Proceedings.

Teahan, William John. 1998. ï¿½Modelling English Text.ï¿½ Ph.D. thesis, Waikato University.

Teahan, William John. 2000. ï¿½Text Classification and Segmentation Using Minimum Cross-Entropy.ï¿½ Pp. 943ï¿½61 in Content-Based Multimedia Information Access-Volume 2.

Teahan, William. 2018. ï¿½A Compression-Based Toolkit for Modelling and Processing Natural Language Text.ï¿½ Information 9(12):294.

Tim Buckwalter. n.d. ï¿½Buckwalter Arabic Transliteration.ï¿½ Retrieved January 29, 2019 (http://www.qamus.org/transliteration.htm).

Wintner, Shuly. 2014. ï¿½Morphological Processing of Semitic Languages.ï¿½ Pp. 43ï¿½66 in Natural language processing of Semitic languages. Springer.

Wu, Peiliang. 2007. ï¿½Adaptive Models of Chinese Text.ï¿½ University of Wales, Bangor.

MANUSCRIPT AUTHORS

Mr. Ibrahim S. Alkhazi

College of Computers & Information Technology Tabuk University Tabuk, Saudi Arabia - Saudi Arabia

i.alkhazi@ut.edu.sa

Dr. William J. Teahan

School of Computer Science and Electronic Engineering Bangor University United Kingdom - United Kingdom

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS