Home   >   CSC-OpenAccess Library   >    Manuscript Information
Exploring Twitter as a Source of an Arabic Dialect Corpus
Areej Odah Alshutayri, Eric Atwell
Pages - 37 - 44     |    Revised - 30-04-2017     |    Published - 01-06-2017
Volume - 8   Issue - 2    |    Publication Date - June 2017  Table of Contents
Dialectal Arabic, Phonological Variations, Social Media, Multi Dialect, Twitter, Tweet.
Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.
1 Google Scholar 
2 BibSonomy 
3 ResearchGate 
4 White Rose Research Online 
5 Scribd 
6 SlideShare 
A. Ali, H. Mubarak, and S. Vogel. (2014). "Advances in Dialectal Arabic speech recognition". In: Proceedings of the of the international workshop on spoken language translation (IWSLT) Dec 4-5, Lake Tahoe CA, USA. pp.156-162.
E. Nagoudi, and D. Schwab. (2017). "Semantic Similarity of Arabic Sentences with Word Embeddings". Association for Computational Linguistics. pp.18-24. [workshop publication]. Available from: http://aclweb.org/anthology/W17-1303.
F. Alorifi. (2008). "Automatic identification of Arabic dialects using Hidden Markov Models". PhD thesis, University of Pittsburgh, Department of Electrical Engineering and Computer Science.
F. Biadsy, J. Hirschberg, N. Habash. (2009). "Spoken Arabic dialect identification using phonotactic modeling". In: Proceedings of the EACL workshop on computational approaches to Semitic languages, pp. 53-61, 31 March, Athens, Greece. ACL, Stroudsburg, PA, USA.
F. Sadat, F. Kazemi, and A. Farzindar. (2014). "Automatic identification of arabic language varieties and dialects in social media". In Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP), pages 22-27.
H. Mubarak, K. Darwish. (2014). "Using Twitter to collect a multi-dialectal corpus of Arabic". In: Proceedings of the EMNLP workshop on natural language processing. Doha, Qatar, 25 October, 2014, pp. 1-7.
K. Almeman, M. Lee, and A. Almiman. (2013). "Multi Dialect Arabic Speech Parallel Corpora". In: Communications, Signal Processing, and their Applications (ICCSPA), 1st International Conference, Sharjah, UAE. IEEE.
K. Almeman, M. Lee. (2013). "Automatic building of Arabic multi-dialect text corpora by bootstrapping dialect words". In: The Proceedings of the 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA'13), Sharjah, UAE, 12-14 Feb., IEEE.
M. Alrabiah, A. Al-Salman, E. Atwell, N. Alhelewh. (2014). "KSUCCA: A Key To Exploring Arabic Historical Linguistics". International Journal of Computational Linguistics 5(2):pp.27-36.
M. Alrabiah, N. Alhelewh, A. Al-Salman, E. Atwell. (2014). "An Empirical Study On The Holy Quran Based On A Large Classical Arabic Corpus". International Journal of Computational Linguistics 5(1):pp.1-13.
M. Elmahdy, R. Gruhn, W. Minker, S. Abdennadher. (2009). "Cross-lingual acoustic modeling for Dialectal Arabic speech recognition". In: ACM SIGKDD Explorations Newsletter 11(1):101-118, November 2009.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, H. Witten. (2009). "The WEKA Data Mining Software: An update". In ACM SIGKDD Explorations Newsletter, 11(1): pp. 10-18, November 2009.
M. Khoshaba. (2006). "Iraqi dialect vs. Standard Arabic", Medium Corporation, San Jose, CA, USA.
M. Saloot, N. Idris, A. Aw, and D. Thorleuchter. (2016). "Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)". Digital Scholarship in the Humanities, 31(2), pp.227-243.
N. Habash. (2010). "Introduction to Arabic natural language processing". Morgan & Claypool Publishers, Synthesis Lectures on Human Language Technology. 10, ebook isbn 978-1-59829-796-6.
O. Zaidan, C. Callison-Burch. (2014). "Arabic dialect identification". In: Computational Linguistics. 40(1): pp. 171-202.
U. Horesh and W. M. Cotter. (2016). "Current research on linguistic variation in the arabic-speaking world". Language and Linguistics Compass, 10(8):370-381.
Mrs. Areej Odah Alshutayri
Faculty of Computing and Information Technology King Abdul Aziz University Jeddah, Saudi Arabia and School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom
Associate Professor Eric Atwell
School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom

View all special issues >>