EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Exploring Twitter as a Source of an Arabic Dialect Corpus

Areej Odah Alshutayri, Eric Atwell

Pages - 37 - 44 | Revised - 30-04-2017 | Published - 01-06-2017

Published in International Journal of Computational Linguistics (IJCL)

Volume - 8 Issue - 2 | Publication Date - June 2017 Table of Contents

MORE INFORMATION

References | Abstracting & Indexing

KEYWORDS

Dialectal Arabic, Phonological Variations, Social Media, Multi Dialect, Twitter, Tweet.

ABSTRACT

Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.

ABSTRACTING & INDEXING

1	Google Scholar

2	BibSonomy

3	ResearchGate

4	White Rose Research Online

5	Scribd

6	SlideShare

REFERENCES

A. Ali, H. Mubarak, and S. Vogel. (2014). "Advances in Dialectal Arabic speech recognition". In: Proceedings of the of the international workshop on spoken language translation (IWSLT) Dec 4-5, Lake Tahoe CA, USA. pp.156-162.

E. Nagoudi, and D. Schwab. (2017). "Semantic Similarity of Arabic Sentences with Word Embeddings". Association for Computational Linguistics. pp.18-24. [workshop publication]. Available from: http://aclweb.org/anthology/W17-1303.

F. Alorifi. (2008). "Automatic identification of Arabic dialects using Hidden Markov Models". PhD thesis, University of Pittsburgh, Department of Electrical Engineering and Computer Science.

F. Biadsy, J. Hirschberg, N. Habash. (2009). "Spoken Arabic dialect identification using phonotactic modeling". In: Proceedings of the EACL workshop on computational approaches to Semitic languages, pp. 53-61, 31 March, Athens, Greece. ACL, Stroudsburg, PA, USA.

F. Sadat, F. Kazemi, and A. Farzindar. (2014). "Automatic identification of arabic language varieties and dialects in social media". In Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP), pages 22-27.

H. Mubarak, K. Darwish. (2014). "Using Twitter to collect a multi-dialectal corpus of Arabic". In: Proceedings of the EMNLP workshop on natural language processing. Doha, Qatar, 25 October, 2014, pp. 1-7.

K. Almeman, M. Lee, and A. Almiman. (2013). "Multi Dialect Arabic Speech Parallel Corpora". In: Communications, Signal Processing, and their Applications (ICCSPA), 1st International Conference, Sharjah, UAE. IEEE.

K. Almeman, M. Lee. (2013). "Automatic building of Arabic multi-dialect text corpora by bootstrapping dialect words". In: The Proceedings of the 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA'13), Sharjah, UAE, 12-14 Feb., IEEE.

M. Alrabiah, A. Al-Salman, E. Atwell, N. Alhelewh. (2014). "KSUCCA: A Key To Exploring Arabic Historical Linguistics". International Journal of Computational Linguistics 5(2):pp.27-36.

M. Alrabiah, N. Alhelewh, A. Al-Salman, E. Atwell. (2014). "An Empirical Study On The Holy Quran Based On A Large Classical Arabic Corpus". International Journal of Computational Linguistics 5(1):pp.1-13.

M. Elmahdy, R. Gruhn, W. Minker, S. Abdennadher. (2009). "Cross-lingual acoustic modeling for Dialectal Arabic speech recognition". In: ACM SIGKDD Explorations Newsletter 11(1):101-118, November 2009.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, H. Witten. (2009). "The WEKA Data Mining Software: An update". In ACM SIGKDD Explorations Newsletter, 11(1): pp. 10-18, November 2009.

M. Khoshaba. (2006). "Iraqi dialect vs. Standard Arabic", Medium Corporation, San Jose, CA, USA.

M. Saloot, N. Idris, A. Aw, and D. Thorleuchter. (2016). "Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)". Digital Scholarship in the Humanities, 31(2), pp.227-243.

N. Habash. (2010). "Introduction to Arabic natural language processing". Morgan & Claypool Publishers, Synthesis Lectures on Human Language Technology. 10, ebook isbn 978-1-59829-796-6.

O. Zaidan, C. Callison-Burch. (2014). "Arabic dialect identification". In: Computational Linguistics. 40(1): pp. 171-202.

U. Horesh and W. M. Cotter. (2016). "Current research on linguistic variation in the arabic-speaking world". Language and Linguistics Compass, 10(8):370-381.

MANUSCRIPT AUTHORS

Mrs. Areej Odah Alshutayri

Faculty of Computing and Information Technology King Abdul Aziz University Jeddah, Saudi Arabia and School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom

aalshetary@kau.edu.sa

Associate Professor Eric Atwell

School of Computing University of Leeds Leeds, LS2 9JT, United Kingdom - United Kingdom

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS