Authors
Yahya Aseri, Khalid Alreemy, Salem Alelyani, Mohamed Mohana, King Khalid University, Saudi Arabia
Abstract
Dialect identification is a prior requirement for learning lexical and morphological knowledge a language variation that can be beneficial for natural language processing (NLP) and potential AI downstream tasks. In this paper, we present the first work on sentence-level Modern Standard Arabic (MSA) and Saudi Dialect (SD) identification where we trained and tested three classifiers (Logistic regression, Multi-nominal Na¨ıve Bayes, and Support Vector Machine) on datasets collected from Saudi Twitter and automatically labeled as (MSA) or SD. The model for each configuration was built using two levels of language models, i.e., unigram and bi-gram, as feature sets for training the systems. The model reported high-accuracy performance using 10-fold cross- validations with average 98.98%. This model was evaluated on another unseen, manually-annotated dataset. The best performance of these classifiers was achieved by Multi-nominal Naïve Bayes, reporting 89%.
Keywords
Dialect Identification, NLP, Standard Arabic, Saudi Dialect, Classification.