Authors
Shahd Alharbi1 and Matthew Purver2, 1King Saud University, Saudi Arabia and 2University of London, UK
Abstract
Most of the recent researches have been carried out to analyse sentiment and emotions found in English texts, where few studies have been conducted on Arabic contents, which have been focused on analysing the sentiment as positive and negative, instead of the different emotions’ classes. Therefore this paper has focused on analysing different six emotions’ classes in Arabic contents, especially Arabic tweets which have unstructured nature that make it challenging task compared to the formal structured contents found in Arabic journals and books. On the other hand, the recent developments in the distributional sematic models, have encouraged testing the effect of the distributional measures on the classification process, which was not investigated by any other classification-related studies for analysing Arabic texts. As a result, the model has successfully improved the average accuracy to more than 86% using Support Vector Machine (SVM) compared to the different sentiments and emotions studies for classifying Arabic texts through the developed semi-supervised approach which has employed the contextual and the co-occurrence information from a large amount of unlabelled dataset. In addition to the different remarkable achieved results, the model has recorded a high average accuracy, 85.30%, after removing the labels from the unlabelled contextual information which was used in the labelled dataset during the classification process. Moreover, due to the unstructured nature of Twitter contents, a general set of pre-processing techniques for Arabic texts was found which has resulted in increasing the accuracy of the six emotions’ classes to 85.95% while employing the contextual information from the unlabelled dataset.
Keywords
SVM, DSM, classifying, Arabic tweets, hashtags, emoticons, NLP &co-occurrence matrix.