Authors
Hossein Hassani1,2 and Dzejla Medjedovic2, 1University of Kurdistan Hewlêr, Iraq and 2Sarajevo School of Science and Technology, Bosnia and Herzegovina
Abstract
Automatic dialect identification is a necessary Language Technology for processing multidialect languages in which the dialects are linguistically far from each other. Particularly, this becomes crucial where the dialects are mutually unintelligible. Therefore, to perform computational activities on these languages, the system needs to identify the dialect that is the subject of the process. Kurdish language encompasses various dialects. It is written using several different scripts. The language lacks of a standard orthography. This situation makes the Kurdish dialectal identification more interesting and required, both form the research and from the application perspectives. In this research, we have applied a classification method, based on supervised machine learning, to identify the dialects of the Kurdish texts. The research has focused on two widely spoken and most dominant Kurdish dialects, namely, Kurmanji and Sorani. The approach could be applied to the other Kurdish dialects as well. The method is also applicable to the languages which are similar to Kurdish in their dialectal diversity and differences.
Keywords
Dialect identification, NLP, Kurdish language, Kurmanji, Sorani