keyboard_arrow_up
An Efficient Approach to Improve Arabic Documents Clustering Based on a new Keyphrases Extraction Algorithm

Authors

Hanane FROUD, Issam SAHMOUDI and Abdelmonaime LACHKAR, Sidi Mohamed Ben Abdellah University (USMBA), Morocco

Abstract

Document Clustering algorithms goal is to create clusters that are coherent internally, but clearly different from each other. The useful expressions in the documents is often accompanied by a large amount of noise that is caused by the use of unnecessarywords, so it is indispensable to eliminate it and keeping just the useful information. Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining applications can use it to improve her results. The Keyphrases are defined as phrases that capture the main topics discussed in document; they offer a brief and precise summary of document content. Therefore, it can be a good solution to get rid of the existent noise from documents. In this paper, we propose a new method to solve the problem cited above especially for Arabic language documents, which is one of the most complex languages, by using a new Keyphrases extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach, we conduct an experimental study on Arabic Documents Clustering using the most popular approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage techniques and a variety of distance functions and similarity measures to perform Arabic Document Clustering task. The obtained results show that our approach for extracting Keyphrases improves the clustering results.

Keywords

Arabic Language, Arabic Text Clustering, Hierarchical Clustering, Suffix Tree Algorithm, Keyphrases Extraction, Similarity Measures.

Full Text  Volume 3, Number 8