keyboard_arrow_up
Punjabi Text Clustering by Sentence Structure Analysis

Authors

Saurabh Sharma and Vishal Gupta, Panjab University, India

Abstract

Punjabi Text Document Clustering is done by analyzing the sentence structure of similar documents sharing same topics and grouping them into clusters. The prevalent algorithms in this field utilize the vector space model which treats the documents as a bag of words. The meaning in natural language inherently depends on the word sequences which are overlooked and ignored while clustering. The current paper deals with a new Punjabi text clustering algorithm named Clustering by Sentence Structure Analysis(CSSA) which has been carried out on 221 Punjabi news articles available on news sites. The phrases are extracted for processing by a meticulous analysis of the structure of a sentence by applying the basic grammatical rules of Karaka. Sequences formed from phrases, are used to find the topic and for finding similarities among all documents which results in the formation of meaningful clusters.

Keywords

Punjabi language, Text clustering, Sentence structure analysis, Karaka theory.

Full Text  Volume 2, Number 4