Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language
Authors: Sameh H. Ghwanmeh
Abstract:
In this study a clustering technique has been implemented which is K-Means like with hierarchical initial set (HKM). The goal of this study is to prove that clustering document sets do enhancement precision on information retrieval systems, since it was proved by Bellot & El-Beze on French language. A comparison is made between the traditional information retrieval system and the clustered one. Also the effect of increasing number of clusters on precision is studied. The indexing technique is Term Frequency * Inverse Document Frequency (TF * IDF). It has been found that the effect of Hierarchical K-Means Like clustering (HKM) with 3 clusters over 242 Arabic abstract documents from the Saudi Arabian National Computer Conference has significant results compared with traditional information retrieval system without clustering. Additionally it has been found that it is not necessary to increase the number of clusters to improve precision more.
Keywords: Hierarchical K-mean like clustering (HKM), Kmeans, cluster centroids, initial partition, and document distances
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1327445
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2570References:
[1] McCallum and K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification", in Proc. of the AAAI-98/ICML- 98,Workshop on Learning for Text Categorization (AAAI), Madison; 1998, pp. 71-74.
[2] D. Fragoudis, D. Meretakis and S. Likothanassis, Integrating Feature and Instance Selection for Text Classification, 2000, pp. 27-37.
[3] K. Nigam, A. Kachites, S. Thrun and T. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM. Kluwer Academic Publishers, Boston. 1999.
[4] K. Thompson and R. Nickolov, "A Clustering-Based Algorithm for Automatic Document Separation", in Proc. of the SIGIR 2002, Workshop on Information Retrieval , 2002, pp. 38-43.
[5] N. Slonim and N. Tishby, "The Power of Word Clusters for Text Classification", in Proc. of the 23rd European Colloquium on Information Retrieval Research, 2001,pp. 1-12
[6] P. Bellot and M. El-Bèze, "Clustering by means of Unsupervised Decision Trees or Hierarchical and K-means-like Algorithm", in Proc. of RIAO 2000, pp. 344-363.
[7] P. Dai, U. Iurgel and G. Rigoll, "A Novel Feature Combination Approach for Spoken Document Classification with Support Vector Machines", in Proc Multimedia Information Retrieval Workshop in conjunction, 2003, pp. 1-5.
[8] R. Ghani, "Using error-correcting codes for text classification", in Proc. 17th International Conference on Machine Learning (ICML-00), Stanford, CA, 2000, pp. 303-310.
[9] R. Ramakrishnan and J. Gehrke, Database Management Systems. McGraw-Hill, 2002.
[10] T. Theeramunkong and V. Lertnattee, "Multi-Dimensional Text Classification", in Proc. of the 19th International Conference on Computational Linguistics, Taipei, 2002, pp. 34-38.
[11] Y. Fang, S. Parthasarathy, and F. Schwartz, "Using Clustering to Boost Text Classification", in Proc. of the IEEE International Conference on Data Mining, California, USA, 2001, pp. 123-127.