A Study on Finding Similar Document with Multiple Categories
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33093
A Study on Finding Similar Document with Multiple Categories

Authors: R. Saraçoğlu, N. Allahverdi

Abstract:

Searching similar documents and document management subjects have important place in text mining. One of the most important parts of similar document research studies is the process of classifying or clustering the documents. In this study, a similar document search approach that includes discussion of out the case of belonging to multiple categories (multiple categories problem) has been carried. The proposed method that based on Fuzzy Similarity Classification (FSC) has been compared with Rocchio algorithm and naive Bayes method which are widely used in text mining. Empirical results show that the proposed method is quite successful and can be applied effectively. For the second stage, multiple categories vector method based on information of categories regarding to frequency of being seen together has been used. Empirical results show that achievement is increased almost two times, when proposed method is compared with classical approach.

Keywords: Document similarity, Fuzzy classification, Multiple categories, Text mining.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1086813

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1706

References:


[1] S.S. Weng and C.K. Liu, Using text classification and multiple concepts to answer e-mails, Expert Systems with Applications 26(4) ,529-543, 2004.
[2] D. Elworthy, Question answering using a large NLP system, The Ninth Text Retrieval Conference, Gaithersburg, 2000.
[3] C. Apte, P. Damerau and S. Weiss, Text Mining with Decision Rules and Decision Trees, In Proceedings of the Conference Automated Learning and Discovery, CMU, 1998.
[4] J.R. Quinlan, Induction of Decision Trees, Machine Learning Journal 1 81-108, 1986.
[5] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian Approach to Filtering Junk e-mail, AAAI 98, Workshops on Text Categorization, 1998.
[6] K. Tzeras and S. Hartmann, Automatic Indexing Based on Bayesian Inference Networks, In Proceedings of the 16th Annual ACM/SIGIR Conference on Research and Development in Information Retrieval, 22- 34, 1993.
[7] E. Wiener, J. Pederson and A. Weigend, A Neural Network Approach to Topic Spotting, Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995.
[8] G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, Using kNN model for automatic text categorization, Soft Computing 10,423-430, 2006.
[9] S.S. Weng and Y.J. Lin, A Study On Searching For Similar Documents Based On Multiple Concepts And Distribution Of Concepts, Expert Systems with Applications 25(3) 355-368, 2003.
[10] B. Masand, G. Linoff, and D. Waltz, Classifying News Stories Using Memory Based Reasoning, In Proceedings of the 15th Annual, 1992.
[11] S. Tan, Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Systems with Applications, 28, 667-671, 2005.
[12] I.S. Dhillon, J. Fan and Y. Guan, Efficient Clustering of Very Large Document Collections, In Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers 357-381, 2001.
[13] S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive Learning Algorithm and Representations for Text Categorization, In Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management 148-155, 1998.
[14] T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, In Proceedings of the 10th European Conference on Machine Learning 1, 137-142, 1998.
[15] A. Klose, A. Nürnberger, R. Kruse, G. Hartmann, and M. Richards, Interactive Text Retrieval Based on Document Similarities, Phys. Chem. Earth (A), 25(8), 649-654, 2000.
[16]
[.C. Yang and C.H. Lee, A text mining approach on automatic generation of web directories and hierarchies, Expert Systems with Applications, 27, 645-663, 2004.
[17] H.C. Yang and C.H. Lee, A text mining approach on automatic construction of hypertexts, Expert Systems with Applications 29(4), 723- 734, 2005.
[18] D.H. Widyantoro, and J. Yen, A Fuzzy Similarity Approach in Text Classification Task, IEEE, 2000.
[19] S. Miyamoto, Fuzzy Multisets and Fuzzy Clustering of Documents, In Proc. of the IEEE International Conference on Fuzzy Systems, FUZZIEEE, 2001.
[20] G. Salton, and C. Buckley, Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, 24(5), 513- 523, 1998.
[21] R. Saraçoğlu, K. Tütüncü and N. Allahverdi, A Fuzzy Clustering Approach for Finding Similar Documents Using a Novel Similarity Measure, Expert Systems with Applications, 33(3), 600-605, 2007.
[22] X Wan, A novel document similarity measure based on earth mover’s distance, Information Sciences, 177, 3718-3730, 2007.
[23] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim, XTRACT: Learning Document Type Descriptors from XML Document Collections, Data Mining and Knowledge Discovery, 7, 23–56, 2003.
[24] Y. Zhao and G. Karypis, Hierarchical Clustering Algorithms for Document Datasets, Data Mining and Knowledge Discovery, 10, 141- 168, 2005.
[25] C.L.A. Clarke, G.V. Cormack, D.I.E. Kisman and T.R. Lynam, Question answering by passage selection, The Ninth Text Retrieval Conference, Gaithersburg, 2000.
[26] R. Saraçoğlu, Searching for Similar Documents Using Fuzzy Clustering, PhD Thesis, Institute of the Natural and Applied Sciences, Selçuk University, 2007.
[27] S. Kim, D. Baek, S. Kim, H. Rim, Question Answering Considering Semantic Categories and Co-occurrence Density, The Ninth Text Retrieval Conference, 2000.
[28] T.S. Morton, Using Coreference in Question Answering, The Eighth Text Retrieval Conference, 1999.
[29] C. Elkan, Deriving TF-IDF as a Fisher Kernel, Proceedings of the International Symposium on String Processing and Information Retrieval (SPIRE'05), Buenos Aires, Argentina, 296-301, 2005.
[30] A. McCallum, K. Nigam, J. Rennie and K. Seymore, Automating the Construction of Internet Portals with Machine Learning, Information Retrieval Journal, 3, 127-163, 2000.
[31] S. Jones and P. Willett, Readings in information retrieval, Morgan Kaufmann Publisher, 1997.