A Study on Finding Similar Document with Multiple Categories
Authors: R. Saraçoğlu, N. Allahverdi
Abstract:
Searching similar documents and document management subjects have important place in text mining. One of the most important parts of similar document research studies is the process of classifying or clustering the documents. In this study, a similar document search approach that includes discussion of out the case of belonging to multiple categories (multiple categories problem) has been carried. The proposed method that based on Fuzzy Similarity Classification (FSC) has been compared with Rocchio algorithm and naive Bayes method which are widely used in text mining. Empirical results show that the proposed method is quite successful and can be applied effectively. For the second stage, multiple categories vector method based on information of categories regarding to frequency of being seen together has been used. Empirical results show that achievement is increased almost two times, when proposed method is compared with classical approach.
Keywords: Document similarity, Fuzzy classification, Multiple categories, Text mining.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1086813
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1706References:
[1] S.S. Weng and C.K. Liu, Using text classification and multiple concepts
to answer e-mails, Expert Systems with Applications 26(4) ,529-543,
2004.
[2] D. Elworthy, Question answering using a large NLP system, The Ninth
Text Retrieval Conference, Gaithersburg, 2000.
[3] C. Apte, P. Damerau and S. Weiss, Text Mining with Decision Rules
and Decision Trees, In Proceedings of the Conference Automated
Learning and Discovery, CMU, 1998.
[4] J.R. Quinlan, Induction of Decision Trees, Machine Learning Journal 1
81-108, 1986.
[5] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian
Approach to Filtering Junk e-mail, AAAI 98, Workshops on Text
Categorization, 1998.
[6] K. Tzeras and S. Hartmann, Automatic Indexing Based on Bayesian
Inference Networks, In Proceedings of the 16th Annual ACM/SIGIR
Conference on Research and Development in Information Retrieval, 22-
34, 1993.
[7] E. Wiener, J. Pederson and A. Weigend, A Neural Network Approach to
Topic Spotting, Fourth Annual Symposium on Document Analysis and
Information Retrieval, 1995.
[8] G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, Using kNN model for
automatic text categorization, Soft Computing 10,423-430, 2006.
[9] S.S. Weng and Y.J. Lin, A Study On Searching For Similar Documents
Based On Multiple Concepts And Distribution Of Concepts, Expert
Systems with Applications 25(3) 355-368, 2003.
[10] B. Masand, G. Linoff, and D. Waltz, Classifying News Stories Using
Memory Based Reasoning, In Proceedings of the 15th Annual, 1992.
[11] S. Tan, Neighbor-weighted K-nearest neighbor for unbalanced text
corpus, Expert Systems with Applications, 28, 667-671, 2005.
[12] I.S. Dhillon, J. Fan and Y. Guan, Efficient Clustering of Very Large
Document Collections, In Data Mining for Scientific and Engineering
Applications, Kluwer Academic Publishers 357-381, 2001.
[13] S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive Learning
Algorithm and Representations for Text Categorization, In Proceedings
of the 1998 ACM 7th International Conference on Information and
Knowledge Management 148-155, 1998.
[14] T. Joachims, Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, In Proceedings of the 10th
European Conference on Machine Learning 1, 137-142, 1998.
[15] A. Klose, A. Nürnberger, R. Kruse, G. Hartmann, and M. Richards,
Interactive Text Retrieval Based on Document Similarities, Phys. Chem.
Earth (A), 25(8), 649-654, 2000.
[16]
[.C. Yang and C.H. Lee, A text mining approach on automatic
generation of web directories and hierarchies, Expert Systems with
Applications, 27, 645-663, 2004.
[17] H.C. Yang and C.H. Lee, A text mining approach on automatic
construction of hypertexts, Expert Systems with Applications 29(4), 723-
734, 2005.
[18] D.H. Widyantoro, and J. Yen, A Fuzzy Similarity Approach in Text
Classification Task, IEEE, 2000.
[19] S. Miyamoto, Fuzzy Multisets and Fuzzy Clustering of Documents, In
Proc. of the IEEE International Conference on Fuzzy Systems, FUZZIEEE,
2001.
[20] G. Salton, and C. Buckley, Term Weighting Approaches in Automatic
Text Retrieval, Information Processing and Management, 24(5), 513-
523, 1998.
[21] R. Saraçoğlu, K. Tütüncü and N. Allahverdi, A Fuzzy Clustering
Approach for Finding Similar Documents Using a Novel Similarity
Measure, Expert Systems with Applications, 33(3), 600-605, 2007.
[22] X Wan, A novel document similarity measure based on earth mover’s
distance, Information Sciences, 177, 3718-3730, 2007.
[23] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim,
XTRACT: Learning Document Type Descriptors from XML Document
Collections, Data Mining and Knowledge Discovery, 7, 23–56, 2003.
[24] Y. Zhao and G. Karypis, Hierarchical Clustering Algorithms for
Document Datasets, Data Mining and Knowledge Discovery, 10, 141-
168, 2005.
[25] C.L.A. Clarke, G.V. Cormack, D.I.E. Kisman and T.R. Lynam, Question
answering by passage selection, The Ninth Text Retrieval Conference,
Gaithersburg, 2000.
[26] R. Saraçoğlu, Searching for Similar Documents Using Fuzzy Clustering,
PhD Thesis, Institute of the Natural and Applied Sciences, Selçuk
University, 2007.
[27] S. Kim, D. Baek, S. Kim, H. Rim, Question Answering Considering
Semantic Categories and Co-occurrence Density, The Ninth Text
Retrieval Conference, 2000.
[28] T.S. Morton, Using Coreference in Question Answering, The Eighth
Text Retrieval Conference, 1999.
[29] C. Elkan, Deriving TF-IDF as a Fisher Kernel, Proceedings of the
International Symposium on String Processing and Information
Retrieval (SPIRE'05), Buenos Aires, Argentina, 296-301, 2005.
[30] A. McCallum, K. Nigam, J. Rennie and K. Seymore, Automating the
Construction of Internet Portals with Machine Learning, Information
Retrieval Journal, 3, 127-163, 2000.
[31] S. Jones and P. Willett, Readings in information retrieval, Morgan
Kaufmann Publisher, 1997.