{"title":"A Study on Finding Similar Document with Multiple Categories","authors":"R. Sara\u00e7o\u011flu, N. Allahverdi","volume":80,"journal":"International Journal of Computer and Information Engineering","pagesStart":1131,"pagesEnd":1136,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/16229","abstract":"<p>Searching similar documents and document<br \/>\r\nmanagement subjects have important place in text mining. One of the<br \/>\r\nmost important parts of similar document research studies is the<br \/>\r\nprocess of classifying or clustering the documents. In this study, a<br \/>\r\nsimilar document search approach that includes discussion of out the<br \/>\r\ncase of belonging to multiple categories (multiple categories<br \/>\r\nproblem) has been carried. The proposed method that based on Fuzzy<br \/>\r\nSimilarity Classification (FSC) has been compared with Rocchio<br \/>\r\nalgorithm and naive Bayes method which are widely used in text<br \/>\r\nmining. Empirical results show that the proposed method is quite<br \/>\r\nsuccessful and can be applied effectively. For the second stage,<br \/>\r\nmultiple categories vector method based on information of categories<br \/>\r\nregarding to frequency of being seen together has been used.<br \/>\r\nEmpirical results show that achievement is increased almost two<br \/>\r\ntimes, when proposed method is compared with classical approach.<\/p>\r\n","references":"<p>[1] S.S. Weng and C.K. Liu, Using text classification and multiple concepts\r\nto answer e-mails, Expert Systems with Applications 26(4) ,529-543,\r\n2004.\r\n[2] D. Elworthy, Question answering using a large NLP system, The Ninth\r\nText Retrieval Conference, Gaithersburg, 2000.\r\n[3] C. Apte, P. Damerau and S. Weiss, Text Mining with Decision Rules\r\nand Decision Trees, In Proceedings of the Conference Automated\r\nLearning and Discovery, CMU, 1998.\r\n[4] J.R. Quinlan, Induction of Decision Trees, Machine Learning Journal 1\r\n81-108, 1986.\r\n[5] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian\r\nApproach to Filtering Junk e-mail, AAAI 98, Workshops on Text\r\nCategorization, 1998.\r\n[6] K. Tzeras and S. Hartmann, Automatic Indexing Based on Bayesian\r\nInference Networks, In Proceedings of the 16th Annual ACM\/SIGIR\r\nConference on Research and Development in Information Retrieval, 22-\r\n34, 1993.\r\n[7] E. Wiener, J. Pederson and A. Weigend, A Neural Network Approach to\r\nTopic Spotting, Fourth Annual Symposium on Document Analysis and\r\nInformation Retrieval, 1995.\r\n[8] G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, Using kNN model for\r\nautomatic text categorization, Soft Computing 10,423-430, 2006.\r\n[9] S.S. Weng and Y.J. Lin, A Study On Searching For Similar Documents\r\nBased On Multiple Concepts And Distribution Of Concepts, Expert\r\nSystems with Applications 25(3) 355-368, 2003.\r\n[10] B. Masand, G. Linoff, and D. Waltz, Classifying News Stories Using\r\nMemory Based Reasoning, In Proceedings of the 15th Annual, 1992.\r\n[11] S. Tan, Neighbor-weighted K-nearest neighbor for unbalanced text\r\ncorpus, Expert Systems with Applications, 28, 667-671, 2005.\r\n[12] I.S. Dhillon, J. Fan and Y. Guan, Efficient Clustering of Very Large\r\nDocument Collections, In Data Mining for Scientific and Engineering\r\nApplications, Kluwer Academic Publishers 357-381, 2001.\r\n[13] S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive Learning\r\nAlgorithm and Representations for Text Categorization, In Proceedings\r\nof the 1998 ACM 7th International Conference on Information and\r\nKnowledge Management 148-155, 1998.\r\n[14] T. Joachims, Text Categorization with Support Vector Machines:\r\nLearning with Many Relevant Features, In Proceedings of the 10th\r\nEuropean Conference on Machine Learning 1, 137-142, 1998.\r\n[15] A. Klose, A. N&uuml;rnberger, R. Kruse, G. Hartmann, and M. Richards,\r\nInteractive Text Retrieval Based on Document Similarities, Phys. Chem.\r\nEarth (A), 25(8), 649-654, 2000.\r\n[16] [.C. Yang and C.H. Lee, A text mining approach on automatic\r\ngeneration of web directories and hierarchies, Expert Systems with\r\nApplications, 27, 645-663, 2004.\r\n[17] H.C. Yang and C.H. Lee, A text mining approach on automatic\r\nconstruction of hypertexts, Expert Systems with Applications 29(4), 723-\r\n734, 2005.\r\n[18] D.H. Widyantoro, and J. Yen, A Fuzzy Similarity Approach in Text\r\nClassification Task, IEEE, 2000.\r\n[19] S. Miyamoto, Fuzzy Multisets and Fuzzy Clustering of Documents, In\r\nProc. of the IEEE International Conference on Fuzzy Systems, FUZZIEEE,\r\n2001.\r\n[20] G. Salton, and C. Buckley, Term Weighting Approaches in Automatic\r\nText Retrieval, Information Processing and Management, 24(5), 513-\r\n523, 1998.\r\n[21] R. Sara&ccedil;o\u011flu, K. T&uuml;t&uuml;nc&uuml; and N. Allahverdi, A Fuzzy Clustering\r\nApproach for Finding Similar Documents Using a Novel Similarity\r\nMeasure, Expert Systems with Applications, 33(3), 600-605, 2007.\r\n[22] X Wan, A novel document similarity measure based on earth mover&rsquo;s\r\ndistance, Information Sciences, 177, 3718-3730, 2007.\r\n[23] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim,\r\nXTRACT: Learning Document Type Descriptors from XML Document\r\nCollections, Data Mining and Knowledge Discovery, 7, 23&ndash;56, 2003.\r\n[24] Y. Zhao and G. Karypis, Hierarchical Clustering Algorithms for\r\nDocument Datasets, Data Mining and Knowledge Discovery, 10, 141-\r\n168, 2005.\r\n[25] C.L.A. Clarke, G.V. Cormack, D.I.E. Kisman and T.R. Lynam, Question\r\nanswering by passage selection, The Ninth Text Retrieval Conference,\r\nGaithersburg, 2000.\r\n[26] R. Sara&ccedil;o\u011flu, Searching for Similar Documents Using Fuzzy Clustering,\r\nPhD Thesis, Institute of the Natural and Applied Sciences, Sel&ccedil;uk\r\nUniversity, 2007.\r\n[27] S. Kim, D. Baek, S. Kim, H. Rim, Question Answering Considering\r\nSemantic Categories and Co-occurrence Density, The Ninth Text\r\nRetrieval Conference, 2000.\r\n[28] T.S. Morton, Using Coreference in Question Answering, The Eighth\r\nText Retrieval Conference, 1999.\r\n[29] C. Elkan, Deriving TF-IDF as a Fisher Kernel, Proceedings of the\r\nInternational Symposium on String Processing and Information\r\nRetrieval (SPIRE&#39;05), Buenos Aires, Argentina, 296-301, 2005.\r\n[30] A. McCallum, K. Nigam, J. Rennie and K. Seymore, Automating the\r\nConstruction of Internet Portals with Machine Learning, Information\r\nRetrieval Journal, 3, 127-163, 2000.\r\n[31] S. Jones and P. Willett, Readings in information retrieval, Morgan\r\nKaufmann Publisher, 1997.<\/p>\r\n","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 80, 2013"}