{"title":"A Study on Finding Similar Document with Multiple Categories","authors":"R. Sara\u00e7o\u011flu, N. Allahverdi","volume":80,"journal":"International Journal of Computer and Information Engineering","pagesStart":1131,"pagesEnd":1136,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/16229","abstract":"
Searching similar documents and document
\r\nmanagement subjects have important place in text mining. One of the
\r\nmost important parts of similar document research studies is the
\r\nprocess of classifying or clustering the documents. In this study, a
\r\nsimilar document search approach that includes discussion of out the
\r\ncase of belonging to multiple categories (multiple categories
\r\nproblem) has been carried. The proposed method that based on Fuzzy
\r\nSimilarity Classification (FSC) has been compared with Rocchio
\r\nalgorithm and naive Bayes method which are widely used in text
\r\nmining. Empirical results show that the proposed method is quite
\r\nsuccessful and can be applied effectively. For the second stage,
\r\nmultiple categories vector method based on information of categories
\r\nregarding to frequency of being seen together has been used.
\r\nEmpirical results show that achievement is increased almost two
\r\ntimes, when proposed method is compared with classical approach.<\/p>\r\n","references":"
[1] S.S. Weng and C.K. Liu, Using text classification and multiple concepts\r\nto answer e-mails, Expert Systems with Applications 26(4) ,529-543,\r\n2004.\r\n[2] D. Elworthy, Question answering using a large NLP system, The Ninth\r\nText Retrieval Conference, Gaithersburg, 2000.\r\n[3] C. Apte, P. Damerau and S. Weiss, Text Mining with Decision Rules\r\nand Decision Trees, In Proceedings of the Conference Automated\r\nLearning and Discovery, CMU, 1998.\r\n[4] J.R. Quinlan, Induction of Decision Trees, Machine Learning Journal 1\r\n81-108, 1986.\r\n[5] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian\r\nApproach to Filtering Junk e-mail, AAAI 98, Workshops on Text\r\nCategorization, 1998.\r\n[6] K. Tzeras and S. Hartmann, Automatic Indexing Based on Bayesian\r\nInference Networks, In Proceedings of the 16th Annual ACM\/SIGIR\r\nConference on Research and Development in Information Retrieval, 22-\r\n34, 1993.\r\n[7] E. Wiener, J. Pederson and A. Weigend, A Neural Network Approach to\r\nTopic Spotting, Fourth Annual Symposium on Document Analysis and\r\nInformation Retrieval, 1995.\r\n[8] G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, Using kNN model for\r\nautomatic text categorization, Soft Computing 10,423-430, 2006.\r\n[9] S.S. Weng and Y.J. Lin, A Study On Searching For Similar Documents\r\nBased On Multiple Concepts And Distribution Of Concepts, Expert\r\nSystems with Applications 25(3) 355-368, 2003.\r\n[10] B. Masand, G. Linoff, and D. Waltz, Classifying News Stories Using\r\nMemory Based Reasoning, In Proceedings of the 15th Annual, 1992.\r\n[11] S. Tan, Neighbor-weighted K-nearest neighbor for unbalanced text\r\ncorpus, Expert Systems with Applications, 28, 667-671, 2005.\r\n[12] I.S. Dhillon, J. Fan and Y. Guan, Efficient Clustering of Very Large\r\nDocument Collections, In Data Mining for Scientific and Engineering\r\nApplications, Kluwer Academic Publishers 357-381, 2001.\r\n[13] S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive Learning\r\nAlgorithm and Representations for Text Categorization, In Proceedings\r\nof the 1998 ACM 7th International Conference on Information and\r\nKnowledge Management 148-155, 1998.\r\n[14] T. Joachims, Text Categorization with Support Vector Machines:\r\nLearning with Many Relevant Features, In Proceedings of the 10th\r\nEuropean Conference on Machine Learning 1, 137-142, 1998.\r\n[15] A. Klose, A. Nürnberger, R. Kruse, G. Hartmann, and M. Richards,\r\nInteractive Text Retrieval Based on Document Similarities, Phys. Chem.\r\nEarth (A), 25(8), 649-654, 2000.\r\n[16] [.C. Yang and C.H. Lee, A text mining approach on automatic\r\ngeneration of web directories and hierarchies, Expert Systems with\r\nApplications, 27, 645-663, 2004.\r\n[17] H.C. Yang and C.H. Lee, A text mining approach on automatic\r\nconstruction of hypertexts, Expert Systems with Applications 29(4), 723-\r\n734, 2005.\r\n[18] D.H. Widyantoro, and J. Yen, A Fuzzy Similarity Approach in Text\r\nClassification Task, IEEE, 2000.\r\n[19] S. Miyamoto, Fuzzy Multisets and Fuzzy Clustering of Documents, In\r\nProc. of the IEEE International Conference on Fuzzy Systems, FUZZIEEE,\r\n2001.\r\n[20] G. Salton, and C. Buckley, Term Weighting Approaches in Automatic\r\nText Retrieval, Information Processing and Management, 24(5), 513-\r\n523, 1998.\r\n[21] R. Saraço\u011flu, K. Tütüncü and N. Allahverdi, A Fuzzy Clustering\r\nApproach for Finding Similar Documents Using a Novel Similarity\r\nMeasure, Expert Systems with Applications, 33(3), 600-605, 2007.\r\n[22] X Wan, A novel document similarity measure based on earth mover’s\r\ndistance, Information Sciences, 177, 3718-3730, 2007.\r\n[23] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim,\r\nXTRACT: Learning Document Type Descriptors from XML Document\r\nCollections, Data Mining and Knowledge Discovery, 7, 23–56, 2003.\r\n[24] Y. Zhao and G. Karypis, Hierarchical Clustering Algorithms for\r\nDocument Datasets, Data Mining and Knowledge Discovery, 10, 141-\r\n168, 2005.\r\n[25] C.L.A. Clarke, G.V. Cormack, D.I.E. Kisman and T.R. Lynam, Question\r\nanswering by passage selection, The Ninth Text Retrieval Conference,\r\nGaithersburg, 2000.\r\n[26] R. Saraço\u011flu, Searching for Similar Documents Using Fuzzy Clustering,\r\nPhD Thesis, Institute of the Natural and Applied Sciences, Selçuk\r\nUniversity, 2007.\r\n[27] S. Kim, D. Baek, S. Kim, H. Rim, Question Answering Considering\r\nSemantic Categories and Co-occurrence Density, The Ninth Text\r\nRetrieval Conference, 2000.\r\n[28] T.S. Morton, Using Coreference in Question Answering, The Eighth\r\nText Retrieval Conference, 1999.\r\n[29] C. Elkan, Deriving TF-IDF as a Fisher Kernel, Proceedings of the\r\nInternational Symposium on String Processing and Information\r\nRetrieval (SPIRE'05), Buenos Aires, Argentina, 296-301, 2005.\r\n[30] A. McCallum, K. Nigam, J. Rennie and K. Seymore, Automating the\r\nConstruction of Internet Portals with Machine Learning, Information\r\nRetrieval Journal, 3, 127-163, 2000.\r\n[31] S. Jones and P. Willett, Readings in information retrieval, Morgan\r\nKaufmann Publisher, 1997.<\/p>\r\n","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 80, 2013"}