A Methodology for Automatic Diversification of Document Categories

Dasom Kim; Chen Liu; Myungsu Lim; Soo-Hyeon Jeon; Byeoung Kug Jeon; Kee-Young Kwahk; Namgyu Kim

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

A Methodology for Automatic Diversification of Document Categories

Authors: Dasom Kim, Chen Liu, Myungsu Lim, Soo-Hyeon Jeon, Byeoung Kug Jeon, Kee-Young Kwahk, Namgyu Kim

Abstract:

Recently, numerous documents including large volumes of unstructured data and text have been created because of the rapid increase in the use of social media and the Internet. Usually, these documents are categorized for the convenience of users. Because the accuracy of manual categorization is not guaranteed, and such categorization requires a large amount of time and incurs huge costs. Many studies on automatic categorization have been conducted to help mitigate the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorize complex documents with multiple topics because they work on the assumption that individual documents can be categorized into single categories only. Therefore, to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, the learning process employed in these studies involves training using a multi-categorized document set. These methods therefore cannot be applied to the multi-categorization of most documents unless multi-categorized training sets using traditional multi-categorization algorithms are provided. To overcome this limitation, in this study, we review our novel methodology for extending the category of a single-categorized document to multiple categorizes, and then introduce a survey-based verification scenario for estimating the accuracy of our automatic categorization methodology.

Keywords: Big Data Analysis, Document Classification, Text Mining, Topic Analysis.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1109351

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1750

References:

[1] J. Hong, N. Kim, and S. Lee, “A Methodology for Automatic Multi-Categorization of Single-Categorized Documents,” Journal of Intelligent Information systems, vol. 20, no. 3, pp. 77-92, Sep. 2014.
[2] I. H. Witten, Text Mining, Practical Handbook of Internet Computing, CRC Press, 2004.
[3] J. Hong, H. Choi, H. Han, J. Kim, E. Yu, S. Lim, and N. Kim, “A Data Analysis-based Hybrid Methodology for Selecting Pending National Issue Keywords,” Entrue Journal of Information Technology, vol. 13, pp. 97-111, Jun. 2014.
[4] R. J. Mooney, and R. Bunescu, “Mining Knowledge from Text Using Information Extraction,” ACM SIGKDD Explorations, vol. 7, pp. 3-10, Jun. 2006.
[5] S. Song, J. Yu, and E. Kim, “Offering System For Major Article Using Text Mining and Data Mining,” Proceedings of the 32th annual conference on Korea Information Processing Society, pp. 733-734, 2009.
[6] E. Yu, J. Kim, C. Lee, and N. Kim, “Using Ontologies for Semantic Text Mining,” The Journal of Information Systems, vol. 21, pp. 137-161, Sep. 2012.
[7] D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel, “Similarity Measures for Tracking Information Flow,” Proceedings of CIKM, Bremen, Germany, 2005.
[8] C. J. V. Rijsbergen, Information Retrieval, 2nd edition, Butterworth, 1979.
[9] F. Sebastiani, Classification of Text, Automatic, The Encyclopedia of Language and Linguistics 14, 2nd edition, Elsevier Science Pub, 2006.
[10] G. Salton, A. Wong, and C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, vol. 18, pp. 613-620, Nov. 1975.
[11] R. Albright, “Taming Text with the SVD,” SAS Institute Inc., 2006.
[12] G. Salton, and M. J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983.
[13] C. Apte, and F. Damerau, “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions on Information Systems, vol. 12, pp. 233-251, Jul. 1994.
[14] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann Publishers, 2011.
[15] H. Lim, and K. Nam, “Computer Science: Improving of KNN - Based Korean Text Classifier by Using Heuristic Information,” The Journal of Korean Association of Computer Education, vol. 5, pp. 37-44, Jul. 2002.
[16] Y. Yang, “Expert network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th International Conference on Research and Development in Information Retrieval, SIGIR 94, pp. 13-22, 1994.
[17] D. D. Lewis, and M. Ringuette, “Comparison of Two Learning Algorithms for Text Categorization”, Proceedings of the 13rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81-93, 1994.
[18] E. Weiner, J. O. Pedersenm, and A. S. Weigend, “A Neural Network Approach to Topic Spotting,” Proceedings of the 14th Annual Symposium on Document Analysis and Information Retrieval, 1995.
[19] T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Springer Berlin Heidelberg, pp. 137-142, 1998.
[20] J. In, J. Kim, and S. Chae, “Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification,” Journal of Internet Computing and Services, vol. 14, pp. 49-57, Oct. 2013.
[21] H. Lim, and D. Kim, “Using Mutual Information for Selecting Features in Multi-label Classification,” Journal of KIISE: Software and Applications, vol. 39, pp. 806-811, Oct. 2012.
[22] J. Yun, J. Lee, and D. Kim, “Feature Selection in Multi-label Classification Using NSGA-II Algorithm,” Journal of KIISE: Software and Applications, vol. 40, pp. 133-140, Mar. 2013.