Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30576
A New Approach for Flexible Document Categorization

Authors: Jebari Chaker, Ounelli Habib

Abstract:

In this paper we propose a new approach for flexible document categorization according to the document type or genre instead of topic. Our approach implements two homogenous classifiers: contextual classifier and logical classifier. The contextual classifier is based on the document URL, whereas, the logical classifier use the logical structure of the document to perform the categorization. The final categorization is obtained by combining contextual and logical categorizations. In our approach, each document is assigned to all predefined categories with different membership degrees. Our experiments demonstrate that our approach is best than other genre categorization approaches.

Keywords: Categorization, flexible, Genre, combination, URL, logicalstructure, category

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1334762

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1135

References:


[1] Biber, D. Spoken and written textual dimensions in English: Resolving the contradictory findings. Language, 62(2), 1986, 384-413.
[2] Biber, D. The multidimensional approach to linguistic analyses of genre variation: an overview of methodology and finding. Computers in humanities, 26(5-6), 1992, 331-347.
[3] Biber, D. Dimensions of register variation: a cross-linguistic comparison. Cambridge, England: Cambridge University Press, 1995.
[4] Boese, E. S and Howe, A. E. Effects of web document evolution on genre classification. In proceeding of 5th conference information and knowledge management, Berlin, Germany, 2005.
[5] Craven, M., DiPasque, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K. and Slattery, S. Learning to extract symbolic knowledge from the word wide web. In proceeding of the 15th national/10th conference on artificial intelligence/innovative applications of artificial intelligence. Madison, W, 1998.
[6] Dewdney, N., Vaness-Dikema, C. and Macmillan, R. The form is the Substance:Classification of Genres in Text. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter of the Association for Computational Linguistics, Toulouse, France, 2001.
[7] Jebari, C., Ounalli, H. The Usefulness of Logical Structure in Flexible Document Categorization. In Proceeding of the International Conference on Computational Intelligence, Istanbul, Turkey. International Journal of Information Technology. 2004.
[8] Karlgren, J. and Cutting, D. Recognizing Text Genre with Simple Metrics Using Discriminant Analysis. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994). Kyoto (Japan), 1994.
[9] Kessler, B., Numberg, G. and Shutze, H. Automatic Detection of Text Genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, 1997.
[10] Lee, Y. and Myaeng, S. Text Genre Classification with Genre-Revealing and Subject-Revealing Features. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR 2002). 2002, Tampere, Finland, 2002.
[11] McCallum, A. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996 (http://www.cs.cmu.edu/~mccallum/bow).
[12] Meyer zu Eissen, S. and Stein, B. Genre Classification of Web Pages: User Study and Feasibility Analysis. In Biundo S., Fruhwirth T. and Palm G. (eds.). KI2004: Advances in Artificial Intelligence. Springer. Berlin-Heidelberg-New York, 2004, 256-269.
[13] Porter, M. An algorithm for suffix stripping. Program, 14(3), 1980.
[14] Rauber, A. and Muller-Kogler, A. Integrating Automatic Genre Analysis into Digital Libraries. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2001), 2001, Roanoke, Virginia (USA).
[15] Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., Liu, X. Genre based navigation on the web. In proceedings of the 34th Hawaiin International Conference on System Sciences, Hawaii, 2001. IEEE Computer Press.
[16] Salton, G. Automatic Text Processing: The transformation, analysis and retrieval of information by computer. 1989, Addison-Wesley.
[17] Santini, M. Automatic identification of genre in web pages. Ph.D. Thesis, University of Brighton, UK, 2007.
[18] Sebastiani, F. Machine learning in automated text categorization, ACM Computing Surveys, 34(1), 2002, 1-47.
[19] Stamatatos, E., Fokatakis, N. and Kokkinakis, G. Text Genre Detection Using Common Word Frequencies. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000). 2000. Saarbr├╝cken (Germany).
[20] Wang, Y., and Kitsuregawa, M. Evaluating contents-link coupled web page clustering for web search results. In proceeding of 11th international conference on information and knowledge management, 2002, 499-506.
[21] Zouari H., Heutte L., Lecourtier L. and Alimi A. Un panorama des méthodes de combinaison de classifieurs en reconnaissance de formes. In 13ème Congrès Francophone AFRIF-AFIA de Reconnaissance des Formes et d'Intelligence Artificielle RFIA'02, Angers, France, vol. 2, 2002, 499-508. World Academy of Science, Engineering and Technology 2 200735