Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33122
The Usefulness of Logical Structure in Flexible Document Categorization
Authors: Jebari Chaker, Ounalli Habib
Abstract:
This paper presents a new approach for automatic document categorization. Exploiting the logical structure of the document, our approach assigns a HTML document to one or more categories (thesis, paper, call for papers, email, ...). Using a set of training documents, our approach generates a set of rules used to categorize new documents. The approach flexibility is carried out with rule weight association representing your importance in the discrimination between possible categories. This weight is dynamically modified at each new document categorization. The experimentation of the proposed approach provides satisfactory results.Keywords: categorization rule, document categorization, flexible categorization, logical structure.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1331761
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1248References:
[1] C. Jebari & al., Catégorisation d-un document électronique en vue d-une meilleure classification thématique, GEI-2002, Hammamet, Tunisie, 2002.
[2] V. Chanana & al., A new context-based information retrieval system, Accepted in 3rd WSEAS Int. Conf. On Artificial Intelligence, Knowledge Engineering, Data Bases (AIKED 2004), Salzburg, Austria, February 13-15, 2004.
[3] M. Maron, Automatic Indexing: An Experimental Inquiry, Journal of the Association for Computing Machinery, 1961, 8(3): pp. 404 - 417.
[4] F. Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Pisa, Italy, 2002.
[5] J. Karlgren and D. Cutting, Recognizing Text Genres with Simple Metrics Using Discriminant Analysis, Proc. Of COLING1994, Kyoto, 1994.
[6] L. Yong-Bae and Sung Hyon, Automatic Identification of Text Genres and Their Roles in Subject-Based Categorization, In Proceedings of the 37th Hawaii International Conference on System Sciences, 2004.
[7] B. Kessler & al., Automatic Detection of Text Genre, ACL-97, pages 32 - 38, July 1997.
[8] E. Stamatatos, Text Genre Detection Using Common Word Frequencies, Proc. Of the 18th International Conference on COLING2000, 2000.
[9] C. Kevin and W. Marie, Reproduced and emergent genres of communication on the world-wide web, In Proceedings of the 30th Hawaii International Conference on System Sciences (HICSS-30), Institute of Electrical and Electronics Engineers, 1997.
[10] A. Marzin & al., Classification de pages web en genre, Journée d-études ATALA-2004, Grenoble, France, janvier 2004.
[11] C. Apte & al., Automated learning of decision rules for text categorization, ACM Transactions on Information Systems, 1994, 12(3): pp. 233 - 251.
[12] P.J. Hayes, CONSTRUE/TIS: a system for content-based indexing of a database of news stories, In Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, 1990, pp. 1 - 5.
[13] T. Mitchell, Machine Learning, McGraw Hill International editions, Computer Science series, ISBN 0-07-042807-7, 1997.
[14] J. J. Rocchio, Relevance Feedback in Information Retrieval, In the SMART retrieval system, G. Salton, pp. 313 - 323, Prentice Hall, Inc., 1971.
[15] R.O. Duda & al., Pattern Classification and Scene Analysis, John Wiley & Sons, 1973.
[16] L. Breiman and al., Classification and Regression Trees, Belmont, CA: Wadsworth, 1984.
[17] V. Vapnik, The Nature of Statistical Learning Theory, Springer - Verlag, 1995.
[18] L. Breiman, Bagging predictors, Machine Learning. Vol. 24, 1996, pp. 123 - 140.
[19] Y. Freund and Shapire, Experiments with a new boosting algorithm, In Proceeding of 13th international conference on Machine Learning, 1996, pp. 148 - 156.
[20] J.R. Quinlan, C4.5: Programming for machine Learning, Morgan Kaufman, 1993.
[21] J.R. Quinlan, Learning efficient classification procedures and their application to chess and games, In R. S. Michalski, J. G. Carbonell and T. M. Mitchell editors, Machine Learning: An Artificial Intelligence Approach. Vol. 1, pp. 463 - 482, 1983.
[22] E. Mephu Nguifo, Treillis de Galois et Classification Supervisée, Séminaire LIMOS, Clermont - Ferrand, 7 mars 2002.
[23] R. Rakotomalala, Graphes d-Induction, Thèse de doctorat de l-université Claude Bernard - Lyon I, décembre 1997.
[24] D.A. Zighed et al., SIPINA : Méthode et logiciel, Editions Alexandre Lacassagne, Mathématiques appliquées n┬░2, 1992.