The Influence of Preprocessing Parameters on Text Categorization

Jan Pomikalek; Radim Rehurek

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

The Influence of Preprocessing Parameters on Text Categorization

Authors: Jan Pomikalek, Radim Rehurek

Abstract:

Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.

Keywords: Text categorization, machine learning, electronic documents, classification.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1332960

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1577

References:

[1] Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of ICML-97, 14th International Conference on Machine Learning, D. H. Fisher, Ed. Nashville, US: Morgan Kaufmann Publishers, San Francisco, US, 1997, pp. 412-420.
[2] E. Gabrilovich and S. Markovitch, "Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5," in Proc. 21st Int. Conf. on Machine Learning, 2004.
[3] J. H. Lee, "Analyses of multiple evidence combination," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. Combination Techniques, 1997, pp. 267-276.
[4] M. F. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130-137, 1980.
[5] R. Krovetz, "Viewing morphology as an inference process," in Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. Linguistic Analysis, 1993, pp. 191-202.
[6] C. D. Paice, "Another stemmer," SIGIR Forum, vol. 24, no. 3, pp. 56-61, 1990.
[7] J. B. Lovins, "Development of a stemming algorithm," Mechanical Translation, vol. 11, pp. 22-31, 1968.
[8] G. Forman, "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003.
[9] C. J. V. Rijsbergen, Information Retrieval. Butterworths, 1979.
[10] J. W. Wilbur and K. Sirotkin, "The automatic identification of stop words," Journal of the American Society for Information Science, vol. 18, pp. 45-55, 1992.
[11] L. Galavotti, F. Sebastiani, and M. Simi, "Experiments on the use of feature selection and negative evidence in automated text categorization," in ECDL, ser. Lecture Notes in Computer Science, J. L. Borbinha and T. Baker, Eds., vol. 1923. Springer, 2000, pp. 59-68.
[12] T. Joachims, "Making large-scale SVM learning practical," in Advances in Kernel Methods ÔÇö Support Vector Learning, B. Sch┬¿olkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1999, pp.169-184.
[13] A. McCallum and K. Nigam, "A comparison of event models for naive bayes text classification," in Proceedings of AAAI-98, Workshop on Learning for Text Categorization, 1998.
[14] J. L. Wiener, Pedersen, and Weigend., "A neural network approach to topic spotting," Proc of the Fourth Annual Symp on Document Analysis and Info, pp. 317-332, 1995.
[15] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992.
[16] Y. Freund and R. E. Schapire, "Large margin classification using the perceptron algorithm," MACHLEARN: Machine Learning, vol. 37, 1999.
[17] E. F. Ian H. Witten, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005.
[18] Y. Yang, "A study on thresholding strategies for text categorization," in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-01), W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, Eds. New York: ACM Press, Sept. 9-13 2001, pp. 137-145.
[19] G. Salton, The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall, 1971.
[20] K. Lang, "Newsweeder: Learning to filter netnews," in ICML, 1995, pp. 331-339.
[21] J. ╦ç Zi╦çzka and T. Hud'─▒k, "Effects of selected basic algorithm parameters and data features on text categorization by support vector machines," in Proceedings of Znalosti 2005. V╦çSB-Technick'a univerzita Ostrava, 2005, pp. 210-217.