A Content Vector Model for Text Classification

Eric Jiang

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33126

A Content Vector Model for Text Classification

Authors: Eric Jiang

Abstract:

As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifies text by their content. The model integrates the class discriminative information from the training data and is equipped with several pertinent feature selection and text classification algorithms. The proposed classifier has been applied to email classification and its experiments on a benchmark spam testing corpus (PU1) have shown that the approach represents a competitive alternative to other email classifiers based on the well-known SVM and naïve Bayes algorithms.

Keywords: Feature Selection, Latent Semantic Indexing, Text Classification, Vector Space Model.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1078289

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1891

References:

[1] Androutsopoulos, G. Paliouras, and E. Michelakis (2004). "Learning to filter unsolicited commercial e-mail".Technical Report 2004/2, NCSR Demokritos.
[2] N. Christianini and J. Shawe-Taylor (2000). An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press.
[3] S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman (1990) "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science. 41, 391-409.
[4] K. Gee (2003). "Using Latent Semantic Indexing to Filter Spam". Proceedings of the 2003 ACM Symposium on Applied Computing, 460- 464.
[5] G. Golub and C. Van Loan (1996). Matrix Computations. John-Hopkins, Baltimore, 3rd edition.
[6] E. Jiang and M. Berry (2000). "Solving Total Least-Squares Problems in Information Retrieval. Linear Algebra and its Applications, 316, 137- 156.
[7] T. Mitchell (1997). Machine Learning. McGraw-Hill.
[8] J. Quinlan (1993). C 4.5: Programs for Machine Learning. Morgan Kaufmann.
[9] J, Rocchio (1971). "Relevance feedback information retrieval". The Smart retrieval system-Experiments in automatic document processing, (G. Salton ed.). Prentice-hall, 313-323.
[10] R. Schapier and Y. Singer (2000). "BoosTexter: a boosting-based system for text categorization". Machine Learning, 39, 2/3, 135-168.
[11] F. Sebastiani (2002). "Machine learning in automated text categorization". ACM Computing Surveys 334, 1, 1-47.
[12] H. Schutze, D.A. Hall and J.O. Pedersen (1995). "A Comparison of Classifiers and Document Representations for the Routing Problem". Proceedings of SIGIR, 1995, 229-237.
[13] Y. Yang and J. Pedersen (1997). "A comparative study on feature selection in text categorization". Proceedings of the 14th International conference on Machine Learning, 412-420.