Feature Selection for Web Page Classification Using Swarm Optimization
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32807
Feature Selection for Web Page Classification Using Swarm Optimization

Authors: B. Leela Devi, A. Sankar

Abstract:

The web’s increased popularity has included a huge amount of information, due to which automated web page classification systems are essential to improve search engines’ performance. Web pages have many features like HTML or XML tags, hyperlinks, URLs and text contents which can be considered during an automated classification process. It is known that Webpage classification is enhanced by hyperlinks as it reflects Web page linkages. The aim of this study is to reduce the number of features to be used to improve the accuracy of the classification of web pages. In this paper, a novel feature selection method using an improved Particle Swarm Optimization (PSO) using principle of evolution is proposed. The extracted features were tested on the WebKB dataset using a parallel Neural Network to reduce the computational cost.

Keywords: Web page classification, WebKB Dataset, Term Frequency-Inverse Document Frequency (TF-IDF), Particle Swarm Optimization (PSO).

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1099636

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3211

References:


[1] Mangai, J. A., & Kumar, V. S. (2011). A Novel Approach for Web Page Classification using Optimum. IJCSNS, 11(5), 252.
[2] X. Qi and B. D. Davison, “Web page classification: features and algorithms,” ACM Computing Surveys, vol. 41, no. 2, article 12, 2009.
[3] T. M. Mitchell, Machine Learning, McGraw-Hill, NewYork, NY, USA, 1st edition, 1997.
[4] Golub, K. and A. Ardo (2005, September). Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Volume 3652 of LNCS, Berlin, pp. 368–378. Springer.
[5] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, pp. 379–423, 1948.
[6] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of the 14th International Conference on Machine Learning (ICML ’97), pp. 412–420, Nashville, Tenn, USA, July 1997.
[7] W. J. Wilbur and K. Sirotkin, “The automatic identification of stop words,” Journal of Information Science, vol. 18,no. 1, pp. 45–55, 1992..
[8] Mangai, J. A., & Kumar, V. S. (2011). A Novel Approach for Web Page Classification using Optimum. IJCSNS, 11(5), 252.
[9] Song, R., Liu, H., Wen, J. R., & Ma, W. Y. (2004, May). Learning block importance models for web pages. In Proceedings of the 13th international conference on World Wide Web (pp. 203-211). ACM.
[10] Xhemali, D., Hinde, C. J., & Stone, R. G. (2009). Naive bayes vs. decision trees vs. neural networks in the classification of training web pages.
[11] Liu, R., Zhou, J., & Liu, M. (2006, October). Graph-based semisupervised learning algorithm for web page classification. In Intelligent Systems Design and Applications, 2006. ISDA'06. Sixth International Conference on (Vol. 2, pp. 856-860). IEEE.
[12] Samarawickrama, S., & Jayaratne, L. (2012, September). Effect of Named Entities in Web Page Classification. In Computational Intelligence, Modelling and Simulation (CIMSiM), 2012 Fourth International Conference on (pp. 38-42). IEEE.
[13] Saraç, E., & Ozel, S. A. (2013, June). Web page classification using firefly optimization. In Innovations in Intelligent Systems and Applications (INISTA), 2013 IEEE International Symposium on (pp. 1- 5). IEEE.
[14] Ozel, S. A. (2011, June). A genetic algorithm based optimal feature selection for web page classification. In Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on (pp. 282-286). IEEE.
[15] Jebari, C., & Wani, M. A. (2012, December). A Multi-label and Adaptive Genre Classification of Web Pages. In Machine Learning and Applications (ICMLA), 2012 11th International Conference on (Vol. 1, pp. 578-581). IEEE.
[16] He, Z., & Liu, Z. (2008, October). A Novel Approach to Naïve Bayes Web Page Automatic Classification. In Fuzzy Systems and Knowledge Discovery, 2008. FSKD'08. Fifth International Conference on (Vol. 2, pp. 361-365). IEEE.
[17] Sun, A., Lim, E. P., & Ng, W. K. (2002, November). Web classification using support vector machine. In Proceedings of the 4th international workshop on Web information and data management (pp. 96-99). ACM.
[18] Kan, M. Y., &Thi, H. O. N. (2005, October). Fast webpage classification using URL features. In Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 325-326). ACM.
[19] Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002, August). Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 275-282). ACM.
[20] Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. JASIS, 50(10), 944-952.
[21] Kraaij, W., & Pohlmann, R. (1994). Porter’s stemming algorithm for Dutch. Informatiewetenschap, 167-180.
[22] Papineni, K. (2001, June). Why inverse document frequency?. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (pp. 1-8). Association for Computational Linguistics.
[23] Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2), 103-134.
[24] Soucy, P., & Mineau, G. W. (2005, July). Beyond TFIDF weighting for text categorization in the vector space model. In IJCAI (Vol. 5, pp. 1130-1135).
[25] Kennedy, J.; Eberhart, R.C., “A discrete binary version of the particle swarm algorithm”, Systems, Man, and Cybernetics, 1997. 'Computational Cybernetics and Simulation'., 1997 IEEE International Conference on Volume 5, 12-15 Oct. 1997 Page(s):4104 - 4108 vol.5.