Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31097
Auto Classification for Search Intelligence

Authors: Lilac A. E. Al-Safadi


This paper proposes an auto-classification algorithm of Web pages using Data mining techniques. We consider the problem of discovering association rules between terms in a set of Web pages belonging to a category in a search engine database, and present an auto-classification algorithm for solving this problem that are fundamentally based on Apriori algorithm. The proposed technique has two phases. The first phase is a training phase where human experts determines the categories of different Web pages, and the supervised Data mining algorithm will combine these categories with appropriate weighted index terms according to the highest supported rules among the most frequent words. The second phase is the categorization phase where a web crawler will crawl through the World Wide Web to build a database categorized according to the result of the data mining approach. This database contains URLs and their categories.

Keywords: Data Mining, Document Classification, Information Processing on the Web

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1344


[1] Kolcz, V. Prabakarmurthi, J.K. Kalita. "Summarization as feature selection for text categorization". Proc. Of CIKM01, 2001.
[2] Z. Broder, S.C. Glassman, and M.S. Manasse, "Syntactic Clustering of the Web," Proceedings of the 6th International World Wide Web Conference, April 1997, pp. 391-404.
[3] Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal, "Web Search Using Automatic Classification," Proceedings of the 6th International World Wide Web Conference, April 1997.
[4] E. Rasmussen, "Chapter 16: Clustering Algorithms," in W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures &Algorithms, Prentice Hall, 1992, pp. 419-442.
[5] G. Salton, editor. "The SMART retrieval system: experiments in automatic document processing," Prentice-Hall Series in Automatic Computation, Englewood Cliffs, New Jersey, 1971, Chapters 14-17.
[6] G. Salton, A. Wong, and C.S. Yang, "A Vector-Space Model for Information Retrieval," Communications of the ACM, vol. 18, no. 11, 1975, pp. 613-620.
[7] H. Chen and S. T. Dumais. Bringing order to the Web: Automatically categorizing search results. Proc. of CHI2000, 2000, 145-152.
[8] H. Mahmood, "CW3S: New Classification Algorithm for World Wide Web Search Engines ",to appear at NITS'08, november 2008, Riyadh, KSA.
[9] H. Zeng, Q. He, Z. Chen, W. Ma and J. Ma, "Learning to cluster Web Search Results", The 27th Annual International ACM SIGIR Conference (SIGIR'2004), July 2004
[10] J. L. Chen, B.Y. Zhou, J. Shi, H.J. Zhang, and Q.F. Wu. Function-based Object Model Towards Website Adaptation, Proc. of WWW10, HK, China, 2001.
[11] J. Pitkow and P. Pirolli, "Mining Longest Repeating Subsequences to Predict World Wide Web Surfing," Proceedings of the 2nd USENIX Symposium on Internet Technologies and Systems (USITS'99), Oct 1999, pp.139- 150.
[12] L. Al-Safadi, "Enhanced Arabic Search Engine", The Fifth International Conference on Information Integration and Web-based Applications & Services (iiWAS2003), Jakarta, Indonesia, September 15 - 17, 2003
[13] M. Hearst, J. Pedersen, "Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), Zurich, June 1996.
[14] M. Houtsma and A. Swami. Set-oriented mining of association rules. Research Report RJ 9567, IBM Almaden Research Center, San Jose, California, October 1993.
[15] M. L. Shyu, S.-C. Chen, and C. Haruechaiyasak, "Mining User Access Behavior on the WWW," IEEE International Conference on Systems, Man, and Cybernetics, October 2001, pp. 1717-1722.
[16] M. L. Shyu, S.-C. Chen, C. Haruechaiyasak, C.-M. Shu, and S.-T. Li, "Disjoint Web Document Clustering and Management in Electronic Commerce," Proceedings of the Seventh International Conference on Distributed Multimedia Systems (DMS-01), September 2001.
[17] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole in parts: text summarization for Web browsing on handheld devices. Proc. of WWW10, Hong Kong, China, May 2001.
[18] R. Cooley, B. Mobasher, and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web," Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997, pp. 558-567.
[19] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules", Proceedings of the 20th VLDB Conference Santiago, Chile, 1994
[20] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, Washington, D.C., May 1993.
[21] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proc. of the ACM SIGMOD, 1998.
[22] S. J. Ker and J.-N. Chen. A Text Categorization Based on Summarization Technique. In the 38th Annual Meeting of the Association for Computational Linguistics IR&NLP workshop, Hong Kong, October 3-8, 2000.
[23] S. Miyamoto and K. Nakayama, "Fuzzy Information Retrieval Based on a Fuzzy Pseudothesaurus," IEEE Transactions on Systems, Man, and Cybernetics, vol. 16, no. 2, March/April 1986, pp. 278-282.
[24] T. Joachims. Transductive inference for text classification using support vector machines. Proc. of ICML-99, Bled, Slovenia, June 1999.
[25] Y.J Ko, J.W Park, J.Y. Seo. Automatic Text Categorization using the Importance of Sentences. Proc. of COLING 2002.
[26] Y. Li and R. Gopalan, "Effective Sampling for Mining Association Rules", 17th Australian Joint Conference on Artificial Intelligence Cairns, Australia, December 2004
[27] Y. Ogawa, T. Morita, and K. Kobayashi, "A Fuzzy Document Retrieval System Using the Keyword Connection Matrix and a Learning Method," Fuzzy Sets and Systems, vol. 39, 1991, pp. 163-179.