Extraction of Data from Web Pages: A Vision Based Approach
Authors: P. S. Hiremath, Siddu P. Algur
Abstract:
With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright notices etc., surrounding the main content of the web page. Hence, tools for the mining of data regions, data records and data items need to be developed in order to provide value-added services. Currently available automatic techniques to mine data regions from web pages are still unsatisfactory because of their poor performance and tag-dependence. In this paper a novel method to extract data items from the web pages automatically is proposed. It comprises of two steps: (1) Identification and Extraction of the data regions based on visual clues information. (2) Identification of data records and extraction of data items from a data region. For step1, a novel and more effective method is proposed based on visual clues, which finds the data regions formed by all types of tags using visual clues. For step2 a more effective method namely, Extraction of Data Items from web Pages (EDIP), is adopted to mine data items. The EDIP technique is a list-based approach in which the list is a linear data structure. The proposed technique is able to mine the non-contiguous data records and can correctly identify data regions, irrespective of the type of tag in which it is bound. Our experimental results show that the proposed technique performs better than the existing techniques.
Keywords: Web data records, web data regions, web mining.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1063312
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1901References:
[1] Baeza Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4): 34ÔÇö58, 1989.
[2] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo. Extracting semistructured information from the web.In Proc.of the Workshop on the Management of Semi-structured Data, 1997.
[3] D. Embley, Y. Jiang, and Y. K. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Conference, 1999.
[4] Kushmerick, N. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 118:15-68, 2000. Clustering-based Approach to Integrating Source Query].
[5] Chang, C-H., Lui, S-L. IEPAD: Information Extraction Based on Pattern Discovery. WWW-01, 2001.]
[6] Crescenzi, V., Mecca, G. and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. VLDB-01, 2001.]
[7] Eying, H. Zhang. HTML Page Analysis based on Visual Cues. 6th International Conference on Document Analysis and Recognition, 2001.
[8] D. Buttler, L. Liu, C. Pu. A Fully Automated Object Extraction System for the World Wide Web. International Conference on Distributed Computing Systems (ICDCS 2001), 2001.
[9] Bing Liu , Kevin chen-chuan chang, Editorial: Special issue on web content mining, WWW 02, 2002.
[10] Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. KDD-03, 2003.
[11] Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. (2003). Extracting Content Structure for Web Pages based on Visual Representation, Asia Pacific Web Conference (APWeb 2003), pp. 406417.
[12] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages, ACM SIGMOD 2003, 2003.
[13] J. Wang, F. H Lochovsky. Data Extraction and Label Assignment for Web Databases.WWW conference, 2003.
[14] H. Zhao, W. Meng, Z. Wu, Raghavan, Clement Yu. Fully Automatic Wrapper Generation For Search Engines, International WWW conference 2005, May 10-14,2005, Japan. ACM 1-59593-046-9/05/005.
[15] Zhai, Y., Liu, B. Web Data Extraction Based on Partial Tree Alignment, WWW-05, 2005, May 10-14, 2005, Chiba, Japan. ACM 1-59593-046- 9/05/00.
[16] Hiremath P.S, Benchalli S.S, Algur Siddu P, Minig Data Regions from Web Pages, COMMAD 2005b.
[17] Algur Siddu P, Hiremath P.S, Extraction of Data from Web - Some Aspects, IICT - 2007