Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31837
Information Extraction from Unstructured and Ungrammatical Data Sources for Semantic Annotation

Authors: Quratulain N. Rajput, Sajjad Haider, Nasir Touheed


The internet has become an attractive avenue for global e-business, e-learning, knowledge sharing, etc. Due to continuous increase in the volume of web content, it is not practically possible for a user to extract information by browsing and integrating data from a huge amount of web sources retrieved by the existing search engines. The semantic web technology enables advancement in information extraction by providing a suite of tools to integrate data from different sources. To take full advantage of semantic web, it is necessary to annotate existing web pages into semantic web pages. This research develops a tool, named OWIE (Ontology-based Web Information Extraction), for semantic web annotation using domain specific ontologies. The tool automatically extracts information from html pages with the help of pre-defined ontologies and gives them semantic representation. Two case studies have been conducted to analyze the accuracy of OWIE.

Keywords: Ontology, Semantic Annotation, Wrapper, Information Extraction.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1955


[1] Adelberg, B.: NoDoSE A Tool For Semi-Automatically Extracting Structured And Semistructured Data From Text Documents. In Proceedings of the ACM SIGMOD International Conference on Management of data, Seattle Washington (1998)
[2] Antoniou, G., Harmelen, F.V.: A Semantic Web Primer. 2nd Edition. MIT Press (2004)
[3] Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases and Webs. In Proceedings of the 14th International Conference on Data Engineering, Florida (1998)
[4] Berendt, B., Hotho, A., Mladenic, D., someren, M.V., Spiliopoulou M., Stumme G.: A Roadmap for Web Mining: from Web to Semantic Web. Lecture Notes in Computer Science European Web Mining Forum (EWMF), Springer-Verlag Berlin Heidelberg (2004)
[5] Berendt, B., Hotho, A., Stumme, G.: Towards Semantic Web Mining. In Proceedings of the 1st International Semantic Web Conference (ISWC), Sardinia Italy (2002)
[6] Crescenzi, V., Mecca, G., and Merialdo, P.: RoadRunner: Towards Automatic Data Extraction From Large Web Sites. In Proceedings of the 26th International Conference on very large Data Bases, Rome Italy (2001)
[7] Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.k., Smith, R.D.: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Journal of Data and Knowledge Engineering, Vol.31(3), (1999) 227-251
[8] Embley, D.W., Tao, C., Liddle, S.W.: Automating the Extraction of Data from HTML Tables with Unknown Structure. Journal of Data & knowledge Engineering. Vol. 54(1), (2005) 3-28
[9] Embley D.W., Ding Y., Liddle S. W., and Vickers M.: Automatic Creation And Simplified Querying Of Semantic Web Content. In Proceedings of First Asian Semantic Conference (ASWC), Beijing China (2006)
[10] Fiumara, G.: Automatic Information Extraction from Web Sources: A Survey. In Proceedings of the Workshop between Ontologies and Folksonomies (BOF). Michigan USA (2007)
[11] Garcia-Molina, H., Hammer, J., McHugh, J.: Semistructured Data: The Tsimmis Experience. In Proceedings of First East-European Workshop on Advances in Database and Information Systems (ADBIS). St. Petersburg Russia (1997)
[12] Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM Semi-automatic CREAtion of Metadata. In Proceedings of 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW), Siguenza Spain (2002)
[13] Hieu, L.Q.: Integration of Web Data Sources: A Survey of Existing Problems. In Proceedings of 17th GI-Workshop on the Foundations of Databases, W├Ârlitz in Saxony-Anhalt Germany (2005) 78-82
[14] Laender, A.H.F., Ribeiro-Neto, B.A., da Silva A.S., Teixeira J.S.: A Brief Survey of Web Data Extraction Tools. In ACM SIGMOD Record, Vol. 31(2) (2002) 84-93
[15] Madhavan, J., Jeffery, S., Cohen, S., Dong, L., Ko, D., Yu, C., Halevy, A.: Web-scale Data Integration: You can only afford to Pay As You Go. In Proceedings of Third Biennial Conference on Innovative Data Systems Research (CIDR), Pacific Grove California (2007)
[16] Mika, P., Social Networks and the Semantic Web Series: Semantic Web and Beyond. Springer, (2007)
[17] Musela, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction For Semistructured Information Sources. Journal of Autonomous Agents and Multi-Agent systems. Vol. 4(1-2) (2001) 93-114
[18] Reeve, L., Han, H : Survey of Semantic Annotation Platforms. In Proceedings of the 20th Annual ACM Symposium on Applied Computing, Web Technologies and Applications track, Santa Fe New Mexico (2005)
[19] Sahuguet, A., Azavant, F.: Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. In Proceeding of 25th International Conference on Very Large Databases (VLDB). Edinburgh Scotland (1999)
[20] Soderland, S.: Learning Information Extraction Rules For Semi- Structured and Free Text. Machine Learning. Vol. 34 (1-3). (1999) 233- 272
[21] Tang, J., Li, J., Lu, H., Liang, B., Huang, X., Wang, K.: IASA: Learning to Annotate the Semantic Web. Journal on Data Semantics. Vol. 4. (2005) 110-145
[22] Tjoa, A., Wagner, R., Andjomshoa, A., Shayeganfar, F.: Semantic Web: Challenges and New Requirements. In Proceedings. Sixteenth International Workshop on Database and Expert Systems Application (DEXA). Copenhagen Denmark (2005) 1160 - 1163
[23] Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F: MnM: Ontology Driven Semi-Automatic and Automatic Support for Semantic Markup. In Proceedings of The 13th International Conference on Knowledge Engineering and Management. Seguenza Spain (2002)
[24] Wilson, M., Matthews, B.: The Semantic Web: Prospects And Challenges. In Proceeding of 7th International Baltic Conference on Databases and Information Systems. Vilnius Lithuania (2006)
[25] Yildiz, B., Miksch, S.: Motivating ontology-driven information extraction. In Prasad, A., Madalli, D., eds.: International Conference on Semantic Web and Digital Libraries. Indian Statistical Institute Platinum Jubilee Conference Series (2007) 45-53
[26] Yildiz Burcu, Miksch Silvia. ontoX - A Method for Ontology-Driven Information Extraction. In: Computational Science and Its Applications (ICCSA 2007), LNCS 4707, Springer-Verlag, 2007, S. 660 - 673.