keyboard_arrow_up
A Semantic Based Approach for Information Retrieval from Html Documents Using Wrapper Induction Technique

Authors

A.M.Abirami1, A.Askarunisa2, T.M.Aishwarya1 and K.S.Eswari1, 1Thiagarajar College of Engineering, India and 2Vickram College of Engineering, India

Abstract

Most of the internet applications are built using web technologies like HTML. Web pages are designed in such a way that it displays the data records from the underlying databases or just displays the text in an unstructured format but using some fixed template. Summarizing these data which are dispersed in different web pages is hectic and tedious and consumes most of the time and manual effort. A supervised learning technique called Wrapper Induction technique can be used across the web pages to learn data extraction rules. By applying these learnt rules to web pages, enables the information extraction an easier process. This paper focuses on developing a tool for information extraction from the unstructured data. The use of semantic web technologies much simplifies the process. This tool enables us to query the data being scattered over multiple web pages, in distinguished ways. This can be accomplished by the following steps – extracting the data from multiple web pages, storing them in the form of RDF triples, integrating multiple RDF files using ontology, generating SPARQL query based on user query and generating report in the form of tables or charts from the results of SPARQL query. The relationship between various related web pages are identified using ontology and used to query in better ways thus enhancing the searching efficacy.

Keywords

Information Retrieval, Ontology, Structured Information Extraction, RDF, SPARQL, Semantic Web.

Full Text  Volume 3, Number 6