[IEEE 2008 Second International Conference on Genetic and Evolutionary Computing (WGEC) - Jinzhou,...

4
Research on Web Information Extraction Based on XML Yan HU, Yanyan XUAN Dept. Computer Science & Technology, Wuhan University of Technology Wuhan, 430070, China Email: [email protected], [email protected] Abstract The standard XML technology is used for Web information extraction in this paper, and a generic XML-based Web information extraction solution is proposed. In the extraction process, two key technologies are proposed and implemented: the XML-based Web data conversion technology and the DOM-based XPath generation technology, to simplify the information extraction work. XSLT is used as the description language of extraction rules, which is conductive to the unity of extraction patterns. 1. Instruction With the explosive growth of Web data, it is becoming more difficult for users to obtain useful information from Web. How to find accurate information quickly and efficiently from Web has become an urgent issue to be resolved. Web information extraction technology [1] comes into bring. The program that extracts information from web is called wrapper [2], various constructing approaches of wrapper have been proposed by people, but all of them have their limitations in the application. With the continuous development of XML technology, the application value of XML has increasingly important role in Web information extraction. Based on the study of existing information extraction technology, the standard XML is used for Web information extraction in this paper. 2. Web information extraction principles and system design idea The definition of Web information extraction is described as follows: For a given group of Web pages S, define a mapping W. W maps the objects in S to a data structure D which is more structured with clearer semantics. And the mapping W has the same function as the Web pages set S’ which is similar with S in semantics and structure. Thus, the key to information extraction is defining the mapping W, that’s extraction rules. The overall research on Web information extraction includes five stages: Pages acquisition stage, Pages optimized stage, Rules learning stage, Information extraction stage and Data storage stage. This article focuses on the core stages: Pages optimized stage and Rules learning stage. The design idea of our system is described as follows: (1) Obtain a large number of HTML pages as sample pages through the data acquisition system for learning. (2) Use HTML parser to parse HTML document into HTMLDOM tree, traverse this tree and clean the page. Construct XHTML document according to XML grammars. Use XML parser to parse the XHTML document into XMLDOM tree. (3) Use JTree to load this XMLDOM tree, and build a DOM-based XPath generation algorithm. By integrating this algorithm to system, XPath expressions of the information nodes can be obtained when users mark the interested information points. Compile XSLT template rules according to these XPath expressions. XSLT template rules are just extraction rules. At the same time, the extraction rules can be optimized with optimization method of data location. (4) Store the extraction rules obtained from rules learning into the extraction rules warehouse. Use XSLT template rules to extract the users’ interested information from XHTML documents by executing XSLT processor engine, and the extraction results are expressed in XML which has strong structure and expansion. 3. Implement of Web information extraction based on XML 3.1. Pages acquisition A large number of HTML samples can be acquired for the research work of Web information extraction by Second International Conference on Genetic and Evolutionary Computing 978-0-7695-3334-6/08 $25.00 © 2008 IEEE DOI 10.1109/WGEC.2008.16 201

Transcript of [IEEE 2008 Second International Conference on Genetic and Evolutionary Computing (WGEC) - Jinzhou,...

Page 1: [IEEE 2008 Second International Conference on Genetic and Evolutionary Computing (WGEC) - Jinzhou, China (2008.09.25-2008.09.26)] 2008 Second International Conference on Genetic and

Research on Web Information Extraction Based on XML

Yan HU, Yanyan XUAN Dept. Computer Science & Technology, Wuhan University of Technology

Wuhan, 430070, China Email: [email protected], [email protected]

Abstract

The standard XML technology is used for Web information extraction in this paper, and a generic XML-based Web information extraction solution is proposed. In the extraction process, two key technologies are proposed and implemented: the XML-based Web data conversion technology and the DOM-based XPath generation technology, to simplify the information extraction work. XSLT is used as the description language of extraction rules, which is conductive to the unity of extraction patterns. 1. Instruction

With the explosive growth of Web data, it is

becoming more difficult for users to obtain useful information from Web. How to find accurate information quickly and efficiently from Web has become an urgent issue to be resolved. Web information extraction technology [1] comes into bring. The program that extracts information from web is called wrapper [2], various constructing approaches of wrapper have been proposed by people, but all of them have their limitations in the application. With the continuous development of XML technology, the application value of XML has increasingly important role in Web information extraction. Based on the study of existing information extraction technology, the standard XML is used for Web information extraction in this paper.

2. Web information extraction principles and system design idea

The definition of Web information extraction is

described as follows: For a given group of Web pages S, define a mapping W. W maps the objects in S to a data structure D which is more structured with clearer semantics. And the mapping W has the same function as the Web pages set S’ which is similar with S in

semantics and structure. Thus, the key to information extraction is defining the mapping W, that’s extraction rules.

The overall research on Web information extraction includes five stages: Pages acquisition stage, Pages optimized stage, Rules learning stage, Information extraction stage and Data storage stage. This article focuses on the core stages: Pages optimized stage and Rules learning stage. The design idea of our system is described as follows: (1) Obtain a large number of HTML pages as sample pages through the data acquisition system for learning. (2) Use HTML parser to parse HTML document into HTMLDOM tree, traverse this tree and clean the page. Construct XHTML document according to XML grammars. Use XML parser to parse the XHTML document into XMLDOM tree. (3) Use JTree to load this XMLDOM tree, and build a DOM-based XPath generation algorithm. By integrating this algorithm to system, XPath expressions of the information nodes can be obtained when users mark the interested information points. Compile XSLT template rules according to these XPath expressions. XSLT template rules are just extraction rules. At the same time, the extraction rules can be optimized with optimization method of data location.

(4) Store the extraction rules obtained from rules learning into the extraction rules warehouse. Use XSLT template rules to extract the users’ interested information from XHTML documents by executing XSLT processor engine, and the extraction results are expressed in XML which has strong structure and expansion.

3. Implement of Web information extraction based on XML

3.1. Pages acquisition

A large number of HTML samples can be acquired

for the research work of Web information extraction by

Second International Conference on Genetic and Evolutionary Computing

978-0-7695-3334-6/08 $25.00 © 2008 IEEE

DOI 10.1109/WGEC.2008.16

201

Page 2: [IEEE 2008 Second International Conference on Genetic and Evolutionary Computing (WGEC) - Jinzhou, China (2008.09.25-2008.09.26)] 2008 Second International Conference on Genetic and

integrating the Google Web API services into the system [3]. When acquiring Web pages, the search keywords of Google Web API are set as "artificial intelligence", and the number of downloaded pages as 100, saving path as "E:\Extractor\HTMLData", to receive the relevant 100 sample pages in the field of Artificial Intelligence.

3.2. Pages optimized

Pages optimized stage includes two steps: Pages

Cleaning and Pages Parsing. The flow for Pages optimized as shown in Figure 1.

Figure 1. Pages optimized flow

(1) Pages Cleaning The main task of Pages Cleaning is to restore illegal

characters, nonstandard or wrong nested tags, wipe off non-theme elements in HTML pages, and convert HTML into XHTML. In this process, we propose an XML-based Web data conversion method, the conversion flow of which is showed in Figure 2. NekoHTML [4] is used as HTML parser to parse the HTML document into HTMLDOM tree.

Figure 2. Data conversion from HTML to XHTML

(2) Pages Parsing Pages Parsing is to parse the XHTML document

obtained from Pages Cleaning into an XMLDOM tree structure using XML parser, so that we can take on this

tree to learn extraction rules, we use Xerces-J [5] as the XML parser in this process.

3.3. Extraction rules learning

In this paper, XSLT is taken as description language of extraction rules, and the key component of XSLT is XPath, which can be used to locate information in XML documents. In rules learning stage, we use JTree to display XMLDOM from Pages optimized. Obtain XPath expressions through the XPath generation algorithm proposed in this paper when users mark interested information, and combine XSLT technology to prepare extraction rules.

(1) Obtain XPath expression It’s difficult to locate information points and prepare

XPath expressions in HTML documents. We propose an XPath generation method to get the XPath expressions of information points to be taken. The DOM-based XPath generation method mainly includes the following two steps:

① Traverse the XMLDOM tree recursively, convert all the Node nodes into TreeNode nodes, and construct JTree to show this XMLDOM’s structure.

② Use TreeNode nodes to construct XPath expressions. The XPath can be obtained automatically when users mark interested information.

Here, we mark a single-block information point on JTree getting the following XPath expression: “/html[1]/body[1]/div[3]/table[4]/tr[1]/td[3]/table[1]/tr[2]/td[1]/table[1]/tr[1]/td[1]/div[1]/p[4]/text()”.

(2) Create extraction rules XSLT is a language used to convert a class of XML

documents into another class of XML documents; it describes the conversion rules with tree structure. In XSLT, the conversion is called Stylesheet, which defines a set of converting rules; every rule in Stylesheet corresponds to the relevant XPath model on the operation of each node in source XML documents, known as the template rules. Taken from the perspective of the information extraction, XSLT template rules are just the extraction rules.

This paper researches on the rules of single-block information as well as multi-block information. When extracting single-block information, we have implemented the rules automatically generated. When extracting multi-block information, all the nodes’ XPath expressions can be created through this platform and XSLT templates can be generated, multi-block rules can be prepared through merging all the single-block rules. A multi-block information rule is showed in Figure 3.

202

Page 3: [IEEE 2008 Second International Conference on Genetic and Evolutionary Computing (WGEC) - Jinzhou, China (2008.09.25-2008.09.26)] 2008 Second International Conference on Genetic and

Figure 3. Multi-block information rule

3.4. Extraction rules optimized XPath is the main component of XSLT, which is

used as extraction rules in the paper. Thus, extraction rules optimized is just the optimization with XPath expression, which has three familiar positioning ways as follows:

(1) Path-based location method It’s the most intuitive method, which builds up on

DOM tree structure of Web pages, uses node-name, node-index and path-structure as the restrictive conditions, so it must be impacted by the tree structure of pages. It’s suitable for XML documents of small changes in the data model. For the non-semantic tags of HTML document, once pages’ structure changes, it will affect the information positioning, while it is the most accurate and common positioning method.

(2) Text-based location method This method uses specific contents as constraints to

location information; it will not be subject to structural changes of pages. It uses axis, predicate and XPath function of XPath syntax to express path of information. However, it’s difficult to find keywords to rely on.

(3) Attribute-based location method The attributes of page elements are used as

restrictive conditions in this method, which is based on some proper properties, such as the display properties or link attributes. On display properties, if pages’ show is relatively fixed, it can achieve better coverage, but it’s difficult to determine that in normal circumstances. The method is the worst. For example, a background color of restraint is used in XPath expression: "html/ body[@ bgcolor = "# ffffff"]".

In practical applications, different location methods are usually joined together according to different

occasions to get better positioning solution. The well-known Anchor-Hop model, as shown in Figure 4, proposed in literature [6] combines Path-based and Text-based positioning thinking. Anchor is a reference node, which is usually the last common ancestor node, contains all the information to be collected, which is Hop.

Figure 4. Anchor-Hop model

We optimize multi-block information rules

combining Anchor-Hop model in this paper. According to Figure 5, there are three pieces of information in <TABLE> node, which are Hop. Because “what is artificial intelligence?” appears before Hop, it can be used as Anchor. Then, the above path can be amended as follows: “//text()[contains(normalize-space(.),’ What is artificial intelligence?’)]”.

Figure 5. Multi-block information file

Reconstruct the multi-block information rule in Figure 3 using the optimized path expression to get optimized rule as shown in Figure 6. 3.5. Information extraction

In accordance with extraction rules, execute XSLT

processor on XHTML documents from Pages Cleaning to get the result documents in XML format. We use Xalan-J as the XSLT processor engine.

<?xml version="1.0" encoding="gb2312"?> <xsl:stylesheet version="2.0" xmlns:xsl= "http://www.w3.org/1999/XSL/Transform"> <xsl:output encoding="gb2312" indent="yes"/> <xsl:template match="/html[1]/body[1]/div [3]/table[4]/tr[1]/td[3]/table[1]/tr[2]/td[1]/table[1] /tr[1]/td[1]/div[1]/"> <content> <inf-block1><xsl:value-of select="h3[1]/text()"/> </inf-block1> <inf-block2><xsl:value-of select="p[1]/text()"/> </inf-block2> <inf-block3><xsl:value-of select="p[4]/text()"/> </inf-block3> </content > </xsl:template>

Hop

Anchor

203

Page 4: [IEEE 2008 Second International Conference on Genetic and Evolutionary Computing (WGEC) - Jinzhou, China (2008.09.25-2008.09.26)] 2008 Second International Conference on Genetic and

Figure 6. Optimized multi-block information rule

4. Information extraction system evaluation

Two key indicators of MUC meeting are used to

assess information extraction system: Precision (P) and Recall (R). They are defined as follows [7]:

P = (AP) producted answers#

(CA) answerscorrect #

R = (TPC) corrects possible total#

(CA) answerscorrect #

Usually, F-measure, which combines precision and recall, is used to evaluate the performance of system comprehensively in a single measurement as follows:

F= RPF1

PR)11F(2

2

++

F1 is the relative weight of precision and recall. F1 equals 1, weighing precision and recall equally; F1 greater than 1, accuracy is more important; F1 less than 1, recall is more important. Usually, the F1 score of a system is 1.

We carry out experiment using the acquired 100 HTML pages related to artificial intelligence. Test data and test results are showed in Table 1.

Table 1. Test data and results for system

Type TPC AP CA P R F Single -block

60 57 49 86%

81.7%

83.8%

Multi -block

170 158 131 82.9%

77.1%

79.9%

Evidently, the system can get better results in collecting the single-block information. For multi-block information, users can facilitate the use of Web information extraction platform to help generate the template rules. Multi-block information rules are complex, and some artificial participation would be taken in merging multi-block rules, which affect its extraction results. Overall, the Web information extraction system can achieve good results, and also the precision and recall can reach a higher proportion.

5. Conclusions

Based on the study of existing Web information

extraction technology, a general Web information extraction solution is proposed in this paper. At last, we take 100 HTML pages for experiment, the experimental results show that the system can complete the information extraction task, and get a higher rate of precision and recall. The system is a generic Web information extraction system, and for different areas of the Web pages, we can quickly build up corresponding wrappers through this system. References

[1] Laender H F, Ribeiro-Neto B A, A S da Silva et al, “A

Brief Survey of Web Data Extraction Tools”, SIGMOD Record, Vol.31, No.2, 2002, pp. 84-93.

[2] InoMuslea, Steve Minton, Craig A.Knoblock, “A hierarchical approach to wrapper induction”, In Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA, 1999, pp.190-197.

[3] Yan Hu, HuZi WU, “Research of Web Pages Acquisition Technology Based on Google Web API”, Fujian Computer, 2007, pp.114-115.

[4] CyberNeko HTML Parser, http://people.apache.org/ ~andyc/neko/doc/html/.

[5] http://xerces.apache.org/xerces-j/index.html. [6] Jussi Myllymaki, Jared Jackson, “Robust Web Data

Extraction with XML Path Expressions”, IBM Research Report, 2003.

[7] Line Eikvil, “Information Extraction from World Wide Web A Survey”, Technical Report 945, Norweigan Computing Center, 1999.

<?xml version="1.0" encoding="gb2312"?> <xsl:stylesheet version="2.0" xmlns:xsl= "http://www.w3.org/1999/XSL/Transform"> <xsl:output encoding="gb2312" indent="yes"/> <xsl:template match="//TABLE[start-with (normalize-space(.),’ What is artificial intelligence?’)]"> <content> <inf-block1><xsl:value-of select= "table[1]/tr[1]/td[1]/div[1]/h3[1]/text()"/></inf-block1> <inf-block2><xsl:value-of select= "table[1]/tr[1]/td[1]/div[1]/p[3]/text()"/></inf-block2> <inf-block3><xsl:value-of select= "table[1]/tr[1]/td[1]/div[1]/p[4]/text()"/></inf-block3> </content> </xsl:template>

204