[IEEE 2011 International Conference on Control, Automation and Systems Engineering (CASE) -...

Ontology-based information extraction system in E-commerce websites Yang Xiudan College of Management Hebei University Baoding, China [email protected] Zhu Yuanyuan College of Management Hebei University Baoding, China Abstract—Information extraction may help the users to query their needed information from the endless useful information. Now there are a variety of information extraction technologies, and we usually use the search engines to find our information we need from the Internet. But it is difficult to find the “real” information we want. So in this paper, we use the concept of the ontology to analyze the structure and content of the website, to build ontology model, in order to extract the information based on ontology from the e-commerce website for the users. In the end, the paper makes an experiment test of the text tool GATE to extract from websites and evaluate the results objectively. Keywords- information extraction, ontology, e-commerce I. INTRODUCTION In the last century, the rapid development of the Internet makes the information increase fast and become too rich, but users should have a clear choice of the information. Choosing from the vast ocean of information becomes the focus of the subsequent. Now a variety of search engines come out, such as Google and Baidu, but which can only find relevant web links for the users to further search; they don’t provide the information needed directly. The concept of information extraction has different interpretations in the domestic and the international research fields. Line believes that information extraction is a task to locate specific information from nature language, which is a special and useful branch of the natural language. [1] Michele thinks that the traditional information extraction requires lots of human intervention, prepared the rules manually or trained examples hand-labeled. [2] In China, Some scholars believe that information extraction is to extract specified class information from a piece of text information, and let the users query from the structured data. [3] The main function of information extraction is to extract specific information from the text, and the information extracted is in the form of a structured description usually, it can be directly deposited into a database for the users to query further analysis. [4] Also, someone suggested that information extraction is extract text-related of specific types of information from the text automatically. [5] The paper believes that information extraction is extracting the structured data in accordance with specific rules from the natural language text or semi-structured text automatically. Currently, the information extraction technology includes the information technology based on the dictionary, rule-based extraction technology and the technology based on the hidden Markov. In general, ontology-based extraction technology is still a rule, in this paper we use the ontology technology to build the wrapper, and then extract information from the e- commerce site. Originally, ontology is a philosophical term which is defined as “the description of the objective existence of the world, namely, the existence”. Then, ontology is applied to different fields, especially in computer and the artificial intelligence. In the computer field, ontology is formalized clearly to the shared conceptual model. Ontology gives the basic terms of the vocabulary and the relationships, to capture the relevant domain knowledge, and propose a common understanding of the field to identify the common vocabulary, and give a clear formal definition. Because ontology has a structured body, and e-commerce website also has the same format, we build e-commerce ontology first during the information extraction. Ontology-based information extraction is mainly used for the description of the data to achieve the extraction, less dependence on the page structure. In this paper, we propose to put the domain ontology into the e-commerce information extraction; we use the semantic extraction algorithm in the system, generate the extraction rules based on OWL ontology, and construct wrappers for the web information extraction. II. E-COMMERCE PRODUCTS ONTOLOGY AE-commerce websites features The e-commerce websites always have certain structure, which follows similar format. It makes a certain advantage when the structured information is extracted. We can see the searched page contains a wealth of data; we only need to extract some of the properties concerned when the needed information is extracted. 978-1-4577-0860-2/11/$26.00 ©2011 IEEE

Transcript of [IEEE 2011 International Conference on Control, Automation and Systems Engineering (CASE) -...

Page 1: [IEEE 2011 International Conference on Control, Automation and Systems Engineering (CASE) - Singapore, Singapore (2011.07.30-2011.07.31)] 2011 International Conference on Control,

Ontology-based information extraction system in E-commerce websites

Yang Xiudan College of Management

Hebei University Baoding, China

[email protected]

Zhu Yuanyuan College of Management

Hebei University Baoding, China

Abstract—Information extraction may help the users to query their needed information from the endless useful information. Now there are a variety of information extraction technologies, and we usually use the search engines to find our information we need from the Internet. But it is difficult to find the “real” information we want. So in this paper, we use the concept of the ontology to analyze the structure and content of the website, to build ontology model, in order to extract the information based on ontology from the e-commerce website for the users. In the end, the paper makes an experiment test of the text tool GATE to extract from websites and evaluate the results objectively.

Keywords- information extraction, ontology, e-commerce

I. INTRODUCTION In the last century, the rapid development of the

Internet makes the information increase fast and become too rich, but users should have a clear choice of the information. Choosing from the vast ocean of information becomes the focus of the subsequent. Now a variety of search engines come out, such as Google and Baidu, but which can only find relevant web links for the users to further search; they don’t provide the information needed directly.

The concept of information extraction has different interpretations in the domestic and the international research fields. Line believes that information extraction is a task to locate specific information from nature language, which is a special and useful branch of the natural language. [1] Michele thinks that the traditional information extraction requires lots of human intervention, prepared the rules manually or trained examples hand-labeled. [2]In China, Some scholars believe that information extraction is to extract specified class information from a piece of text information, and let the users query from the structured data. [3]The main function of information extraction is to extract specific information from the text, and the information extracted is in the form of a structured description usually, it can be directly deposited into a database for the users to query further analysis. [4]Also, someone suggested that information extraction is extract text-related of specific types of information from the text automatically. [5]

The paper believes that information extraction is extracting the structured data in accordance with specific

rules from the natural language text or semi-structured text automatically.

Currently, the information extraction technology includes the information technology based on the dictionary, rule-based extraction technology and the technology based on the hidden Markov. In general, ontology-based extraction technology is still a rule, in this paper we use the ontology technology to build the wrapper, and then extract information from the e-commerce site. Originally, ontology is a philosophical term which is defined as “the description of the objective existence of the world, namely, the existence”. Then, ontology is applied to different fields, especially in computer and the artificial intelligence. In the computer field, ontology is formalized clearly to the shared conceptual model. Ontology gives the basic terms of the vocabulary and the relationships, to capture the relevant domain knowledge, and propose a common understanding of the field to identify the common vocabulary, and give a clear formal definition. Because ontology has a structured body, and e-commerce website also has the same format, we build e-commerce ontology first during the information extraction. Ontology-based information extraction is mainly used for the description of the data to achieve the extraction, less dependence on the page structure.

In this paper, we propose to put the domain ontology into the e-commerce information extraction; we use the semantic extraction algorithm in the system, generate the extraction rules based on OWL ontology, and construct wrappers for the web information extraction.


A.E-commerce websites features The e-commerce websites always have certain

structure, which follows similar format. It makes a certain advantage when the structured information is extracted. We can see the searched page contains a wealth of data; we only need to extract some of the properties concerned when the needed information is extracted.

978-1-4577-0860-2/11/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 International Conference on Control, Automation and Systems Engineering (CASE) - Singapore, Singapore (2011.07.30-2011.07.31)] 2011 International Conference on Control,

Figure 1. Part of searched page of Movie DVDS from Taobao

The figure1 is part of a searched page from Taobao website where we used Movie DVDs as the searching keyword. From the page it is seen that the products on the websites are classified. The information collected during the present time, generally takes for structured data. In fact, they could be considered as ontologies, named light ontologies, which use the natural language to describe the simple classification relationship between the products, not yet established a connection on the semantic ontology. The ontology is considering all the features of the e-commerce websites in common, including the products entities, concepts, attributes and the rules.

It needs to further analyze the current e-commerce features. First of all, e-commerce websites give a simple classification of a product. As figure1 shows, Taobao classifies the properties of the movie into the film medium, film type, subtitling/voice, packaging, and release date. It is based on the cognition of the movie websites, which are thought that the movie DVDs should contain the type of the properties. But other e-commerce websites may not agree with this classification, and it is sure that each e-commerce website has its own classification system and methods, which makes the e-commerce websites incompatible and also makes a lot of unnecessary troubles when users use them to search. Secondly, in the film medium, Taobao gives five points on it, DVD5, Blu-ray, VCD, EVD, and the other DVD. When users click on to query, the corresponding answer would not mean to the film product, it could be also possible to be the same type of the other educational videos, because it does not establish contact between the entities in each category. The websites category is just a simple classification structure and it does not have deep relationships at the semantic level.

Therefore, in our study, we use the ontology to establish a relationship for each entity and the concepts, form a semantic network, so that the users can get the accurate and comprehensive results when they search. B.The simple e-commerce products ontology

Ontology composition can be explained using the following formula.

Ontology = Classes + Relations + Axiom + Instances + Functions.

From the semantic, the instance is the object, and the concept represents the collection of the objects, the relations represent the collection of the object sets. The definition of the concept commonly uses the frame structure, which includes the name of the concept, the relationship between the set of other concepts, and description by using natural language. There are four basic relations, such as part-of, kind-of, instance-of and attribute-of.

In the practical applications, it is not necessary to construct the ontology using the above-mentioned element strictly, while the relationship between the concept is not limited to the four kinds of basic relations of the following table lists. Basically, here the “property” and “relationship”, “value” and “instance” have no difference between the meaning and connotation. We define the relationship in according with the specific circumstances when we need.

In the process of the information extraction, the information we need may come from different sites, and due to the different representations, thus there need a strong and well-organized knowledge base. Therefore, the position of the required data and the semantics of the data extraction through the establishment of the domain will include a rich knowledge of the field as much as possible. The paper analyzes the e-commerce websites, and builds e-commerce video product ontology. For the large e-commerce, the paper analyzes it at the superficial level in the field, which could be shown as below in figure2.

Figure 2. Movie (E-commerce) ontology

The film video products ontology consists of the name of a film (Movie), location (Place), price (Price), film characters (Movie-person), freight (Shipping), service (saleservice), and evaluation (comments). And the movie characters also include two subclasses, the performers (Performer) and producers (Creator). For the producers, they include producer, writer and the director according to the general needs of the films. For the performers, they are divided into two categories, the actor and the actress. Film characters instances (Movieperson) is the range of the Movie‘s (Movie) cast. Therefore, each film can be recorded by the participating staff according to the property relations (With_director, with_producer, with_writer). For example, under the ontology, we can see that relationship like: Feng Xiaogang (Director) directed (with_director) “The Phone” (Movie). It is an ontology relationship. C.The extraction rules

Generally, before describing the extraction rules, it should model the html and form the labeling tree structure to facilitate the extraction rules and data extraction. The paper uses the domain ontology instead of the websites’ tree structure. As ontology is able to explain the structure

Page 3: [IEEE 2011 International Conference on Control, Automation and Systems Engineering (CASE) - Singapore, Singapore (2011.07.30-2011.07.31)] 2011 International Conference on Control,

of the e-commerce essentially, it can generate the extraction rules based on the ontology we have built. In the process of information extraction, wrapper uses the ontology of the web pages and the extraction rules.

The paper tries to begin to extract from the instances, and the wrapper will start from the child nodes. Given the ontology and the rules, it can extract the path of any node. The rules are based on the landmark. Each extraction rule has a start rules and end rule.

Here is part of OWL ontology description. <owl:Ontology rdf:about=""/>

<owl:Class rdf:ID="actor"> <rdfs:subClassOf rdf:resource="#Performer"/> <rdfs:subClassOf>

<owl:Restriction> <owl:onProperty rdf:resource="#hasCast"/> <owl:allValuesFrom

rdf:resource="#Movie"/> </owl:Restriction>

</rdfs:subClassOf> </owl:Class> <owl:Class rdf:ID="actress">

<rdfs:subClassOf rdf:resource="#Performer"/> <rdfs:subClassOf>

<owl:Restriction> <owl:onProperty rdf:resource="#hasCast"/> <owl:allValuesFrom

rdf:resource="#Movie"/> </owl:Restriction>

</rdfs:subClassOf> </owl:Class> <owl:DatatypeProperty rdf:ID="Chinesetitle">

<rdfs:domain rdf:resource="#Movie"/> <rdfs:comment

rdf:datatype="&xsd;string">chinesetitle</rdfs:comment> </owl:DatatypeProperty> <owl:Class rdf:ID="Comments"/> <owl:Class rdf:ID="Creator">

<rdfs:subClassOf rdf:resource="#MoviePerson"/> </owl:Class> <owl:ObjectProperty rdf:ID="direct">

<rdfs:domain rdf:resource="#Director"/> <rdfs:range rdf:resource="#Movie"/>

</owl:ObjectProperty> <owl:Class rdf:ID="Director">

<rdfs:subClassOf rdf:resource="#Creator"/> <rdfs:subClassOf>

<owl:Restriction> <owl:onProperty

rdf:resource="#has_direct"/> <owl:allValuesFrom

rdf:resource="#Movie"/> </owl:Restriction>

</rdfs:subClassOf> </owl:Class>

The rules of the actress is: Start: (The Start Rule) R1:

SkipTo (<actress rdf:ID=); End: (The End Rule) R2: Skip To(>).


The information extraction system is based on e-commerce website. First of all, the web pages of e-commerce contain a large number of text messages, the paper uses the ontology model to extract the triple rules {Concept (Class), relations (Property), instances (individual)}, and constructs the wrapper, in which the instances of the ontology will be put into the database for later query. Then the users input query sentence, ontology parses the sentence first, and then it extracts the triples through the rules of the module, matching the instance and database to extract the required information and returning results to the users. This is the whole system extraction process. A. System structure

The system consists of three parts, namely, domain ontology analysis module, the text pre-processing module and the information extraction module. These three parts joint coordination and cooperation, according to the concept, relation, instances to extract the information, and finally extract from the knowledge base, return the results back to the users. The whole process is shown in figure 3.

Figure 3. The process of the information extraction

B. Ontology parsing module In this module, there are two tasks: one is generating

rules, and second is building a knowledge base. First the paper builds simple e-commerce domain ontology, as shown previous. Based on the OWL, the paper analyzes the ontology and generates rules. Then it extracts the instances separately to the database to constitute the knowledge base. The knowledge base is the database what the extraction needs. Of course, the knowledge base will change according to the text being inputted. Knowledge Base is a relational database. The structure of the database is based on the ontology class. The paper first creates a master table, which row is a class. Then it establishes the form according to the object properties of each class. The data table is shown as below in table1.



Output information

Web pages




Rule execution

Rules base


Rule generation


Semantic Keys

Page 4: [IEEE 2011 International Conference on Control, Automation and Systems Engineering (CASE) - Singapore, Singapore (2011.07.30-2011.07.31)] 2011 International Conference on Control,

Movie Movie

Person Place Price Shipping SaleService



Love of


rn Tree




gzhou 12.00 8.00

Director Writer Producer

Zhang Yimou Ai Mi Zhang Yimou

C. Text pre-processing module

This module is used when the users query. When the users enter a query, the system will pre-text the text first, that means identify the users’ request. The main function of this module is to divide the query text into paragraphs, then divide the paragraphs into sentences, at the end divide the sentences into words. Computer will indentify the triples, it can query through the rule execution module. D. Information extraction module

This system is based on instances; it searches the relevant information between father and son relationships according to the structure of ontology from instances to the concepts. In this module, it mainly includes an information extraction algorithm.


In order to test the performance of the system, the paper uses GATE for the information extraction experiment. GATE (General architecture for text engineering) is over 15 years old and is in active use for all types of computational task involving human language. GATE is architecture for language engineering which is developed by the University of Sheffield and it contains a suite of tools for language processing. It runs over a corpus of texts to produce a set of annotated texts in IE. The input takes the form of URLs of target WebPages or ontology of the domain. [6] The ontology sets the structure. For IE application, the paper inputs the ontology of the DVDs.

First, the paper downloads a batch of e-commerce web pages (html), inputs the ontology and the web resources into GATE, and the uses the ontology to extract the information from the web resources, and finally outputs the results. GATE has the function to load the ontology and extract using ontology. The results are as follows. The paper inputs the extracted data into a database table, and queries from it when the users make a request.

V. CONCLUSION In general, the technology of the information extraction

system based on ontology is not mature, especially for the ontology, there are still a lot of manual works, and the development remains to be further studied. The paper only extracts from a small area of the e-commerce websites. As an emerging field, e-commerce developed rapidly recent years, it plays an important role in the internet age now, it needs to expand the field and explores in-depth in the future. Also, the paper has to improve constantly in the information extraction rules and the building of ontologies.


[1] Line.Eikvil, “Information Extraction from World Wide Web”, Norwegian Computer Center, P.B.114, Blindern,N-0314, Oslo, Norway, pp. 5-10, July 1999.

[2] Michele.Banko, Michael.J.Cafarella, and Stephen Soderland, “Open Information Extraction from the Web”, Department of Computer Science and Engineering, University of Washington, Seattle, USA, pp. 2670-2671, 2007.

[3] Wang.Jingpu, “Algorithm Research for Text Information Extraction Based on Wrapper Model”, Hunan University, Hu Nan, pp. 1-16, July 2002.

[4] Li.Baoli, Chen.Yuzhong, and Yu.Shiwen, “Research on Information Extraction: A Survey”, Computer Industry and Application, pp.1-5, 2003.

[5] Hongsheng Wang, Lu Yuan, and Hong Shao, “Text Information Extraction Based on OWL Ontologies”, Fifth International Conference on Fuzzy System and Knowledge Discovery, IEEE Computer Society,DOI 10.1109/FSKD.2008.311 pp.217-221, 2008.

[6] Diana Maynard, Milena Yankova, Alexandros Kourakis, Antonis Kokossis, “Ontology-based information extraction for market monitoring and technology watch”, Eu-funded Knowledge Web network fo excellence (IST-2004-507482) and SEKT project (IST-2004-506826).

Figure 4. Results of the extraction on GATE