Constructing Domain Ontology From Texts

8/8/2019 Constructing Domain Ontology From Texts

1/4

Constructing Domain Ontology from Texts:

A Practical Approach and a Case Study

William L Sousan, Kristina L Wylie and Zhengxin Chen

College of Information Science and Technology

University of Nebraska at OmahaOmaha, NE 68182

{ wsousan, zchen }@mail.unomaha.edu

Abstract Constructing domain ontology from texts is an

effective way of achieving ontology on demand. However,

this process is very complex. In this paper, we explore a

practical approach for domain ontology construction from

texts using existing tools. We also describe a case study, which

focuses on selecting a variety of papers on avian influenza

virus. According to our approach, after being reduced to only

key sentences, the selected texts are parsed by NLP software to

identify the words and their corresponding parts of speech.

The labeled sentences are then related to one another to create

an ontology. In addition, the Graphviz tool is used to visualizethe graphical ontology. Advantages, limitations, as well as the

improvements of this approach are discussed.

Keywords-Semantic Web; domain ontology; ontology

construction; ontology on demand; natural language processing

I. INTRODUCTIONThe theory behind ontology on demand is for the

ability to construct domain specific ontologies reasonably

quick and accurately that model highly specialized domains.

Constructing domain ontology from a text corpus is an

effective way of achieving ontology on demand.

However, performing this process programmatically is verycomplex and difficult due to the nature of the English

language. In addition, there are challenges in determining

the structure of the ontology such as the levels of

generalization and specialization of concepts and the

relevancy of concepts.

In this paper, we explore an experimental approach for

domain ontology construction from texts found on the Web

in a semi-automated fashion, describing a general

methodology and a case study focusing on the avian

influenza virus. We combine the use of natural language

processing (NLP) tools, visualization, and user interaction

to develop a domain specific ontology of the avian/swine

influenza virus from a small text corpus.The rest of the paper is organized follows. We first

review basics of ontology, focusing on ontology

construction from texts, with related work (Section II). Then

in Section III we describe our general methodology, which

is followed by a case study in Section IV. Discussion and

conclusion are provided in Section V.

II. BASICS OF ONTOLOGY AND ONTOLOGYCONSTRUCTION FROM TEXTS

A. Ontology and Semantic WebAs noted in [4], the aim of the Semantic Web is to add a

layer of meaning on top of data, services and resources to

enforce their interoperability and enable machine

interpretability. The data, services and resources are then

described semantically via metadata, captured with respectto ontologies, which are logical theories and thus have a

formal logical interpretation, independent to specific

applications. Gruber defined an ontology as a formalexplicit specification of a shared conceptualization [5]. An

ontology relates a large number of ideas and concepts

together in a hierarchical format. A survey of 1300 OWL

ontolgies and RDFS schemas is provided in [11].

The use of ontologies in information systems provides

several benefits. First and foremost, the knowledge needed

and acquired can be stored in a standardized format that

unambiguously describes the knowledge in a formal model.

Next, ontologies are hierarchical and thus provide a

taxonomy of concepts that allows for the semantic indexingand retrieval of information. Besides their retrieval and

indexing characteristics, ontologies provide a means of data

fusion by supplying synonyms or concepts defined using

various descriptions. For example, the concept of Avian

Bird Flu, could be found in text under various descriptions

as Bird Flu, Avian Bird Flu, Avian Flu and other

variations that all identify the same concept of Avian Bird

Flu. This feature is also useful by providing a means of

semantically annotating keywords found in text with their

corresponding ontological concepts. Thus users can clearly

specify or query using ontological concepts instead of

keywords.

B. Towards ontology on demand: Ontologyconstruction from texts

Typically, ontologies are difficult and labor intensive to

create. In order to acquire domain knowledge needed for

ontologies, we need domain experts, as well as domain

information. Ontology construction from texts deserves

particular attention as they provide the largest source of

information on Web. Texts in specific knowledge areas

form the domain corpus and provide a model of the domain.

2009 Fifth International Conference on Next Generation Web Services Practices

978-0-7695-3821-1/09 $25.00 2009 IEEE

DOI 10.1109/NWeSP.2009.7

98


978-0-7695-3821-1/09 $25.00 2009 IEEE

DOI 10.1109/NWeSP.2009.7

98


978-0-7695-3821-1/09 $25.00 2009 IEEE

DOI 10.1109/NWeSP.2009.7

98


2/4

These considerations justifies why the concept of

ontologies on demand is so attractive, because it will

allow us to quickly construct domain-specific ontologies for

knowledge management.

Later in this paper we will discuss related work for

ontology construction from texts, and present our own

approach. But first, we want to provide a little background

on the particular knowledge domain we have worked with,

which necessities ontology on demand.

C. Ontology on demand for avian/swine infuenzaIn recent years, bioinformatics researchers at University of

Nebraska at Omaha have undertaken a series of research

projects to deal with avian flu (and more recently, swine

flu), employing data mining, information retrieval and text

mining techniques [7]. A related website of involved

databases can be found at http://www.flugenome.org/. As a

case study for ontology construction from texts, we have

continued to work in this particular field. Constructing

ontologies from texts to achieve ontology on demand is

particularly important to this knowledge domain, becausewell-known ontologies in biological/medical domain, such

as Gene Ontology (GO, http://www.geneontology.org/) and

Unified Medical Language System (UMLS,

http://www.nlm.nih.gov/research/umls/), are far too general,

and are not specific enough in our focused area. In addition,

as noted in [6], a challenging issue is how to keep an

ontology up-to-date. Recent development in avian flu and

swine flu reveals the dynamics of this research field:

unknown cases and new discoveries are emerging at a fast

pace. In order to overcome the knowledge acquisition

bottleneck to deal with an evolvingworld (and sometimes,

an uncharted territory), we need automatic or semi-

automatic tools to build ontologies. The result will be used

for database annotation.

D. General steps of ontology construction from texts andrelated work

Reference [4] provides a nice survey on current status of

what the authors referred to as ontology learning. In our

view, although there may be some subtle differences

between ontology learning and a full-fledged ontology

construction, they share a lot of things in common. The

tasks involved in ontology learning from texts are:

Extracting relevant domain terminology andsynonyms;

Discovering concepts which can be regarded asabstractions of human thought;

Deriving a concept hierarchy to organize theseconcepts;

Extending an existing concept hierarchy by addingnew concepts;

Learning non-taxonomic relationships; Extracting instances of relations and concepts; Discovering other axiomatic relationships or rules

involving concepts and relations.

Although in general, these tasks are basic elements

needed to be accomplished in ontology construction,

practices vary, depending on actual applications. In

particular, ontology construction-related research has been

quite active in bioinformatics, as exemplified in [1, 3, 6, 8].

Through a comprehensive examination of existing methods

for ontology construction, we realized that the overall tasks

as outlined in [4] are viable for ontology construction;however, we also believe more practical concerns should be

incorporated. In particular, we believe that instead of

reinventing the wheel for each task, we would like to use

existing and publicly-available tools, particularly related to

natural language processing (NLP), to do the job. We also

believe incorporating human experts experience, including

the use of a seed ontology at the initiating stage, could be

an effective way to achieve our goal, as to be outlined in the

next section.

III. A PRACTICAL APPROACHOur understanding of the overall process for ontology

construction from texts is summarized in Fig. 1.

Construction of ontologies from text often includes many

complex sub-tasks occurring within a pipe-lined fashion.

Initially concept and relationship discovery is first applied

to the result obtained from using NLP tools. From there,

various discovery algorithms (including lexico-syntactic

pattern discovery, Noun-Verb-Noun patterns discovery,

word frequency discovery, as well as association rules, etc.)

are applied, and are incorporated with seed ontology (or bag

of words), to build domain concepts and relationships.

In order to make this approach work, more specific

details are needed, as to be outlined in the rest of this

section. Our approach has several important features,including feasibility (Easy to use), semi-automation (with

human interaction), and flexibility. The general steps

involved in the practical approach of ontology construction

from texts are:

Manual selection of a set of related papers. Selection of relevant sentences from these papers

(if needed). Although not necessary, this step could

reduce the result to a manageable size. The

selection can be done manually, or automatically

by applying certain heuristics so that certain

sentences can be entirely skipped.

Run an NLP program; for example, the Stanford parser (http://nlp.stanford.edu/software/parser.shtml).

Locate noun phrases and verbs which providepotential concepts and relationships.

Build graph to visualize and evaluate the ontology,such as using Graphviz (http://www.graphviz.org/).

Post-construction analysis by domain experts.

999999


3/4

Figure 1. Common ontology construct methods from text

IV. ONTOLOGY CONSTRUCTION FROM AVIAN FLURESEARCH PAPERS:A CASE STUDY

A. Paper SelectionThis case study focuses on creating an ontology for the

avian influenza virus. The avian influenza virus affects a

variety of countries and animals. We found there are many

sublineages, clades, and changes that are found in the

influenza virus. Most of the papers focused on only H5N1

influenza, but we also included some research that looked at

how influenza affects humans. There were similarities

between these two types of influenza. We also included one

paper that discussed preventative measures. With the

concern of influenza epidemics in the media, it was

important learn more about preventative measures.

Shown below is the result of using four well-selected

papers for ontology construction. Full texts (rather than

abstracts) of these articles have been used for ontology

construction.

1. G. Di Guiseppe, R. Abbate, L.Albano, P. Marinelli, I. F. Angelillo et.al. "A survey of knowledge, attitudes and practices towards avianinfluenza in an adult population of Italy."BioMedCentral8, 2008.

2. D. Van Riel, V. J. Munster, E. De Wit et al.,. "Human and AvianInfluenza Viruses Target Different Cells in the Lower Respiratory

Tract of Humans and Other Mammals." The American Journal ofPathology 171, 1215-223, 2007.

3. D.Vijaykrishna, J.Bahl, S. Riley, L. Duan et al., EvolutionaryDynamics and Emergence of Panzootic H5N1 Influenza Viruses.

PLoS Pathog4(9), e1000161, 2008.4. X.-F. Wan, T. Nguyen, DC. T. Davis et al., Evolution of Highly

Pathogenic H5N1 Avian Influenza Viruses in Vietnam between 2001and 2007.PLoS ONE3(10), e3462, 2008.

B. NLP and Reducing textTo create the ontology, we used two open source

programs. First, the natural language parsing tool from

Stanford University was used. This program statistically

analyzes each sentence and labels each word with its proper

part of speech thereby creating a grammatical structure of

sentences. As a result, the output format allows for theidentification of noun and verb phrases which may indicate

potential concepts and relationships. Furthermore, the tool

provides different output formats. The following shows an

example of a portion of one of the available output formats

produced by the NLP software which uses codes such as NP

(Noun Phrase) to identify the parts of speech:

(ROOT

(S

(PP (IN Since)

(NP (RB then)))(, ,)

(NP

(NP (NNS outbreaks))

(PP (IN of)(NP (DT the) (JJ H5N1)

(ADJP (RB highly) (JJ pathogenic))

(JJ avian) (NN influenza) (NN strain))))

(VP (VBP have)(VP (VBN been)

(VP (VBN identified)

The challenge thus lies in parsing this output looking for

sequences of NP-VP-NP sequences that provide potential

concept-relationship-concept or domain-relation-range.

Note similar work of using parts of speech for building

ontologies has been performed in [2,10].

An extremely large set of output was produced creating

result files that were hundreds of pages long. This was anextremely large amount of relations to analyze manually. To

reduce the workload, only noun phrases that described the

influenza virus were used. After the final reduction, the

results still spanned 73 pages.

Once the information was parsed, we began creating the

ontology. We linked together words and phrases to create a

relational word network.. This task was done by combining

background knowledge of influenza virus, the output from

the NLP, and reading through the papers discussed

previously.

C. Visual presentationWe visually present the obtained ontology by using

the open source package Graphviz, which creates images of

graphs that are described by the dot programming

language. Thus in our process, we had to convert concepts

into nodes and their relations/connections into arrows

described in the dot language syntax. However, in our

present implementation, the delineation between concepts

and relationships needs some additional work as we do not

indicate the difference between concepts and relationships.

For better readability, a portion of the ontology is enlarged,

NLP Tools

Concept &

Relationship

Discovery

Text Corpus

Extracts:

Word TokensParts of Speech

Noun & Verb PhrasesParse Trees

Named Entities

Build Domain

Concepts

And Relationships

Semantic DistanceClustering

Formalizing Concepts

& Relationships

Extending Concepts

Discovery Algorithms:Lexico-Syntactic Patterns

Association rules

Noun-Verb-Noun patternsWord Frequency

Seed Ontology

or Bag of Words

Final

Ontology

100100100


4/4

Constructing Domain Ontology From Texts

Documents

Transcript of Constructing Domain Ontology From Texts