Constructing Domain Ontology From Texts
Transcript of Constructing Domain Ontology From Texts
-
8/8/2019 Constructing Domain Ontology From Texts
1/4
Constructing Domain Ontology from Texts:
A Practical Approach and a Case Study
William L Sousan, Kristina L Wylie and Zhengxin Chen
College of Information Science and Technology
University of Nebraska at OmahaOmaha, NE 68182
{ wsousan, zchen }@mail.unomaha.edu
Abstract Constructing domain ontology from texts is an
effective way of achieving ontology on demand. However,
this process is very complex. In this paper, we explore a
practical approach for domain ontology construction from
texts using existing tools. We also describe a case study, which
focuses on selecting a variety of papers on avian influenza
virus. According to our approach, after being reduced to only
key sentences, the selected texts are parsed by NLP software to
identify the words and their corresponding parts of speech.
The labeled sentences are then related to one another to create
an ontology. In addition, the Graphviz tool is used to visualizethe graphical ontology. Advantages, limitations, as well as the
improvements of this approach are discussed.
Keywords-Semantic Web; domain ontology; ontology
construction; ontology on demand; natural language processing
I. INTRODUCTIONThe theory behind ontology on demand is for the
ability to construct domain specific ontologies reasonably
quick and accurately that model highly specialized domains.
Constructing domain ontology from a text corpus is an
effective way of achieving ontology on demand.
However, performing this process programmatically is verycomplex and difficult due to the nature of the English
language. In addition, there are challenges in determining
the structure of the ontology such as the levels of
generalization and specialization of concepts and the
relevancy of concepts.
In this paper, we explore an experimental approach for
domain ontology construction from texts found on the Web
in a semi-automated fashion, describing a general
methodology and a case study focusing on the avian
influenza virus. We combine the use of natural language
processing (NLP) tools, visualization, and user interaction
to develop a domain specific ontology of the avian/swine
influenza virus from a small text corpus.The rest of the paper is organized follows. We first
review basics of ontology, focusing on ontology
construction from texts, with related work (Section II). Then
in Section III we describe our general methodology, which
is followed by a case study in Section IV. Discussion and
conclusion are provided in Section V.
II. BASICS OF ONTOLOGY AND ONTOLOGYCONSTRUCTION FROM TEXTS
A. Ontology and Semantic WebAs noted in [4], the aim of the Semantic Web is to add a
layer of meaning on top of data, services and resources to
enforce their interoperability and enable machine
interpretability. The data, services and resources are then
described semantically via metadata, captured with respectto ontologies, which are logical theories and thus have a
formal logical interpretation, independent to specific
applications. Gruber defined an ontology as a formalexplicit specification of a shared conceptualization [5]. An
ontology relates a large number of ideas and concepts
together in a hierarchical format. A survey of 1300 OWL
ontolgies and RDFS schemas is provided in [11].
The use of ontologies in information systems provides
several benefits. First and foremost, the knowledge needed
and acquired can be stored in a standardized format that
unambiguously describes the knowledge in a formal model.
Next, ontologies are hierarchical and thus provide a
taxonomy of concepts that allows for the semantic indexingand retrieval of information. Besides their retrieval and
indexing characteristics, ontologies provide a means of data
fusion by supplying synonyms or concepts defined using
various descriptions. For example, the concept of Avian
Bird Flu, could be found in text under various descriptions
as Bird Flu, Avian Bird Flu, Avian Flu and other
variations that all identify the same concept of Avian Bird
Flu. This feature is also useful by providing a means of
semantically annotating keywords found in text with their
corresponding ontological concepts. Thus users can clearly
specify or query using ontological concepts instead of
keywords.
B. Towards ontology on demand: Ontologyconstruction from texts
Typically, ontologies are difficult and labor intensive to
create. In order to acquire domain knowledge needed for
ontologies, we need domain experts, as well as domain
information. Ontology construction from texts deserves
particular attention as they provide the largest source of
information on Web. Texts in specific knowledge areas
form the domain corpus and provide a model of the domain.
2009 Fifth International Conference on Next Generation Web Services Practices
978-0-7695-3821-1/09 $25.00 2009 IEEE
DOI 10.1109/NWeSP.2009.7
98
2009 Fifth International Conference on Next Generation Web Services Practices
978-0-7695-3821-1/09 $25.00 2009 IEEE
DOI 10.1109/NWeSP.2009.7
98
2009 Fifth International Conference on Next Generation Web Services Practices
978-0-7695-3821-1/09 $25.00 2009 IEEE
DOI 10.1109/NWeSP.2009.7
98
-
8/8/2019 Constructing Domain Ontology From Texts
2/4
These considerations justifies why the concept of
ontologies on demand is so attractive, because it will
allow us to quickly construct domain-specific ontologies for
knowledge management.
Later in this paper we will discuss related work for
ontology construction from texts, and present our own
approach. But first, we want to provide a little background
on the particular knowledge domain we have worked with,
which necessities ontology on demand.
C. Ontology on demand for avian/swine infuenzaIn recent years, bioinformatics researchers at University of
Nebraska at Omaha have undertaken a series of research
projects to deal with avian flu (and more recently, swine
flu), employing data mining, information retrieval and text
mining techniques [7]. A related website of involved
databases can be found at http://www.flugenome.org/. As a
case study for ontology construction from texts, we have
continued to work in this particular field. Constructing
ontologies from texts to achieve ontology on demand is
particularly important to this knowledge domain, becausewell-known ontologies in biological/medical domain, such
as Gene Ontology (GO, http://www.geneontology.org/) and
Unified Medical Language System (UMLS,
http://www.nlm.nih.gov/research/umls/), are far too general,
and are not specific enough in our focused area. In addition,
as noted in [6], a challenging issue is how to keep an
ontology up-to-date. Recent development in avian flu and
swine flu reveals the dynamics of this research field:
unknown cases and new discoveries are emerging at a fast
pace. In order to overcome the knowledge acquisition
bottleneck to deal with an evolvingworld (and sometimes,
an uncharted territory), we need automatic or semi-
automatic tools to build ontologies. The result will be used
for database annotation.
D. General steps of ontology construction from texts andrelated work
Reference [4] provides a nice survey on current status of
what the authors referred to as ontology learning. In our
view, although there may be some subtle differences
between ontology learning and a full-fledged ontology
construction, they share a lot of things in common. The
tasks involved in ontology learning from texts are:
Extracting relevant domain terminology andsynonyms;
Discovering concepts which can be regarded asabstractions of human thought;
Deriving a concept hierarchy to organize theseconcepts;
Extending an existing concept hierarchy by addingnew concepts;
Learning non-taxonomic relationships; Extracting instances of relations and concepts; Discovering other axiomatic relationships or rules
involving concepts and relations.
Although in general, these tasks are basic elements
needed to be accomplished in ontology construction,
practices vary, depending on actual applications. In
particular, ontology construction-related research has been
quite active in bioinformatics, as exemplified in [1, 3, 6, 8].
Through a comprehensive examination of existing methods
for ontology construction, we realized that the overall tasks
as outlined in [4] are viable for ontology construction;however, we also believe more practical concerns should be
incorporated. In particular, we believe that instead of
reinventing the wheel for each task, we would like to use
existing and publicly-available tools, particularly related to
natural language processing (NLP), to do the job. We also
believe incorporating human experts experience, including
the use of a seed ontology at the initiating stage, could be
an effective way to achieve our goal, as to be outlined in the
next section.
III. A PRACTICAL APPROACHOur understanding of the overall process for ontology
construction from texts is summarized in Fig. 1.
Construction of ontologies from text often includes many
complex sub-tasks occurring within a pipe-lined fashion.
Initially concept and relationship discovery is first applied
to the result obtained from using NLP tools. From there,
various discovery algorithms (including lexico-syntactic
pattern discovery, Noun-Verb-Noun patterns discovery,
word frequency discovery, as well as association rules, etc.)
are applied, and are incorporated with seed ontology (or bag
of words), to build domain concepts and relationships.
In order to make this approach work, more specific
details are needed, as to be outlined in the rest of this
section. Our approach has several important features,including feasibility (Easy to use), semi-automation (with
human interaction), and flexibility. The general steps
involved in the practical approach of ontology construction
from texts are:
Manual selection of a set of related papers. Selection of relevant sentences from these papers
(if needed). Although not necessary, this step could
reduce the result to a manageable size. The
selection can be done manually, or automatically
by applying certain heuristics so that certain
sentences can be entirely skipped.
Run an NLP program; for example, the Stanford parser (http://nlp.stanford.edu/software/parser.shtml).
Locate noun phrases and verbs which providepotential concepts and relationships.
Build graph to visualize and evaluate the ontology,such as using Graphviz (http://www.graphviz.org/).
Post-construction analysis by domain experts.
999999
-
8/8/2019 Constructing Domain Ontology From Texts
3/4
Figure 1. Common ontology construct methods from text
IV. ONTOLOGY CONSTRUCTION FROM AVIAN FLURESEARCH PAPERS:A CASE STUDY
A. Paper SelectionThis case study focuses on creating an ontology for the
avian influenza virus. The avian influenza virus affects a
variety of countries and animals. We found there are many
sublineages, clades, and changes that are found in the
influenza virus. Most of the papers focused on only H5N1
influenza, but we also included some research that looked at
how influenza affects humans. There were similarities
between these two types of influenza. We also included one
paper that discussed preventative measures. With the
concern of influenza epidemics in the media, it was
important learn more about preventative measures.
Shown below is the result of using four well-selected
papers for ontology construction. Full texts (rather than
abstracts) of these articles have been used for ontology
construction.
1. G. Di Guiseppe, R. Abbate, L.Albano, P. Marinelli, I. F. Angelillo et.al. "A survey of knowledge, attitudes and practices towards avianinfluenza in an adult population of Italy."BioMedCentral8, 2008.
2. D. Van Riel, V. J. Munster, E. De Wit et al.,. "Human and AvianInfluenza Viruses Target Different Cells in the Lower Respiratory
Tract of Humans and Other Mammals." The American Journal ofPathology 171, 1215-223, 2007.
3. D.Vijaykrishna, J.Bahl, S. Riley, L. Duan et al., EvolutionaryDynamics and Emergence of Panzootic H5N1 Influenza Viruses.
PLoS Pathog4(9), e1000161, 2008.4. X.-F. Wan, T. Nguyen, DC. T. Davis et al., Evolution of Highly
Pathogenic H5N1 Avian Influenza Viruses in Vietnam between 2001and 2007.PLoS ONE3(10), e3462, 2008.
B. NLP and Reducing textTo create the ontology, we used two open source
programs. First, the natural language parsing tool from
Stanford University was used. This program statistically
analyzes each sentence and labels each word with its proper
part of speech thereby creating a grammatical structure of
sentences. As a result, the output format allows for theidentification of noun and verb phrases which may indicate
potential concepts and relationships. Furthermore, the tool
provides different output formats. The following shows an
example of a portion of one of the available output formats
produced by the NLP software which uses codes such as NP
(Noun Phrase) to identify the parts of speech:
(ROOT
(S
(PP (IN Since)
(NP (RB then)))(, ,)
(NP
(NP (NNS outbreaks))
(PP (IN of)(NP (DT the) (JJ H5N1)
(ADJP (RB highly) (JJ pathogenic))
(JJ avian) (NN influenza) (NN strain))))
(VP (VBP have)(VP (VBN been)
(VP (VBN identified)
The challenge thus lies in parsing this output looking for
sequences of NP-VP-NP sequences that provide potential
concept-relationship-concept or domain-relation-range.
Note similar work of using parts of speech for building
ontologies has been performed in [2,10].
An extremely large set of output was produced creating
result files that were hundreds of pages long. This was anextremely large amount of relations to analyze manually. To
reduce the workload, only noun phrases that described the
influenza virus were used. After the final reduction, the
results still spanned 73 pages.
Once the information was parsed, we began creating the
ontology. We linked together words and phrases to create a
relational word network.. This task was done by combining
background knowledge of influenza virus, the output from
the NLP, and reading through the papers discussed
previously.
C. Visual presentationWe visually present the obtained ontology by using
the open source package Graphviz, which creates images of
graphs that are described by the dot programming
language. Thus in our process, we had to convert concepts
into nodes and their relations/connections into arrows
described in the dot language syntax. However, in our
present implementation, the delineation between concepts
and relationships needs some additional work as we do not
indicate the difference between concepts and relationships.
For better readability, a portion of the ontology is enlarged,
NLP Tools
Concept &
Relationship
Discovery
Text Corpus
Extracts:
Word TokensParts of Speech
Noun & Verb PhrasesParse Trees
Named Entities
Build Domain
Concepts
And Relationships
Semantic DistanceClustering
Formalizing Concepts
& Relationships
Extending Concepts
Discovery Algorithms:Lexico-Syntactic Patterns
Association rules
Noun-Verb-Noun patternsWord Frequency
Seed Ontology
or Bag of Words
Final
Ontology
100100100
-
8/8/2019 Constructing Domain Ontology From Texts
4/4