Constructing Domain Ontology From Texts

download Constructing Domain Ontology From Texts

of 4

Transcript of Constructing Domain Ontology From Texts

  • 8/8/2019 Constructing Domain Ontology From Texts

    1/4

    Constructing Domain Ontology from Texts:

    A Practical Approach and a Case Study

    William L Sousan, Kristina L Wylie and Zhengxin Chen

    College of Information Science and Technology

    University of Nebraska at OmahaOmaha, NE 68182

    { wsousan, zchen }@mail.unomaha.edu

    Abstract Constructing domain ontology from texts is an

    effective way of achieving ontology on demand. However,

    this process is very complex. In this paper, we explore a

    practical approach for domain ontology construction from

    texts using existing tools. We also describe a case study, which

    focuses on selecting a variety of papers on avian influenza

    virus. According to our approach, after being reduced to only

    key sentences, the selected texts are parsed by NLP software to

    identify the words and their corresponding parts of speech.

    The labeled sentences are then related to one another to create

    an ontology. In addition, the Graphviz tool is used to visualizethe graphical ontology. Advantages, limitations, as well as the

    improvements of this approach are discussed.

    Keywords-Semantic Web; domain ontology; ontology

    construction; ontology on demand; natural language processing

    I. INTRODUCTIONThe theory behind ontology on demand is for the

    ability to construct domain specific ontologies reasonably

    quick and accurately that model highly specialized domains.

    Constructing domain ontology from a text corpus is an

    effective way of achieving ontology on demand.

    However, performing this process programmatically is verycomplex and difficult due to the nature of the English

    language. In addition, there are challenges in determining

    the structure of the ontology such as the levels of

    generalization and specialization of concepts and the

    relevancy of concepts.

    In this paper, we explore an experimental approach for

    domain ontology construction from texts found on the Web

    in a semi-automated fashion, describing a general

    methodology and a case study focusing on the avian

    influenza virus. We combine the use of natural language

    processing (NLP) tools, visualization, and user interaction

    to develop a domain specific ontology of the avian/swine

    influenza virus from a small text corpus.The rest of the paper is organized follows. We first

    review basics of ontology, focusing on ontology

    construction from texts, with related work (Section II). Then

    in Section III we describe our general methodology, which

    is followed by a case study in Section IV. Discussion and

    conclusion are provided in Section V.

    II. BASICS OF ONTOLOGY AND ONTOLOGYCONSTRUCTION FROM TEXTS

    A. Ontology and Semantic WebAs noted in [4], the aim of the Semantic Web is to add a

    layer of meaning on top of data, services and resources to

    enforce their interoperability and enable machine

    interpretability. The data, services and resources are then

    described semantically via metadata, captured with respectto ontologies, which are logical theories and thus have a

    formal logical interpretation, independent to specific

    applications. Gruber defined an ontology as a formalexplicit specification of a shared conceptualization [5]. An

    ontology relates a large number of ideas and concepts

    together in a hierarchical format. A survey of 1300 OWL

    ontolgies and RDFS schemas is provided in [11].

    The use of ontologies in information systems provides

    several benefits. First and foremost, the knowledge needed

    and acquired can be stored in a standardized format that

    unambiguously describes the knowledge in a formal model.

    Next, ontologies are hierarchical and thus provide a

    taxonomy of concepts that allows for the semantic indexingand retrieval of information. Besides their retrieval and

    indexing characteristics, ontologies provide a means of data

    fusion by supplying synonyms or concepts defined using

    various descriptions. For example, the concept of Avian

    Bird Flu, could be found in text under various descriptions

    as Bird Flu, Avian Bird Flu, Avian Flu and other

    variations that all identify the same concept of Avian Bird

    Flu. This feature is also useful by providing a means of

    semantically annotating keywords found in text with their

    corresponding ontological concepts. Thus users can clearly

    specify or query using ontological concepts instead of

    keywords.

    B. Towards ontology on demand: Ontologyconstruction from texts

    Typically, ontologies are difficult and labor intensive to

    create. In order to acquire domain knowledge needed for

    ontologies, we need domain experts, as well as domain

    information. Ontology construction from texts deserves

    particular attention as they provide the largest source of

    information on Web. Texts in specific knowledge areas

    form the domain corpus and provide a model of the domain.

    2009 Fifth International Conference on Next Generation Web Services Practices

    978-0-7695-3821-1/09 $25.00 2009 IEEE

    DOI 10.1109/NWeSP.2009.7

    98

    2009 Fifth International Conference on Next Generation Web Services Practices

    978-0-7695-3821-1/09 $25.00 2009 IEEE

    DOI 10.1109/NWeSP.2009.7

    98

    2009 Fifth International Conference on Next Generation Web Services Practices

    978-0-7695-3821-1/09 $25.00 2009 IEEE

    DOI 10.1109/NWeSP.2009.7

    98

  • 8/8/2019 Constructing Domain Ontology From Texts

    2/4

    These considerations justifies why the concept of

    ontologies on demand is so attractive, because it will

    allow us to quickly construct domain-specific ontologies for

    knowledge management.

    Later in this paper we will discuss related work for

    ontology construction from texts, and present our own

    approach. But first, we want to provide a little background

    on the particular knowledge domain we have worked with,

    which necessities ontology on demand.

    C. Ontology on demand for avian/swine infuenzaIn recent years, bioinformatics researchers at University of

    Nebraska at Omaha have undertaken a series of research

    projects to deal with avian flu (and more recently, swine

    flu), employing data mining, information retrieval and text

    mining techniques [7]. A related website of involved

    databases can be found at http://www.flugenome.org/. As a

    case study for ontology construction from texts, we have

    continued to work in this particular field. Constructing

    ontologies from texts to achieve ontology on demand is

    particularly important to this knowledge domain, becausewell-known ontologies in biological/medical domain, such

    as Gene Ontology (GO, http://www.geneontology.org/) and

    Unified Medical Language System (UMLS,

    http://www.nlm.nih.gov/research/umls/), are far too general,

    and are not specific enough in our focused area. In addition,

    as noted in [6], a challenging issue is how to keep an

    ontology up-to-date. Recent development in avian flu and

    swine flu reveals the dynamics of this research field:

    unknown cases and new discoveries are emerging at a fast

    pace. In order to overcome the knowledge acquisition

    bottleneck to deal with an evolvingworld (and sometimes,

    an uncharted territory), we need automatic or semi-

    automatic tools to build ontologies. The result will be used

    for database annotation.

    D. General steps of ontology construction from texts andrelated work

    Reference [4] provides a nice survey on current status of

    what the authors referred to as ontology learning. In our

    view, although there may be some subtle differences

    between ontology learning and a full-fledged ontology

    construction, they share a lot of things in common. The

    tasks involved in ontology learning from texts are:

    Extracting relevant domain terminology andsynonyms;

    Discovering concepts which can be regarded asabstractions of human thought;

    Deriving a concept hierarchy to organize theseconcepts;

    Extending an existing concept hierarchy by addingnew concepts;

    Learning non-taxonomic relationships; Extracting instances of relations and concepts; Discovering other axiomatic relationships or rules

    involving concepts and relations.

    Although in general, these tasks are basic elements

    needed to be accomplished in ontology construction,

    practices vary, depending on actual applications. In

    particular, ontology construction-related research has been

    quite active in bioinformatics, as exemplified in [1, 3, 6, 8].

    Through a comprehensive examination of existing methods

    for ontology construction, we realized that the overall tasks

    as outlined in [4] are viable for ontology construction;however, we also believe more practical concerns should be

    incorporated. In particular, we believe that instead of

    reinventing the wheel for each task, we would like to use

    existing and publicly-available tools, particularly related to

    natural language processing (NLP), to do the job. We also

    believe incorporating human experts experience, including

    the use of a seed ontology at the initiating stage, could be

    an effective way to achieve our goal, as to be outlined in the

    next section.

    III. A PRACTICAL APPROACHOur understanding of the overall process for ontology

    construction from texts is summarized in Fig. 1.

    Construction of ontologies from text often includes many

    complex sub-tasks occurring within a pipe-lined fashion.

    Initially concept and relationship discovery is first applied

    to the result obtained from using NLP tools. From there,

    various discovery algorithms (including lexico-syntactic

    pattern discovery, Noun-Verb-Noun patterns discovery,

    word frequency discovery, as well as association rules, etc.)

    are applied, and are incorporated with seed ontology (or bag

    of words), to build domain concepts and relationships.

    In order to make this approach work, more specific

    details are needed, as to be outlined in the rest of this

    section. Our approach has several important features,including feasibility (Easy to use), semi-automation (with

    human interaction), and flexibility. The general steps

    involved in the practical approach of ontology construction

    from texts are:

    Manual selection of a set of related papers. Selection of relevant sentences from these papers

    (if needed). Although not necessary, this step could

    reduce the result to a manageable size. The

    selection can be done manually, or automatically

    by applying certain heuristics so that certain

    sentences can be entirely skipped.

    Run an NLP program; for example, the Stanford parser (http://nlp.stanford.edu/software/parser.shtml).

    Locate noun phrases and verbs which providepotential concepts and relationships.

    Build graph to visualize and evaluate the ontology,such as using Graphviz (http://www.graphviz.org/).

    Post-construction analysis by domain experts.

    999999

  • 8/8/2019 Constructing Domain Ontology From Texts

    3/4

    Figure 1. Common ontology construct methods from text

    IV. ONTOLOGY CONSTRUCTION FROM AVIAN FLURESEARCH PAPERS:A CASE STUDY

    A. Paper SelectionThis case study focuses on creating an ontology for the

    avian influenza virus. The avian influenza virus affects a

    variety of countries and animals. We found there are many

    sublineages, clades, and changes that are found in the

    influenza virus. Most of the papers focused on only H5N1

    influenza, but we also included some research that looked at

    how influenza affects humans. There were similarities

    between these two types of influenza. We also included one

    paper that discussed preventative measures. With the

    concern of influenza epidemics in the media, it was

    important learn more about preventative measures.

    Shown below is the result of using four well-selected

    papers for ontology construction. Full texts (rather than

    abstracts) of these articles have been used for ontology

    construction.

    1. G. Di Guiseppe, R. Abbate, L.Albano, P. Marinelli, I. F. Angelillo et.al. "A survey of knowledge, attitudes and practices towards avianinfluenza in an adult population of Italy."BioMedCentral8, 2008.

    2. D. Van Riel, V. J. Munster, E. De Wit et al.,. "Human and AvianInfluenza Viruses Target Different Cells in the Lower Respiratory

    Tract of Humans and Other Mammals." The American Journal ofPathology 171, 1215-223, 2007.

    3. D.Vijaykrishna, J.Bahl, S. Riley, L. Duan et al., EvolutionaryDynamics and Emergence of Panzootic H5N1 Influenza Viruses.

    PLoS Pathog4(9), e1000161, 2008.4. X.-F. Wan, T. Nguyen, DC. T. Davis et al., Evolution of Highly

    Pathogenic H5N1 Avian Influenza Viruses in Vietnam between 2001and 2007.PLoS ONE3(10), e3462, 2008.

    B. NLP and Reducing textTo create the ontology, we used two open source

    programs. First, the natural language parsing tool from

    Stanford University was used. This program statistically

    analyzes each sentence and labels each word with its proper

    part of speech thereby creating a grammatical structure of

    sentences. As a result, the output format allows for theidentification of noun and verb phrases which may indicate

    potential concepts and relationships. Furthermore, the tool

    provides different output formats. The following shows an

    example of a portion of one of the available output formats

    produced by the NLP software which uses codes such as NP

    (Noun Phrase) to identify the parts of speech:

    (ROOT

    (S

    (PP (IN Since)

    (NP (RB then)))(, ,)

    (NP

    (NP (NNS outbreaks))

    (PP (IN of)(NP (DT the) (JJ H5N1)

    (ADJP (RB highly) (JJ pathogenic))

    (JJ avian) (NN influenza) (NN strain))))

    (VP (VBP have)(VP (VBN been)

    (VP (VBN identified)

    The challenge thus lies in parsing this output looking for

    sequences of NP-VP-NP sequences that provide potential

    concept-relationship-concept or domain-relation-range.

    Note similar work of using parts of speech for building

    ontologies has been performed in [2,10].

    An extremely large set of output was produced creating

    result files that were hundreds of pages long. This was anextremely large amount of relations to analyze manually. To

    reduce the workload, only noun phrases that described the

    influenza virus were used. After the final reduction, the

    results still spanned 73 pages.

    Once the information was parsed, we began creating the

    ontology. We linked together words and phrases to create a

    relational word network.. This task was done by combining

    background knowledge of influenza virus, the output from

    the NLP, and reading through the papers discussed

    previously.

    C. Visual presentationWe visually present the obtained ontology by using

    the open source package Graphviz, which creates images of

    graphs that are described by the dot programming

    language. Thus in our process, we had to convert concepts

    into nodes and their relations/connections into arrows

    described in the dot language syntax. However, in our

    present implementation, the delineation between concepts

    and relationships needs some additional work as we do not

    indicate the difference between concepts and relationships.

    For better readability, a portion of the ontology is enlarged,

    NLP Tools

    Concept &

    Relationship

    Discovery

    Text Corpus

    Extracts:

    Word TokensParts of Speech

    Noun & Verb PhrasesParse Trees

    Named Entities

    Build Domain

    Concepts

    And Relationships

    Semantic DistanceClustering

    Formalizing Concepts

    & Relationships

    Extending Concepts

    Discovery Algorithms:Lexico-Syntactic Patterns

    Association rules

    Noun-Verb-Noun patternsWord Frequency

    Seed Ontology

    or Bag of Words

    Final

    Ontology

    100100100

  • 8/8/2019 Constructing Domain Ontology From Texts

    4/4