Enhancing portability with multilingual ontology-based...

18
Enhancing portability with multilingual ontology-based knowledge management Aviv Segev ,1 , Avigdor Gal Technion Israel Institute of Technology, Haifa 32000, Israel Available online 6 August 2007 Abstract Information systems in multilingual environments, such as the EU, suffer from low portability and high deployment costs. In this paper we propose an ontology-based model for multilingual knowledge management in information systems. Our unique feature is a lightweight mechanism, dubbed context, that is associated with ontological concepts and specified in multiple languages. We use contexts to assist in resolving cross-language and local variation ambiguities. Equipped with such a model, we next provide a four-step procedure for overcoming the language barrier in deploying a new information system. We also show that our proposed solution can overcome differences that stem from local variations that may accompany multilingual information systems deployment. The proposed mechanism was tested in an actual multilingual eGovernment environment and by using real-world news syndication traces. Our empirical results serve as a proof-of-concept of the viability of the proposed model. Also, our experiments show that news items in different languages can be identified by a single ontology concept using contexts. We also evaluated the local interpretations of concepts of a language in different geographical locations. © 2007 Elsevier B.V. All rights reserved. Keywords: Knowledge management; Knowledge sharing; Ontology; Context; Multilinguality; eGovernment 1. Introduction Experiences in developing information systems have shown it to be a long and expensive process. Therefore, once a generic information system has been developed, it is the aim of the developer to make it as portable as possible and the aim of users to deploy it with minimum effort. In some cases, such deployment requires the change of lan- guage, which affects the user interface as well as the inter- nal decision making processes. In this work we focus on applications in which a language transfer serves as a main obstacle in adapting an information system to user needs. As a case in point, consider eGovernment applications in the European Union. The EU puts effort into homogeniz- ing its governance procedures to allow easy interoperabil- ity. Yet it does so without committing to a single language. On the contrary, the EU values the preservation of local culture (including language). In such applications, the development of an information system that is monolingual will result in low portability and high deployment costs and therefore multilingual information systems seem to be more appropriate. Recent advances in information system development suggest the use of ontologies as a main knowledge man- agement tool. Ontologies model the domain of discourse and may be used for routing data, controlling the workflow of activities, assisting in semantic annotation of both data and queries, etc. In this paper we take advantage of these Available online at www.sciencedirect.com Decision Support Systems 45 (2008) 567 584 www.elsevier.com/locate/dss Corresponding author. E-mail addresses: [email protected] (A. Segev), [email protected] (A. Gal). 1 Present address: National Chengchi University, Taipei 11605, Taiwan. 0167-9236/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2007.07.011

Transcript of Enhancing portability with multilingual ontology-based...

Page 1: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

Available online at www.sciencedirect.com

45 (2008) 567–584www.elsevier.com/locate/dss

Decision Support Systems

Enhancing portability with multilingual ontology-basedknowledge management

Aviv Segev ⁎,1, Avigdor Gal

Technion — Israel Institute of Technology, Haifa 32000, Israel

Available online 6 August 2007

Abstract

Information systems inmultilingual environments, such as the EU, suffer from low portability and high deployment costs. In this paperwe propose an ontology-basedmodel for multilingual knowledgemanagement in information systems. Our unique feature is a lightweightmechanism, dubbed context, that is associated with ontological concepts and specified in multiple languages. We use contexts to assist inresolving cross-language and local variation ambiguities. Equipped with such a model, we next provide a four-step procedure forovercoming the language barrier in deploying a new information system. We also show that our proposed solution can overcomedifferences that stem from local variations that may accompany multilingual information systems deployment. The proposed mechanismwas tested in an actual multilingual eGovernment environment and by using real-world news syndication traces. Our empirical resultsserve as a proof-of-concept of the viability of the proposed model. Also, our experiments show that news items in different languages canbe identified by a single ontology concept using contexts.We also evaluated the local interpretations of concepts of a language in differentgeographical locations.© 2007 Elsevier B.V. All rights reserved.

Keywords: Knowledge management; Knowledge sharing; Ontology; Context; Multilinguality; eGovernment

1. Introduction

Experiences in developing information systems haveshown it to be a long and expensive process. Therefore,once a generic information system has been developed, it isthe aim of the developer to make it as portable as possibleand the aim of users to deploy it with minimum effort. Insome cases, such deployment requires the change of lan-guage, which affects the user interface as well as the inter-nal decision making processes. In this work we focus onapplications in which a language transfer serves as a main

⁎ Corresponding author.E-mail addresses: [email protected] (A. Segev),

[email protected] (A. Gal).1 Present address: National Chengchi University, Taipei 11605, Taiwan.

0167-9236/$ - see front matter © 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.dss.2007.07.011

obstacle in adapting an information system to user needs.As a case in point, consider eGovernment applications inthe European Union. The EU puts effort into homogeniz-ing its governance procedures to allow easy interoperabil-ity. Yet it does so without committing to a single language.On the contrary, the EU values the preservation of localculture (including language). In such applications, thedevelopment of an information system that is monolingualwill result in low portability and high deployment costsand therefore multilingual information systems seem to bemore appropriate.

Recent advances in information system developmentsuggest the use of ontologies as a main knowledge man-agement tool. Ontologies model the domain of discourseandmay be used for routing data, controlling theworkflowof activities, assisting in semantic annotation of both dataand queries, etc. In this paper we take advantage of these

Page 2: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

568 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

recent advances and propose an ontology-based model formultilingual knowledge management in information sys-tems.Our mechanism is based on a single ontology, whoseconcepts can have multiple representations (i.e., conceptnames) in various languages.While such solutions alreadyexist (e.g., in Protégé), we argue that they are insufficient.On the one hand, a single global ontology is preferred overlocal ontologies when it comes to interoperability. On theother hand, mere translation of ontological concepts fromone language to another is insufficient to fully representdifferences that may arise from the change of language.Such differences may result in concept ambiguity andgenerally in under-specification of semantic meaning [9].

To compensate for ontology under-specification wepropose to support multilingual ontologies with a light-weight mechanism, dubbed context. Contexts serve in theliterature to represent local views of a domain, as opposedto the global view of an ontology [10]. While the specificrepresentation of contexts vary, one may envision a con-text, as an example, to be represented by a set of words,possibly associated with weights, reflecting some notion ofimportance. Contexts, in our proposed solution, are asso-ciated with ontological concepts and specified in multiplelanguages. Therefore, they aim at conveying the local in-terpretation of ontological concepts, thus assisting in theresolution of cross-language and local interpretationambiguities.

Equipped with such a model, we next provide a four-step procedure of overcoming the language barrier indeploying information systems. We also show that ourproposed solution can overcome differences that stem fromlocal interpretations that may accompany cross-languageinformation systems deployment. The proposed mecha-nism was tested in an actual multilingual environment inthe framework of the QUALEG (Quality of Service andLegitimacy in eGovernment) project.2 The QUALEGproject is an innovative project sponsored by the EuropeanUnion to promote the relationship between local govern-ments and citizens. This multilingual information systemaims at allowing local governments to maintain a directconnectionwith citizens through the ongoing adjustment oftheir policies according to the assessment of citizen needs.This implies that local governments should be able tomeasure the performance of the services they offer, assesscitizen satisfaction, and re-formulate policy orientations onsuch elements with the participation of citizens, all in amultilingual setting.

To complement our experiences with QUALEG, weprovide a set of experiments performed in a controlled

2 http://ww.qualeg.eupm.net.

environment using news syndication data. Our empiricalanalysis shows that news items in different languages canbe identified by a single ontology concept using contexts.We also evaluate the local interpretations of concepts of alanguage in different geographical locations and show thatthe results drop when using a different local interpretationtraining set and drop considerably when using a mixedlocal interpretation training set from an identical language.

To summarize, our main contributions are as follows:

• We propose a knowledge management model, basedon the relationships between ontologies and contexts,which lends itself well to the support of effectiveportability and deployment of multilingual informationsystems.

• The high degree of flexibility the proposed modelprovides is translated into procedures for the deploy-ment and querying of amultilingual information system.

• We demonstrate the feasibility of our model using animplementation and deployment in the context of aEuropean eGovernment project.

• We provide a thorough empirical analysis, revealingthat joint classification by uniting all languages in asingle concept has minimal impact on the results.

The rest of the paper is organized as follows. Section 2provides a review of relatedwork as a starting point for theontology-based multilingual knowledge managementmodel, proposed in Section 3. Section 4 provides themethodology for managing ontologies when deploying anew multilingual information system and when queryingit. Section 5 describes our experiences with the QUALEGproject and experiments with RSS news data.We end withconcluding remarks in Section 6.

2. Related work

This section reviews previous research work relevant tothis paper. We start by discussing current support formultilingual ontologies (Section 2.1). We then discuss themachine translation approach for multilinguality (Section2.2), and current techniques of multilingual informationretrieval (Section 2.3).

2.1. Ontologies and multilinguality

Ontologies are currently considered the de-facto stan-dard for representing semantic information. Their design,however, is a difficult task, requiring the collaborationof ontology engineers and organization experts. There-fore, ontologies are manually crafted and tuned, whichresults in a static domain model, infrequently modified.

Page 3: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

569A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

Nevertheless, once designed their universal naturemakesthem an excellent mechanism for application interoper-ability. Without universal ontologies, interoperabilitybecomes an uncertain process [8], which may not beacceptable for some applications.

The static nature of ontologies conflicts with thedynamic nature of the world. Businesses nowadays oftenchange and need to adapt the semantic representation oftheir occupations to their changing business environment.Governments, which change less often, still need to adapttheir own regulations to a global community, whilemaintaining some divergence from standard governance,reflecting local interpretations and lingual differences. Theresearch literature has proposed a hybrid approach [14], inwhich ontologies are recognized as static entities yet anorganization can change its business semantic representa-tion dynamically. To do so, an ontology is defined to havetwo parts: a static part (which is the global ontology) and adynamic part, which evolves either by exporting ontologiesor by discovery. With such a model at hand, organizationscan still interoperate using the universal part of theontology, and continuously change their business modelsusing the local component of the ontology. Every now andthen, an industry may recognize certain practices asuniversal and add them to the global ontology, a standardaccepted by consensus. Themodel we propose in this workreplaces local ontologies with contexts, which are easier toextract and manage.

To support multilingual applications, ontologies canseparate their internal concept descriptors (typicallysemantic-less textual description) from the name of aconcept. In doing so, one can associate more than a singlename to the same ontology concept. In particular, if eachname is given in a different language, one can construct aclear and cohesive system, without redundancy. Such asupport has already become a standard and is used byseveral ontology management systems (e.g., Protégé). Weextend this idea into multilingual contexts.

2.2. Translation

The first solution that comes to mind when workingwith multiple languages is to use translation. However,translation entails language-specific difficulties, such as theimportance of the connection between grammar and mean-ing, the role of word endings and word position, and thelength and complexity of words, which are comprised ofother words. Translation also entails difficulties that arisefrom the translation effort itself: some words do not haveexact parallels in other languages, nuances are hard toconvey, and a word may have different meanings indifferent contexts.

The use of automatic tools for language translation hasbeen suggested as a solution for multilingual applications[33]. However, this solution is not viable, since automaticmachine translation (MT) today suffers from severalcritical limitations [13]. First, these tools have yet toachieve a level of proficiency comparable to humantranslation. Although there are no universally acceptedevaluation methods, different methods of evaluation ofMT in specified operational contexts still indicate that MTdoes not attain a sufficiently good level, in terms ofmeasures such as intelligibility, accuracy, fidelity, andappropriateness of style. While human translation canidentify errors and deficiencies that can be corrected orimproved,MT has yet to acquire this ability. A personwhomakes a mistake once can learn for the future, but MTstillcannot. Currently, any prospect of a fully automaticgeneral-purpose systemcapable of good quality translationwithout human intervention is beyond the scope of MT.

Therefore, this paper presents a solution that bypassesmachine translation in multilingual environments by usinga single ontology system towhich predeterminedmanuallytranslated ontology concepts are automatically mapped.We compensate for under-specification of the ontology byusing contexts, local view points that can be automaticallygenerated in a language-independent fashion.

2.3. Multilingual information retrieval

Research into multilingual information retrieval hasbeen going on for more than 50 years. In recent years, theincreasing impact of the Internet has generated even moreinterest in the topic [12].

One issue with multilingual information retrieval refersto the performance of information retrieval in differentlanguages. The performance of AlltheWeb, Altavista,Google, and three Arabic engines was examined byMoukdad [21] to see how they handle Arabic linguisticcharacteristics. He found that it is necessary to make usersaware of the limitations of general search engines inretrieving Arabic documents. Along the same lines, Bar-Ilan and Gutman [3] examined the performance ofAlltheWeb, Altavista, and Google to handle four differentlanguages: French, Hebrew, Hungarian, and Russian.They reported that “non-English languages have a muchlarger chance of being lost in Cyberspace”. Multilingualinformation retrieval from the Internet in the Turkishlanguage entails certain problems, particularly due to theexistence of special characters in the language [1].

Another line of research involves cross-retrievalamong different languages. One approach to cross-lingual text retrieval (CLTR) using multilingual textmining finds themultilingual concept–term relationships

Page 4: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

570 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

from linguistically diverse textual data relevant to adomain. These, in turn, are used to find the conceptualcontent of the multilingual text. When language-indepen-dent concepts hidden beneath both document and queryare revealed, concept-basedmatching ismade possible [6].

Examples of other multilingual systems includeMEMPHIS, an agent-based system for enabling acqui-sition of multilingual content [24], and MUMISmultimedia indexing through multi-source and multi-language information extraction [28].

The model presented in this paper uses ontologies andcontexts for multilingual knowledge management, includ-ing tasks of information retrieval. Our model is languageindependent and requiresminimal training to define a topicautomatically. Furthermore, the model allows cross-lingual storage of information and can integrate some ofthe aforementioned techniques of information retrieval.

3. A model for multilingual knowledge management

This section presents a model for multilingualknowledge management in information systems. Themodel is based on a semantic representation tool (ontol-ogy) enhanced with contexts. We start by formally defin-ing contexts and ontologies and their inter-relationships(Section 3.1). We then present the use of ontologies andcontexts in supporting multilingual tasks (Section 3.2).Finally, we discuss the advantages of using the model formultilingual knowledge management (Section 3.3).

3.1. Preliminaries

A common definition of an ontology considers it tobe “a specification of a conceptualization” [10], whereconceptualization is an abstract view of the worldrepresented as a set of objects. The term has been used indifferent research areas, including philosophy (where itwas coined), artificial intelligence, information sciences,knowledge representation, object modeling, and mostrecently, eCommerce applications. For our purposes, anontology O=V,E is a directed graph, with nodesrepresenting concepts (vocabulary or things [4,5])associated with certain semantics and relationships [26].For example, in eGovernment a concept can be PublicService with a relation includes to a concept Activity ofPublic Administration and a relation responsibility to aconcept Local Spatial Management Strategic Plan.Typically, ontologies are represented using DescriptionLogic [7], where subsumption typifies the semanticrelationship between terms; or Frame Logic [15], wherea deductive inference system provides access to semi-structured data.

We define a descriptor c from domain D as an indexterm used to identify a record of information [20]. It canconsist of a word, phrase, or alphanumerical term. Aweight w∈ℜ identifies the importance of descriptor c inrelation to the record of information. For example, we canhave a descriptor Immovables, and a weight of 6. A de-scriptor set {⟨ci,wi⟩} is defined by a set of pairs, descriptorsand weights.

By collecting descriptor sets together we obtain acontext. A context C={{⟨cij,wij⟩}i}j is a set of finite sets ofdescriptors. For example, a context C may be a set ofwords (hence D is a set of all possible charactercombinations) defining a document Doc and the weightscan represent the relevance of a descriptor to Doc. Inclassic Information Retrieval, ⟨cij,wij⟩ may represent thefact that the word cij is repeated wij times in Doc.

3.2. Ontologies, contexts, and multilingual knowledgemanagement

We now move on to describe a model for multilingualknowledge management using ontologies and context. Wecan consider each descriptor c to be a different point ofview of some concept ν∈V. A descriptor set then definesdifferent perspectives and their relevant weight, whichidentifies the importance of each perspective. For example,an ontology concept Local Spatial Management StrategicPlan can be represented by descriptors such as: ⟨Immo-vables, 40⟩, ⟨Building, 25⟩, ⟨Infrastructure, 20⟩, etc. Wecan now assume that each descriptor set represents adifferent language and then a context is a multilingualrepresentation of a concept.

The proposed model associates an ontology conceptwith a name and a context. We extend the multiple-namesupport mechanism, described in Section 2.1 and proposemultiple-context support in a similar fashion. A concept isassociatedwithmultiple contexts (note that in [32]we havedefined a context algebra that is closed under the unionoperator and therefore multiple contexts are in themselvesa context), each in a different language. Fig. 1 provides aschematic illustration of our model for multilingualknowledge management. Four ontology concepts aredisplayed: Public Service, Citizen, Activity of Public,and Local Spatial. Each one has concept names also inFrench,German, and Polish. For the Local Spatial concept,a set of contexts represents the local perspective of theconcepts in both English and Polish.

3.3. Discussion

Onemay argue, and rightly so, that an ontology conceptneeds no interpretation. All one needs to know about a

Page 5: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

Fig. 1. Multilingual ontology example.

571A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

concept can be deduced from the ontology and fromreasoning about the semantic relationships between aconcept and other concepts. For example, consider theconcepts of Public Service and School: Public Serviceencompasses School since School is an area of responsi-bility of Public Service, while School is administered aspart of a Public Service. This argument holds true as longas the ontology fully specifies the domain of discourse.However, there is a philosophical ongoing debate onwhether fully-specified ontologies actually exist. Classicalresults (e.g., the model-theoretic argument of the philos-opherH. Putnam in [25]) indicate that there is nothing to bedone to prevent unintended interpretations of clauses in aformal language. Therefore, to be pragmatic, we mustassume that a universal ontology is under-specified in thatit fails to identify local variations. For example, the conceptof Rahmenprogramm in German can be translated into amaster program, fringe event, or supporting program butcannot be understood from its relation to the concept ofFinanzen (finance). Therefore, a context is necessary tounderstand a concept whenever an ontology is under-specified and to fill this semantic gap. The contextsassociated with Rahmenprogramm include: Pressekonfer-enz (press conference), Festivalclub, and Festveranstal-tung (festival event), indicating that its truemeaning here isthe festival main program.

Contexts are less expressive than ontologies. Theirinterpretation of a concept is “flat” in that they describe

only a vague notion of semantic relationship and theirstructure is typically limited to a set of keywords. Forexample, the synonymous contexts Bulletin, Eintritt(admission), and Startschuss (starting shots) can onlyassume a clearmeaning once they are related to the conceptof Rahmenprogramm. Therefore, one may derive a greaterbenefit from representing the local component as anontology aswell (whichwas the proposed solution in [14]).However, recall that ontology engineering is a manual anddifficult task. This brings us to themain difference betweenpreviously proposed solutions and ourmodel. Contexts canbe automatically generated and we present, by way ofmotivation, an algorithm for context extraction in Section4.2.1. Contexts can also bemanually tuned by organizationexperts, rather than ontology engineers, which enables amore dynamicmodification of contexts, better representingthe evolving business environment.

We see three main benefits of the proposed model:

Flexibility with respect to a global ontologyNewly developed applications follow some common

standard and come, most likely, with a generic globalontology. Such an ontology is the outcome of a well-designed set of concepts and relationships modeled byexperts. Once defined, evolving it becomes a difficult task.Here, contexts can serve as the local interpretation of theglobal ontology, which can bemaintained by a local expertwithout the involvement of the IT personnel. For example,

Page 6: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

572 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

to utilize an ontology concept of Spatial Management inthe public service all the civil servant needs to do is add acontext, semantically translating the concept to her locallanguage.

Flexibility with respect to languageMultilinguality results in a need to adapt the ontology to

different languages separately. Avoiding such multipleefforts is desirable, both for the initial specification of theontology and for the ontology evolution. Here, the contextcan serve as the “translation” mechanism, in whichontological concepts are interpreted in the local language.To illustrate this point, consider the English concept state,representing a medium government level. While Germanyuses a translation of the English concept state for thisconcept, Poland officially uses a related term of a province.The latter seems to represent an entity with less autonomy(similar to the difference between the federal systems in theUS and Canada) than the former. When translating theEnglish word state to German, one gets ten concepts thatare close in meaning to that of a state, while with Polish,there are fourteen possible meanings. The use of a contextand the mechanism suggested below for generating thecontext of a concept (such as state) compensates for anyunder-specification thatmay result from the universality ofthe ontology.

Flexibility with respect to local interpretationsEven without a language barrier, different entities may

have different emphases on this or that task, emphases thatrepresent local interpretations. For example, border cities(e.g., Saarbrücken at the German–French border) may putmore emphasis on recognizing language differences thancities in the heart of countries, therefore investing more inmulticultural events. Also, capital cities may have moresensitivity to minority culture than cities in the periphery.Context can therefore serve as a compensating element inontologies, adding topics of interest to the global ontology.

4. From a model to a multilingual information system

Equipped with the model, presented in Section 3, wenow illustrate its usage by providing a four-step procedurefor deploying a new multilingual information system. Theprocedure is focused on adapting an existing ontology tothe needs of a new information system. It includes thefollowing four steps, selection, collection, extraction, andadaptation. Selection involves selecting an existingontology. In the collection step, sample documents thatrepresent ontology concepts are collected. Contexts areextracted from the sample documents in the extractionstep. Finally, extracted contexts are associated with

ontology concepts. We now describe each of these stepsin more details.

The first step of ontology selection involves a selectionof an existing ontology, relevant to the information systemdomain of discourse. Such an ontology can be part of theinformation system, or can be exported from brokers thatspecialize in domain-specific ontologies (e.g., ontology.org). If an ontology designer is involved, one may considerdesigning a new ontology from scratch, revising an existingontology, or integrating existing ontologies from severaldomains. One approach to a multilingual ontology canfocus on the automatic generation of a hierarchical knowl-edge map—NewsMap [23], which displayed a techniquefor extracting relevant phrases from a news collection usinga statistical phrase extractor, hierarchical categorization,and knowledge maps visualizer. In [31], a semi-automaticmethod for supporting organization experts (as opposed toontology engineers) in the task of evolving ontology wasprovided, increasing the automation of this step.

In the collection step, an organization expert is re-quested to provide organizational documents that bestdescribe each concept in the ontology. Sample documentscan include organization documents representing theconcepts, news articles representing the topic, or corre-spondence, such as email, between the organization and itsclients. This step is the most manual-intensive of all foursteps. Existing documentation of document classificationin an organization may prove to be useful here. Forexample, to identify documents for a yearly Theater Festi-val concept in a local government, documents describingthe festival from the previous years can be used.

The extraction step yields a context for each concept inthe ontology. The context is computed from the set ofdocuments deemed relevant for a concept. Due to a longresearch tradition in the area of automated text categori-zation (see, for example, a survey in [29]), this step can bedone automatically. For illustration purposes, we providein Section 4.2.1 one such automatic context extractionalgorithm. We consider our ability to use automatic meansfor this step a major benefit of our model, as opposed toexisting models, e.g., [14], where the local view of thesystem is given as an ontology, which is hard to design andmaintain by organization experts.

The final step is a technical one. It involves adding thenew contexts to their relevant concepts. It is worth notingthat at this time the organization expert may decide toeither keep existing contexts as well, or remove them.Keeping existing contexts can assist in multilingual infor-mation systems, where contexts in different languages canco-serve in tasks of knowledge management. However, ifthese contexts are too greatly oriented towards localinterpretations, then keeping existing contexts from other

Page 7: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

573A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

deployments of the same information systems may harmthe system performance.

4.1. Document collection

The document collection step is an essential step inproviding the information system with the correct inter-pretation of ontology concepts. To highlight the benefits ofthis approach,we continue our discussion along the lines ofthe three benefits, as proposed in Section 3.3:

Flexibility with respect to a global ontology: Recallthat a global ontology serves as a basis for localapplications. An organization expert, given an ontology,determines the most appropriate set of documents ofeach concept, based on the set of concepts in the onto-logy. Therefore, the domain expert implicitly compen-sates for ontology under-specification by manipulatingthe relevance of documents, associating them with theconcept that is most relevant in the given ontology.Flexibility with respect to language: The documents,as provided by the organization expert, are given in thelocal language. In the discussion in Section 4.2 weemphasize that the extraction algorithm should belanguage independent, and therefore the generatedcontexts are given in the local language. At times, if aterm in a different language is closely related to thecontext, it will be added as well. For example, inSaarbrücken at the German–French border the Frenchnamed Perspective du Theatre festival encompassesGerman related documents.Flexibility with respect to local variations: Localvariations will be taken into account, using the localorganization expert document classification. Therefore,if a certain document falls under one concept in oneorganization and under another concept in anotherorganization, such classification will affect the contextgeneration process.

This step relies, to a great extent, on the subjectiveassessment of a local organization expert. Therefore, it maygenerate undesirable biases in concept interpretation. In thenext section we demonstrate how such biases can becountered by using query expansion for context extractionand recognition.

4.2. Context extraction and recognition

We now focus on the third step, the extraction step. Alarge body of research exists for extracting context fromtext. A class of algorithms were proposed in the IRcommunity, based on the principle of counting the number

of appearances of each word in a text, assuming that thewords with the highest number of appearances serve as thecontext. Variations on this simple mechanism involvemethods for identifying the relevance of words to a do-main, using methods such as stop-lists and inverse docu-ment frequency [29]. In this section we first discuss thedesirable properties of context extraction for our research,followed by a description of a context recognition algo-rithm, to illustrate our approach. Other models, such as[18] and [11], can be adopted for context recognition aswell. Evaluating the best extraction algorithm for our needsis beyond the scope of this paper.We settle for an algorithmthat shows reasonable results as a proof of concept. Ourexperiences and experiments with the context extractionalgorithm are detailed in Section 5.

A desirata for an algorithm that implements theextraction step should consist of three elements. First, itshould be automatic, ensuring a quick deployment. Sec-ond, it should be language independent. This way, it can beapplied with each new deployment, regardless of thelanguage of choice. Third, it should circumvent biases thatmay have been introduced in the collection step, given thatthe collection step is manual labor intensive.

There may be many possible algorithms for contextextraction and recognition that satisfy these three require-ments. For illustration purposes, we next provide a des-cription of one such algorithm. It is fully automatic anduses the Internet as a knowledge base to extract multiplecontexts. The use of the Internet provides a language-independent mechanism and also avoids biases byapplying query expansion to the original text, as providedby the organization expert. This algorithm was adaptedfrom [30] and is currently part of the QUALEG solution.

The use of the Internet as a context database instead of aprecalculated frequencies base [6] has several advantages.The use of the Internet does not require the constant up-dating and maintenance of a database, while the pre-calculated frequencies base requires the user to work in alimited predefined knowledge domain. Also, the Internetcan serve as an unlimited knowledge domain that is con-tinuously being updated. Last but not least, the multilin-gual nature of the Internet makes it a perfect infrastructurefor the proposed method.

The success of the algorithm depends, to a great extent,on the number of documents retrieved from the Internet.With a greater number of relevant documents, less pre-processing (using methods such as Natural LanguageProcessing) is needed in the data collection phase.

4.2.1. A context recognition algorithmLet D={P1, P2,…, Pm} be a set of textual propositions

representing a document, where for all Pi there exists a

Page 8: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

Fig. 2. Identifying the current contexts.

574 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

collection of descriptor sets forming the context Ci={⟨ci1,wi1⟩,…,⟨cin,win⟩}so that ist(Ci,Pi) is satisfied. McCarthy[19] defines a relation ist(C,P), asserting that a propositionP is true in a context C. The granularity of the textualpropositions varies, based on the case at hand, and may bea single sentence, a single paragraph, a statement made bya single participant (in a chat discussion or a Shakespearianplay), etc. The context recognition algorithm identifies theouter context C defined by

ist C; \mi¼1

ist Ci;Pið Þ� �

:

The algorithm input is defined as a set of textualpropositions representing a document. Each textualproposition is sent to a Web search engine. The set ofdescriptors is extracted by clustering theWeb pages searchresults. The number of textual propositions that extract thesame descriptor identifies the number of references to thedescriptor in the text. Similarly, the number of Web pagesthat identify the same descriptor represents the number ofreferences in Internet documents. A high ranking in onlyone metric does not necessarily indicate the importance ofthe context: for example, high ranking in only Internetreferencesmaymean that it is an important topic but mightnot be relevant to the document. To combine both metricsthe two values are weighted to contribute equally to finalweight value.

The context recognition algorithm consists of the fol-lowingmajor phases: collecting data, selecting contexts foreach text, ranking the contexts, and declaring the currentcontexts. The phase of data collection includes parsing thetext and checking it against a stop-list. To improve thisprocess, text can be checked against a domain-specificdictionary. The result is a list of keywords obtained fromthe text. The selection of the current context is based onsearching the Internet for relevant documents according tothese keywords and on clustering the results into possiblecontexts. The output of the ranking stage is the currentcontext or a set of highest ranking contexts. The set ofpreliminary contexts that has the top number of references,both in number of Internet pages and in number ofappearances in all the texts, is declared to be the currentcontext and theweight is defined by integrating the value ofreferences and appearances. An example of identifying thecontexts that receive both high number of references andhigh number of appearances is illustrated in Fig. 2.

4.3. Querying the multilingual information system

We now turn our attention to querying, one of the mainusages of knowledgemanagement of information systems.

When a user submits a query to the information system, itcan be classified using the ontology to a specific concept,based on context comparison. As a concrete example,consider a new document that is submitted to theinformation system. We can extract a context from thisdocument and then compare it to the existing contextsassociated with ontology concepts. Matching contextsmay lead to tasks such as document classification, emailrouting, workflow activity processing, etc.

Automatic context extraction is an uncertain process,subject to noise that exists in the documents at hand.Different context extraction algorithms may yield varyinglevels of uncertainty. In any case, it may be too restrictiveto adhere to a strict approach according to which a contextcan be matched to an ontology concept only if it com-pletely matches the concept’s context. In context extrac-tion, a generated false negative context can be, forexample, Music, which is not represented in the TheaterFestival concept but is required.Conversely, a context suchas Art can also provide false positive identification ofrelated documents, since not all of them are related toTheater.

The descriptors ci and their respective weight wi areextracted by the algorithm described in Section 4.2.1. Inthis algorithm, the weight represents the number ofappearances in the text and the number of references inthe Internet to the context.

The selection of the context is based on a searchthrough the Internet for all relevant documents accordingto the text from the documents. The retrieved documentsare clustered into possible descriptors. For each descriptor,wemeasure howmany times it is referred to in the text andhow many Internet pages refer to it. For example, Musicmight not appear at all in the document, but the descriptorbased on clustered Internet pages could refer to it 2 times inthe text and 235 Internet pagesmight be referring to it. The

Page 9: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

575A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

following formula is used based on [30]. The descriptorsthat receive the highest ranking form the context. TheWeightedValue forms the wi weight previously described.The weight is calculated according to the following steps.Find the difference between each value of the number ofreferences and its nearest lower value neighbor, defined asCurrent References Difference Value (CRDV). Find thedifference between each value of the number ofappearances and its nearest lower value neighbor, definedas Current Appearances Difference Value (CADV). Theweight of the number of appearances in the text and thenumber of references in the Internet is calculated accordingto the following formula:

MVR=Maximum Value of ReferencesMVA=Maximum Value of Appearances

Weighted Value

¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2⁎CADV⁎MVR

3⁎MVA

� �2

þ CRDVð Þ2s ð1Þ

To compare contexts, we first define distance betweentwo descriptors ci and cj with their associated weights wi

and wi as follows:

d ci; cj� � ¼ jwi � wjj ci ¼ cj

max wi;wj

� �cipcj

�ð2Þ

This distance function assigns greater importance todescriptors with larger weights, assuming that weightsreflect the importance of a descriptor within a context.According to Eq. (2), if ci=cj, then d(ci=cj) is the absolutedifference between the weights. In other words, theequation measures the distance of the weights betweenidentical contexts. In all other cases, it takes the maximalweight as the distance between them,which allows a lowervalue to be given only to similar context when calculatingdistance. To define the best ranking concept in comparisonwith a given context we use Hausdorff metric, as follows.Let A and B be two contexts and α and b be descriptors inA and B, respectively. Then,

d a;Bð Þ ¼ inf d a; bð ÞjbaBf g ð3Þ

d A;Bð Þ ¼ max sup d a;Bð ÞjaaAf g; sup d b;Að ÞjbaBf gf gð4Þ

Eq. (3) provides the value of minimal distance of anelement from all elements in a set. Eq. (4) identifies thefurthest elements when comparing both descriptor sets.

To summarize, the mapping process introduced aboveis language independent. Therefore, relevant ontologyconcepts can be identified as long as the ontology contextis sufficiently similar to the matched context. Such acontext mapping process enjoys all the benefits mentionedabove. Therefore, it allows variations of emphases thatstem from local interpretation and lingual differences. Inaddition, its support of less-than-perfect mappings com-pensates for differences of terminology while ontologiesalone often lack such flexibility. To understand why, oneshould recall that the ontology is designed with the assis-tance of an organization expert. However, the new arrivingdocuments can come from laymen (e.g., an email thatarrives from a citizen), using a different terminology. It ishard, in particular for an organization expert, to anticipatesuch variations and design them as part of the ontology.

4.3.1. ExamplesWe now present three examples to illustrate possible

usage of knowledge management with the proposedmodel in querying a multilingual information system. Thethree examples are taken from the eGovernment domain.The first example involves the routing of input, such as anincoming email, to the appropriate place in the organiza-tion. Given a distance threshold, t1, any ontology conceptwhose context matches an automatically generatedcontext from an email and its distance is lower than thethreshold (d(A,B)b t1) will be considered relevant. If sucha context has an associated email address, the email will berouted to it. An overlap between contexts belonging todifferent concepts is possible, similar to dynamic taxon-omies [27].

The second example involves opinion analysis. Arelevant set of ontology concepts is identified, as in thecase of email routing. The ontology also contains contextsdefining various opinions. Such contexts may be globallydefined (for the whole ontology) or specific to someconcepts. Opinion contexts can be defined in multiplelanguages. InQUALEG, positive and negativewordsweretaken from theWebUseScientific ResearchOn the Internetfrom the University of Maryland (http://www.webuse.umd.edu:9090/tags/). These words were translated usingan online dictionary to the German language. The relativedistances of the different opinions of a matching conceptare evaluated. If the difference in distance is too close tocall (given an additional threshold t2), the system refrainsfrom providing an opinion. Otherwise, the email is markedwith the opinion with minimal distance.

The determination of public agenda is a third task thesystem can support. If all ontology concepts (of the nrelevant concepts) do not exceed the threshold d(A,B)≥ t1,then the email is considered to be part of a new topic on the

Page 10: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

576 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

public agenda and is added to other emails under thisconcept. Periodically, such emails are clustered andprovided to decision makers to determine the addition ofnew ontology concepts.

Fig. 1 presents an example of a multilingual ontology.Each concept is represented by a node with multiplecontexts for each language. It contains three ontologyconcepts, namely Citizen, Public Service, and Activityof Public Service. Each ontology concept in English istranslated to Polish, French, and German. Next, there areontology concepts that are relevant only to the localgovernment of Tarnow and therefore they appear only inEnglish and in Polish. Each local government can ex-tend the ontology concepts to include ones that interest italone and can decide to use existing ontology conceptssimply by adding the translation to the local language.

Consider the following example of an email in Polish,describing an email received by local government ofTarnow on the topic taxes on immovable assets:

Subject: podatek od nieruchomosciSzanowni Panstwo,Zwracam siȩ z prosba o przesłanie wysokosci stawekopłat za podatek od nieruchomosci dla osób prawnychw Państwa miescie obowiazujace w latach 2000–2004?Z poważaniemEdyta

The context as extracted by the context recognitionalgorithm include: {⟨Podatek, 59⟩, ⟨nieruchomosci, 43⟩,⟨Polska, 26⟩, ⟨Strona, 21⟩}. The first two, translated tovalue added tax (v.a.t.) and immovables, are very relevantto the topic of the email. The other two contexts, Polska,which is Poland, and Strona, which has multiple meaningssuch as a party or side, have less relevance to the scenariodescribed in the text.

Whenmapping the contexts to the ontology, the contextof nieruchomoUci can be identified in the list. Therefore,this document can be mapped to the topic of Local Spatial

Table 1Algorithm sample results

Perspectives du TheatreContext Descriptor Concept Relevance

Art 10 Rahmenprogramm, OrganiAbseits des Mainstreams 0Oscar Wildes 0Jenseits des Mainstreams 0Movie 0Firma 1 BesucherSaar 2Download 0Programm 3 Rahmenprogramm, Informa

Management Strategic Plan, which can now be accessedby both English or Polish queries.

The following is an example of a German email:

Strassentheater und das kostenlos und auf hohem iveau.Ein Aushängeschild für Saarbrücken und ein leuch-tendes Beispiel für junges wildes Theater, abseits desMainstreams.Toll.

The context recognition algorithm identified thefollowing context, described in detail in Table 1 with theactual values assigned by the algorithm. The results arerepresented by the context on the left side. There are twopossible main ontology concepts, Perspectives du Theatreand LongDay School, to which each data item can belong.The Perspective du Theatre concept encompasses six othersubconcepts, which include Rahmenprogramm, Organisa-tion, Spielplan, Veranstalter, Besucher, and Informationen,when each context can belong to one or more concepts.Note that the context is extracted not only in German butalso in French. The highest ranking was received by thePerspective du Theatre concept. Note also that some of thecontexts are not mapped to any concept. Therefore, whenwe examine which of the subconcepts related toPerspective du Theatre are relevant to the text, we canclassify the document as belonging to the Rahmenpro-gramm (master program) in Perspective du Theatre whichreceived the total highest score out of all subconcepts.

5. Experiences with QUALEG

QUALEG has pilots in France, Poland, and Germanyand thus currently focuses on four languages, three ofwhich are French, Polish, andGerman. English is also usedas a common international representation language. Tomaintain uniformity and avoid repetitive translations,QUALEG processes the information from the input,such as debates and emails, in the local languages.

Long Day SchoolConcept Relevance

sation, Spielplan, Veranstalter 10000000

tionen 0

Page 11: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

577A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

For the deployment in QUALEG the first step includedstarting with an existing ontology and expanding it for thespecific project needs. The ontology used was a localgovernment ontology developed for TerreGov, a differentEU project. In the collection step, local government repre-sentatives from each of the pilots supplied organizationaldocuments that describe each concept in the ontology col-lected from previous years. The extraction step created acontext for each concept in the ontology using thealgorithm described in Section 4. The last step involvedadding the new contexts to their relevant concepts andstoring them. These contexts are monitored according tothe system performance and can be updated when needed.

The system is built to support multilingual ontologymanagement. The system allows an ontology search to beperformed, retrieving documents that relate to a specificontology concept. The mapping of the documents to theontology concepts is performed using the context recog-nition algorithm implemented in the Knowledge Extrac-tion module.

Section 5.1 presents our experiences with the Germanlanguage. Section 5.2 discusses the advantages thistechnique has in some languages versus others and theextent of language independence of the model. Weconclude (Section 5.3) with applications of the model ofontology and context in the field of opinion analysis.

5.1. The German Perspectives du Theatre Festival

Our first experience was with the Perspectives duTheatre Festival held during May every year inSaarbrücken, located at the French border of Germany.The festival includes contemporary French theatre,films, street events, music, etc. Our challenge was toanalyze the material and provide a useful set ofclassifications so that the materials could be rapidlyunderstood and routed to the appropriate civil servants.

The data we received included daily communications(in German) about this event, consisting of 104 differentemails, primarily emails from citizens to the city halland press releases and announcements from the cityoutward. The festival is an annual event and we weregiven data from 2004 and 2005.

The goal of the topic classification experiment was toidentify the topic of the email according to a predefinedlist of ontology concepts as supplied by Saarbrücken fororganizing cultural events. The predefined concepts ofthe emails supplied by Saarbrücken were: Organisation,Veranstalter, Finanzen, Besucher, Informationen, Rah-menprogramm, Spielplan, Other. Each topic, an ontol-ogy concept, was accompanied by a set of contexts thatdescribe it.

To examine the proposed model we used a singleontology and two different methods to define and extractcontexts. One method was the one described in Section4.2.1. Thismethod used the technique ofmapping contextsto ontology concepts, as detailed in Section 4.3. Each set ofwords was mapped from the email to the Internet,extracting a set of best ranked textual contexts that definethe document. These contexts were searched against thelist of contexts describing each concept in the ontology.The other method was based on conventional NaturalLanguage Processing techniques, enhanced by a languagedomain expert to build a set of rules for identifying relevantwords and grammar relevant to theGerman language. Thistechnique was based on a per sentence analysis. For eachsentence a classifier, automatically trained on keywordsand morphological variants (based on the initial list oftopics from Saarbrücken), was used. Each sentence in theemail was searched against the list of keywords andmorphological variants. The two techniques are verydifferent. The former is language independent, making itmore suitable formultilingual environments at the possiblecost of lacking language-specific analysis tools, used bythe latter.

The experiment included classifying the incoming dataaccording to the concepts described above. We comparedthe recall and precision of the proposed model to theNatural Language Processing technique. The input to bothmethods was identical. The context extraction techniquewere different. The output of both methods was a bestmatching concept. The data was also analyzed by twoNatural Language experts and a local governmentrepresentative from Saarbrücken to supply the “goldenstandard.”

For both methods the input was parsed at thegranularity of sentences. Our Internet search basedtechnique parsed long sentences according to the maxi-mum number of words that could be used in a searchengine. The Natural Language Processing preprocessingincluded a Tokenizer, a tool for breaking up compoundnouns, and a German Demorpher (Morphy engine),downloaded from the University of Stuttgart (http://www.lezius.de/wolfgang/morphy/). The Demorpherremoves case markings, tense markings, etc.

Two different experiments were performed. The firstexperimentwas to analyze ourmodel based on theGermandata. The knowledge extraction component achieved aPrecision of 85.37%, Recall of 84.34%, and total F-Scoreof 84.85%. This is based on the comparison of the resultsof the Context Recognition/Knowledge Extraction com-ponent to the human judgements. The German input datawas classified by two German Language experts and bySaarbrücken local government civil servant employees.

Page 12: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

578 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

The second experiment analyzed the performance ofour method implemented in the Knowledge Extractioncompared to the NLP technique. In this experiment asubset of 72 different emails representing data from asingle year was used for comparison. The number ofemails in the second experiment is a subset of the total 104emails since only opinion-related emails were used in thisexperiment as opposed to official announcements mes-sages that were sent at the time. The results show thatmethod proposed in this work achieved the F-Score of81% while the Natural Language Processing techniqueachieved the F-Score of 78% where the precision andrecall were weighted equally. The results show that theproposed ontology-based multilingual model of contextsand ontology achieved better results.

These results show the promising ability of theproposed model to provide language-independent supportto local government decision making.

5.2. Extent of language independence

Having shown a proof-of-concept of the proposedmodel applicability, we next discuss the extent to whichlanguage-independent algorithms, such as the one pro-posed in Section 4.2.1, can be used in a multilingual set-ting. The proposed algorithm compensates for the lack oflanguage-specific rules by using a vast database, such asthe Internet, to gain better statistical knowledge in iden-tifying contexts. However, there is a skewed distribution oflanguages over the Internet. Xu [34] estimated that 71% ofthe pages (453 million out of 634 million Web pagesindexed by the Excite search engine at that time) werewritten in English, followed by Japanese (6.8%), German(5.1%), French (1.8%), Chinese (1.5%), Spanish (1.1%),Italian (0.9%), and Swedish (0.7%). Earlier experiments[30] show over 90% recall of the proposed algorithm forthe English language. Our experience with the Germanlanguage also suggests reasonable performance, with an F-Score of above 80%.Would such a success rate be retainedwith languages whose relative presence on the Web ismuch lower? Some researchers argue that one hundredmillion words is a large enough corpus for many empiricalstrategies for learning about language, either for linguists[2] and lexicographers [16] or for technologies that needquantitative information about the behavior of words asinput (most notably parsers [17]).

To test the extent to which language-independentalgorithms can be used, we tried a classification taskwith the Polish language. The relative presence of thePolish language on the Web is less than 0.5%, 10 timessmaller than the German presence [16]. This means thatthere are a few million Polish Web pages, which

according to previous research may not suffice for thistask.

For the purpose of our experiment, we analyzed fourontology concepts, namely Przyroda, Transport, Zabytkiand Zagospodarowanie, in the local government of Tar-now. A total of 30 documents were analyzed. For eachconcept, contexts were identified manually by Tarnowlocal government civil servant employees and each of thedocuments was classified using the algorithm in Section4.2.1. Our initial experiment, which avoided the use ofany language dependent tool, yielded poor results of lessthan 11% recall. Therefore, we applied a simple NLPmechanism.We used a synonym dictionary on the resultsof the context recognition algorithm. The Polish dic-tionary in Portal Wiedzy Tłumacz (http://portalwiedzy.onet.pl/tlumacz.html) provides multiple synonyms anddemorphing for each word translated. The use of thedictionary for the identification of synonyms and demor-phing increased the number of contexts and thus in-creased the chances that the words associated with theontology concept would be identified. The use of thesetools increased the recall to 97%.

Due to the small sample we had at our disposal, noconclusive conclusions can be given at this time, and moreexperiments are needed. Nevertheless, this experimentindicates that for languages with smaller presence on theInternet, the proposed algorithm needs to be enhancedwithlanguage-specific methods. We note that language depen-dent tools are available and can easily be combined in amultilingual system. Other methods require a more in-depth knowledge of a language and may be specificallytailored to a given domain. Using more of the former (aswe did with the Polish language) and less of the latterimproves the deployment of information systems acrossborders.

5.3. Multilinguality and opinion analysis

Another possible application of the multilingual modellies in the field of opinion analysis. Opinions can beviewed as perspectives expressed in the input information.Opinions can be included in the ontology as concepts,associated with sets of contexts that provide the localinterpretation of each opinion.

In the QUALEG project the system was developed intwo different parts, separating the task of knowledgeextraction from that of opinion analysis. The maindifference between the two parts of the system is that theknowledge extraction component avoids the language-specific implementation and bases its analysis techniqueson the use of a large corpus of relevant documents takenfrom the Internet, while the opinion analysis component

Page 13: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

579A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

uses techniques from IR and NLP to improve contentunderstanding. As in the knowledge extraction, the resultsof the opinion analysis are mapped to concepts in theontology, in this case, opinion concepts. These opinionsfall into three categories of concepts— positive, negative,and neutral.

The experiment included 72 emails in German toanalyze the opinion analysis component. For the use ofopinion analysis a set of opinion words were analyzed inEnglish and machine translated to German. These opinionwords are associated with the three opinion concepts. Twopossibilities were examined: first, to translate the emailsinto English and then analyze the texts for opinion, andsecond, to translate opinion words. The latter alternativewas found to achieve better results, resulting in slightlybetter accuracy of 60% versus 59% of the first option.These results tested the identification of the correct opinionwords of each sentence. The results can be explained asslightly better since fewer words are translated. Thesetranslationswere based on the European Parliament corpususing GIZA ++, followed by the alignment intersectionheuristic [22].

The final results included adding additional NLPinformation in German. For example, in German the lastopinion word in a sentence overrules previous opinionwords. In addition, the opinion of each sentence wasanalyzed separately and summed to identify the opinion ofthe whole email. The results of the opinion analysisreached a Precision of 78.95%, Recall of 69.23%, and F-Score of 73.77%. The results indicate that expanding thecontext and ontologymodel to perform opinion analysis isfeasible, albeit at a lower accuracy.

5.4. Experiments

To analyze the impact of our model for the support ofmultilinguality in information systems we experimentedwith data from news RSS. RSS is a format for distributingand gathering content from sources across the Web,including newspapers, magazines, and blogs. Web pub-lishers use RSS to easily create and distribute news feedsthat include links, headlines, and summaries. As a result,RSS serves as a useful platform for comparing news itemsin different languages supplied by different sources. Westart with a description of the real-world data trace andexperiment set-up, followed by a description of ourexperiments and an empirical analysis of the results.

5.4.1. Data sets and metricsThe RSS news data traces come from BBC — British

English news (http://news.bbc.co.uk/1/hi/help/3223484.stm), CNN—American English news (http://edition.cnn.

com/services/rss), Stern — German news (http://www.stern.de/sonst/?id=517321), and Le Figaro — Frenchnews (http://www.lefigaro.fr/rss/). We use the first two astwo sources with local interpretations of topics.

The list of topics we selected from each news data traceis displayed in Fig. 3, where selected topics are circled.Topics are taken from four categories, namely Politics,World, Science, and Technology. Topics may vary slightlyamong different RSS sites. Some sites unify topics, e.g.,Science and Medicine in Le Figaro. In what follows, aconcept is either a topic or a category, based on theexperiment.

In these data traces data are partitioned to topics with noontological relationships. The experiments focus on theconcepts/contexts relationships, for which these data setsserve adequately. Research and experiments on ontolog-ical relationships using contexts are reported in [32].

The RSS trace was collected during August–Septem-ber 2006. The news topics in each data set include between33 and 641 data, where a datum is anRSS news header or anews descriptor. There was a total of 1778 data items usedin the experiments. Table 2 describes the RSS news datasets. The table summarizes the number of news data foreach news data set (size), the number of categories, theminimum and maximum number of data per concept, andeach concept size. Concept size represents the total numberof data in up to the four data sets.

We generated a context for each concept using thealgorithm described is Section 4. This context is referred toas context⁎ and the data that was used for this contextgeneration is referred to hereafter as the context⁎ data. Thenumber of data items that were used for generatingcontext⁎ data was set to 10 and to 51 data items, based onempirical analysis performed in [32]. The context⁎ data areselected randomly from the data items associated with aconcept. Avarying number of concepts was used, rangingfrom 1 to 13 concepts depending on the experiment. Wealso experimented using only four concepts, representingthe four categories. In this case, the data for generatingcontext⁎were chosen randomly from themultilingual dataset. For context extraction we use the algorithm, adaptedfrom [30]. Our aim is to test the impact of multilingualityon the performance of ourmodel, rather than to test the textclassification abilities of this or that algorithm. Neverthe-less, this algorithm is known to have generated reasonablecontexts in the past (see experiments in [30]).

As a measure of evaluation we use recall and precisionmetrics. The recall of a concept is defined as the ratio ofdata items which share at least one descriptor with thedescriptors of context⁎ and the number of the data itemswhich belong to the concept. A high recall measure meansthat the algorithm was able to classify correctly a good

Page 14: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

Fig. 3. Experiment outline.

Table 2RSS data set statistics

Data set BBC CNN Stern Le Figaro

Size 1007 257 246 268Categories 4 3 3 3Minimum dataper category

68 42 33 48

Maximum dataper category

641 159 157 156

Data set Politics Science Technology World

Size 358 266 198 956

580 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

portion of the data item set, minimizing false negative.Precision is defined to be the ratio of the number of truepositive identifications and the number of data itemsassociated with a concept. We measure precision withrespect to the original classification of data items to topics/categories as given in the data traces. It is worth noting thatin most of the experiments the algorithm classifies a datumto a single concept, the one whose contexts share thehighest number of descriptors with its context, thus settinga lower bound on the algorithm performance.

5.4.2. Experiment resultsWe are now ready to discuss our experiments in details.

The overall purpose of these experiments is to test theimpact of multilinguality on our model. We start withevaluating the model ability to correctly classify docu-ments to concepts in various languages. Then, we analyzethe impact of having amultilingual corpus on the precisionof classification. Finally, we analyze the impact of localinterpretations, showing the need to compensate ontolo-gies for under-specification.

5.4.2.1. Single class classification. In the first exper-iment, we evaluated the impact of multilinguality on

classification recall. In each experiment we selected asingle concept and generated a context using context⁎

data. For each of the remaining data in this category acontext was generated and compared with the context⁎.For each context⁎ data size we repeated the experiment 4times, each time choosing randomly the context⁎ data. Toevaluate the impact of multilinguality on a single classclassification, we also generated four multilingual datasets, each representing a single category and repeated theexperiment with these data sets.

Page 15: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

Fig. 5. RSS precision.

581A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

A graphic illustration of our results is given in Fig. 4,where the top part provides an overview of the results andthe bottom part focuses on the recall range of 90%–100%.The y-axis displays the average recall over the fourexperiments of each data set. Each group represents adifferent category. The last bar of each group shows theaverage recall of the multilingual data set. The right-mostgroup provides an average recall over all data sets. In thisexperiment each context, besides context⁎, is limited to 10descriptors.

A per concept analysis shows an average recall rangingfrom 91.67% to 100% in the news RSS data set, whencontext⁎ is defined using up to 45 descriptor sets. Theimpact of multilinguality is seen by comparing the averagerecall of the “joint” bar vs. the individual topic results. Onaverage, the use of multilingual corpus results in a minorreduction of less than 2% in recall (from about 98.35% to96.58%). A category-based analysis shows that in one casethe joint results are lower than individual data sets averagerecall, in two cases they fall in between other results, and inone case they reach the top results. All-in-all, the impact ofusing a multilingual corpus is negligible.

5.4.2.2. Multiple class classification. Next, we analyzethe impact of multilinguality on classification precision.For this set of experiments we enforced a rigid classi-

Fig. 4. RSS recall.

fication scheme, in which each document is classified to asingle concept. It is worth noting that not all applicationsfollow such a strict approach. In QUALEG, for example,email routing requires emails to be routed to all conceptswhose context is sufficiently similar to the email's context.

The experiment results are summarized in Fig. 5. Thehorizontal axis displays the number of participatingconcepts and the vertical axis presents the classificationprecision. Fig. 5(top) analyzes the 13 concepts separately.In Fig. 5(bottom) we present the results for four multi-lingual corpora, as described in the previous experiment. Inthis experiment each context was limited to 10 descriptorsand context⁎ is defined using up to 45 descriptor sets.

As the number of concepts increases, precisiondeclines, stabilizing at a level of 50%–60% after 7 con-cepts. To analyze the impact of the multilinguality wecompare the results displayed in the two graphs. Theprecision for all 13 concepts reaches 55.17%while the useof multilingual corpora reduces the precision by about 6%to 49.06%. Even when ignoring the data corpus sizes, andcomparing to the performance of classifying to four classesin Fig. 5 (bottom), one observes a reduction of 10.84%,from 59.9% to 49.06%. These results indicate that ourmodel suffers a minor reduction in performance with theintroduction of multilinguality.

Page 16: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

Table 3Average recall for BBC and CNN

Recall Training

BBC CNN BBC+CNN

Data BBC 96.06% 75.31% 94.58%CNN 97.68% 99.32% 99.90%

Table 5Average precision BBC and Stern

Precision Training

BBC STERN BBC+STERN

Data BBC 73.64% 73.53% 40.12%STERN 60.31% 72.08% 44.66%

582 A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

5.4.2.3. Analysis of local interpretation. This experi-ment analyzes the performance of our model, whenpresented with local interpretation of concepts in the samelanguage. For each context⁎ data size we repeated theexperiment 4 times, each time choosing randomly thecontext⁎ data. We compared three similar concepts ofCNN and BBC, namely World, Technology, and Science,as well as three similar concepts of BBC and Stern,Politics (Politik & Panorama), Technology (Computer &Technik), and Science (Wissenschaft & Gesundhelt), as acontrol set. We used one data set for training and then usedthe other data sets as test data, interchanging the roles ofnews agencies.We also analyzed the results of uniting twodata sets into one when selecting a similar size of datafrom each data set concept. The united data set was usedas a training set and each of the two original data sets wasused as test sets.

Average recall results over the three topics, testedseparately, are displayed in Table 3. When using one dataset for training and another for testing, recall is lower(75.31% and 97.68%) thanwhen training data and test dataare taken from the same data set (96.06% and 99.32%,respectively). The impact of our model is evident in theright-most column of Table 3. Using data from both BBCand CNN for training improves recall (94.58% vs. 75.31%and 99.90% vs. 97.68%) and is similar to that of using ahomogeneous data set for both training and testing(94.58% vs. 96.06% and 99.90% vs. 99.32%). For CNNdata, the use of both data sets for training slightly increasesrecall (from 99.32% to 99.90%), a result that may beattributed to statistical variation.

Average precision results for CNN and BBC aredisplayed in Table 4 and CNN and Stern results aredisplayed in Table 5. Precision drops when usingtraining from a different data set, e.g., training withCNN for BBC data reduces precision from 68.15% to

Table 4Average precision BBC and CNN

Precision Training

BBC CNN +CNN

Data BBC 68.15% 47.31% 37.03%CNN 47.09% 59.60% 35.54%

47.31%. These results serve as an empirical justificationof our claim that even within the same language, localinterpretation has a major impact, and therefore evenwithout the language barrier, ontologies quickly becomeunder-specified. Another observation is that the use of acombined data set for training does not improveprecision. This means that whenever an informationsystem is deployed for local use only, it is better to avoidusing contexts from other deployments. As a control forour experiments, we tested precision also by comparingBBC with Stern, generating multilingual contexts. Weobserved the same phenomenon in a multilingual settingas well, although precision is better here. Oneexplanation for the improved precision may be thehigher word variation in different languages.

6. Conclusion

In this work we proposed a knowledge managementmodel for the support of multilingual applications. Themodel is based on a global ontology, manually designedfor a specific domain, and local contexts, associated withontology concepts. The combination of ontologies andcontexts lends itself well to multilingual applications inwhich a single ontology fails to capture all nuances thatstem from language and cultural differences. The modelwas presented both in technical terms and via an examplefrom the eGovernment domain. Themodel propertieswerediscussed and some experiences with a specific eGovern-ment application, QUALEG, were described andanalyzed.

The single ontology systemwith associated concepts inmultiple languages proposed here provides a frameworkthat is both versatile and flexible. The system functionssimultaneously inmultiple languages, is low-maintenance,and is easily extended in and adapted to differentlanguages. The model captures cultural as well as lingualdifferences using contexts, thus allowing easy customiza-tion across cultures and languages.

Future directions of research include identifyingmethods for allowing real-time interaction between localgovernment representatives and citizens through the use ofmultilingual ontologies. Another direction is identifying

Page 17: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

583A. Segev, A. Gal / Decision Support Systems 45 (2008) 567–584

the ability to recommend a policy to the local government,based on the information in the ontology, and automatictranslation of words that define the ontology based on theircontext. In addition, future research can examine the per-formance of the system when implemented on languageswith different character sets, such as Chinese and Arabic.

Acknowledgments

The work was partially supported by two EuropeanCommission 6th Framework IST projects, TerreGov(http://www.terregov.eupm.net) and QUALEG (http://www.qualeg.eupm.net), and the Fund for the Promotionof Research at the Technion.

References

[1] S. Aytac, Multilingual information retrieval on the internet: a casestudy of Turkish users, International Information & Library Review37 (2005) 275–284.

[2] C.F. Baker, C.F. Fillmore, J.B. Lowe, The Berkeley framenetproject, In Proceedings of COLING-ACL, 1998, pp. 86–90.

[3] J. Bar-Ilan, T. Gutman, How do search engines handle non-English queries? A case study, In Proceedings of the twelfthinternational World Wide Web conference, 2003.

[4] M. Bunge, Treatise on basic philosophy, The Furniture of theWorld,Ontology I, vol. 3, D. Reidel Publishing Co., Inc., New York, NY,1977.

[5] M. Bunge, Treatise on basic philosophy, A World of Systems,Ontology II, vol. 4, D. Reidel Publishing Co., Inc., New York, NY,1979.

[6] R. Chau, C. Yeh, A multilingual text mining approach to webcross-lingual text retrieval, Knowledge-Based Systems 17 (2001)219–227.

[7] F.M. Donini, M. Lenzerini, D. Nardi, A. Schaerf, Reasoning indescription logic, in: G. Brewka (Ed.), Principles on KnowledgeRepresentation, Studies in Logic, Languages and Information,CSLI Publications, 1996, pp. 193–238.

[8] A. Gal, A. Anaby-Tavor, A. Trombetta, D. Montesi, A frameworkfor modeling and evaluating automatic semantic reconciliation,VLDB Journal 14 (1) (2005) 50–67.

[9] A. Gal, A. Segev, Putting things in context: dynamic eGovernmentre-engineering using ontologies and context, In Proceedings of the2006 WWW Workshop on E-Government: Barriers and Opportu-nities, 2006.

[10] T.R. Gruber, A translation approach to portable ontologies,Knowledge Acquisition 5 (2) (1993).

[11] A. Hotho, S. Staab, A. Maedche, Ontology-based text clustering,In Proceedings of the IJCAI— 2001 Workshop Text Learning:Beyond Supervision, 2001, p. 29.

[12] E. Hovy, N. Ide, R. Frederking, J. Mariani, A. Zampolli,Multilingual information management: current levels and futureabilities, Linguistica Computazionale XIV–XV (2001).

[13] J. Hutchins. Current commercial machine translation systems andcomputer-based translation tools: System types and their uses.International Journal of Translation, 2005. to appear.

[14] J. Kahng, D. McLeod, Dynamic classificational ontologies fordiscovery in cooperative federated databases, In Proceedings of

the First IFCIS International Conference on CooperativeInformation Systems (CoopIS'96), June 1996, pp. 26–35,Brussels, Belgium.

[15] M. Kifer, G. Lausen, J. Wu, Logical foundation of object-oriented and frame-based languages, Journal of the ACM 42(1995).

[16] A. Kilgarriff, G. Grefenstette, Introduction to the special issue onthe web as corpus, Computational Linguistics 29 (3) (2003).

[17] A. Korhonen, Using semantically motivated estimates to helpsubcategorization acquisition, In Proceedings of the Joint Confer-ence on Empirical Methods in NLP and Very Large Corpora, 2000,pp. 216–223.

[18] T. Liu, Z. Chen, B. Zhang, W.-Y. Ma, G. Wu, Improving textclassification using local latent semantic indexing, In ICDM,2004, pp. 162–169.

[19] J. McCarthy, Notes on formalizing context, In Proceedings of theThirteenth International Joint Conference on Artificial Intelli-gence, 1993.

[20] C. Mooers, Encyclopedia of library and information science, vol. 7,Marcel Dekker, 1972, pp. 31–45, chapter Descriptors.

[21] H. Moukdad, Lost in cyberspace: how do search engines handleArabic queries? In access to information: Technologies, skills,and socio-political context. In Proceedings of the 32nd annualconference of the Canadian Association for Information Science,2004.

[22] F.J. Och, H. Ney, Improved statistical alignment models, 38thAnnual Meeting of the Association for Computational Linguis-tics, 2000, pp. 440–447.

[23] T. Ong, H. Chen, W. Sung, B. Zhu, Newsmap: a knowledge mapfor online news, Decision Support Systems 39 (2005) 583–597.

[24] N. Papadakis, A. Litke, D. Skoutas, and T. Varvarigou. Memphis:A mobile agent-based system for enabling acquisition of multilin-gual content and providing flexible format internet premiumservices. Journal of Systems Architecture, 2005. to appear.

[25] H. Putnam (Ed.), Reason, Truth, and History, Cambridge UniversityPress, 1981.

[26] S. Russell, P. Norving, Artificial Intelligence: A ModernApproach, 2nd edition, Prentice Hall, Upper Saddle River,New Jersey, 2003.

[27] G. Sacco, Dynamic taxonomies: a model for large informationbases, IEEE Transactions on Knowledge and Data Engineering12 (2) (2000) 468–479.

[28] H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, O.Hamza, Y. Wilks, Multimedia indexing through multi-source andmulti-language information extraction: the MUMIS project, Dataand Knowledge Engineering 48 (2004) 247–264.

[29] F. Sebastiani, Machine learning in automated text categorization,ACM Computing Surveys 34 (1) (March 2002) 1–47.

[30] A. Segev, Identifying the multiple contexts of a situation,Proceedings of IJCAI—Workshop Modeling and Retrieval ofContext (MRC2005), 2005.

[31] A. Segev, A. Gal, Ontology verification using contexts, InProceedings of ECAI—Workshop on Contexts and Ontologies:Theory, Practice and Applications, 2006, p. 30.

[32] A. Segev, A. Gal, Putting things in context: a topological approachto mapping contexts to ontologies, Journal on Data Semantics IX(2007) 113–140.

[33] P. Vossen, Eurowordnet general document, LE2-4003 LE4-8328,EuroWordNet, 1999.

[34] J.L. Xu, Multilingual search on the World Wide Web, InProceedings of the Hawaii International Conference on SystemScience (HICSS-33), 2000.

Page 18: Enhancing portability with multilingual ontology-based ...schoolofcomputing.southalabama.edu/~segev/... · content of the multilingual text. When language-indepen-dent concepts hidden

uppor

Avigdor Gal received the DSc degree in tem-poral active databases from the Technion —Israel Institute of Technology in 1995. He isan Associate Professor at the Faculty of In-dustrial Engineering and Management, Tech-

584 A. Segev, A. Gal / Decision S

Society.

nion. He has published more than 70 papers

in journals, books, and conferences on dataintegration, temporal databases, informationsystems architectures, and active databases.He is a member of the steering committee ofIFCIS, a member of IFIP WG 2.6, and a

Aviv Segev is an Assistant Professor at theCollege of Commerce, National ChengchiUniversity. Previously, he was a postdoc at theFaculty of Industrial Engineering & Manage-ment at the Technion. In 2004 he received his

t Systems 45 (2008) 567–584

textual data, mappinginformation. He has puband conferences. Previothe Israeli Aircraft Indu

Ph.D. from Tel-Aviv University in manage-

ment information systems in the field ofcontext recognition. During his studies, Avivreceived the Vatat Scholarship for ExcellingDoctorate Students in Elite Technology andthe Adams Institute Scholarship Award. His

current research includes classifying information and opinions of

of context to ontologies, and mapping of

recipient of the IBM Faculty Award for 2002–2004. He is a memberof the ACM and a senior member of the IEEE and the IEEE Computer

lished a number of papers in scientific journalsusly Aviv was a simulation project manager instry.