Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... ·...

20
Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 Unveiling the hidden bride: deep annotation for mapping and migrating legacy data to the Semantic Web Raphael Volz a,b,, Siegfried Handschuh a , Steffen Staab a,c , Ljiljana Stojanovic b , Nenad Stojanovic a a Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany b FZI—Research Center for Information Technology, 76131 Karlsruhe, Germany c Ontoprise GmbH, Amalienbad Street 36, 76227 Karlsruhe, Germany Received 26 September 2003; received in revised form 12 November 2003; accepted 21 November 2003 Abstract The success of the Semantic Web crucially depends on the easy creation, integration, and use of semantic data. For this purpose, we consider an integration scenario that defies core assumptions of current metadata construction methods. We describe a framework of metadata creation where Web pages are generated from a database and the database owner is cooperatively participating in the Semantic Web. This leads us to the deep annotation of the database—directly by annotation of the logical database schema or indirectly by annotation of the Web presentation generated from the database contents. From this annotation, one may execute data mapping and/or migration steps, and thus prepare the data for use in the Semantic Web. We consider deep annotation as particularly valid because: (i) dynamic Web pages generated from databases outnumber static Web pages, (ii) deep annotation may be a very intuitive way to create semantic data from a database, and (iii) data from databases should remain where it can be handled most efficiently—in its databases. Interested users can then query this data directly or choose to materialize the data as RDF files. © 2003 Elsevier B.V. All rights reserved. Keywords: H.3.3 Information search and retrieval; H.3.5 Online information services; H.1.2 User/machine systems; I.2.1 Applications and expert systems Corresponding author. E-mail addresses: [email protected], [email protected], http://www.aifb.uni-karlsruhe.de/, http://www.fzi.de/ (R. Volz), [email protected], http://www.aifb.uni-karlsruhe.de/ (S. Handschuh), [email protected] karlsruhe.de, [email protected], http://www.aifb.uni-karlsruhe.de/, http://www.ontoprise.de/ (S. Staab), [email protected], http://www.fzi.de/ (L. Stojanovic), [email protected], http://www.aifb.uni-karlsruhe.de/ (N. Stojanovic). 1. Introduction One of the core challenges of the Semantic Web is the creation of metadata by mass collaboration, i.e. by combining semantic content created by a large number of people. To attain this objective several approaches have been conceived (e.g. CREAM [11], MnM [31], or Mindswap [10]) that deal with the manual and/or the semi-automatic creation of metadata from exist- ing information. These approaches, however, as well 1570-8268/$ – see front matter © 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2003.11.005

Transcript of Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... ·...

Page 1: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

Web Semantics: Science, Services and Agentson the World Wide Web 1 (2004) 187–206

Unveiling the hidden bride: deep annotation for mapping andmigrating legacy data to the Semantic Web

Raphael Volza,b,∗, Siegfried Handschuha, Steffen Staaba,c,Ljiljana Stojanovicb, Nenad Stojanovica

a Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germanyb FZI—Research Center for Information Technology, 76131 Karlsruhe, Germany

c Ontoprise GmbH, Amalienbad Street 36, 76227 Karlsruhe, Germany

Received 26 September 2003; received in revised form 12 November 2003; accepted 21 November 2003

Abstract

The success of the Semantic Web crucially depends on the easy creation, integration, and use of semantic data. For this purpose,we consider an integration scenario that defies core assumptions of current metadata construction methods. We describe aframework of metadata creation where Web pages are generated from a database and the database owner is cooperativelyparticipating in the Semantic Web. This leads us to the deep annotation of the database—directly by annotation of the logicaldatabase schema or indirectly by annotation of the Web presentation generated from the database contents. From this annotation,one may execute data mapping and/or migration steps, and thus prepare the data for use in the Semantic Web. We considerdeep annotation as particularly valid because: (i) dynamic Web pages generated from databases outnumber static Web pages,(ii) deep annotation may be a very intuitive way to create semantic data from a database, and (iii) data from databases shouldremain where it can be handled most efficiently—in its databases. Interested users can then query this data directly or choose tomaterialize the data as RDF files.© 2003 Elsevier B.V. All rights reserved.

Keywords:H.3.3 Information search and retrieval; H.3.5 Online information services; H.1.2 User/machine systems; I.2.1 Applications andexpert systems

∗ Corresponding author.E-mail addresses:[email protected], [email protected],

http://www.aifb.uni-karlsruhe.de/, http://www.fzi.de/ (R. Volz),[email protected],http://www.aifb.uni-karlsruhe.de/ (S. Handschuh), [email protected], [email protected],http://www.aifb.uni-karlsruhe.de/, http://www.ontoprise.de/(S. Staab), [email protected], http://www.fzi.de/ (L. Stojanovic),[email protected],http://www.aifb.uni-karlsruhe.de/ (N. Stojanovic).

1. Introduction

One of the core challenges of the Semantic Web isthe creation of metadata by mass collaboration, i.e. bycombining semantic content created by a large numberof people. To attain this objective several approacheshave been conceived (e.g. CREAM[11], MnM [31],or Mindswap[10]) that deal with the manual and/orthe semi-automatic creation of metadata from exist-ing information. These approaches, however, as well

1570-8268/$ – see front matter © 2003 Elsevier B.V. All rights reserved.doi:10.1016/j.websem.2003.11.005

Page 2: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

188 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

Table 1Principal situation

Web site Cooperative owner Uncooperative owner

Static Embedded (�) or remote (�) metadata by conventional annotation [A] Remote metadata by conventional annotation [B]Dynamic Deep annotation with (�) server-side mapping rules or (�)

client-side mapping rules or migrated data [D]Wrapper construction, remote metadata [C]

as older ones that provide metadata, e.g. for search ondigital libraries, build on the assumption that the in-formation sources under consideration arestatic, e.g.given as static HTML pages or given as books in a li-brary (cf. [A, B] in Table 1). Such static informationmust not necessarily be embedded in the HTML page(case� in Table 1) but might also be stored remotelyon some other server (case� in Table 1).

Nowadays, however, a large percentage of Webpages are not static documents. On the contrary, themajority of Web pages are dynamic.1 For dynamicWeb pages (e.g. ones that are generated from thedatabase that contains a bibliography) it does not seemto be useful to manually annotate every single page.Rather one wants to “annotate the database” in orderto reuse it for one’s own Semantic Web purposes.

For this objective, approaches have been conceivedthat allow for the construction of wrappers by explicitdefinition of HTML or XML queries[25] or by learn-ing such definitions from examples[4,14]. Thus, ithas been possible to manually create metadata for aset of structurally similar Web pages. The wrapper ap-proaches come with the advantage that they do not re-quire cooperation by the owner of the database. How-ever, their shortcoming is that the correct scraping ofmetadata is dependent to a large extent on data layoutrather than on the structures underlying the data (cf.[C] in Table 1).

While for many Web sites and underlying databases,the assumption of non-cooperativity may remain valid,we assume that many Web sites will in fact partici-pate in the Semantic Web and will support the shar-ing of information. Such Web sites may present theirinformation as HTML pages for viewing by the user,

1 It is not possible to give a percentage of dynamic to staticWeb pages in general, because a single Web site may use a simplealgorithm to produce an infinite number of, probably not veryinteresting, Web pages. Estimations, however, based on Web pagesactually crawled by existing search engines estimate that dynamicweb pages outnumber static ones by 100 to 1.

but they may also be willing to give (partial) ac-cess to the underlying database and to describe thestructure of their information on the very same Webpages. Thus, they give their users the possibility toutilize:

(1) information proper;(2) information structures (e.g. the logical database

schema of a relational database);(3) information context (e.g. the presentation into

which information retrieved from the database isincluded).

A user may then exploit these three in order to cre-ate mappings into his own information structures (e.g.his ontology) and/or to migrate the data into his ownrepository—which may be a lot easier than if the in-formation a user receives is restricted to informationstructures[20] and/or information proper alone[8].

We define “deep annotation” as an annotation pro-cess that reaches out to the so-calledDeep Web2

in order to make data available for the SemanticWeb—combining the capabilities of conventionalannotation and databases.

Table 1summarizes the different settings just laidout.

Thereby,Table 1further distinguishes between twoscenarios regarding static Web sites. In the one sce-nario [B], the annotator is not allowed to change staticinformation, but he can create the metadata and re-motely retain it from the source it belongs to (e.g.by XPointer). In the other scenario [A], he is free tochoose between embedding the metadata created inthe annotation process into the information proper (�;e.g. via the〈meta〉 tag of a HTML page) or keeping itremote (�).3 For deep annotation [D], the two choicesboil down to either storing a created mapping within

2 Seehttp://library.albany.edu/internet/deepweb.htmlfor a dis-cussion of the term ‘deep web.’

3 cf. [11] on those two possibilities.

Page 3: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 189

the database of the server (�) or to storing mappingand/or migrated data remotely from the server (�).

In the remainder of the paper, we will describe thebuilding blocks for deep annotation. First, we elabo-rate on the use cases of deep annotation in order toillustrate its possible scope (Section 2). One of the usecases, a portal, will serve as a running example in theremainder of the paper. We continue with a descriptionof two possible scenarios for using the deep annota-tion process inSection 3. These two scenarios exploita tool set described in the architectureSection 4. Thetool set exploits:

(1) the description of the database given as a completelogical schema or as a description of the databaseby server-side Web page markup that defines therelationship between the database and the Webpage content (cf.Section 5);

(2) annotation tools to actually let the user utilize thedescription of queries underlying a dynamic Webpage (cf.Section 6) or the database schema un-derlying a dynamic Web site (cf.Section 7) formapping information;

(3) components that let the user exploit the map-pings, allowing him to investigate the constructedmappings (cf.Section 8), and query the servingdatabase or to migrate data into his own reposi-tory.

Before we conclude with future work, we summa-rize our lessons learned and relate our work to othercommunities that have contributed to the overall goalof metadata creation and exploitation.

2. Use cases for deep annotation

Deep annotation is relevant for a large and fastgrowing number of dynamic Web sites that aim at co-operation, for instance.

2.1. Scientific databases

They are frequently built to foster cooperationamong researchers. Medline, Swissprot, or EMBLare just a few examples that can be found on theWeb. In the bio informatics community alone currentestimations are that >500 large databases are freelyaccessible.

Such databases are frequently hard to understandand it is often difficult to evaluate whether the tablenamed “gene” in one database corresponds to the tablename “gene” in the other database. For example[26]reports an interesting case of semantic mismatches,since even an (apparently) unambiguous term like genemay be conceptualized in different ways in differentgenome databases. According to one (GDB), a geneis a DNA fragment that can be transcribed and trans-lated into a protein, whereas for others (GenBank andGSDB), it is a “DNA region of biological interest witha name and that carries a genetic trait or phenotype.”Hence, it may be much easier to tell the meaning ofa term from the context in which it is presented thanfrom the concrete database entry, for example, if genesare associated with the proteins they encode.

2.2. Syndication

Besides direct access to HTML pages of news sto-ries or market research reports, etc. commercial in-formation providers frequently offer syndication ser-vices. The integration of such syndication services intothe portal of a customer is typically expensive manualprogramming effort that could be reduced by a deepannotation process that defines the content mappings.

For the remainder of the paper we will focus on thefollowing use case.

2.3. Community Web portal (cf.[27] )

This serves the information needs a community onthe Web with possibilities for contributing and access-ing information by community members. A recent ex-ample that is also based on Semantic Web technologyis [29].4 The interesting aspect to such portals lies inthe sharing of information, and some of them are evendesigned to deliver semantic information back to theircommunity as well as to the outside world.5

The primary objective of a community setting upa portal will continue to be the opportunity of accessfor human viewers. However, given the appropriatetools they could easily provide information content, in-formation structures, and information context to their

4 http://www.ontoweb.org.5 cf., e.g. [28] for an example producing RDF from database

content.

Page 4: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

190 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

Fig. 1. The process of deep annotation.

members for deep annotation. The way that this pro-cess runs is described in the following.

3. The process of deep annotation

The process of creating deep annotation consists ofthe following four steps (depicted inFig. 1):

Input: A Web site6 driven by an underlying rela-tional database.

Step 1: The database owner produces server-sideWeb page markup according to the informationstructures of the database (described in detail inSection 5).

Result: Web site with server-side markup.

6 cf. Section 10on other information sources.

Step 2: The annotator produces client-side annota-tions conforming to the client ontology either viaWeb presentation-based annotations (Step 2a, cf.Section 6) or via automatic schema to ontologymapping (Step 2b, cf.Section 7).

Result: Mapping rules between database and clientontology.

Step 3: The annotator publishes the client ontology(if not already done before) and the mapping rulesderived from annotations (Section 7).

Result: The annotator’s ontology and mapping rulesare available on the Web.

Step 4: The annotator assesses and refines the map-ping using certain guidelines (Section 7).

Result: The annotator’s ontology and refined map-ping rules are available on the Web.

Deep annotations can be used in two ways (de-picted in Fig. 1): firstly, the querying party loads

Page 5: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 191

second party’s ontology and mapping rules and usesthem to query the database via the database interface(e.g. a Web service API) (cf.Section 8.1). Secondly,the querying party maybe interested in a migra-tion of the complete (mapped) database content toontology-based RDF instance data (e.g. in a RDFstore such as Sesame[3]).

4. Architecture

Our architecture for deep annotation consists ofthree major pillars corresponding to the three differ-ent roles (database owner, annotator, querying party)as described in the process.

4.1. Database and Web site provider

At the Web site, we assume that there is an under-lying database (cf.Fig. 2) and a server-side scripting

Fig. 2. An architecture for deep annotation.

environment, e.g. Zope or Java Server Pages (JSP),used to create dynamic Web pages. The database alsohas to provide some Web accessible interface (e.g. aWeb service API) allowing third parties to query thedatabase directly.

Furthermore, the information about the underlyingdatabase structure, which is necessary for the mappingof the database, will be available on the Web pagesand via the API.

4.2. Annotator

The annotator has two choices for the deep anno-tation of the database: (i) indirectly, by annotation ofthe Web presentation or (ii) directly by annotation ofthe logical database schema.

For the annotation of the Web presentationthe annotator uses an extended version of theOntoMat-Annotizer, viz. OntoMat-DeepAnnotizer, inorder to manually create relational metadata, which

Page 6: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

192 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

correspond to a given client ontology, for some Webpages). OntoMat-DeepAnnotizer takes into accountproblems that may arise from generic annotationsrequired by deep annotation (seeSection 6). Withthe help of OntoMat-DeepAnnotizer, we create map-ping rules from such annotations (route 1–2a–3–4in Fig. 1 that are later exploited by an inferenceengine).

Alternatively, the annotator can base his map-pings directly on the database scheme. In this case,the input of the migration is an extended relationalmodel that is derived from the SQL-DDL. Thedatabase schema is mapped into the given ontologyusing the mapping process described below, whichapplies the rules specified inSection 7. The sameholds for database instances that are transformed intoa knowledge base, which is based on the domainontology.

For the automation of the mapping process basedon the schema of the underlying database (route1–2b–3–4 in Fig. 1), we used OntoMat-Reverse,a tool for semi-automatically connecting relationaldatabases to ontologies, which enables less experi-enced users to perform this mapping process. More-over, OntoMat-Reverse automatizes some phasesin that mapping process, particularly capturing in-formation from the relational schema, validationof the mapping process and data migration (seeSection 7.3).

4.3. Querying party

The querying party uses a corresponding tool tovisualize the client ontology, to compile a query fromthe client ontology and to investigate the mapping.In our case, we use OntoEdit[30] for those threepurposes. In particular, OntoEdit also allows for theinvestigation, debugging, and change of given map-ping rules. To that extent, OntoEdit integrates andexploits the Ontobroker[9] inference engine (seeFig. 2).

5. Server-side Web page markup

The goal of the mapping process is to allow in-terested parties to gain access to the source data. Tothis extend pointers to the underlying data sources

are required. The role of the server-side Web pagemarkup is exactly that, i.e. it describes the structureof the database schema and queries issued to thedatabase from a certain page. This way one can ob-tain all information required to access the data fromoutside.

5.1. Requirements

All required information has to be published, sothat an interested party can use this information toretrieve the data from the underlying database. Theinformation which must be provided is as follows:(i) which database is used as a data source, how isthis database structured, and how can it be accessed;(ii) which query is used to retrieve data from thedatabase; (iii) which elements of the query result areused to create the dynamic Web page. Those threecomponents are detailed in the remainder of thissection.

5.2. Database representation

The database representation is specified using a ded-icated deep annotation ontology, which is instantiatedto describe the physical structure of the part of thedatabase which may facilitate the understanding of thequery results. Thereby, the structure of all tables/viewsinvolved in a query can be published.7

5.2.1. Formal modelThe deep annotation ontology contains concepts and

relations such that the formal model of a databaseschema, viz. the relational model, can be expressed us-ing RDF. In our ontological description we extend thecommon formal definition of the relational model (e.g.in [24]) with additional constructs typically found inSQL-DDLs, i.e. constructs which allow to state inclu-sion dependencies[7]. Hence, our ontology capturesthe following formal model:

Definition 1. A relational schema S is a 8-tuple (R,A, T, I, att, key, type, notnull) with:

7 The reader may note, that alternatively also the structure ofthe complete database can be published once and the descriptionfor an individual page can refer to this description via OWLimports mechanisms.

Page 7: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 193

(1) a finite setR called Relations;(2) a finite setA called Attributes;(3) a function att: R → 2A which defines the

attributes contained in a specific relationri ∈ R;

(4) function key: R → 2A that defines which at-tributes are primary keys in a given relation (thuskey (ri) ⊆ att(ri) must hold);

(5) a setT of atomic data types;(6) a function type:A → T that states the type of a

given attribute;(7) a function notnull:R → 2A which states those

attributes of a relation which have to have avalue;

(8) a set of inclusion dependenciesI where

• each element has the form ((r1, A1), (r2, A2));• r1, r2 ∈ R;• A1 = {a11, a12, . . . , a1n};• A2 = {a21, a22, . . . , a2n};• A1 ⊆ att(r1) andA2 ⊆ att(r2);• |A1| = |A2| and type(a1i) = type(a2i).

Remark 2.

(1) We will refer to r1 as domain relation andr2 asrange relation.

(2) Ic denotes the transitive closure ofI.

5.2.2. Modeling considerationsThe reader may note that SQL-DDLs are typi-

cally more expressive than relational algebra. Forinstance, it is usually possible to specify con-straints (such as DE-FAULT and NOT NULL).Default values for datatypes are not supported incurrent Web ontology languages and therefore ig-nored. The NOT NULL constraint is translated tothe function “notnull” and expressed in the ontol-ogy as a concept “NotNullAttribute” that is sub-

Fig. 3. Example database schema.

sumed by Attribute. Please note that SQL enforcesautomatically that∀ri ∈ R : key(ri) ⊆ notnull(ri)⊆ att(ri).

In our modeling we do not consider the dynamicaspects found in SQL-DDLs for the deep annotation.Thus, triggers, referential actions (like ON UPDATE,etc.), and assertions are not mapped.

In SQL-DDLs it is also possible to specify ref-erential integrity constraints, by means of so-calledforeign keys. This information is especially use-ful for the mapping process as it indicates as-sociations between database relations. SQL ref-erential integrity constraints reinforce the viewthat inclusion dependencies[7] are valid at alltimes.

With respect to inclusion dependencies, a pitfallin the translation process becomes apparent: thedatabase designer is supposed to express associationsbetween database relations by means of foreign keysto make these semantics explicit. However, the pro-cessing of this semantics is usually not supportedby its definition. The combination of informationsupported via such associations is created manuallyby stating appropriate joins of tables in the queriesto the database. Hence, those associations remainoften unspecified. The appropriate semantics mayonly be extracted by analyzing the queries sent tothe database. However, the latter is not feasiblesince running information systems would have to bealtered to track the queries issued by the system.We therefore allow users to specify foreign keys aposteriori.

5.2.3. An example representationFor example, the following representation (that cor-

responds to the database schema shown inFig. 3) is apart of the HTML head of the Web page presented inFig. 4.

Page 8: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

194 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

The RDF propertyda:accessService pro-vides a link to the service which allows for anonymousdatabase access. Consequently, additional securitymeasures can be implemented in this access service.For example, anonymous users should usually haveread-access to public information only.

5.3. Query representation

Additionally, the query itself, which is used to re-trieve the data from a particular source is placed in theheader of the page. It contains the intended SQL-queryand is associated with a name as a means to distin-

Page 9: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 195

guish between queries and operates on a particulardata source.

The structure of the query result must be publishedby means of column groups. Each column group musthave at least one identifier, which is used in the anno-tation process to distinguish individual instances anddetect their equivalence. Since database keys are onlylocal to the respective table, but the Semantic Web hasa global space of identifiers, appropriate prefixes haveto established. The prefix also ensures that the equal-ity of instance data generated from multiple queriescan be detected, if the Web application maintainerchooses the same prefix for each occurrence of thatidin a query. Eventually, database keys are translated to

instance identifiers (cf.Section 8.1) via the followingpattern:

〈prefix〉[keyi − name= keyi − value]

For example:http://www.ontoweb.org/person/id=1.

5.4. Result representation

Whenever parts of the query results are used in thedynamically generated Web page, the generated con-tent is surrounded by a tag, which carries information

Page 10: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

196 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

Fig. 4. Screenshot of providing deep annotation with OntoMat-DeepAnnotizer.

about which column of the result tuple delivered by aquery represents the used value. In order to stay com-patible with HTML, we used the〈span〉 tag as aninformation carrier. The actual information is repre-sented in attributes of〈span〉:

Such span tags are then interpreted by the annota-tion tool and are used in the mapping step.

6. Create mappings by annotation of the Webpresentation

The annotator has two choices for the deep an-notation of the database: (i) indirectly, by anno-

tation of the Web presentation or (ii) directly byannotation of the logical database schema. In thissection we present the indirect way (route 1–2a–3

–4 in Fig. 1), namely to create the mappingsby annotating individual pages, viz. the queriesissued to the database when a certain page isdynamically constructed. The results of the an-notation presented here are mapping rules be-tween the database and the client onto-logy.

Page 11: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 197

6.1. Annotation process

An annotation in our context is a set of instanti-ations related to an ontology and referring parts ofan (HTML) document. We distinguish (i) instantia-tions of concepts, (ii) instantiated properties from oneconcept instance to a datatype instance—henceforthcalled attribute instance (of the concept instance),and (iii) instantiated properties from one concept in-stance to another concept instance—henceforth calledrelationship instance.

In addition, for the deep annotation one must dis-tinguish between ageneric annotationand a literalannotation. In a literal annotation, the piece of textmay stand for itself. In ageneric annotation, a pieceof text that corresponds to a database field and that isannotated is only considered to be a place holder, i.e.a variable must be generated for such an annotationand the variable may have multiple relationships al-lowing for the description of general mapping rules.For example, a concept Institute in the client ontol-ogy may correspond to one generic annotation for theOrganization identifier in the database.

Consequential to the above terminology, we will re-fer to generic annotation in detail asgeneric conceptinstances, generic attribute instances, andgeneric re-lationship instances.

An annotation process of server-side markup(generic annotation) is supported by the user interfaceas follows:

(1) In the browser the user opens a server-side markedup Web page.

(2) The server-side markup is handled individually bythe browser, e.g. it provides graphical icons onthe page wherever a markup is present, so that theuser can easily identify values which come froma database.

(3) The user can select one of the server-side markupsto either create a newgeneric instanceand mapits database field to ageneric attribute, or map adatabase field to ageneric attributeof an existinggeneric instance.

(4) The database information necessary to query thedatabase in a later step is stored along with thegeneric instance.

The reader may note thatliteral annotationis stillperformed when the user drags a marked-up piece ofcontent that is not a server-side markup.

6.2. Creating generic instances of concepts

When the user drags an item identified as server-sidemarkup onto a particular concept of the ontology, anew generic concept instance is generated (cf. arrow#1 in Fig. 4). A dialog is then presented to the user tocapture the instance name and its attributes to whichthe database values should to be mapped. Attributeswhich resemble the column name are pre-selected(cf. dialog #1a inFig. 4). If the user confirms thisoperation, database concept and instance checks areperformed and the new generic instance is created.Generic instances will later appear with a databasesymbol in their icon.

Along with each generic instance the informationabout its underlying database query and the uniqueidentifier pattern is stored. This information is re-trieved from the markup at creation time of the in-stance as each server-side markup contains a refer-ence to the query, the selected column, and the valueof the column. The identifier pattern is obtained fromthe reference to the query description and the accord-ing column group (cf.Section 5.3). The markup usedto create the instance, defines the identifier pattern forthe generic instance. The identifier pattern will be usedwhen instances are generated from the database (cf.Section 8.1).

Let us consider the example displayed inFig. 4: Atthe beginning the user selects the server-side markup“AIFB” and drops it on the concept Institute. The con-tent of the server-side markup is ‘〈span qresult=“q1”column=“Orgname”〉AIFB〈/span〉.’ This creates anew generic instance with a reference to the queryq1 (cf. Section 5.3). Immediately after this action theusers chooses to use the values of database column“OrgName” as a filler of the generic concept attribute“name” for all generic instances “AIFB.” The identi-fier pattern for generic instances is created as a sideeffect of this action. In our example this ishttp://www.ontoweb.org/org/organizationid-=organizationid,

since “OrganizationID” is the database columnwhich presents the primary key in queryq1.

6.3. Creating generic attribute instances

The user simply drags the server-side markup intothe corresponding table entry (cf. arrow #2 inFig. 4)

Page 12: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

198 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

to create a generic attribute instances. If the draggedcontent does not contain server-side markup, the con-tent itself will be used as a default value (that is sharedby all generic instances).

If the content contains server-side markup thefollowing steps are performed. Firstly, all generic at-tributes which are mapped to database table columnswill also show a special icon. The current value ofthe dragged content will be used for display pur-poses in the interface where it appears in italic font.In this case, the assignment of generic attributescan be removed but the exemplary value itself isimmutable.

The following steps are additionally performed bythe system when the generic attribute is filled:

(1) The integrity of the database definition in theserver-side markup is checked.

(2) We check whether the generic attribute is selectedfrom the same query as used for the generic in-stance. This ensures that result fields come fromthe same database and the same query otherwisenon-matching information (e.g. publication titlesand countries) could be queried.

(3) All information given by the markup, i.e. whichcolumn of the result tuple delivered by the queryrepresents the value, is associated with the genericattribute instance.

6.4. Creating generic relationship instances

In order to create a generic relationship instancethe user simply drops the selected server-side markuponto the relation of a pre-selected instance (cf. ar-row #3 in Fig. 4). As in Section 6.2, a new genericinstance is generated. In addition, the new genericinstance is connected with the pre-selected genericinstance.

7. Create mappings by annotation ofthe schema

In this section we consider the second alternative(route 1–2b–3–4 inFig. 1), namely to annotate thedatabase schema and base the annotations directly onthe schema. While this has the benefit, that one canannotate a whole site that origins from a database, it

comes with the drawback that we miss the informationcontext, e.g. the presentation into which informationin a Web page is included.

7.1. Annotation process

An annotation process of server-side markup is sup-ported by the user interface as follows:

(1) In the browser the user opens a server-side markedup Web page.

(2) The database schema description included in theserver-side markup is loaded by the tool.

(3) The user can ask the tool to automatically createan initial mapping to the ontology based on thelexicalizations and the structure of the database(cf. the remainder of this section). This mapping iscreated in two phases, e.g. first concept mappingsare created and afterwards relationship mappingsare created.

(4) The user may refine, remove, and create map-pings between relational schema and ontology asrequired.

The reader may note that no literal annotation isperformed anymore, only generic annotations remain.

7.2. Schema to ontology mapping

The automatic schema to ontology mapping is car-ried out via mapping rules, that are applicable if formalpreconditions are met. Mapping rules map elements ofthe server schema to entities in the client ontology. Inthe following presentation we use the auxiliary func-tions:

• concept:R → C to denote the association betweena relation and a concept;

• typetrans:T → D transforming relational datatypes into the appropriate XML schema datatypes.

In order to discover mappings several approaches[2,5,8,18,20,23], can be used. Our own approach isbased on[17] and extended to incorporate the struc-tural heterogeneity resulting from the fact that we donot map between ontologies but two different metamodels, viz. relational schema and ontology. This ex-tention is required to map the implicit semantics in-corporated in some relational schemas to explicit on-tological structures. The automatic translation gener-

Page 13: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 199

ally depends on a lexical agreement of part of thetwo ontologies/database schemata. This dependencyis weakened since users typically manually refine theobtained mapping at a later stage.

7.2.1. Creating concept mappingsDatabase relations are mapped to concepts if a lex-

ical agreement in the naming exists. However, sinceschemas usually incorporate many elements that areonly defined to avoid the structural limitations of therelational model or made for performance reasons,certain relations are excluded from potential conceptmappings.

7.2.1.1. n:m associations.Certain database rela-tions are only defined to express (n:m) associationsbetween two other relations. Typically, this type ofauxiliary relation is characterized by the fact that itcontains only two attributes, which are both primarykeys and foreign keys to two other relations. A dedi-cated mapping rule excludes these relations from themapping, hence as a result no concept mapping iscreated.

Translation Rule 3 ((n:m) association).Auxiliaryrelations used to specify (n:m) associations may beidentified via the following conditions:

• att(ri) = key(ri);• A1 ⊂ att(ri), A2 ⊂ att(ri);• A1 ∪ A2 = att(ri);• A1 ∩ A2 = ∅;• ((ri, A1), (rj, key(rj)) ∈ I;• ((ri, A2), (rk, key(rk)) ∈ I.

A similar rule triggers the creation of a mapping torelationships that holds between both conceptscj, ck.8

Preference is given to relationships that are mutuallyinverse to each other.

7.2.1.2. Information distribution. In certain casesinformation that logically corresponds to one entityin the ontology is distributed across several databaserelations for performance reasons. This is for examplethe case if one of the attributes is storage-intensiveand optional. To optimize the clustering behavior of

8 Wherecx = concept(rx).

Fig. 5. Specialization in extended entity relationship diagrams.

the database such attributes are often stored in sep-arate relations together with the primary key of themain relation.

Translation Rule 4 (information distribution).In-formation distribution may be detected with the fol-lowing heuristic:

• ((ri, key(ri)), (rj, key(rj)) ∈ I

As a result no concept mappings are created. Insteada similar rule triggers the aggregation of attributesof both relations and their conversion to relationshipmappings on only one concept.

This heuristic is slightly problematic since it alsocovers one of the well-known mapping procedures fortranslating entity hierarchies in EER models to rela-tional schemas. Hence, a similar heuristic could al-ternatively lead to two concept mappings given that asubsumption relation between those concepts holds.

Fig. 5depicts such a situation. Here the entity PER-SON has several specialized entities ASSIST-PROFand FULL-PROF. Usually this may be translated torelational schemas via the following principles[1]:

(1) Create a relation for the super entity and relationsfor each sub entity, such that they contain all at-tributes of the sub entity and the primary key ofthe super entity.

(2) If the specialization is total (there is no instance ofthe super entity, only sub entities are instantiated):create a relation for all sub entities only and copythe attributes defined for the super entity.

(3) If the specialization is disjoint (an instance mayonly be instance of one entity): create only one

Page 14: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

200 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

relation that contains the attributes defined for allentities and makes those attributes optional. Addan extra attribute to explicitly state the type.

This clearly shows that the semantic intention be-hind a given relational structure cannot be capturedfully by the automatism.

Translation Rule 5. Our general heuristic is thateach relation is mapped to the lexically closest con-cept.

This heuristic is applied when no other rule could beapplied. In the current tool, edit distance[15] is usedto identify the lexically closest concept. Other lexi-cal distances, e.g. Hamming distance could be used aswell but have not been tested within the tool. Alongthe same lines one could apply different preprocess-ing heuristics such as stemming. However, we did notexperiment with such strategies yet.

7.2.2. Creating relationship mappings7.2.2.1. Attributes versus relationships.As men-tioned before the OntoMat tool distinguishes betweenattributes and relationships.9 The tool cannot decidewhether a given attribute is only defined for orga-nizational purposes, such as many auto-incrementedprimary keys, which do not carry any real meaningbeyond tuple identification10 and should thereforebe excluded from the mapping. Consequently, allattributes are mapped to corresponding datatype prop-erties, if such properties are defined for the conceptor one of its subconcepts. The domain of attributesis naturally the concept, to which the relation uponwhich the attribute is defined is mapped.

Associations between database relations, whichare expressed via foreign keys (and constitute inclu-sion dependencies), are mapped to object properties.Hence, inclusion dependencies are translated intomappings to object properties. The domain concept ofan object property corresponds to the mapping targetof the domain-relation in the inclusion dependency.The range concept corresponds to mapping target

9 In OWL terminology: datatype properties and object proper-ties.

10 The role of the latter is provided by URIs in the SemanticWeb, hence this information would no longer be needed.

to the range-relation of the inclusion dependency,respectively.

7.2.2.2. Translation rules. The default rule is to mapan attribute to the lexically closest attribute definedfor a concept (or one of its subconcepts). The defaultrule will be applied if no other rules are applicable.Further mapping rules are used to create mappingsto relationships between ontologies. In the followingwe use the following definitions across these mappingrules:

• ((ri, A1), (rj, key(rj)) ∈ I;• ((ri, A2), (rk, key(rk)) ∈ I, rj �= rk;• cj = concept(rj);• ck = concept(rk).

The mapping rules generally result in up to twomappings to the lexically closest relationships hold-ing in either direction betweencj andck. Preferenceis given to relationships that are mutually inverse toeach other, and have eithercj or ck or one of theirsubconcepts as domain and range, respectively. Allother translation rules are applied in the followingsections.

Translation Rule 6 ((n:m) association).The firstheuristic takes care of (n:m) associations betweendatabase relations. The preconditions for the applica-tion of this heuristic are:

• A1 ⊂ att(ri), A2 ⊂ att(ri);• att(ri) = key(ri);• A1 ∪ A2 = att(ri);• A1 ∩ A2 = ∅;• ((ri, A1), (rj, key(rj)) ∈ I;• ((ri, A2), (rk, key(rk)) ∈ I, rj �= rk;• cj = concept(rj);• ck = concept(rk).

Translation Rule 7 ((1:1) association).Thisheuristic captures (1:1) associations between databaserelations. The preconditions for the applicability ofthis heuristic are:

• ci = concept(ri);• cj = concept(rj);• ((ri, key(ri)), (rj, key(rj)) ∈ I;• ((rj, key(rj)), (ri, key(ri)) ∈ I.

Page 15: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 201

Translation Rule 8 ((1:m) association).Thisheuristic treats one to many associations, which areconstituted by foreign keys. The precondition for theapplication of this heuristic are:

• ci = concept(ri);• cj = concept(rj);• A1 ⊆ att(ri);• ((ri, A1), (rj, key(rj)) ∈ I.

Translation Rule 9 (role grouping).This mappingrule is used to group the attributes that are distributedin several relations and complements Rule 4. The pre-conditions for the applicability of this rule are:

• ¬∃ci = concept(ri);• ∃cj = concept(rj);• ((ri, key(ri)), (rj, key(rj)) ∈ I.

For all attributesA = att(ri)/key(ri) new roles withdomain conceptcj are created.

7.3. Prototype implementation

The automatic mapping is prototypically imple-mented in the OntoMat-Reverse tool that also enables

Fig. 6. Automatic schema to ontology mapping.

very intuitive presentation/inspection of the database’srelational schema and the structure of the given ontol-ogy, as presented inFig. 6. OntoMat-Reverse uses editdistance[15] as a measure to obtain lexical matchinghypotheses as required for the mapping rules. For ex-ample, the database relation named “Projekt” can bemapped into the concept named “Project,” since theedit distance between these words is very small.

Additionally, OntoMat-Reverse supports a valida-tion step: for example, it can discover that a databaserelation is not mapped into any concept. In that case,it analysis the structure and the content of such adatabase relation, in order to determine the cause ofthe problem, for example that this database relationhas neither foreign keys nor that other database rela-tions have a foreign key to this database relation.

Fig. 6 shows how OntoMat-Reverse makes recom-mendations for assigning attributes of the database re-lation to ontological relations between concepts in theontology. The left side of figure depicts the structureof the source database schema. The right side showsthe target ontology. Highlighted entities are mapped toeach other. Recommendations for mapping attributesare listed in the dialog.

Page 16: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

202 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

8. Accessing the database

Deep annotations can be used in two ways to ac-cess the database. Firstly, interested parties can querythe database. Here, the querying party loads secondparty’s ontology and mapping rules and uses themto obtain database tuples via the database interface.Secondly, interested parties can use the mappingto migration the complete (mapped) database intoontology-based RDF instance data.

8.1. Querying the database

The querying party can use further tools to visualizethe client ontology, to investigate the published map-ping and to compile a query from the client ontology.In our case, we used the OntoEdit plugins OntoMapand OntoQuery.

OntoMap visualizes the database query, the struc-ture of the client ontology, and the mapping betweenthem (cf.Fig. 7). The user can control and change themapping and also create additional mappings.

OntoQuery is a Query-by-Example user interface.One creates a query by clicking on a concept andselecting the relevant attributes and relationships. Theunderlying Ontobroker system transforms a query onthe ontology into a corresponding SQL-query on the

Fig. 7. Mapping between server database (left window) and client ontology (right window).

database. To this extend, Ontobroker uses the map-ping descriptions, which are internally representedas F-Logic axioms, to transform the query. TheSQL-query is sent to the database via the interfacepublished in the server-side markup. The databaseanswer, viz. a set of tuples, is transformed back intothe ontology-based representation using the mappingrules. This task is executed automatically, hence nointeraction with the user is necessary.

8.2. Data migration

Alternatively to querying the mappings can also beused to materialize ontology instances into RDF files.Here, all tuples of the relation database that are reach-able via the mapping are stored explicitly and form aknowledge base for the published client ontology.

8.3. Data transformation

Data has to be transformed to ontology-baseddata for both methods of accessing the database.This data transformation is executed in two sepa-rate steps. In the first step, all the required conceptinstances are created without considering relation-ships or attributes. These instances are stored to-gether with their identifier. The identifier is translated

Page 17: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 203

from the database keys using the identifier pattern(see Section 5.2). For example, the instance withthe name “AIFB” of the concept Institute, whichis a subconcept of Organization, has the identifier:http://www.ontoweb.org/org/organizationid=3.

After the creation of all instances the system startscomputing the values of the instance relationships andattributes. The way the values are assigned is deter-mined by the mapping rules. Since the values of an at-tribute or a relationship have to be computed from boththe relational database and the ontology, we generatetwo queries per attribute/relationship, one SQL-queryand one Ontobroker query. Each query is invoked withan instance key value (corresponding database key inSQL-queries) as a parameter and returns the value ofthe attribute/relationship.

Note that the database communication takesplace through binding variables. The correspond-ing SQL-query is generated, and if this is the firstcall, it is cached. A second call would try to usethe same database cursor if still available, withoutparsing the respective SQL statement. Otherwise, itwould find an unused cursor and retrieve the results.In this way efficient access methods for relations anddatabase rules can be maintained throughout the ses-sion. Ontobroker’s API function dbaccess enables thegeneration of the instances from the given relationaldatabase on the fly.

9. Related work

Deep annotation as we have presented it here isa cross-sectional enterprise.11 Therefore, there are anumber of communities that have contributed towardsreaching the objective of deep annotation. So far, wehave identified communities for information integra-tion (Section 9.1), mapping frameworks (Section 9.2),wrapper construction (Section 9.3), and annotation(Section 9.5).

9.1. Information integration

The core idea of information integration lies in pro-viding an algebra that may be used to translate in-

11 Just like the Semantic Web overall!

formation proper between different information struc-tures. Underlying algebras are used to provide com-positionality of translations as well as a sound basisfor query optimization (cf. e.g. a commercial systemas described in[21] with many references to previouswork—much of the latter based on principal ideas is-sued in[32].

Unlike [21], our objective has not been the pro-visioning of a flexible, scalable integration platformper se. Rather, the purpose of deep annotation lies inproviding a flexible framework forcreating the trans-lation descriptionsthat may then be exploited by anintegration platform like EXIP (or Nimble, Tsimmis,Infomaster, Garlic, etc.). Thus, we have more in com-mon with the approaches for creating mappings withthe purpose of information integration described next.

9.2. Mapping and merging frameworks

Approaches for mapping and/or merging ontolo-gies and/or database schemata may be distinguishedmainly along the following three categories: discover,[2,5,8,18,20,23], mapping representation[2,16,19,22],and execution[6,19].

In the overall area, closest to our own approach is[17], as it handles—like we do—the complete map-ping process involving the three process steps justlisted (in fact it also takes care of some more issueslike evolution).

What makes deep annotation different from all theseapproaches is that for the initial discovery of over-laps between different ontologies/schemata they alldepend on lexical agreement of part of the two ontolo-gies/database schemata. Deep annotation only dependson the user understanding the presentation—the infor-mation within an information context—developed forhim anyway. Concerning the mapping representationand execution, we follow a standard approach exploit-ing Datalog giving us many possibilities for investi-gating, adapting and executing mappings as describedin Section 7.

9.3. Wrapper construction

Methods for wrapper construction achieve manyobjectives that we pursue with our approach of deepannotation. They have been designed to allow forthe construction of wrappers by explicit definition

Page 18: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

204 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

of HTML or XML queries [25] or by learning suchdefinitions from examples[4,14]. Thus, it has beenpossible to manually create metadata for a set of struc-turally similar Web pages. The wrapper approachescome with the advantage that they do not requirecooperation by the owner of the database. However,their shortcoming is that the correct scraping of meta-data is dependent to a large extent on data layoutrather than on the structures underlying the data.

Furthermore, when definitions are given explicitly[25], the user must cope directly with querying by lay-out constraints and when definitions are learned, theuser must annotate multiple Web pages in order to de-rive correct definitions. Also, these approaches do notmap to ontologies. They typically map to lower levelrepresentations, e.g. nested string lists in[25], fromwhich the conceptual descriptions must be extracted,which is a non-trivial task. In fact, we have integrateda wrapper learning method, viz. Amilcare[4], into ourOntoMat-Annotizer. How to bridge between wrapperconstruction and annotation is described in detail in[12].

9.4. Database reverse engineering

There are very few approaches investigating thetransformation of a relational model into an onto-logical model. The most similar approach to our ap-proach is the project Infosleuth[11]. In this project,an ontology is built based on the database schemasof the sources that should be accessed. The ontol-ogy is refined based on user queries. However, thereare no techniques for creating axioms, which are avery important part of an ontology. Our approach isheavily based on the mapping of some database con-straints into ontological axioms. Moreover, the seman-tic characteristics of the database schema are not al-ways analyzed. More work has been addressed onthe issue of explicitly defining semantics in databaseschemas[4,16], extracting semantics out of databaseschema[4,10] and transforming a relational modelinto an object-oriented model[3], which is close toan ontological theory. Rishe[16] introduces seman-tics of the database as a “means” to closely capturethe meaning of user information and to provide a con-cise, high-level description of that information. In[4]an interactive schema migration environment that pro-vides a set of alternative schema mapping rules is

proposed. In this approach, which is similar to ourapproach on the conceptual level, the reengineer re-peatedly chooses an adequate mapping rule for eachschema artifact. However, this stepwise process cre-ates an object-oriented schema, therefore axioms arenot discussed.

9.5. Annotation

Finally, we need to consider information properas part of deep annotation. There, we “inherit” theprincipal annotation mechanism for creating rela-tional metadata as elaborated in[11]. The interestedreader finds an elaborate comparison of annotationtechniques there as well as in a forthcoming book onannotation[13].

10. Conclusion

In this paper, we have described deep annotation,an original framework to provide semantic annotationfor large sets of data. Deep annotation leaves seman-tic data where it can be handled best, viz. in databasesystems. Thus, deep annotation provides a means formapping and reusing dynamic data in the SemanticWeb with tools that are comparatively simple and in-tuitive to use.

To attain this objective we have defined a deepannotation process and the appropriate architec-ture. We have incorporated the means for server-side markup that allows the user to define semanticmappings by using OntoMat-Annotizer to createWeb presentation-based annotations12 and OntoMat-Reverse to create schema-based annotations. An on-tology and mapping editor and an inference engineare then used to investigate and exploit the resultingdescriptions either for querying the database contentor to materialize the content into RDF files. In to-tal, we have provided a complete framework and itsprototype implementation for deep annotation.

12 The methodology “CREAM” and its implementation“OntoMat-Annotizer” have been intensively tested by authors ofISWC-2002 when annotating the summary pages of their paperswith RDF metadata; seehttp://www.annotation.semanticweb.org/iswc/documents.html.

Page 19: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206 205

For the future, there is a long list of open issuesconcerning deep annotation—from the more mundane,though important, ones (top) to far-reaching ones (bot-tom):

(1) Granularity: So far we have only consideredatomic database fields. For instance, one may finda string “Proceedings of the Eleventh Interna-tional World Wide Web Conference, WWW2002,Honolulu, Hawaii, USA, 7–11 May 2002.” as thetitle of a book whereas one might rather be inter-ested in separating this field into title, location,and date.

(2) Automatic derivation of server-side Web pagemarkup: A content management system like Zopecould provide the means for automatically de-riving server-side Web page markup for deepannotation. Thus, the database provider couldbe freed from any workload, while allowing forparticipation in the Semantic Web. Some steps inthis direction are currently being pursued in theKAON CMS, which is based on Zope.13

(3) Other information structures: For now, we havebuilt our deep annotation process on SQL and re-lational databases. Future schemes could exploitXquery14 or an ontology-based query language.

(4) Interlinkage: In the future deep annotations mayeven link to each other, creating a dynamic inter-connected Semantic Web that allows translationbetween different servers.

(5) Opening the possibility to directly query thedatabase, certainly creates problems such as newpossibilities for denial of service attacks. In fact,queries, e.g. ones that involve too many joins overlarge tables, may prove hazardous. Nevertheless,we see this rather as a challenge to be solved byclever schemes for CPU processing time (withthe possibility that queries are not answered be-cause the time allotted for one query to one useris up) than for a complete “no go.”

We believe that these options make deep annotationa rather intriguing scheme on which a considerablepart of the Semantic Web might be built.

13 Seehttp://www.kaon.aifb.uni-karlsruhe.de/Members/rvo/kaonportal.

14 http://www.w3.org/TR/xquery.

Acknowledgements

Research for this paper has been funded by theprojects DARPA DAML OntoAgents, EU IST Bizon,and EU IST WonderWeb. We gratefully thank LeoMeyer, Dirk Wenke (Onto prise), Gert Pache, and ourcolleagues at AIFB and FZI, for discussions and im-plementations that contributed toward the deep anno-tation prototype described in this paper.

References

[1] C. Batini, S. Ceri, S. Navathe, Conceptual DatabaseDesign—An Entity-Relationship Apporach, Benjamin/Cum-mings Publishing Company, 1992.

[2] S. Bergamaschi, S. Castano, D. Beneventano, M. Vincini,Semantic Integration of Heterogeneous Information Sources,in: Proceedings of the Special Issue on Intelligent Informa-tion Integration, Data and Knowledge Engineering, vol. 36,Elsevier, 2001, pp. 215–249.

[3] J. Broekstra, A. Kampman, F. van Harmelen, Sesame: AnArchitecture for Storing and Querying RDF Data and SchemaInformation, MIT Press, 2001.

[4] F. Ciravegna, Adaptive information extraction from text byrule induction and generalisation, in: B. Nebel (Ed.), Pro-ceedings of the 17th International Conference on ArtificialIntelligence (IJCAI), Morgan Kaufmann Publishers, Inc., SanFrancisco, CA, 4–10 August 2001, pp. 1251–1256.

[5] W. Cohen, The WHIRL Approach to Data Integration, IEEEIntelligent Systems, 1998, pp. 1320–1324.

[6] T. Critchlow, M. Ganesh, R. Musick, Automatic gener-ation of warehouse mediators using an ontology engine,in: Proceedings of the Fifth International Workshop onKnowledge Representation Meets Databases (KRDB), 1998,pp. 8.1–8.8.

[7] C. Date, Referential integrity, in: Proceedings of Interna-tional Conference on Very Large Data Bases (VLDB), 1981,pp. 2–12.

[8] A. Doan, J. Madhavan, P. Domingos, A. Halevy, Learningto map between ontologies on the Semantic Web, in: Pro-ceedings of the World-Wide Web Conference (WWW), ACMPress, 2002, pp. 662–673.

[9] D. Fensel, J. Angele, S. Decker, M. Erdmann, H.-P. Schnurr,S. Staab, R. Studer, Andreas witt, on2broker: Semantic-basedaccess to information sources at the WWW, in: Proceedings ofthe World Conference on the WWW and Internet (WebNet),Honolulu, Hawaii, USA, 1999, pp. 366–371.

[10] J. Golbeck, M. Grove, B. Parsia, A. Kalyanpur, J. Hendler,New Tools for the Semantic Web, in: Proceedings of EKAW,LNCS 2473, Springer, 2002, pp. 392–400.

[11] S. Handschuh, S. Staab, Authoring and annotation of Webpages in CREAM, in: Proceedings of the 11th InternationalWorld Wide Web Conference (WWW), Honolulu, Hawaii,ACM Press, 7–11 May 2002, pp. 462–473.

Page 20: Unveiling the hidden bride: deep annotation for mapping ...staab/Research/Publications/2004/... · Unveiling the hidden bride: deep annotation for mapping and migrating legacy data

206 R. Volz et al. / Web Semantics: Science, Services and Agents on the World Wide Web 1 (2004) 187–206

[12] S. Handschuh, S. Staab, F. Ciravegna, S-CREAM—semi-automatic creation of metadata, in: Proceedings of the EKAW,LNCS, 2002, pp. 358–372.

[13] S. Handschuh, S. Staab (Eds.), Annotation in the SemanticWeb, Frontiers in Artificial Intelligence and Applications, vol.96, IOS Press, 2003.

[14] N. Kushmerick, Wrapper induction: efficiency and express-iveness, Artif. Intell. 118 (12) (2000) 15–68.

[15] V. Levenshtein, Binary codes capable of correcting deletions,insertions, and reversals, Soviet Phys. Dokl. 10 (8) (1966)707–710; Russian original, Doklady Akademii Nauk SSR163–164 (1965) 845–848.

[16] J. Madhavan, P.A. Bernstein, E. Rahm, Generic schemamatching with cupid, in: Proceedings of the 27th Interna-tional Conferences on Very Large Databases, 2001, pp. 49–58.

[17] A. Maedche, B. Motik, N. Silva, R. Volz, MAFRA—amapping framework for distributed ontologies, in: Proceed-ings of the EKAW, LNCS 2473, Springer, 2002, pp. 235–250.

[18] D. McGuinness, R. Fikes, J. Rice, S. Wilder, The chimaeraontology environment, in: Proceedings of the AAAI, 2000,pp. 1123–1124.

[19] P. Mitra, G. Wiederhold, M. Kersten, A graph-oriented modelfor articulation of ontology interdependencies, in: Proceed-ings of the Conference on Extending Database Technology(EDBT), Konstanz, Germany, 2000.

[20] N.F. Noy, M.A. Musen, PROMPT: algorithm and tool forautomated ontology merging and alignment, in: Proceedingsof the AAAI, 2000, pp. 450–455.

[21] Y. Papakonstantinou, V. Vassalos, Architecture and imple-mentation of an XQuery-based information integration plat-form, IEEE Data Eng. Bull. 25 (1) (2002) 18–26.

[22] J.Y. Park, J.H. Gennari, M.A. Musen, Mappings forreuse in knowledge-based systems, in: Proceedings of

the Technical Report, SMI-97-0697, Stanford University,1997.

[23] E. Rahm, P. Bernstein, A survey of approaches to automaticschema matching, VLDB J. 10 (4) (2001) 334–350.

[24] S. Abiteboul, S.R. Hull, V. Vianu, Foundation of Databases,Addison-Wesley, 1995.

[25] A. Sahuguet, F. Azavant, Building intelligent Web applica-tions using lightweight wrappers, Data Knowl. Eng. 3 (36)(2001) 283–316.

[26] S. Schulze-Kremer, Adding semantics to genome databases:towards an ontology for molecular biology, in: Proceedingsof the Fifth International Conference on Intelligent Systemsfor Molecular Biology, Halkidiki, Greece.

[27] S. Staab, J. Angele, S. Decker, M. Erdmann, A. Hotho,A. Maedche, H.-P. Schnurr, R. Studer, Y. Sure, Semanticcommunity Web portals, Proc. WWW9/Comput. Networks33 (1-6) (2000) 473–491.

[28] N. Stojanovic, A. Maedche, S. Staab, R. Studer, Y. Sure,SEAL: a framework for developing SEmantic PortALs, in:Proceedings of the K-CAP, ACM Press, 2001, pp. 155–162.

[29] R. Studer, Y. Sure, R. Volz, Managing user focused access todistributed knowledge, J. Univ. Comput. Sci. (J. UCS) 8 (6)(2002) 662–672.

[30] Y. Sure, J. Angele, S. Staab, Guiding ontology developementby methodology and inferencing, in: K. Aberer, L. Liu (Eds.),ODBASE-2002-Ontologies, Databases and Applications ofSemantics, Irvine, CA, USA, LNCS, Springer, 29–31 October2002, pp. 1025–1222.

[31] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A.Stutt, F. Ciravegna, MnM: ontology driven semi-automaticand automatic support for semantic markup, in: Proceedingsof the EKAW, LNCS 2473, Springer, 2002, pp. 379–391.

[32] G. Wiederhold, Intelligent integration of information, in: Pro-ceedings of the ACMSIGMOD International Conference onManagement of Data, 1993, pp. 434–437.