PORSCHE: Performance ORiented SCHEma mediation

21
Information Systems 33 (2008) 637–657 PORSCHE: Performance ORiented SCHEma mediation Khalid Saleem a, , Zohra Bellahsene a , Ela Hunt b a LIRMM-UMR 5506, Universite´Montpellier 2, 34392 Montpellier, France b GlobIS, Department of Computer Science, ETH Zurich, CH-8092 Zurich, Switzerland Abstract Semantic matching of schemas in heterogeneous data sharing systems is time consuming and error prone. Existing mapping tools employ semi-automatic techniques for mapping two schemas at a time. In a large-scale scenario, where data sharing involves a large number of data sources, such techniques are not suitable. We present a new robust automatic method which discovers semantic schema matches in a large set of XML schemas, incrementally creates an integrated schema encompassing all schema trees, and defines mappings from the contributing schemas to the integrated schema. Our method, PORSCHE (Performance ORiented SCHEma mediation), utilises a holistic approach which first clusters the nodes based on linguistic label similarity. Then it applies a tree mining technique using node ranks calculated during depth- first traversal. This minimises the target node search space and improves performance, which makes the technique suitable for large-scale data sharing. The PORSCHE framework is hybrid in nature and flexible enough to incorporate more matching techniques or algorithms. We report on experiments with up to 80 schemas containing 83,770 nodes, with our prototype implementation taking 587 s on average to match and merge them, resulting in an integrated schema and returning mappings from all input schemas to the integrated schema. The quality of matching in PORSCHE is shown using precision, recall and F-measure on randomly selected pairs of schemas from the same domain. We also discuss the integrity of the mediated schema in the light of completeness and minimality measures. r 2008 Elsevier Ltd All rights reserved. Keywords: XML schema tree; Schema matching; Schema mapping; Schema mediation; Tree mining; Large scale 1. Introduction Schema matching relies on discovering corre- spondences between similar elements in a number of schemas. Several different types of schema matching [1–8] have been studied, demonstrating their benefit in different scenarios. In data integra- tion schema matching is of central importance [9]. The need for information integration arises in data warehousing, OLAP, data mashups [10], and work- flows. Omnipresence of XML as a data exchange format on the web and the presence of metadata available in that format force us to focus on schema matching, and on matching for XML schemas in particular. Previous work on schema matching was devel- oped in the context of schema translation and integration [1,2,11], knowledge representation [4,8], machine learning, and information retrieval [3]. Most mapping tools map two schemas with human intervention [1–6,12]. The goals of research in this field are typically differentiated as matching, map- ping or integration oriented. A Matching tool finds ARTICLE IN PRESS www.elsevier.com/locate/infosys 0306-4379/$ - see front matter r 2008 Elsevier Ltd All rights reserved. doi:10.1016/j.is.2008.01.010 Corresponding author. Tel.: +33 467418585; fax: +33 467418500. E-mail addresses: [email protected] (K. Saleem), [email protected] (Z. Bellahsene), [email protected] (E. Hunt).

Transcript of PORSCHE: Performance ORiented SCHEma mediation

Page 1: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

0306-4379/$ - se

doi:10.1016/j.is.

�Correspondfax: +3346741

E-mail addre

[email protected]

Information Systems 33 (2008) 637–657

www.elsevier.com/locate/infosys

PORSCHE: Performance ORiented SCHEma mediation

Khalid Saleema,�, Zohra Bellahsenea, Ela Huntb

aLIRMM-UMR 5506, Universite Montpellier 2, 34392 Montpellier, FrancebGlobIS, Department of Computer Science, ETH Zurich, CH-8092 Zurich, Switzerland

Abstract

Semantic matching of schemas in heterogeneous data sharing systems is time consuming and error prone. Existing

mapping tools employ semi-automatic techniques for mapping two schemas at a time. In a large-scale scenario, where data

sharing involves a large number of data sources, such techniques are not suitable. We present a new robust automatic

method which discovers semantic schema matches in a large set of XML schemas, incrementally creates an integrated

schema encompassing all schema trees, and defines mappings from the contributing schemas to the integrated schema. Our

method, PORSCHE (Performance ORiented SCHEma mediation), utilises a holistic approach which first clusters the

nodes based on linguistic label similarity. Then it applies a tree mining technique using node ranks calculated during depth-

first traversal. This minimises the target node search space and improves performance, which makes the technique suitable

for large-scale data sharing. The PORSCHE framework is hybrid in nature and flexible enough to incorporate more

matching techniques or algorithms. We report on experiments with up to 80 schemas containing 83,770 nodes, with our

prototype implementation taking 587 s on average to match and merge them, resulting in an integrated schema and

returning mappings from all input schemas to the integrated schema. The quality of matching in PORSCHE is shown

using precision, recall and F-measure on randomly selected pairs of schemas from the same domain. We also discuss the

integrity of the mediated schema in the light of completeness and minimality measures.

r 2008 Elsevier Ltd All rights reserved.

Keywords: XML schema tree; Schema matching; Schema mapping; Schema mediation; Tree mining; Large scale

1. Introduction

Schema matching relies on discovering corre-spondences between similar elements in a numberof schemas. Several different types of schemamatching [1–8] have been studied, demonstratingtheir benefit in different scenarios. In data integra-tion schema matching is of central importance [9].The need for information integration arises in data

e front matter r 2008 Elsevier Ltd All rights reserved.

2008.01.010

ing author. Tel.: +33467418585;

8500.

sses: [email protected] (K. Saleem),

(Z. Bellahsene), [email protected] (E. Hunt).

warehousing, OLAP, data mashups [10], and work-flows. Omnipresence of XML as a data exchangeformat on the web and the presence of metadataavailable in that format force us to focus on schemamatching, and on matching for XML schemas inparticular.

Previous work on schema matching was devel-oped in the context of schema translation andintegration [1,2,11], knowledge representation [4,8],machine learning, and information retrieval [3].Most mapping tools map two schemas with humanintervention [1–6,12]. The goals of research in thisfield are typically differentiated as matching, map-ping or integration oriented. A Matching tool finds

Page 2: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657638

possible candidate correspondences from a sourceschema to a target schema. Mapping is an expres-sion which distinctly binds elements from a sourceschema to elements in the target schema, dependingupon some function. And integration is the processof generating a schema which contains the conceptspresent in the source input schemas. Anotherobjective, mediation, is mapping between each sourceschema and an integrated schema. The objectivebehind our work is to explore the ensemble of allthese aspects in a large set of schema trees, usingscalable syntactic and semantic matching and inte-gration techniques. The target application area forour research is a large-scale scenario, like WSML1

based web service discovery and composition, webbased e-commerce catalogue mediation, schemabased P2P database systems or web based queryingof federated database systems.

We consider input schemas to be rooted labelledtrees. This supports the computation of contextualsemantics in the tree hierarchy. The contextualaspect is exploited by tree mining techniques,making it feasible to use automated approximateschema matching [3] and integration in a large-scalescenario. Initially, we create an intermediatemediated schema based on one of the inputschemas, and then we incrementally merge inputschemas with this schema. The individual semanticsof element labels have their own importance. Weutilise linguistic matchers to extract the meaningshidden within the labels. This produces clusters ofnodes bearing similar labels, including nodes fromthe intermediate mediated schema. During themerge process, if a node from a source schema hasno correspondence in the intermediate mediatedschema, a new node in the intermediate mediatedschema is created to accommodate this node.

Tree mining techniques extract similar sub-treepatterns from a large set of trees and predictpossible extensions of these patterns. The patternsize starts from one and is incrementally augmented.There are different techniques [13,14] which minerooted, labelled, embedded or induced, ordered orunordered sub-trees. The basic function of treemining is to find sub-tree patterns that are frequentin the given set of trees, which is similar to schemamatching activity that tries to find similar conceptsamong a set of schemas.

1Web Service Modeling Language, http://www.w3.org/submission/

WSML.

1.1. Contributions

We present a new scalable methodology for schemamapping and producing an integrated schema from aset of input schemas belonging to the same con-ceptual domain, along with mappings from the sourceschemas to the integrated (mediated) schema. Themain features of our approach are as follows.

1.

2

3

The approach is almost automatic and hybrid innature.

2.

It is based on a tree mining technique supportinglarge-scale schema matching and integration. Tosupport tree mining, we model schemas as rootedordered (depth-first) labelled trees.

3.

It uses node level clustering, based on node labelsimilarity, to minimise the target search space, asthe source node and candidate target nodes are inthe same cluster. Label similarity is computedusing tokenisation and token level synonym andabbreviation translation tables.

4.

The technique extends the tree mining datastructure proposed in [14]. It uses ancestor/descendant scope properties (integer logicaloperations) on schema nodes to enable fastcalculation of contextual (hierarchical) similaritybetween them.

5.

It provides as output the mediated schema and aset of mappings (1:1, 1: n, n: 1) from input sourceschemas to the mediated (integrated) schema andvice versa.

6.

The approach was implemented as a prototype.We report on experiments using different real(OAGIS,2 xCBL3) and synthetic scenarios, de-monstrating:(a) fast performance for different large-scale

data sets, which shows that our method isscalable and supports a large-scale dataintegration scenario;

(b) input schema selection options (smallest,largest or random) for the creation of initialmediated schema, allowing us to influencematching performance;

(c) match quality evaluation using precision, recall

and F-measure as measures of mapping quality;(d) an analysis of the integrity of integrated

schema with reference to completeness andminimality measures;

(e) quadratic time complexity.

http:

http:

//www.openapplications.org.

//www.xcbl.org.

Page 3: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657 639

The remainder of the paper is organised as follows.Section 2 presents the desiderata and issues encoun-

tered in large-scale schema mediation. Section 3defines the concepts used in the paper. Ourapproach, Performance ORiented SCHEma media-tion (PORSCHE) is detailed in Section 4, comprisingof the architecture, algorithms and data structures,supported by a running example. Section 5 presentsthe experimental results in the framework of thedesiderata for schema mediation. Section 6 reviewsrelated work and compares it to ours. Section 7 givesa discussion on the lessons learned and Section 8concludes.

2. Desiderata for schema mediation

Schema mediation can be defined as integrationof a set of input schemas into a single integratedschema, with concepts mappings from the inputschemas to the integrated schema, also called themediated schema. There are numerous issues in thesemantic integration of a large number of schemas.Beside mapping quality, the performance andintegrity of the integrated schema are also veryimportant. For example, the Semantic Web, bydefinition, offers a large-scale environment whereindividual service providers are independent. Insuch a situation the mappings can never be exact,rather they are approximate [3,4,16]. And withhundreds of services available, searching andselecting the required service needs to be fast andreliable enough to satisfy a user query. Following, isa brief discussion of the desiderata for schemaintegration and mediation.

2.1. Feasibility and quality of schema mapping and

integration

The quality of mappings depends on the numberand types of matching algorithms and their combi-nation strategy, for instance their execution order.To produce high quality mappings, matching toolsapply a range of match algorithms to every pair ofsource and target schema elements. The results ofmatch algorithms are aggregated to compute thepossible candidate matches which are then scruti-nised by users who select the best/most correctcandidate as the mapping for the source schemaelement(s) to the target schema element(s).COMAþþ [2] and S-Match [4] follow thistechnique and create a matrix of comparisons ofthese pairs of elements and then generate mappings.

The final mapping selection can also be automated,depending upon some pre-defined criteria (matchquality confidence), for example, choosing thecandidate with the highest degree of similarity.QOM [12] and RONDO [15] use a variation of thesetechniques. The mapping algorithms used there areapplied to each pair of schemas of size amenable tohuman inspection. Large schemas, anything inexcess of 50 elements, make human interventionvery time consuming and error prone. Therefore,automated selection of best mappings from thesource to the target schema is a must in large-scalescenarios.

The second aspect regarding large-scale scenariosis the requirement for batch schema integration,where schemas may contain thousands of elements.Often, an integrated schema is to be created andmappings from source schemas to the integratedschema have to be produced, and this is to be donefast and reliably. This requires both matching andintegration and is not supported by tools likeCOMAþþ, S-Match, QOM and RONDO.

2.2. Schema integration approaches and integrity

measures

Integrating a batch of schemas is a specialisedapplication of schema matching, and a natural nextstep in data integration. Schema integration can beholistic, incremental or iterative (blend of incre-mental and holistic), as surveyed in [9]. The binaryladder (or balanced incremental integration) ap-proach can be implemented as a script whichautomates the merging process, based on algorithmslike QOM and similarity flooding (RONDO). Forexample, in a binary ladder implementation, theresult of merging the first two schemas is consecu-tively merged with subsequent schemas. [16,22], onthe other hand, discuss mining techniques and applyan n-ary integration approach (all schemasexploited holistically), to generate an integratedschema, along with approximate acceptable map-pings. An iterative approach first finds clusters ofschemas, based on some similarity criteria, and thenperforms integration at cluster level iteratively[17–19]. Although the iterative process is automaticand robust, the mediation aspect is very difficult toimplement on top of those approaches.

The main purpose of schema integration in an in-formation system is to hide the complexity of the un-derlying data sources and schemas from the user. Theuser accesses the integrated schema (mediated schema)

Page 4: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657640

to find a service or an answer to a query. Batista andcolleagues [20] explain the quality aspects of theintegrity of a mediated schema in an informationsystem. They highlight three quality criteria.

Schema completeness, computed as completeCon-

cepts/allConcepts is the percentage of conceptsmodelled in the integrated schema that can befound in the source schema. Completeness isindicative of the potential quality of query results.

Minimality, calculated as nonRedundantElements/

allElements, measures the compactness of theintegrated schema and is normally reflected in queryexecution time.

Type consistency, measured as consistentElements/

allElements spans all the input and mediated schemas.It is the extent to which the elements representingthe same concept are represented using the same datatypes.

2.3. Performance

Performance is an open issue in schema matching[7,8], governed by schema size and matching algo-

rithms. The complexity of the matching task for apair of schemas is typically proportional to the sizeof both schemas and the number of matchalgorithms employed, i.e., Oðn1n2aÞ, where n1 andn2 are element counts in the source and the targetand a is the number of algorithms used [2,4,6].Selection of the best match from a set of possiblematches increases the complexity of the problem. Inschema integration and mediation, beside theparameters discussed above, as there are more thantwo schemas, we may consider instead the batch size(number of schemas) and the manner in which theschemas are compared and integrated. The perfor-mance of the integration and mapping process canbe improved by optimising the target search spacefor a source schema element. Minimising the searchspace by clustering will improve performance.

Our three desiderata, outlined at the start of thissection, motivated us to deliver a new robust andscalable solution to schema integration and media-tion. Here, we focus on a large number of schemas,automated matching, and good performance. Weexplore mediated schema generation. For a givenbatch of large schemas and an initial mediatedschema (derived from one of the input schemas), weefficiently construct a mediated schema whichintegrates all input schemas in an incrementalholistic manner. To enhance the speed and lowerthe cost of data integration, we try to remove the

need for human intervention. We present a newmethod for schema matching and integration whichuses a hybrid technique to match and integrateschemas and create mappings from source schemasto the mediated schema. The technique combinesholistic node label matching and incremental binaryladder integration [9]. It uses extended tree miningdata structures to support performance orientedapproximate schema matching. Our approach isXML schema centric, and it addresses the require-ments of the Semantic Web.

3. Preliminaries

3.1. Match cardinalities

Schema matching finds similarities between ele-ments in two or more schemas. There are three basicmatch cardinalities at element level [7]. Since we arematching schema tree structures (elements arenodes), where the leaf nodes hold data, we placemore emphasis on leaf node matching. Our categor-isation of node match cardinalities is driven by thenode’s leaf or non-leaf (inner node) status.

(i)

1:1—one node of source schema corresponds toone node in the target schema; leaf:leaf or non-leaf:non-leaf.

(ii)

1: n—one node in the source schema is equiva-lent to a composition of n leaves in the targetschema; leaf:non-leaf, where a source leaf nodeis mapped to a sub-tree containing n leaf nodesin the target.

(iii)

n: 1—n leaves in source schema compositelymap to one leaf in the target schema; non-leaf:leaf, allowing a sub-tree with n leaves in asource to be mapped to a target leaf.

Example 1 (Match cardinality) : For 1:1, cardinalityis straightforward. For 1: n, consider Fig. 1. Amatch is found between Ssourcename [2], child ofwriter, and Stargetname [2], child of author, withchildren first and last. We have a 1:2 mappingðnameÞsource: ðname=first; name=lastÞtarget. Also, thereis a match between Ssourcepublisher [4] and Starget

publisher [5], with a 1:1 mapping ðpublisher=nameÞsource: ðpublisherÞtarget.

Semantically speaking, a match between two nodesis fuzzy. It can be either an equivalence or a partialequivalence. In a partial match, the similarity is partial.It is highlighted in the following example.

Page 5: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Fig. 1. Example schema trees showing labels and depth-first order number for each node.

K. Saleem et al. / Information Systems 33 (2008) 637–657 641

Example 2 (Partial match) : In source schemaName ¼ ‘John M: Brown’, is partially matchedto LastName ¼ ‘Brown’ and FirstName ¼ ‘John’in the target, because Name also contains theMiddleInitial ¼ ‘M’.

3.2. Definitions

Semantic matching requires the comparison ofconcepts structured as schema elements. Labelsnaming the schema elements are considered to beconcepts and each element’s contextual placementinformation in the schema further defines thesemantics. For example, in Fig. 1, Ssourcename [5]and Stargetname [2] have similar labels but their treecontexts are different, which makes them concep-tually disjoint. Considering the XML schema as atree, the combination of the node label and thestructural placement of the node defines theconcept. Here we present the definitions used inschema matching and integration.

Definition 1 (Schema tree) : A schema S ¼ ðV ;EÞ isa rooted, labelled tree [14], consisting of nodesV ¼ f0; 1; . . . ; ng, and edges E ¼ fðx; yÞjx; y 2 Vg.One distinguished node r 2 V is called the root,and for all x 2 V , there is a unique path from r to x.Further, lab:V ! L is a labelling function mappingnodes to labels in L ¼ fl1; l2; . . .g.

Schema tree nodes bear two kinds of information:the node label, and the node number allocatedduring depth-first traversal. Labels are linguisticallycompared to calculate label similarity (Definition 2,Label semantics). Node number is used to calcu-late the node’s tree context (Definition 4, Nodescope).

Definition 2 (Label semantics) : A label l is acomposition of m strings, called tokens. We applythe tokenisation function tok which maps a label toa set of tokens Tl ¼ ft1; t2; . . . ; tmg. Tokenisation [4]

helps in establishing similarity between two labels.

tok:L! PðTÞ; where PðTÞ is a power set over T .

Example 3 (Label equivalence) : FirstName, toke-nised as ffirst, nameg, and NameFirst, tokenised asfname, firstg, are equivalent, with 100% similarity.

Label semantics corresponds to the meaning ofthe label (irrespective of the node it is related to). Itis a composition of meanings attached to the tokensmaking up the label. As shown by Examples 3–5,different labels can represent similar concepts. Wedenote the concept related to label l as CðlÞ.

Example 4 (Synonymous labels) : WriterName, to-kenised as fwriter, nameg, and AuthorName, toke-nised as fauthor, nameg are equivalent (theyrepresent the same concept), since ‘writer’ is asynonym of ‘author’.

Semantic label matching minimises the searchspace of possible mappable target nodes [4,12]. Thederivation of concept similarity in two schemas isinitiated by comparing their labels. Similaritybetween labels is either equivalence or partialequivalence, or no similarity, as defined below.

(a)

Equivalence: CðlxÞ ¼ CðlyÞ. (b) Partial equivalence: CðlxÞ ffi CðlyÞ.

(i) More specific (is part of): CðlxÞ � CðlyÞ.(ii) More general (contains): CðlxÞ � CðlyÞ.(iii) Overlaps: CðlxÞ \ CðlyÞa;.

(c)

No similarity: CðlxÞ \ CðlyÞ ¼ ;.

Example 5 (Label similarity) : As AuthorName

and WriterName are equivalent (Example 4),we write AuthorName ¼WriterName. Also,AuthorLastName � AuthorName, as LastName

is conceptually part of name. Conversely, AuthorName � AuthorLastName. MiddleLastName and

Page 6: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657642

FirstNameMiddle overlap, as they share tokensfname, middleg.

Definition 3 (Node clustering) : To minimise thetarget search space (see Section 2), we cluster allnodes, based on label similarity. The clusteringfunction can be defined as VT :L! PðV Þ wherePðV Þ is the power set over V. VT returns for eachlabel l 2 L a set of nodes vi � V , with labels similarto l. Fig. 2 illustrates node clustering. Here, thecluster contains nodes fx52; x11;x23; x42g bearingsynonymous labels fauthor, novelist, writerg.

Definition 4 (Node scope) : In schema S each nodex 2 V is numbered according to its order in thedepth-first traversal of S (the root is numbered 0).Let SubTreeðxÞ denote the sub-tree rooted at x, andx be numbered X, and let y be the rightmost leaf (orhighest numbered descendant) under x, numberedY. Then the scope of x is scopeðxÞ ¼ ½X ;Y �.Intuitively, scopeðxÞ is the range of nodes under x,and includes x itself, see Fig. 3. The count of nodesin SubTreeðxÞ is Y � X þ 1.

Definition 5 (Node semantics) : Node semantics ofnode x, Cx, combines the semantics of the nodelabel l, CðlxÞ, with its contextual placement in thetree, TreeContextðxÞ [4], via function NodeSem.

Cx: x! NodeSemðCðlxÞ;TreeContextðxÞÞ.

TreeContext of a node is calculated relative to othernodes in the schema, using node number and scope(Example 7).

Fig. 2. Node clustering. In xsn, s is the schema number and n is

the node number within the schema.

Fig. 3. Example schema tree with labels and (number, scope) for

each node.

Definition 6 (Schema mediation) : INPUT: A set ofschema trees SSet ¼ fS1;S2; . . . ;Sug.

OUTPUTS: (a) An integrated schema tree Sm

which is a composition of all distinct concepts Cx inSSet (see Definition 5, and [20]).

Sm ¼^

x2Si

Vu

i¼1ðCxÞ,

where V is a composition operator on the set ofschemas which produces a tree containing all thedistinct concepts (encoded as nodes). The tree has tobe complete (see integrated schema integrity mea-sure, Section 5 and [20]) to ensure correct queryresults.

(b) A set of mappings M ¼ fm1;m2; . . . ;mwg fromthe concepts of input schema trees to the concepts inthe integrated schema.

The integrated schema Sm is a compositionof all nodes representing distinct concepts inSSet. During the integration, if an equivalent nodeis not present in Sm, a new edge e0 is created inSm and the node is added to it, to guaranteecompleteness.

3.3. Scope properties

Scope properties describe the contextual place-ment of a node [14]. They explain how structuralcontext (within a tree) can be extracted during theevaluation of node similarity. The propertiesrepresent simple integer operations.

Unary Properties of a node x with scope [X,Y]:

Property 1 (Leaf nodeðxÞ) : X ¼ Y .

Property 2 (Non-leaf nodeðxÞ) : XoY .

Example 6 (Leaf and non-leaf) : See Fig. 3. Property1 holds for title [7,7] which is a leaf. Property 2holds for writer [1,2] which is an inner node.Intuitively, Properties 1 and 2 detect simple andcomplex elements in a schema.

Binary Properties for x [X,Y], xd ½X d ;Y d �,xa½X a;Y a�, and xr½X r;Y r�:

Property 3 (Descendant ðx; xdÞ, xd is a descendant

of x) : X d4X and Y dpY.

Property 4 (DescendantLeaf ðx; xdÞ) : This combines

Properties 1 and 3. X d4X and Y dpY and

X d ¼ Y d .

Page 7: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657 643

Property 5 (Ancestor ðx;xaÞ, xa is an ancestor of x) :(Complement of Property 3) X aoX and Y aXY.

Property 6 (RightHandSideNode ðx; xrÞ, xr is Right-

hand side node of x with non-overlapping scope) :X r4Y .

Example 7 (Node relationships) : See Fig. 4. Starget—Property 5 holds for nodes [4,5] and [5,5], asAncestor ([5,5],[4,5]), so publisher [4,5] is an ancestorof name [5,5]. Also Ancestor ([4,5],[0,7]) holds forbook [0,7] and publisher [4,5]. In Starget, Right-

HandSideNode ([4,5],[6,6]) holds, implying nodelabelled isbn is to the right of node labelled publisher.

Example 8 (Using scope properties) : The task is tofind a mapping for Ssourceauthor=name in the targetschema Starget (Fig. 4). Starget has two nodes calledname: [2,2] and [5,5]. We assume synonymy betweenauthor and writer, top-down traversal, andSsourceauthor being already mapped to writer [1,2].We perform the descendant node check on [2,2] and[5,5] with respect to writer [1,2]. Using Property 3,Descendantð½5; 5�; ½1; 2�Þ ¼ false implies [5,5] is not adescendant of [1,2], whereas Descendant ([2,2],[1,2])is true. Thus, [2,2] is a descendant of [1,2], andauthor/name is mapped to writer/name.

4. PORSCHE

PORSCHE accepts a set of XML schema trees. Itoutputs an integrated schema tree and mappingsfrom source schemas to the integrated schema.

4.1. PORSCHE architecture

PORSCHE architecture (see Fig. 5) supports thecomplete semantic integration process involvingschema trees in a large-scale scenario. The integra-tion system is composed of three parts: (i) Pre-

Mapping, (ii) Label Conceptualisation and (iii) Node

book[0,3]

author[1,2]

name[2,2]

title[3,3]

Ssource Starget

writer[1,2]

name[2,2]

Fig. 4. Source and tar

Mapping, supported by a repository which housesoracles and mappings.

The system is fed a set of XML schema instances.Pre-Mapping module processes the input as trees,calculating the depth-first node number and scope(Definition 4) for each of the nodes in the inputschema trees. At the same time, for each schema treea listing of nodes is constructed, sorted in depth-firsttraversal order. As the trees are being processed, asorted global list of labels over the whole set ofschemas is created (see Section 4.2).

In Label Conceptualisation module, label conceptsare derived using linguistic techniques. We tokenisethe labels and expand the abbreviated tokens usingan abbreviation oracle. Currently, we utilise adomain specific user defined abbreviation table.Further, we make use of token similarity, supportedby an abbreviation table and a manually defineddomain specific synonym table. Label comparison isbased on similar token sets or similar synonymtoken sets. The architecture is flexible enough toemploy additional abbreviation or synonym oraclesor arbitrary string matching algorithms.

In Node Mapping module, the mediated schemacreator constructs the initial mediated schema fromthe input schema tree with the highest number ofnodes augmented with a virtual root. Then itmatches, merges and maps. Concepts from inputschemas are matched to the mediated schema. Themethod traverses each input schema depth-first,mapping parents before siblings. If a node is foundwith no match in the mediated schema, a newconcept node is created and added to the mediatedschema. It is added as the rightmost leaf of the nodein the mediated schema to which the parent of thecurrent node is mapped. This new node is used asthe target node in the mapping. The techniquecombines node label similarity and contextualpositioning in the schema tree, calculated with thehelp of properties defined in Section 3.

book[0,7]

title[7,7]

isbn[6,6]

info[3,6]

publisher[4,5]

name[5,5]

get schema trees.

Page 8: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Fig. 5. PORSCHE architecture consists of three modules: Pre-Mapping, Label Conceptualisation and Node Mapping.

4Lx.y refers to line y of algorithm in figure x.

K. Saleem et al. / Information Systems 33 (2008) 637–657644

The Repository is an indispensable part of thesystem. It houses oracles: thesauri and abbreviationlists. It also stores schemas and mappings, andprovides persistent support to the mapping process.

4.2. Algorithms and data structures

In this section we discuss the hybrid algorithmsused in PORSCHE. We make assumptions pre-sented in [16,22] which hold in single domainschema integration. Schemas in the same domaincontain the same domain concepts, but differ instructure and concept naming, for instance, name inone schema may correspond to a combination ofFirstName and LastName in another schema. Also,in one schema different labels for the same conceptare rarely present. Further, only one type of matchbetween two labels in different schemas is possible,for example, author is a synonym of writer.

Pre-Mapping comprises a number of functions:(1) depth-first, scope and parent node number

calculation, (2) creation of data structures for

matching and integration: schema node list (SNL)for each schema and global label list (GLL), and (3)identification of the input schema from which theinitial mediated schema is created. SNL and GLL

are later updated by the Node Mapping module (seeFig. 11). Pre-Mapping requires only one traversal ofeach input schema tree (Fig. 6).

Algorithm nodeScope (Fig. 7) is a recursivemethod, called from the preMap algorithm (L6.3)4

for each schema, generating a schema node list(SNL). It takes as input the current node, its parentreference, and the node list. In its first activation,reference to the schema root is used as the currentnode: there is no parent, and SNL is empty. A newnode list element is created (L7.2-6) and added toSNL. Next, if the current node is a leaf, a recursivemethod update (Fig. 8) is called. This adjusts therightmost node reference for the current SNL

element and then goes to adjust its ancestor entriesin SNL. This method recurses up the tree till it

Page 9: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Fig. 6. Pseudocode of Pre-Mapping.

Fig. 7. Pseudocode for node scope calculation and node list creation.

K. Saleem et al. / Information Systems 33 (2008) 637–657 645

reaches the root, and adjusts all elements on thepath from the current node to the root. Next, innodeScope (L7.9-10), the method is called for eachchild of the current node.

After calculating node scope, preMap adds thenew SNL to its global list of schemas and creates aglobal (sorted) label list (GLL) (L6.5-9). (L6.10)chooses the largest (smallest or random) schematree for subsequent formation of the initial mediatedschema. GLL creation is a binary incrementalprocess. SNL of the first schema is sorted andused as an initial GLL (L6.6-7) and then SNLs

from other schemas are iteratively merged with theGLL, as follows.

mergeLabelLists (see Fig. 9) is a variant of mergesort. At (L9.7-8), we skip labels shared betweenSNL and GLL and add links between them, to keeptrack of shared labels. Multiple occurrences of thesame label within an SNL; however, are needed, asthey help us in identifying labels attached to distinctnodes, e.g., author/name is different from publisher/name. If more distinct nodes with the same label areencountered in the matching process, they arehandled in an overflow area. In the GLL, multiple

Page 10: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Fig. 9. Pseudocode for label list merging, based on merge sort.

Fig. 8. Pseudocode for updating scope entries in SNL.

K. Saleem et al. / Information Systems 33 (2008) 637–657646

occurrences of the same label are equivalent to thelargest number of their separate occurrences in oneof the input schemas. This is further explained withthe help of the following example.

Example 9 (Repeated labels) : Assume three inputschema trees: S1, S2 and S3. Their SNLs beforemerge sort are: S1fbook;author;name;info;publisher;name; titleg, S2fbook;writer;name; publisher; address;name; titleg and S3fbook; author; address publisher;address;name; titleg.

The sorted GLL is faddress, address, author,book, info, name, name publisher, title, writerg.

Label Conceptualisation is implemented in label-

Similarity (Fig. 10). The method creates similarityrelations between labels, based on label semantics(Definition 2). This method traverses all labels,tokenises them (with abbreviations expanded), and

compares them exhaustively. If similarity is de-tected, a link is added in GLL recording labelsimilarity between two list entries.

After Pre-Mapping, PORSCHE carries out NodeMapping (see Fig. 11, mediation), which accom-plishes both schema matching and integration. Thishandles the semantics of node labels, along with thetree context, to compute the mapping. The algo-rithm accepts as input a list of SNLs (V) and theidentifier of the input schema with which otherschemas will be merged (j). It outputs the mediatedschema node list (MSNL), Vm, and a set ofmappings M, from input schema nodes to themediated schema nodes.

First, the initial mediated schema tree whichworks as a seed for the whole process is identified.To this purpose, a clone of the largest SNL iscreated with a new ROOT node, Vm (L11.2). The new

Page 11: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Fig. 10. Pseudocode for label similarity derivation, based upon tokenisation (tok).

Fig. 11. Pseudocode for schema integration and mapping generation.

K. Saleem et al. / Information Systems 33 (2008) 637–657 647

root node allows us to add input schema trees at thetop level of the mediated schema whenever ouralgorithm does not find any similar nodes. This isrequired to support the completeness of the finalmediated schema.

We use three data structures mentioned in thepreceding paragraphs: (a) inputSNLs,V, (b) GLL,and (c) MSNL, V m. Here, we give a briefexplanation of attributes associated with the nodeobjects in each data structure. V holds u SNLs,

Page 12: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657648

representing each input schema, where each elementof the list has attributes fdepth-first order number,scope, parent, mapping data fmediated schema node

reference, map typegg. GLL element attributes areflabel, link to similar labelg. V m’s elements compriseattributes fdepth-first order number, scope, parent,list of mappingsfinput schema identifier, node

numbergg. Data structures (a) and (c) are furthersorted according to the order of (b), i.e., sortedGLL, where each node object is placed aligned withits label (see Table 1). This helps in the clustering ofsimilar nodes (Definition 3) and speeds up mapping,as similar nodes can be looked up via an indexlookup. In mapping data in V, the mediated schema

node reference is the index number of the node inMSNL to which it has been matched, and maptype

records mapping cardinality (1:1, 1: n or n: 1).During mediation a match for every node (L11.4)

of each input schema (L11.3) is calculated, mapping itto the mediated schema nodes. For each input nodex, a set Vt of possible mappable target nodes in themediated schema is created, producing the targetsearch space for x. The criterion for the creation ofthis set of nodes is node label equivalence or partialequivalence (L11.5,6). V t can have zero (L11.13), one(L11.9) or several (L11.11) nodes.

If there is only one possible target node in themediated schema, method oneMatch (Fig. 12) isexecuted. (L12.2,8,10) compare the tree context of

Fig. 12. Pseudocode for mat

nodes x (input node) and xt1 (possible target nodein Vt), to ensure that a leaf node is mapped toanother leaf node. Second check, ancestorMap

(L12.3), ensures that at some point up the ancestorhierarchy of x and xt1 there has been a mapping, inaccordance with Property 5. This guarantees thatsub-trees containing x and xt1 correspond to similarconcept hierarchies. This increases the match con-fidence of nodes x and xt1.

Alternatively, if the target search space V t hasmore than one node, algorithm multipleMatch

(Fig. 13) is executed. Here, we check the descendantProperty 3 (L13.4) for each node xt in V t (L13.3).Descendant function verifies if xt lies in the sub-treeof the node to which the parent is mapped, that iswithin the scope. The function returns true for onlyone node or none.

Method addNewNode (L11.14,13.9,12.6) adds anew node xt as the rightmost sibling of the node towhich the parent of x is mapped. Adding a newnode requires a scope update: the ancestor andright-hand side nodes of the new node have now alarger scope. Properties 5 and 6 are used to find theancestor and right-hand side nodes. For ancestornodes, scope is incremented, and for right-hand sidenodes, both node number and scope are incremen-ted.

Mapping method mapOneToOne, creates a 1:1map from x to xt, whereas mapOneN creates a

ching one target node.

Page 13: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Fig. 13. Pseudocode for matching several target nodes.

Fig. 14. Input schema trees Sa and Sb.

K. Saleem et al. / Information Systems 33 (2008) 637–657 649

mapping from x (a leaf) to a set of leaves in the sub-tree rooted at xt (non-leaf), which is a 1: n mapping.This mapping is considered to be an approximatemapping but follows the semantics of leavesmapping to leaves. And, similarly, mapNOne is acomposite mapping of leaves under x, to leaf xt.Node Mapping algorithm (Fig. 11) integrates allinput nodes into the mediated schema and createscorresponding mappings from input schemas to themediated schema.

4.3. A schema integration example

Fig. 14 shows two trees after Pre-Mapping. A listof labels created in this traversal is shown inTable 1a. The two nodes with the same label name

but different parents are shown (labels with index 2and 3). The last entry in the list is the ROOT of themediated schema. In the label list the semanticallysimilar (equivalent or partially equivalent) labels aredetected: author (index 0) is equivalent to writer

(index 7).

A matrix of size uq (here 2� 9) is created, where u

is the number of schemas and q the number ofdistinct labels in the GLL, see Table 1b. Each matrixrow represents an input schema tree and each non-null entry contains the node scope, parent node link,and the mapping, which is initially null. Matrixcolumns are alphabetically ordered. The largerschema, Sb, Fig. 14, is selected for the creation ofthe initial mediated schema Sm. A list of size q,Table 1c, is created to hold Sm, assuming the samelabel order as in Table 1a and b.

The Node Mapping algorithm takes the datastructures in Table 1 as input, and producesmappings shown in Table 2b and the integratedschema in Table 2c. In the process, the input schemaSa is mapped to the mediated schema Sm, i.e.,extended schema Sb. The mapping is taken as thecolumn index (Table 2b, in bold) of the node in Sm.Saving mappings as column index gives us theflexibility to add new nodes to Sm, by appending toit. Scope values of some nodes may be affected, asexplained previously, because of the addition of new

Page 14: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Table 2

After Node Mapping

(a) Global Label List

0 1 2 3 4 5 6 7 8

author book name name price pub title writer ROOT

(b) Mapping Matrixa: Row 1 is Sa and Row 2 is Sb

1,2,0,7 0; 3;�1; 1 2,2,1,2 3,3,0,4

0; 5;�1; 1 2,2,1,2 4,4,3,3 3,4,0,5 5,5,0,6 1,2,0,7

(c) Final Mediated Schemab

1,7,0, 3,3,2, 5,5,4, 7,7,1, 4,5,1, 6,6,1, 2,3,1, 0; 7;�11.0, 2.0 1.2, 2.2 2.4 1.3 2.3 2.5 1.1, 2.1

aColumn entry is (node number, scope, parent, mapping).bBold numbers refer to sourceSchema.node.

Table 1

Before Node Mapping

(a) Global Label List

0 1 2 3 4 5 6 7 8

author book name name price pub title writer ROOT

(b) Input Schema Matrix: Row 1 is Sa and Row 2 is Sb

1,2,0 0; 3;�1 2,2,1 3,3,0

0; 5;�1 2,2,1 4,4,3 3,4,0 5,5,0 1,2,0

(c) Initial Mediated Schema

1,6,0 3,3,2 5,5,4 4,5,1 6,6,1 2,3,1 0; 6;�1

Schema column entry is (node number, scope, parent).

K. Saleem et al. / Information Systems 33 (2008) 637–657650

nodes, but column indices of all previous nodesremain the same. Intuitively, none of the existingmappings are affected.

Node Mapping for input schema Sa (Table 1brow 1) starts from the root node Sa[0,3]book. Itfollows the depth-first traversal of the input tree, toensure that parent nodes are mapped before thechildren. Sa[0,3] has only one similar node in themediated schema tree Sm i.e., node 1 at index 1. Soentry 1 at index 1 for Sa records the mapping(Sa½0; 3�;Sm½1; 6�), see Table 2b. Information regard-ing mapping is also saved in the mediated schemanode as 1.0 (node 0 of schema 1). Next node to mapin Sa is [1,2]author, similar to Sm[1,2]writer. Bothnodes are internal and the method ancestorMap

returns true since parent nodes of both are alreadymapped. The resulting mapping for node with labelauthor is entry 7. For node with label 2 name, thereare two possibilities, label 2 (index 2) and label 3(index 3) in the mediated schema. Descendant is truefor node at index 2 (author/name,writer/name), andfalse for 3 (author/name,pub/name). Hence, 2 is thecorrect match.

The last node to map in Sa is [3,3]price. There isno node in Sm with a similar label, so a new node isadded to Sm, recorded by an entry in the columnwith label ‘price’ in the mediated schema (Table 2c).A new node is created as the rightmost sibling ofthe node in the mediated tree to which the parentnode of current input node is mapped, i.e., aschild of book. The scope and parent are accordinglyadjusted for the new node, and for its ancestors,and for right-hand side nodes, scope and numberare adjusted. There is no effect on the exist-ing mapping information. Finally, a mapping iscreated from the input node to this new targetnode.

If a new node is added in the middle of the tree,its ancestor’s scope is incremented by one. And,accordingly, right-hand side nodes (satisfying Prop-erty 6) have their order numbers and scopesincremented by one. The implementation keepstrack of the columns for the next node accordingto depth-first order. Thus, the final mediatedschema tree can be generated from the finalmediated schema row by a traversal starting from

Page 15: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Fig. 15. Mediated schema with mappings.

K. Saleem et al. / Information Systems 33 (2008) 637–657 651

the ROOT. The mediated schema with mappings(dotted lines) is shown in Fig. 15.

4.4. Complexity analysis

The worst-case time complexity of PORSCHEcan be expressed as a function of the number ofnodes in an input schema, n on average, where thenumber of schemas is u. The algorithm has thefollowing phases.

Data structure creation—all input schemas aretraversed as trees (depth-first) and stored as nodelists (SNLs), a global list of node labels (GLL) iscreated, and one of the schemas is selected as thebase for the mediated schema, with time com-plexity OðnuÞ. � Label list sorting—the number of labels is nu and

sort time complexity for the global list isOðnu logðnuÞÞ.

� Label similarity—we compare each label to the

labels following it in the list. Time complexity ofOððnuÞ2Þ.

� Node contextual mapping—nodes are clustered,

based on node label similarity. Worst case iswhen all labels are similar and form one clusterof size nu. We compare each node once to allother nodes, using tree mining functions. Atworst, nu nodes are compared to nu nodes, withtime complexity OððnuÞ2Þ. Realistically, there aremultiple clusters, much smaller than nu, whichimproves performance.

Overall time complexity is OððnuÞ2Þ, with spacecomplexity OðnðuÞ2Þ, as the largest data structure isa matrix of size nu2 used in the schema mappingprocess.

The theoretical time and space complexity ofPORSCHE is similar to other algorithms wesurveyed (Section 2). Because we perform only onefull traversal of all the schemas and reuse the

memory-resident data structures generated in thetraversal of the fist schema, instead of performingrepeated traversals used in alternative approacheswhich use a script to repeatedly merge any twoschemas, we can generate a mediated schema fasterthan other approaches. Our experiments confirmthe complexity figures derived above, see Section 5.

5. Experimental evaluation

The prototype implementation uses Java 5.0. APC with Intel Pentium 4, 1.88GHz processor and768MB RAM, running Windows XP was used. Weexamine both the performance and quality ofschema mediation, with respect to mapping qualityand mediated schema integrity.

5.1. Performance evaluation

Performance is evaluated as the number ofschemas or nodes processed versus the time requiredfor matching, merging and mapping. We selectedthree sets of schema trees from different domains,shown in Table 3. OAGIS and xCBL are realschemas, whereas BOOKS are synthetic schemas.

Experiments were performed with different num-bers of schemas (2–176). Algorithm performancewas timed under three different similarity scenarios:

(A)

Label string equivalence. (B) Label token set equivalence. (C) Label synonym token set equivalence.

Fig. 16 shows a comparison of three similarityscenarios, A, B, and C, for sets of 2, 4, 8, 16, 32, 64,128, and 176 schemas from BOOKS. There is novisible difference in the performance of variousmatchers. This is possibly due to the fact thatsynthetic schemas vary little in their labels.

Fig. 17 shows time in seconds for OAGIS andxCBL. The execution time for PORSCHE depends

Page 16: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

Table 3

Schema domains used in the experiment

Domain OAGIS xCBL BOOKS

Number of Schemas 80 44 176

Average nodes per schema 1047 1678 8

Largest schema size 3519 4578 14

Smallest schema size 26 4 5

0

200

400

600

800

1000

0 50 100 150 200

time

(ms)

Schemas

Performance Scalability (Books Schemas)

Same trend for case A,B and C

Fig. 16. BOOKS: integration time (ms) as a function of the

number of schemas.

0

200

400

600

800

1000

0 20 40 60 80 100

time

(s)

Nodes (x 1000)

Performance Scalability (Nodes)

OAGIS Schemas

XCBL Schemas

Fig. 17. Comparison of schema integration times (seconds) for

real web schemas.

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80 90

time

(s)

Nodes (x 1000)

Performance Scalability (Nodes)

Similarity A

Similarity B

Similarity C

Fig. 18. Integration of OAGIS schemas.

0

200

400

600

800

1000

0 2010 4030 6050 8070

time

(s)

Nodes (x 1000)

Performance Scalability (Nodes)

Similarity A

Similarity B

Similarity C

Fig. 19. Integration of xCBL schemas.

K. Saleem et al. / Information Systems 33 (2008) 637–657652

upon the number of schemas to be integrated, andappears to be quadratic in the number of nodes, aspredicted by the complexity analysis (Section 4.4).Figs. 18 and 19 show the time in seconds against thenumber of nodes processed for the three similaritymethods (A, B, and C), for OAGIS and xCBLschemas. xCBL schemas (Fig. 19) are slower tomatch than OAGIS schemas (Fig. 18). This is due tothe higher average number of nodes in xCLB

schemas. It takes approximately 600 s to match 80OAGIS schemas, while 44 xCLB schemas requireabout 800 s.

In both cases there is a slight difference inmatching times for categories A–C (Figs. 18 and19), due to different label matching strategies. A isthe fastest, as it works only on the labels. B isslightly slower, as labels have to be tokenised, and Cis the slowest, as one needs to match synonyms aswell. These evaluation cases show that PORSCHEhas acceptable performance on an office PC for alarge number of schemas.

5.2. Mapping quality

As available benchmarks focus on two schemas ata time, and our approach targets a set of schemas,we can only compare mapping quality for twoschemas at a time. From the set of 176 booksschemas we randomly selected five pairs of schemas

Page 17: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

0

0.2

0.4

0.6

0.8

1

Books1

Books2

BOoks3

Books4

Books5

PurOrder

Oagis1

Oagis2

Rec

all

Different Scenarios

Mapping Quality

PORSCHE SF COMA++

Fig. 20. Precision for PORSCHE, COMAþþ and Similarity

Flooding for eight pairs of schemas.

0

0.2

0.4

0.6

0.8

1

Books1

Books2

BOoks3

Books4

Books5

PurOrder

Oagis1

Oagis2

Rec

all

Different Scenarios

Mapping Quality

PORSCHE SF COMA++

Fig. 21. Recall for PORSCHE, COMAþþ and Similarity

Flooding for eight pairs of schemas.

0

0.2

0.4

0.6

0.8

1

Books1

Books2

BOoks3

Books4

Books5

PurOrder

Oagis1

Oagis2

F-m

easu

re

Different Scenarios

Mapping Quality

PORSCHE SF COMA++

Fig. 22. F-measure for PORSCHE, COMAþþ and Similarity

Flooding for eight pairs of schemas.

0

0.2

0.4

0.6

0.8

1

PORSCHE SF COMA++

Ave

rage

Qua

lity

Tools Comparison

Mapping Quality

Precision Recall F-measure

Fig. 23. Global comparison of quality metrics.

K. Saleem et al. / Information Systems 33 (2008) 637–657 653

(sizes ranging between 5 and 14 nodes), frompurchase order one schema pair (schema sizes 14and 18 nodes), and from OAGIS two schema pairs(schema sizes from 26 to 52 nodes).5 We computedmapping quality for COMAþþ and SF: SimilarityFlooding (RONDO) and compared those toPORSCHE. The quality of mappings is evaluatedusing precision, recall and F-measure. The resultsare summarised in Figs. 20–23. For PORSCHE,similarity uses token level synonym lookup.

The comparison shows almost equivalent map-ping quality for PORSCHE and COMAþþ, sinceboth are driven by similar user defined pre-matcheffort of constructing synonym tables used to derivelabel similarity. Similarity flooding demonstratesbetter recall than PORSCHE or COMAþþ. The

5Schemas and results detail at http://www.lirmm.fr/PORSCHE/

results/.

results also demonstrate local schema differences.Books2 is the hardest case, and we discovered that itcontained inverted paths (book/author is matchedto author/book).

PORSCHE has also been evaluated within ourbenchmark, XBenchMatch [23]. The benchmarkuses precision, recall, F-measure, overall measure,and other metrics of structural schema proximity.Structural proximity measures are based on differ-ences in the number and size of sub-trees sharedbetween input and mediated schemas, and schemaproximity compares two schema trees which arebeing matched. PORSCHE demonstrated goodresults in comparison with COMAþþ and RON-DO (Similarity Flooding).

5.3. Mediated schema integrity

We gave a brief overview of the integrated schemaquality evaluation in Section 2. Since our testdomains are XML schema instances with only

Page 18: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESS

0

0.2

0.4

0.6

0.8

1

Books1

Books2

BOoks3

Books4

Books5

PurOrder

Oagis1

Oagis2

Min

imal

ty

Different Scenarios

Integrity of Integrated Schema

Fig. 24. Minimality of PORSCHE schema integration for the

selected pairs of schemas.

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100120

140160

180

Min

imal

ty

Schema Number

Integrity of Integrated Schema

Fig. 25. Minimality in PORSCHE for 176 book schemas. About

176 experiments are shown (along the x-axis), with each schema

selected, in turn, as the initial mediated schema.

K. Saleem et al. / Information Systems 33 (2008) 637–657654

element label information, type consistency can-not be evaluated. In the following, we consideronly completeness and minimality. As discussedin Section 2.2 and algorithm implementation(Figs. 11–13), when a match is not found in themediated schema, a new concept node is addedto it. Mapping from source schema tree to the newnode is then established. This demonstrates thatPORSCHE inherently fulfills the completenesscriterion of mediated schema integrity.

To evaluate minimality, we take the pairs ofschemas considered in mapping quality evaluation.For each pair of schemas we perform matching andmerging. The resulting integrated schema is scruti-nised manually for redundancies. The results(minimality ¼ 1� ðredundantNodes=totalNodesÞ)are shown in Fig. 24. An inspection of integratedschemas showed that minimality decreased whereinput schemas being matched had inverted paths.To further investigate the integrity of our approach,we calculated the minimality of the integratedschema created by merging 176 books schemas(small in size and amenable to human inspection).We used an ideal integrated schema (manuallycreated, containing 17 nodes) to assess redundancy.A batch of 176 experiments was performed, eachselecting one schema as the seed mediated schema.Fig. 25 shows that minimality fluctuates between0.68 and 0.85.

6. Related work

Most schema matching systems compare twoschemas at a time and aim for quality matching but

require significant human intervention. CUPID [6],COMAþþ [2], S-Match [4], QOM [12], GLUE [3]are some of these tools presented in this section.Several surveys [2,7,8] argue that extending thematching to data integration is time consuming andlimited in scope. Matching of two large bio-medicaltaxonomies has been demonstrated by Do andRahm using COMAþþ, and Mork and Bernstein[24] using CUPID and similarity flooding (RONDO[15]). Quick Ontology Matching (QOM) matcheslarge ontologies with performance in view. Large-scale schema matching has also been investigated inthe web interface schema integration [16,22] usingdata mining. PORSCHE’s goal is to match, mergeand map in a hybrid manner, whereas most ofthe tools separate the two activities of matchingand integrating, which makes them unsuitablefor automated integration scenarios, as needed ine-commerce.

COMAþþ [2] is one of the most recentmatching and merging tools. It is a compositematcher producing quality matches for a pair ofschemas. It provides an extensible platform whichcan combine several matchers. It uses user definedsynonym and abbreviation tables as a pre-mappingeffort. It has a comprehensive interface for navigat-ing through the matches produced by the software.The user verifies and selects or edits the matchresults to finalise the mapping between the twoschemas. Based on these mappings, COMAþþ canalso generate a merged schema, which has to bevalidated by the user. The authors of COMAþþgive a comprehensive evaluation of match qualityresults but do not quantify the quality of the merged

Page 19: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657 655

schema. In large-scale schema matching, COMAþþ requires a significant amount of human interven-tion. First, the user identifies fragments in the twoschemas to be mapped. This aims to manage thenamespace/ include characteristics of XML sche-mas. Then, individual fragments from the sourceschemas are mapped to fragments of the targetschema, one by one, and match results are saved ina relational database for subsequent comparison.COMAþþ uses a bottom up hierarchical ap-proach: if leaves are similar, then parent nodesmay also be similar. In PORSCHE we use a reverseapproach: given similarity between parent orancestor nodes, we hope to discover similarityamong the descendants. We believe that ourapproach is more appropriate in single domainschemas with large tree depth, which requiresmatching of the structural context of descendantnodes in a robust manner.

S-Match [4] is a hybrid matcher which carries outsemantic matching by using the WordNet diction-ary. It tackles the matching as a propositionalsatisfiability problem. It demonstrates better map-pings than COMAþþ and CUPID [6] but hasworse performance. S-Match employs 16 (13element level and three structural level) matchalgorithms. It does not fulfill the requirement formerging and creating mappings from the source tothe integrated schema. PORSCHE specificallyaddresses the data integration needs of largeenvironments where efficiency is important.

Mork and Bernstein [24] present a case study ofmatching two large ontologies of human anatomy.They use lexical and hierarchical matching modulesfrom CUPID [6] and a structural algorithm calledSimilarity Flooding [15] to find mappings. Thehierarchical algorithm goes one step further thanCOMAþþ. The similarity of descendants is usedto evaluate ancestor similarity, and the approach isnot limited just to parent–child relationships. Theauthors argue that the hierarchical approachproduced disappointing results because of differ-ences in context. They report that a lot ofcustomisation was required to get satisfactoryresults. Our approach, in comparison, considersonly top-down similarity, and looks for matchingancestors only. It seems that further research isneeded to properly define the role of node context inXML, in particular the scope of context, in terms ofnode distance and structure within a tree.

QOM [12] is a semi-automatic (RDF based)ontology mapping tool. It uses heuristics and

ontological structures to classify candidate map-pings as promising or less promising. It usesmultiple iterations, where in each iteration thenumber of possible candidate mappings is reduced.It employs label string similarity (sorted label list) inthe first iteration, and, afterwards, it focuses onmapping change propagation. The structural algo-rithm follows top-down (level-wise) element simi-larity, which reduces time complexity. PORSCHEmatching also uses label similarity using linguisticrather than lexical algorithms. In QOM, in thesecond iteration, depth-first search is used to selectthe appropriate mappings from among the candi-date mappings.

Another interesting schema matching domainunder active research is matching across queryinterfaces of structured web databases. Web pagelayout forms a hierarchy backed by a databaseschema. For certain Web domains, such as travel,these interfaces are very numerous. [16,22] handleholistically the integration of these structuredlayouts as a mining problem. He and colleagues[16] observe that Web database query interfaces inthe same domain are usually semantically similar, asa label is often unambiguous in a domain but it canhave several meanings in a dictionary, and synon-ymous labels are rarely co-present in the sameschema. However, grouping of elements such asLastName and FirstName under the same parent iscommon, as those together form a larger concept.He et al. [16] utilise data mining techniques on theinput forms and data ranges, for elements availablefrom the web pages. The technique is more biasedtowards the accuracy of integration than perfor-mance. Thus, it is ideal for scenarios where theschemas are small, as in web query/data interfaces.Our method targets tree schema structures (orschemas which can be converted to trees) whichcome with minimum information (just elementlabels) and each schema can have thousands ofelements. This gives PORSCHE a much broaderapplication domain.

Clustering, both at the element and schema level,is often used in matching and merging. Smiljanicand co-authors [19] suggest element clusteringwithin a schema, in a repository of schemas whichare matched to a given person schema. The tech-nique shows how the combinatorial nature of thematching problem can be reduced to polynomial,and, further, to linear complexity, by using elementclustering. The method only caters for schemamatching. Lee et al. [17] present an integration

Page 20: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657656

method based on the clustering of similar DTDs ofXML sources. The method works incrementally, bycreating clusters of similar schemas. Similar schemaselection is based on element similarity. Thesolution uses external oracles defining elementsimilarity. It mainly focuses on the integrationprocess but lacks the mediation aspect.

Instance level schema matching tools work on asample mapping set. They use data and example setsto learn a strategy to compute equal instances andconcepts, and are characterised by a high timecomplexity. For example, GLUE [3] uses relaxationlabelling to calculate schema mappings: mappingassigned to an entity is typically influenced by thefeatures of the node’s neighbourhood in the graph.Multiple iterations have to be performed to confirma mapping. Overall, such methods are slow. Anadvantage of instance based learners is theircapability to learn mapping expressions, as illu-strated in iMAP [21], a variant implementation ofGLUE.

7. Lessons learned

Implementing PORSCHE gave us a good insightinto the existing schemas and their properties. Welearned that newer standard schemas available onthe web can be used together with linguistictechniques, and that string similarity should notbe the sole criterion for label matching. Elementnames are now becoming more structured andmeaningful, and new matching techniques areincreasingly based on external oracles like Word-Net. Linguistic label similarity can help in minimis-ing the target search space for matching verysignificantly. In a large-scale scenario, performingclustering just once will enhance performance. Theapplication of tree mining further reduces the timeto select the correct target for mapping.

For schema integration purposes, the selection ofone of the schemas as the seed for the initialmediated schema follows no fixed rule. We triedrandom, the smallest and the largest schema fromthe input set of schemas as the initial selection.However, using such heuristics showed that theselection depending on schema size does not boostperformance. Rather, it is the structure of the initialseed schema which influences performance: if theinitial mediated schema contained more frequentsub-trees shared with the set of input schema trees,the speed of matching improved.

Overall, the architecture of PORSCHE is flexible,and can accommodate new syntactic and linguisticsimilarity algorithms. Most importantly, PORSCHE isscalable, as demonstrated, and can be used in large-scale data integration. At present, it uses a limitedarray of linguistic methods but its domain specificmatching quality is approximately equivalent to othercurrent tools.

In the future, we will investigate the applicationof this technique in information systems based onP2P architectures. Secondly, we want to enhance thelinguistic matching part of the system. Our study ofthe tree mining technique reveals that it can beutilised to identify relationships between the ele-ments and groups of elements within a single treeand in a forest of schema trees. This will help inidentifying subsumptions and overlaps for n:mcomplex mappings [25]. Another possible extensionis the development of persistent indexes for incre-mental matching.

8. Conclusions

We presented a novel schema integration method,PORSCHE, which has shown very promisingresults for large-scale schema integration. It uses atree based depth-first traversal algorithm for match-ing, merging and mapping a set of schema trees. Toimprove performance, we adapted a technique fromtree mining used in the clustering of similar nodelabels. This minimises the target search space for anode match and improves performance. PORSCHEuses an optimistic top-down depth-first matchtraversal (parents are mapped before children, andthe left sub-tree is traversed before the right sub-tree), since our assumption is that we utilise it in adomain specific environment. This helps in using thestructural contextual semantics of nodes for betterquality matching.

The novelty of our method is fourfold. First, wesupport automated schema matching. Second, wenot only generate matches, but also build anintegrated schema at the same time. Third, ourapproach scales to hundreds of schemas. Fourth,the use of tree mining techniques for schemamatching is also new in this field.

Acknowledgements

Z.B. is supported by ANR-05-MMSA-007 andE.H. is funded by an EU Marie Curie Fellowship.

Page 21: PORSCHE: Performance ORiented SCHEma mediation

ARTICLE IN PRESSK. Saleem et al. / Information Systems 33 (2008) 637–657 657

References

[1] P.A. Bernstein, S. Melnik, M. Petropoulos, C. Quix,

Industrial-strength schema matching, SIGMOD Rec. 33

(4) (2004) 38–43.

[2] H.-H. Do, E. Rahm, Matching large schemas: approaches

and evaluation, Inf. Syst. 32 (6) (2007) 857–885.

[3] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, A.Y.

Halevy, Learning to match ontologies on the semantic web,

VLDB J. 12 (4) (2003) 303–319.

[4] F. Giunchiglia, P. Shvaiko, M. Yatskevich, S-Match: an

algorithm and an implementation of semantic matching, in:

ESWS, 2004, pp. 61–75.

[5] J. Lu, S. Wang, J. Wang, An Experiment on the matching

and reuse of XML schemas, in: ICWE, 2005, pp. 273–284.

[6] J. Madhavan, P.A. Bernstein, E. Rahm, Generic schema

matching with cupid, in: VLDB, 2001, pp. 49–58.

[7] E. Rahm, P.A. Bernstein, A survey of approaches to

automatic schema matching, VLDB J. 10 (4) (2001) 334–350.

[8] P. Shvaiko, J. Euzenat, A survey of schema-based matching

approaches, J. Data Semantics IV (2005) 146–171.

[9] C. Batini, M. Lenzerini, S.B. Navathe, A comparative

analysis of methodologies for database schema integration,

ACM Comput. Surv. 18 (4) (1986) 323–364.

[10] A. Jhingran, Enterprise information mashups: integrating

information, simply—keynote address, in: VLDB, 2006,

pp. 3–4.

[11] A. Halevy, Z. Ives, D. Suciu, I. Tatarinov, Schema

mediation in peer data management systems, in: ICDE,

2003, pp. 505–516.

[12] M. Ehrig, S. Staab, QOM—quick ontology mapping, in:

ISWC, 2004, pp. 683–697.

[13] T. Asai, H. Arimura, S. Nakano, Discovering frequent

substructures in large unordered trees, in: ICDS, 2003,

pp. 47–61.

[14] M.J. Zaki, Efficiently mining frequent embedded unordered

trees, Fundam. Informaticae 66 (1–2) (2005) 33–52.

[15] S. Melnik, E. Rahm, P.A. Bernstein, Rondo: a programming

platform for generic model management, in: SIGMOD,

2003, pp. 193–204.

[16] B. He, K.C.-C. Chang, J. Han, Discovering complex

matchings across web query interfaces: a correlation mining

approach, in: KDD, 2004, pp. 148–157.

[17] M.-L. Lee, L.H. Yang, W. Hsu, X. Yang, XClust: clustering

XML schemas for effective integration, in: CIKM, 2002,

pp. 292–299.

[18] C. Pluempitiwiriyawej, J. Hammer, Element matching across

data-oriented XML sources using a multi-strategy clustering

model, Data Knowl. Eng. 48 (3) (2003) 297–333.

[19] M. Smiljanic, M. van Keulen, W. Jonker, Using element

clustering to increase the efficiency of XML schema

matching, in: ICDE Workshops, 2006, p. 45.

[20] M. da Conceicao Moraes Batista, A.C. Salgado, Informa-

tion quality measurement in data integration schemas, in:

Quality in Databases, VLDB workshop, 2007.

[21] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, P. Domingos,

iMAP: discovering complex semantic matches between

database schemas, in: ACM SIGMOD, 2004, pp. 383–394.

[22] W. Su, J. Wang, F. Lochovsky, Holistic query interface

matching using parallel schema matching, in: ICDE, 2006,

pp. 122–125.

[23] F. Duchateau, Z. Bellahsene, E. Hunt, XBenchMatch: a

benchmark for XML schema matching tools, in: VLDB,

2007, pp. 1318–1321.

[24] P. Mork, P.A. Bernstein, Adapting a generic match

algorithm to align ontologies of human anatomy, in: ICDE,

2004, pp. 787–790.

[25] T. Pankowski, E. Hunt, Data merging in life science data

integration systems, in: Intelligent Information Systems,

2005, pp. 279–288.