MAPPING REUSE FOR META-QUERIER...

MAPPING REUSE FOR META-QUERIER CUSTOMIZATION

By

XIAO LI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2011

c© 2011 Xiao Li

2

In memory of my grandfather, Peixin Li

3

ACKNOWLEDGMENTS

First, I would like to thank my supervisor, Dr. Randy Chow, for providing the

guidance and support throughout the course of my research. None of this would have

been possible without his patience and support. Second, I would also like to thank the

other members of my advisory committee (Dr. Jih-Kwon Peir, Dr. Markus Schneider, Dr.

Tuba Yavuz and Dr. Raymond Issa), for teaching me what constitutes quality research.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.1 Meta-querier Customization . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Mapping Reuse in Community-driven Meta-querier Customization . . . . 121.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Meta-queriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Requirement Specification in Data Integration Systems . . . . . . . . . . 20

3 META-QUERIER CUSTOMIZATION . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 MQ-Customizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Customization Workflow . . . . . . . . . . . . . . . . . . . . . . . . 243.2.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Ontology-centric Mass Customization . . . . . . . . . . . . . . . . . . . . 273.4 Reuse-oriented Meta-querier Construction and Maintenance . . . . . . . 30

4 MAPPING MODELING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Modeling of Query Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Change-oriented Mapping Modeling . . . . . . . . . . . . . . . . . . . . . 38

4.2.1 Motivating Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Ontology-based Mapping Modeling . . . . . . . . . . . . . . . . . . . . . . 434.3.1 Motivating Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Metadata in Mapping Modeling . . . . . . . . . . . . . . . . . . . . . . . . 504.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.5.1 Mapping Modeling for Data Integration . . . . . . . . . . . . . . . . 524.5.2 Schema Element Clustering for Data Integration . . . . . . . . . . 54

5

5 MAPPING REPOSITORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 M-Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 MO-Repository and M-Table . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 REUSE-ORIENTED MAPPING DISCOVERY . . . . . . . . . . . . . . . . . . . 67

6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2 Discovery through M-Table . . . . . . . . . . . . . . . . . . . . . . . . . . 696.3 Discovery through M-Ontology . . . . . . . . . . . . . . . . . . . . . . . . 716.4 Validating & Correcting Mappings . . . . . . . . . . . . . . . . . . . . . . . 746.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 ONTOLOGY-CENTRIC SOURCE SELECTION . . . . . . . . . . . . . . . . . . 79

7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.2 Capability-based Recommendation . . . . . . . . . . . . . . . . . . . . . . 81

7.2.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.2.2 Demand Capture and Matching . . . . . . . . . . . . . . . . . . . . 84

7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8 IMPLEMENTATION AND EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . 91

8.1 System Structure of Mapping Repositories . . . . . . . . . . . . . . . . . 918.2 Experiments for ontology construction . . . . . . . . . . . . . . . . . . . . 928.3 Experiments for mapping discovery . . . . . . . . . . . . . . . . . . . . . . 988.4 Experiments for source selection . . . . . . . . . . . . . . . . . . . . . . . 103

9 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . . . . . 106

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6

LIST OF TABLES

Table page

8-1 Statistics of the domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8-2 Capability-based matching by MOM . . . . . . . . . . . . . . . . . . . . . . . . 104

8-3 Capability-based matching by NNM . . . . . . . . . . . . . . . . . . . . . . . . 105

7

LIST OF FIGURES

Figure page

3-1 The workflow of MQ-Customizer. . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3-2 The system architecture of MQ-Customizer. . . . . . . . . . . . . . . . . . . . . 25

4-1 A query form and its graph model. . . . . . . . . . . . . . . . . . . . . . . . . . 37

4-2 Mapping evolution scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4-3 The life cycle of a mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4-4 The incremental formulation of a mapping object. . . . . . . . . . . . . . . . . . 42

4-5 Three job search forms with the mappings. . . . . . . . . . . . . . . . . . . . . 44

4-6 A fragment of a mapping ontology (E-Nodes and G-Nodes). . . . . . . . . . . . 46

4-7 A fragment of a mapping ontology (G-Nodes and A-Nodes). . . . . . . . . . . . 49

4-8 The life cycle of a node/edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5-1 The flowchart of ontology construction. . . . . . . . . . . . . . . . . . . . . . . 56

5-2 (a) An example of mapping object. (b) The evolution of a global query form.(c) The evolution of a local query form . . . . . . . . . . . . . . . . . . . . . . . 65

7-1 The ontology-centric source selection algorithm. . . . . . . . . . . . . . . . . . 85

8-1 The Repository Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

8-2 Experiment results of schema element classification without schema repetition. 95

8-3 Experiment results of schema element classification with schema repetition. . . 97

8-4 Experiment results of mapping discovery through M-Ontology. . . . . . . . . . 99

8-5 Experiment results of concept searching for schema elements. . . . . . . . . . 101

8-6 Experiment results of mapping discovery through M-Ontology and MO-Repository.102

8

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

MAPPING REUSE FOR META-QUERIER CUSTOMIZATION

By

Xiao Li

May 2011

Chair: Dr. Randy ChowMajor: Computer Engineering

The primary goal of this dissertation is to investigate the methodologies for

developing a system framework that supports the customization of meta-queriers

over the Web. Meta-queriers facilitate effective information retrieval from multiple

and heterogeneous data sources that are accessible only through query interfaces.

They are virtual data integration systems that shield users from data heterogeneity

and sources locations. Due to the ever-increasing number of available data sources

and sophistication of users, it is highly desirable (and in many cases necessary) to

allow for customization of meta-queriers based on users’ preferences (or necessity).

Customization comes with some cost: the scalability of the underlying system and the

complexity of human-machine interaction. This dissertation investigates new open

research issues for the implementation of meta-queriers with respect to customization.

There are two aspects of the scalability issue: a potentially large number of

customized meta-queriers and a large repository that stores mapping information

between entities in different data sources for interoperability. The complexity of

human-machine interaction is resulted from the need for collaboration in sharing the

mapping information. The proposed approach and the innovation is that it tackles

both the scalability and complexity problems holistically through a system framework

based on a synergetic combination of two concepts: a community-driven collaborative

meta-querier construction and an ontology-based sharing of meta-querier components.

9

Meta-queriers are domain-specific. Domain users constitute a loose community with

some common interest and might need to collaborate. In our solution approach, we

turn the challenge of the need for human collaboration into opportunities, that is,

the scalability problem can be alleviated through the distribution of the meta-querier

construction workload by harvesting the human power (assistance) in the community.

Another very significant challenge-to-opportunity is the reuse potential from a possibly

large scale of existing meta-querier and their components. The innovation in our

ontology-based mapping repository is its ability to fully exploit the impact of reuse for

human-friendly construction and maintenance of meta-queriers.

10

CHAPTER 1INTRODUCTION

1.1 Meta-querier Customization

As the tremendous amount of information sources accessible through HTML

query interfaces [20, 28, 87] continues to grow, data integration has become a very

challenging issue for effective and flexible information retrieval from multiple and

heterogeneous data sources in the WWW. Meta-queriers (a.k.a., vertical search

engines) are virtual data integration systems that shield users from data heterogeneity

and source locations. They provide the users with a uniform query interface (a.k.a.

global schema) for simultaneous access to a set of integrated data sources in the same

domain. Users do not need to input repetitive information to each source interface (a.k.a.

local schema). User queries over the global schema are respectively reformulated to

the queries in terms of the local schemas, and then the query results from data sources

are presented to the users in an integrated form. As opposed to physically integrating

the data sources, virtual data integration offers more flexibility when the underlying

system involves a large number of data sources and a large variety of user needs, and

in particular, when the number and the variety are frequently changing.

Much effort has been made on the construction of a single domain-specific

meta-querier (e.g., WISE-Integrator [59] and Meta-Queriers [29]). In general, these

data integration systems are design for specific user requirements to integrate a given

set of data sources. However, the differences in user interests and preferences require

different source selection and global schemas (interfaces). For example, kayak.com

and zoomtra.com are two meta-queriers for searching airfares. Although kayak.com

is more popular in general, zoomtra.com finds better airfares to India most of the time.

The difference between them is mainly due to different selections of data sources, and

consequently different interfaces. As the number of data sources continue to grow, it is

highly desirable that users have the freedom to customize their meta-queriers: 1) by

11

selecting their preferred data sources according to the functionalities, data quality, and

site credibility of the sources; 2) by tailoring the global schema with needed functional

components. This is the ultimate goal of the proposed research.

Furthermore, user-specified source selection is not static but dynamic, and the

query interfaces of data sources also are evolving. For example, users might want to

insert a new data source into an existing meta-querier. The changes on source selection

might affect the functionalities and the global schema of a meta-querier, and can even

destroy the whole system. This means the existing global schema and mappings need

to be updated to adapt to these changes. Therefore, meta-queriers should be allowed

to dynamically evolve to adapt to the changing user needs and source interfaces (i.e.,

dynamic re-configurability). Most often, this important requirement is not fully considered

in the existing meta-querier research, especially in the context of multiple customized

meta-queriers.

The dissertation explores the open design and implementation issues, and outlines

with some preliminary work on a community-driven infrastructure and an ontology-

based mapping management scheme to effectively achieve the goal.

1.2 Mapping Reuse in Community-driven Meta-querier Custom ization

Allowing ad hoc customization of meta-queriers without some planned strategies

could naturally result in a large number of independently constructed meta-queries, and

many of them are overlapping or redundant with duplicated efforts. Since meta-queriers

are designed for a specific application domain and users in the domain share some

common interests, it is beneficial for meta-querier builders to form a community for the

purpose of collaborative construction of meta-queriers through sharing of knowledge on

schema transformation. In the following, we summarize four research fronts along the

line of how the concepts of mapping reuse are to be investigated in the context of data

integration.

12

Mapping modeling : Mapping modeling is a critical problem in the meta-querier

customization. It is highly desirable to design new approaches to mapping modeling

for facilitating mapping management, utilization and reuse. The major challenges

include the potentially large scales and untraceable evolution of mappings. First, to

match various user needs and numerous data sources, the meta-querier customization

must solve the issues caused by the interoperability between the corresponding global

schemas of meta-queriers and local schemas of data sources. The potential large

volume of schemas (both types) indicates the necessity of modeling a larger number of

mappings, which is a new research topic. Second, to support dynamic re-configurability

of meta-queriers, the mappings must be evolved to adapt the emerging user needs and

the updated schemas. Since the changes on local schemas are normally unpredictable

and untraceable, the evolution of mappings cannot be modeled through the traditional

techniques in schema evolution and versioning.

Mapping sharing : Meta-querier construction centers on the mapping repositories

shared by the meta-querier in the same domain. Separate storage of mappings might

cause a high degree of data redundancy and potential update anomalies. Through

sharing of construction information, construction of a meta-querier can reuse the

previous work of self and peer meta-queriers. It is necessary to develop reuse-

oriented repositories to store shared components for effective construction. The

major design considerations should include: 1) Facilitation of mapping reuse by both

human and machines. The internal structure should not only be reasonable by machines

and but also be easy to understand and manage even for non-expert volunteers. 2)

Best-effort avoidance of human intervention in repository construction. The original

goal of repositories is to reduce the human efforts in building meta-queriers, and thus

the repository construction should not introduce more involvement than what the

repositories reduced.

13

Mapping discovery : In the context of meta-querier customization, query-form

matching need to be revisited due to the increasing scalability issues caused by

the customization. In essence, it is equivalent to a classical research problem of

schema matching. It is well known that full automation is AI-complete [54, 91]. So

much ambiguity and uncertainty exists in real-world applications. Inevitably, human

intervention cannot be avoided in any practical solutions. As a considerable number

of mappings are to be discovered holistically than individually, care must be taken

to reduce the overall workload by avoiding repetitive tasks and reusing the existing

mappings.

Source selection : The selection of data sources determines the content coverage

of meta-queriers. User-driven source selection is a convenient and straightforward

method to customize meta-queriers. For achieving accurate selection, the system

need to understand the capabilities of data sources, which can be learned from the

related query forms. To understand the semantics of query forms, methodologies of

ontology are commonly used. However, the construction and maintenance of ontologies

are labor-intensive and error-prone. Since the mappings between the heterogeneous

query forms of the data sources can be viewed as instances of relations connecting

concepts, the potential large number of unordered mappings could be employed to

form an ontology. Thus, it is desirable to design a mapping-driven solution to ontology

construction and an ontology-centric solution to source selection.

1.3 Contributions

Concentrating on the research fronts discussed above, our research makes the

following specific contributions:

• Architecture design : Our research pioneers to explicitly articulate an open

three-layer architecture, called MQ-Customizer, for meta-querier customization.

The first layer, called service layer, is to capture the individual needs from users

14

for meta-querier discovery and construction. The bottom layer, called builder

layer is composed of a reuse-oriented auto builder and a community-driven mass

builder. As the second layer, info layer stores and manages the information for

the operations in the other two layers. Additionally, for the first time, we introduce

into data integration a novel concept, mass customization, which originally is a

marketing and manufacturing strategy [34, 109] for combining low cost of mass

production and high flexibility of individual customization. Its open issues and

potential solution approaches are elaborated in the discussions of the proposed

system architecture.

• Mapping modeling : To capture the semantics of numerous and unordered

mappings, we introduce the concepts of ontology to model the mappings. These

semantics-based concepts ease the understanding of human beings on mappings,

and enhance the reasonability of machines. The structure of the ontology helps

to guide the abstraction, optimization and validation of the mapping information

for effective sharing and reuse by the community. Furthermore, we propose a

change-based model, mapping object, to record the evolution of mappings for

the reuse. Different from the traditional techniques, our model can support the

untraceable evolution of mappings.

• Mapping repository : Two different repositories (called M-Ontology and MO-

Repository) are respectively designed based on the proposed ontology-base

and change-based mapping model. To adapt the dynamic customization of

meta-queriers, we also develop construction algorithms through incrementally

inserting individual mappings. The insertion procedures also consider the ease

of human understanding, which is one of the major design principles in the

community-driven approaches. In addition, the proposed mapping-driven ontology

15

construction is also a novel approach to build a task-specific ontology in the field of

knowledge representation.

• Mapping discovery : Although many attempts have been made in the mapping

discovery (a.k.a., schema matching), the discovery of complex mappings is still an

open problem. We propose a promising approach to this problem with assistance

of MO-Repository and M-Ontology. In essence, this is a reuse-oriented solution

that reuses their self-history (i.e., the previous versions) and peers (i.e., the other

mappings in the same domain). Our approach is straightforward to ordinary users

so that they can be easily involved in mapping discovery and validation.

• Source selection : To better reuse the pre-integrated data sources, we provide a

capability-based source selection algorithm to recommend users their potentially

desired sources. Complimentary to the query-based solutions, our approach is

based on the query capabilities of data sources, instead of the sampled source

contents. Unlike the prior work on query-form clustering, our approach is able to

distinguish the query capabilities of the data sources whose query forms have

been clustered in the same domain.

1.4 Dissertation Outline

The rest of the dissertation is organized as follows: Chapter 2 discusses the current

state of the art on data integration, meta-queriers and the requirement specification

in data integration systems. We make a system-level comparison of our work with the

others. Chapter 3 focuses on the system design of MQ-Customizer. In this chapter, we

first discussed the open issues in the meta-querier customization. We then present our

proposed workflow of the customization process. To realize the two-phase customization

process, we design a three-layer architecture for the MQ-Customizer. Under this system,

we propose the potential solutions with the relevant challenges and directions for future

work. The solutions include ontology-based mass customization and reuse-oriented

16

construction and maintenance. Chapter 4 first presents query-form modeling for

meta-queriers, and then two models of mappings from the perspectives of their

semantics and evolution history. These two mapping models are change-based and

ontology-based modeling. Finally, we review various mapping modeling published in

the literature in comparison with our models. Based on our proposed two models, we

develop two corresponding mapping repositories, i.e., M-Ontology and MO-Repository,

discussed in Chapter 5. This chapter also includes their incremental construction

algorithms with human-friendliness consideration. Chapter 6 and Chapter 7 respectively

apply these two repositories to address two critical research problems, mapping

discovery and source selection. In Chapter 8, we presents the implementation and

experimental analysis for repository construction, source selection and mapping

discovery. Finally, we conclude with a summary of future work in Chapter 9.

17

CHAPTER 2RELATED WORK

2.1 Data Integration

Data integration aims at providing users a uniform manner to manipulate and

manage the (heterogenous) data residing at different sources. Major solutions can be

categorized into two groups based on the location of data: physical integration and

virtual integration. Physical integration (a.k.a., data warehousing) loads data into a

repository through data extraction and transformation from a collection of data sources.

User can directly interact with the physical repository without any query reformulation.

As opposed to physically integrating the data sources, virtual integration provides users

a virtual view (a.k.a., global schema) of underlying data sources without any physical

storage. User queries through such a virtual view are translated to respective local

queries and then the combined local query results are returned. The query translation

is based on the mappings between the global schema and the local schemas of the

underlying sources.

Content on the web is blossoming. It has been experiencing a tremendous growth

in (semi-)structured information, in particular, the database behind deep Web [28, 87].

Virtual integration techniques are widely applied in many data integration frameworks

[54, 73] to accommodate the rising needs of integrating data sources over the Web.

These data sources are self-governing agents that are autonomous and independent

from data integration systems. Most of such sources are only accessible through HTML

query interfaces. Thus, the data completeness and freshness is almost impossible to

achieve in physical integration. Studies [15, 32, 58] have shown that virtual integration

is the only practical solution to accurate information retrieval and integration from online

data sources.

18

2.2 Meta-queriers

Meta-queriers are virtual data integration systems. The most significant benefit

of a meta-querier is its ability to utilize multiple local search engines to query multiple

different but related objects at the same time. Many such commercial meta-queriers,

such as kayak.com, have implemented similar but simplified functions. Their internal

designs are proprietary and were often not reported. With the prevalence of e-commerce,

multiple research groups [29, 59, 95, 127] have attempted to analyze and build

meta-queriers from various perspectives. These systems can be viewed as an automatic

integration of web databases to facilitate efficient web exploration.

However, these frameworks only focus on the initial construction of meta-queriers,

but ignore the maintenance issues caused by evolving user needs. In addition, they do

not describe any specific details for the storage of mappings, which are essential for

the constructions and operations of meta-queriers such as schema matching, schema

merging, and query translation.

The construction of a single meta-querier can be divided into five major research

problems: result extraction [6, 25, 45, 71, 126, 136, 153, 154], query-form extraction

[19, 59, 60, 62, 70, 97, 115, 151], result merging [8, 23, 31, 42, 44, 118, 132, 146],

query-form matching [55, 56, 59, 90, 135, 140, 143, 144, 152], global-form generation

[3, 41, 59, 142]. Since the first two research problems are not affected by meta-querier

customization, the related research results can be directly applied to MQ-Customizer.

Personalized result merging and query refinement have been widely studied in the

previous research on the construction of customized meta-search engines [42, 49,

63, 68, 83, 85, 99, 145]. These customized meta-search engines mainly function as

document/text retrieval systems, and thus, their local query forms are simple and

uniform. They do not address the challenges of query-form matching and global-form

generation in terms of complex local forms. The scalability and maintenance resulting

19

from meta-querier customization were not an issue in these customized meta-search

engines.

We will focus on query-form matching and global-form generation that need

to be revisited in the context of meta-querier customization. In essence, they are

equivalent to two classical research problems of schema matching [12, 46, 65, 72,

103, 116] and schema merging [12, 14, 18, 82, 92, 113, 114, 128, 133]. Although

much research effort has been made in different contexts (e.g., semi-structured data,

ontology and ER model), the existing solutions still heavily rely on human involvement.

It is impractical to solve them separately due to the increasing scalability issues

caused by customization. To tackle these challenges, we propose an ontology-based

community-driven MQ-Customizer by leveraging the opportunities brought by meta-querier

customization (i.e., human contribution from a potentially large number of users,

and reuse potentials from a possibly large scale of existing meta-queriers and their

components).

2.3 Requirement Specification in Data Integration Systems

Data integration systems are regarded as mediators to effectively match information

needs and resources. The scale and diversity of user needs have been recognized in

the context of data integration over the Web [29, 87]. Generally, users can specify their

preference and needs explicitly or implicitly. MetaQuerier [29] and WISE-Integrator [59]

(data integration systems) allow users to explicitly select a domain of interest. For each

domain, a single mediated schema is created by combining the local schemas of all

inclusive data sources. Such a mediated schema can only provide the functionalities

that are supported by all the underlying sources; otherwise, the results from data

sources can not conform to the original queries inputted from the users. More data

sources it includes, less functionalities it can support. Moreover, different users may

have different selection criteria for data sources, e.g., data qualities, site credibility and

brand loyalty. Thus, this approach hardly satisfy the diverse user needs and preferences.

20

PAYGO-based data integration architecture [87] attempts to obtain the information needs

from the keywords input by users. However, the extracted user needs are usually not

exact and it is difficult for machines to transform user queries from flat keywords to

structured queries. MySearchView [59, 85] enables users to specify their preferred data

sources for generating the personalized data integration systems, but it provides users

a very limited source type for selection, i.e., only single-keyword-box data sources. In

reality, query forms are much more complex, i.e., they include more controls. Due to

the limit on source types, MySearchView does not address the challenges of mapping

discovery between complex query forms, which is the core in data integration system

construction.

Our work follows the basic idea of MySearchView on personalized source

selection, but employs a different construction approach, i.e., community-driven and

reuse-oriented construction, for the purpose of eliminating its limitation on data sources.

Since such customization leads to a large variety of data integration systems to be

constructed, the construction burdens should be distributed to a considerable number of

cooperative members in a community, which can include users, technicians and domain

experts. In this setting, we propose a “human-friendly” mapping repository for efficient

mapping storage, management and discovery. Unlike the traditional data integration

systems, these data integration systems in the same domain share the same mapping

repository. Its main consideration is that the construction and maintenance of the data

integration systems might benefit from the previous outcomes of repetitive tasks. This

assumption is also observed in our experiment results. The more mapping information

a mapping repository owns, the more benefits their construction and maintenance can

obtain.

21

CHAPTER 3META-QUERIER CUSTOMIZATION

3.1 Research Questions

A growing trend of increasing variety of meta-queriers is noticeable even in the

same application (e.g., airline ticket booking). It is impossible to build a one-size-fit-all

meta-querier that can satisfy diverse user needs and preferences [158]. Therefore, a

meta-querier construction system intended for a large audience must provide users with

the freedom to customize their meta-queriers. Many challenging issues arise and we

propose to address the fundamental questions on the design and implementation of an

infrastructure to support dynamic customization of meta-queriers:

Q1: How can users’ needs be best accommodated through the cus tomization of

meta-queriers?

From the users’ viewpoints, the global query form and the selection of source

databases are the most critical (or perhaps the only) factors that influence the contents

retrieved from the meta-queriers. First, source selection determines the content

coverage of meta-queries. The meta-queriers are virtual data integration systems

[73] that do not physically store any information. That is, the returned contents are

completely determined by the underlying data sources. Second, modifying the global

schema is the only way for the users to express their demands on the results. All the

returned contents should be conformed to the constraints set by the users, i.e., by

modifications of the controls (clicking radio button, drop-down menu, and entering

text, etc.) In a sense, the contents of global forms are also decided by the selection

of data sources, since the global forms should only consists of the functionalities that

are supported by every underlying data sources; otherwise, the results might include

some/many records that violate the original user-specified conditions. Therefore, source

selection is arguably one of the most critical problems in meta-querier customization. In

our research, we focus on source selection for customization.

22

Q2: What are the challenges for supporting meta-querier cus tomization?

Automating the construction of data integration systems is one of the primary

research areas in the information integration community [40, 54, 123, 125]. The

construction of meta-queriers has also received considerable attention [29, 59, 95, 127].

Much effort has been made especially on two fundamental research issues: i)

constructing and maintaining global schemas; ii) constructing and maintaining schema

mappings. All the existing complete/partial solutions mainly aim at building a single (or

very few) data integration system, and most of them still rely heavily on the involvement

of domain experts and system designers. Supporting meta-querier customization is

likely to result in the creation of a large number of meta-queriers, which is impractical

for a small set of domain experts and system designers to construct and to maintain.

Our research tackles both construction and maintenance issues with an approach that

minimizes the interaction between machine automation and human involvement.

Q3: Why is the reuse-oriented approach essential for suppor ting community-

based customization?

Another opportunity brought by meta-querier customization is the reuse potential

of the previous human efforts. In the same application-specific domain, users share

common interests and similar background. To meet user needs, data sources offer

highly overlapped functionalities. The designs of query forms often follow the same

trend. The involved vocabularies and control types are “clustering in localities and

converging in sizes”[28]. When more and more meta-queriers are available in the same

domain, we believe that the reuse of these existing meta-queriers (i.e., system-level

reuse) and their components (i.e., component-level reuse) is arguably the best approach

to construct new meta-queriers. Our proposed strategies, models and algorithms

capture the essence of reuse.

23

Mass Builder

Data sources

Required inputs

Users

Keywords

A domain

Data sources

An existingmeta-querierwiththe unwantedsources

Optional inputs

SourceRecommender

User needs

Auto BuilderMeta-querierRecommender The desired

sources

A customizedmeta-querier

Recommendation

Meta-queriers

Figure 3-1. The workflow of MQ-Customizer.

3.2 MQ-Customizer

3.2.1 Customization Workflow

Before digging into the details of MQ-Customizer, we first present the workflow

of the customization process. Imagine a user who wants to construct a meta-querier

tailored to his/her individual needs. As illustrated in Figure 3-1, the whole process of

customization is abstracted in two phases from the viewpoint of the user:

Phase 1 : Resource Selection. From a user-specified domain, Meta-Querier Recom-

mender and Source Recommender recommend a ranked list of meta-queriers and a

ranked list of data sources, respectively, based on user requirements. The requirements

are derived from the user inputs: the preferred data sources and the corresponding

application domain. The domain can be selected only from the existing ones in the

system. If the user can find an appropriate meta-querier from the meta-querier list, the

final goal is achieved by reusing an existing one; otherwise, it enters the next phases to

create a new one.

Phase 2 : Meta-querier construction. A new meta-querier can be generated by

integrating a set of data sources from scratch, or built on top of an existing one by

removing the unwanted data sources and inserting some additional sources. Following

24

RepositoryManager

ProfileRecorder

Schema Generator

...

Users

...

Schema Matcher

CompatibilityChecker

SourceRecommender

Communities

MappingRepositories

SchemaRepository

Meta-queriers

ProfileRepository

BuilderLayer

Information Layer

Service Layer

Meta-querierRecommender

Figure 3-2. The system architecture of MQ-Customizer.

the user requirements, Auto Builder attempts to construct a new meta-querier without

human assistance. If the effort fails, Mass Builder is invoked to continue the task with

Auto Builder. Mass Builder relies on a mass collaboration strategy. We believe the users

not only have enough motivation to build and maintain their own meta-queriers, but

also are willing to volunteer to assist others. These users along with a small number

of domain experts can practically form a collaborative domain-specific community for

meta-querier construction. Finally, a customized meta-querier is delivered to the user.

3.2.2 System Architecture

To realize the two-phase customization process, we present a three-layer architecture

for the MQ-Customizer in Figure 3-2. For simplicity, this figure only shows the major

components relating to the three research strategies discussed in the introduction,

i.e., mass customization, mass collaboration, and reuse-oriented construction and

maintenance.

25

Service Layer provides interactive services to assist users in discovering and

constructing meta-queriers based on their individual needs. Meta-querier Recommender

performs the recommendation of the existing meta-queriers with the pre-integrated

data sources. With the assistance of Compatibility Checker , Source Recommender

guides and recommends users in their selection of data sources. Both recommenders

should be fully automated to provide users an interactive environment. The success

of recommendation is mainly determined by the understanding of user requirements

and resource capabilities/functionalities. To improve the performance of run-time

recommendation, a functionality-based algorithm should be designed for matching

user needs and system resources (including all the meta-queriers and their underlying

data sources). The mining algorithms for discovering user needs and preferences have

been extensively studied in user adaptive systems, e.g., content-based [10, 105, 106]

and collaborative [5, 124, 137] filtering algorithms. They should be included with our

ontology-based resource selection to implement the three major components in the

service layer.

Builder Layer contains the reuse-oriented Auto Builder and the community-driven Mass

Builder. Two major research issues here are schema generation (i.e., generating global

query forms) and schema matching (i.e., discovering query-form mappings) for the

construction and maintenance of a large number of meta-queriers. Unfortunately, both

are well known AI-complete problems [54, 91]. Most existing approaches still rely heavily

on human involvement, especially in the context of dynamic and complex schema

integration. Auto Builder is proposed to address the two problems by maximally reusing

the previous validated outcomes. The research focus is to design human-friendly

reuse-oriented Schema Matcher and Generator to reduce and to facilitate the inevitable

human interaction. Auto Builder is augmented by the Mass Builder, which exploits

collaborative intelligence to further tackle the same problems. However, collaborative

activities in Mass Builder might lead to potential inconsistencies or errors introduced

26

due to intentional or unintentional mistakes by community members. Thus, a secondary

focus in this layer is to reduce the adverse influences through the incorporation of three

error-handling mechanisms: error avoidance, identification and recovery.

Information Layer stores and manages the information for the operation in the other

two layers. The first responsibility of Profile Recorder is to acquire user profiles (e.g., the

interaction between users and meta-queriers) and system logs (e.g., the involvement

records of community members), and then to store them in Profile Repository. The

second one is to record each human-validated version of meta-queriers (i.e., their

global and local schemas and the associated mappings) into Schema Repository.

With the assistance of community members, Repository Manager constructs and

maintains two complementary domain-specific mapping repositories: an evolution-

oriented repository (MO-Repository ) that records the evolution of mappings, and a

task ontology (M-Ontology ), which is the core of the whole system. M-Ontology has

multiple responsibilities through functioning as: (1) a mapping repository for efficiently

storing and managing a large number of mappings in a human-friendly manner; (2)

a communication platform for simple and secure mass collaboration of meta-querier

construction and maintenance; (3) a knowledge base for maximal reuse and complex

reasoning of the previous validated schema matching and generation; Because of its

critical roles in our system, our research addresses the issues in the whole life cycle of

M-Ontology, from the design, generation, to the utilization and maintenance of it. The

rest of this dissertation explains our work on M-Ontology with promising experimental

results over real-world data sources.

3.3 Ontology-centric Mass Customization

The proposed customization process of meta-queriers is a kind of mass customization

[34, 109], which lies between two extreme types of customizations: one-size-fit-all

and individual personalized meta-queriers. It is desirable to minimize the number of

necessary and useful customizations to reduce the overall workload of meta-querier

27

construction and maintenance in the community. Our solution strategy focuses on the

best-effort reuse of pre-built meta-queriers and pre-integrated data sources in meeting

diverse user needs. The system-level and component-level reuse in meta-querier

construction will be discussed in Section 3.4. This section mainly copes with the

recommendation problem for the reuse (as discussed in phases 1 of Section 3.2.1): how

to intelligently select meta-queriers and data sources in the domain to meet specific user

needs.

Related Work : This recommendation problem can be regarded as a source selection

problem in a non-cooperative environment, where the local data sources are autonomous

and self-governing, and their contents can be acquired only through submitting queries.

Different from sampling in distributed text retrieval systems [22, 51, 64], surfacing

the contents hidden behind complex HTML forms is very difficult [115, 140, 141] and

even infeasible [89]. That means this recommendation problem cannot be solved

through the traditional source selection techniques (e.g., selecting local sources by their

contents [50, 131] or scales [130, 131] estimated from query-based samples). Thus, it is

necessary to design new algorithms for such an emerging source selection problem.

Specific Research Tasks : The research should investigate how to exploit the query

capabilities [27, 74, 149] of resources for achieving more accurate recommendation.

The recommendation procedure is composed of three steps: acquiring user needs

from the current inputs and history records, calculating the matching scores between

the needs and each meta-querier and data source, and presenting users a list of

meta-queriers and data sources sorted by the matching scores. To integrate query

capabilities into this procedure, the following research tasks should be conducted:

• Resource modeling : Data sources and meta-queriers are two primary resources

in our system. Query forms for these resources can be regarded as a set of query

conditions [62, 151], we call schema elements. Each element consists of a control

and its associated attributes. A single or a set of query components represent a

28

specific query capability [27, 74, 149] that a resource possesses. The components

with the same capability (normally in different query forms) are clustered to form a

higher-level capability concept (i.e., a G/A node) in M-Ontology, as discussed in Section

3.3. In a sense, M-Ontology is a domain-specific query capability repository, where

each connected G/A-Node sub-graph generally corresponds to an abstracted query

capability. Thus, each resource is modeled as a set of abstracted query capabilities.

Additionally, other resource properties (e.g., popularity, stability and credibility) that can

be discovered dynamically from the interaction records are included in the resource

model.

• Requirement modeling and user interaction : In terms of query capabilities, user

needs and interests can be modeled by three preference vectors respectively storing

data sources, meta-queriers and capabilities. Three interaction mechanisms can be

introduced to construct these three vectors: 1) user inputs of the preferred data sources;

2) user selection of the preferred query capabilities from a list automatically generated

from M-Ontology; 3) user selection of meta-queriers with wanted/unwanted sources.

The preference vectors can be directly learned from the current and previous user

behaviors. In our current solution, based on the explicit capability specification from

users, the capability vector is acquired by grouping the capabilities of the user-preferred

meta-queriers and sources.

• Matching : The calculation of matching scores between needs and resources is based

on capability similarity. Similarity between user needs and resource capabilities can

be identified through comparison of the users’ preference vectors with the existing

resources and their query capabilities. It can be treated as a multi-criteria decision

making problem. Each criterion corresponds to the desirability of a specific capability.

By combining all the criteria, a utility function is desired to calculate the matching scores

that quantify the desirability of all the data sources for a particular user need.

29

3.4 Reuse-oriented Meta-querier Construction and Mainten ance

Dynamic customization of meta-queriers is our ultimate research goal. To achieve

this goal, a meta-querier must have two unique capabilities: 1) Customizability: its

construction enables users to specify their needs and preferences. 2) Dynamic re-

configurability: it can evolve to adapt to the changing user needs and source interfaces.

Both system users and data sources are self-governing agents that are autonomous

and independent from meta-queriers. The maintenance issues are even more

important than the initial construction, especially in such a dynamic virtual environment.

Therefore, MQ-Customizer provides services for the construction and maintenance of

user-customized meta-queriers.

Specific Research Tasks : The research tasks are categorized as: a) design-level tasks

– analysis of user requirements, and methodologies of construction and maintenance; b)

implementation-level tasks – considerations and approaches to schema matching and

generation.

• User requirements : A fundamental question about the design of a meta-querier is:

“what kind of global query forms are expected by the normal users?” In global query

forms, schema elements can be classified based on how they match with local query

forms: 1) fully-overlapped elements are the global elements that have corresponding

elements in each and every underlying local schema; 2) partially-overlapped elements

that are otherwise. We observe that the partially-overlapped elements greatly affect the

effectiveness of meta-queriers. Imagine that a meta-querier integrating only two data

sources DA and DB . Assume that besides full-overlapped elements, its global schema

has two partially-overlapped elements Ea and Eb, respectively, from DA and DB . If users

input some query conditions in both Ea and Eb (e.g., entering some texts), the query

results should be null or further filtered out by the meta-querier; otherwise, the result

might include some/many records that violate the original user-specified conditions.

Such result filtering and other post-processing operations are not hard to implement in

30

a integration system over a relatively small number of pre-configured data sources (e.g.,

Information Manifold[74], TSIMMIS [75]). However, the implementation of result filtering

become more challenging in the construction and maintenance of meta-queriers, where

a large number of fully-autonomous data sources that need to be integrated. Best-effort

approaches are widely used in result extraction and merging.

Motivated by this observation, we propose two additional strategies complimentary

to result filtering. The first strategy is to reduce the number of partially-overlapped

elements in global schemas. This can be achieved by introducing interaction mechanisms

in the source selection phase that guide and recommend users in their selection of data

sources. The second strategy is to reduce the possibility of inappropriate user inputs.

This can be accomplished in the global-interface generation phase by separating

partially-overlapped elements from fully-overlapped elements, and placing them in

groups based on the overlapping relations of their sources. We also provide an option to

exclude partially-overlapped elements from global schemas as far as possible.

• Methodologies of construction and maintenance : Our solution supports two types

of construction procedures: 1) Scratch construction is to construct a new meta-querier

from a given set of data sources. It is the default approach in all current meta-querier

construction systems [29, 59, 127]. To construct a meta-querier from scratch, the first

step is to find the correspondence among the local query forms (i.e., schema matching).

These element correspondences are used to generate global schemas (i.e., schema

generation). For translating user queries through global schemas to respective local

queries, the data transformation rules from global schema elements to local elements

must be discovered (i.e., schema matching). 2) Prefabricated construction is to build a

new meta-querier by re-configuring an existing meta-querier (e.g., inserting new data

sources or deleting inclusive sources). Surprisingly, no sound solution exists in the

literature to cope with this construction approach. In our adaptation of the prefabricated

construction solution, MQ-Customizer first recommends meta-queriers and data

31

sources based on the specific requirements derived from user inputs such as keywords,

preferred data sources and essential functionalities. After the selection of a specific

meta-querier, the construction process is completed by two subsequent operations,

source deletion and source insertion, for removing unwanted and inserting preferred

user data sources, respectively.

After the initial construction, the global schemas and the associated transformation

rules need to be updated, if data sources change their original query interfaces or users

want to further insert/remove data sources. In essence, such system maintenance

is equivalent to a reuse-oriented construction process similar to the concept of reuse

in software development [11]. Thus, in our proposed solution, system maintenance

is integrated with prefabricated construction through system-level reuse (i.e., reusing

pre-existing meta-queriers). The two major operations, source deletion and insertion, for

implementing system-level reuse are discussed as follows.

Constructing and maintaining holistically a large number of meta-queriers, we take

into account three major design considerations: 1) Information-sharing: Construction of

a meta-querier can reuse the previous work of peer meta-queriers in the same domain

through sharing of construction information in the common knowledge bases, i.e.,

M-Ontology and Schema Repository; 2) Incremental construction: Construction of a

meta-querier always reuses its self-history as the first attempt of schema matching

and generation instead of starting from scratch; 3) Human-friendliness: the reasoning

procedures and results of construction should be easy to understand and manage even

for non-expert volunteers.

• Source insertion and deletion are two basic operations for incremental updates

of global schemas and their associated mappings for insertion and deletion of local

schemas. Intuitively, source deletion can be implemented by simply removing the

corresponding schema elements (if not shared) in the global schema and its associated

mappings [59]. However, this solution of source deletion cannot undo the effect of

32

source insertion. Consider for example, an air-ticket booking meta-querier consisting of

two sources with different child-ticket age ranges, {2-16} and {4-12}. The integrated

global schema contains five sub-ranges on passenger age, {0-2, 2-4, 4-12, 12-16,

16+}, or else errors could occur. When one of the sources is removed, the global

schema remains the same. After successive source insertions and deletions on this

meta-querier, the global schema can become unnecessary complex and even hard to

understand. This problem is worsened in our system since system-level reuse is applied

in both prefabricated construction and system maintenance. There is a need to develop

algorithms for source deletion and insertion that are invertible to each other. To insert or

delete a data source, MQ-Customizer first attempts to reuse the existing meta-queriers

with the same underlying data sources by searching Schema Repository. Then, the

ontology-based schema matcher and merger are called to update global schemas and

the associated mappings.

• Schema matching is to discover the mappings from one schema to another

schema. Although many-to-many mappings with conversion rules are pervasive in

real-world applications, fully automated discovery of them is almost impossible in most

existing learning-based [35, 39, 56, 86] and template-based [37, 43, 144] solutions.

Ontology-based approaches [7, 94, 117, 148, 157] show promising results for matching

schemas through an external ontology that specifies domain-specific knowledge.

However, there is no universal/generic ontology or even a small set of them in real-world

applications [100]. Constructing a customized global ontology [36, 110, 121] and

matching various local ontologies [46] are also labor-intensive, time-consuming and

error-prone problems. Additionally, most ontologies do not contain conversion rules that

are hard to reason by machines. To tackle these problems, the proposed M-Ontology is

generated from schemas and mappings. As more schemas and mappings are inserted

into the ontology, more concept nodes and T-Edges can be generated and subsequently

more mappings can be discovered.

33

The basic algorithm for the schema matcher is to match two schemas by classifying

all the schema elements into the concept nodes in M-Ontology. The approach is based

on a feature of M-Ontology: when an E-Node that encapsulates a schema element is

classified into an existing G-Node, all the mappings associated with this G-Node are

automatically assigned to this E-Node. That is, all the E-Nodes in the same G-Nodes

share their mapping information since they have the identical semantics in the same

formats. A-Nodes are selected if all the inclusive G-Nodes are already identified. The

mapping information stored in T-Edges can be reused if these edges connect the

selected A/G-Nodes. Following this basic idea, the critical issue is to classify the schema

elements into the correct concept nodes (i.e., G-Nodes).

We design a human-friendly algorithm for this specific element classification

problem. To classify an element en, we first search M-Ontology for its current or

previous versions. If none exists, we use semantic matching to compare en with

the representative object of each G-Node. For each G-Node gn in M-Ontology, a

representative object ro is automatically generated to describe its semantics as follows:

1) generating a bag of descriptive words DA by normalizing the descriptive attributes

of all human-verified E-Nodes in gn using NLP techniques [52] such as tokenization,

stop-word removal and stemming; 2) obtaining a set of descriptive labels DL by selecting

the terms with the top-k TF-IDF (i.e., term frequency-inverse document frequency [122])

weight from DA; 3) finding instances IST and constraints IC by combining the instances

and constraints of all verified E-Nodes. Finally, we generate the representative object ro

with a tuple 〈SetE−Node, DA, DL, IST, IC〉. In a sense, the element classification problem

is converted into a one-to-one schema-matching problem. This model provides three

main benefits: 1) ro offers community members simple and straightforward descriptions

of concept nodes that are easy to understand. Members can correct and enrich its

contents (discussed in Section 4.1); 2) The performance of the classification algorithm

can be easily improved by community members (e.g., through changing DL and DA); 3)

34

Most existing semantics matching techniques [46, 116] can be combined and integrated

into the classification algorithm. Instead of treating these matching techniques as black

boxes, we need to enable the matching procedures and results for efficient visualization

and control by the community members.

35

CHAPTER 4MAPPING MODELING

The cornerstones of M-Customizer are the mappings from global forms to local

forms. In our proposed solution, both global form generation and user query translation

rely on the existing mappings. Before digging into our reuse-oriented mapping discover

and merger, this chapter first presents query-form modeling for meta-queriers, and then

two models of mappings from the perspectives of their semantics and evolution history.

4.1 Modeling of Query Forms

The proposed mapping repository requires a uniform representation of query

forms. This section introduces an undirected graph model for representing query forms.

Current languages used to represent query forms include HTML, CSS, and some

scripting languages. Structures of query forms can be classified into two groups:

one-step forms and multi-step forms. Theoretically, the multi-step forms can be

decomposed to multiple one-step forms based on their appearance dependence[127].

Therefore, this work considers only the modeling of one-step forms.

For a one-step form, the W3Cs HTML specification [1] defines it as a section

of a document containing normal content, markup, special elements called controls

(checkboxes, radio buttons, menus, etc.), and labels on those controls. In meta-queriers,

global and local query forms can be regarded as a set of query conditions [151][62].

User requests are normally made by modifying the HTML controls, e.g., clicking radio

buttons, entering text, etc.

A control and its associated attributes (e.g., name, id and class) and instances (i.e.,

possible user inputs) are regarded as “a whole”, also referred to as a schema element.

For instance, in the bottom left of Fig. 4-1, E1 is a schema element extracted from a

query interface. It has a control type “menu”, a label name “price”, a descriptive text

“Price”, and an instance set (from “20-40” to “350-400” and “All”). Typically, Query-form

Extractor (IE) can extract many useful attributes such as control type, name/label,

36

Figure 4-1. A query form and its graph model.

descriptive text, instances, data domain, default value, scale/unit (e.g., kg, million,

dollar), and data/value types (e.g., date type, time format, char type, etc.) [19, 59, 60,

62, 70, 97, 115, 151]. This work focuses on how to utilize these extractable attributes to

automate the construction and maintenance of meta-queriers.

In each global or local query form, schema elements are the most fundamental

building blocks. These schema elements are structured in a certain order and required

to obey some constraints, such as domain constraints and referential constraints. Two

schema elements are called syntactically equivalent iff all the attributes of two elements

are the same.

To capture the structural semantics among different elements, we translate

query forms from the native format into undirected graphs. In this graph, each vertex

corresponds to a specific schema element in one form. An edge is used to connect two

adjacent vertexes, with a boolean property to represent its adjacency type, vertical or

horizontal. Each maximum connected subgraph in the graph corresponds to a semantic

block with a descriptive text D (if available). Each block is assigned a unique identifier

called BlockID. In addition, each graph corresponds to a unique query form. It can be

identified by its uniform resource identifier (URI) and a version number (denoted by

37

the time span T during which users can successfully access the data source through

this query form). The left portion of Fig. 4-1 shows a query interface for a hotel booking

system. It requires users to input three categories of information. The right portion

of Figure 4-1 shows its corresponding graph model. This graph is composed of three

corresponding semantic blocks. In each block, the vertical relations between schema

elements are denoted by solid lines, whereas the horizontal ones are denoted by dotted

lines.

4.2 Change-oriented Mapping Modeling

Before the introduction of change-oriented mapping modeling, we first present the

definition of mappings between two query forms based on the representation of query

forms (described in last section). Having a semantically rich representation of mappings

is particularly important. The rest of this dissertation follows the following definition.

Definition 1. An element mapping mapST (also called a mapping instance) is an instance

of a specific relation from a query form QFS to QFT . It can be represented by a tuple

〈EListS ,EListT ,ExpST 〉, where

• EListS and EListT are two ordered lists of schema elements respectively from QFS

and QFT , whose semantics are relevant to each other. The element number of a

list can be one or greater than one, and thus mapping cardinality might be 1:1, 1:n

or n:m (n > 1 and m > 1).

• ExpST denotes a high-level declarative expression that specifies the transformation

rules from EListS to EListT . Expressions can be list-oriented functions (e.g.

equivalence, concatenation, mathematic expressions) or other more complex

statements (e.g. if-else, while-loop). In addition, the format of Exp should be both

human-understandable (i.e., able to be easily modified by normal users), and

machine-processable (i.e., can be automatically transformed to executable rules).

• An element mapping without ExpST is called as a correspondence corrST .

38

4.2.1 Motivating Scenario

Mappings are not static but dynamic in the context of meta-querier customization,

where both end users and data sources are autonomous agents that are independent

from meta-queriers. The concrete scenarios of mapping evolution can be characterized

into two types:

1) External changes. The changes in the schema elements from local and global

forms often causes the evolution of the corresponding mapping, as shown in Figure 4-2

(a) and (b). For example, if a car-rental local form wants to support new emerging car

models (e.g., 2011 Ford Fiesta), the corresponding entries need to be included in its

car-model control (e.g., a selection menu). Furthermore, changes in global forms occur

due to the evolution of its underlying local forms and user-specified source selection.

Consider for example, an air-ticket booking meta-querier consisting of two sources with

different child-ticket age ranges, {2-16} and {4-12}. The integrated global form will

contain five sub-ranges on passenger age, {0-2, 2-4, 4-12, 12-16, 16+}, or else errors

could occur. If one source (e.g., 2-16) is removed, the age element in the global form

should be simplified (i.e., {0-4, 4-12, 12+}).

These changes of the query forms are unpredictable and untraceable, since their

updates (e.g., functionalities and representation) often occur without any notification

and information. That is, it is almost impossible to obtain when and how the query-forms

are changed step by step. The syntactically different query forms with the equivalent

URI normally are regarded as the different versions of the gateway/portal to a specific

data source. The data source is accessible separately through these forms during the

different time spans.

The unpredictable and untraceable characteristics contradict with the significant

assumptions[119][102] in the schema evolution and versioning, which are two traditional

research problems in the database field. Thus, we cannot simply employ the related

techniques to model the changes of query forms. Although we cannot automatically

39

exp1

exp2

(c)(b)(a)

exp1EList

1

S

exp2

EList2

S

EList TEList

1

T

EList2

T

EList S

EList Sexp1

exp2

EList T

Figure 4-2. Mapping evolution scenarios.

identify the step-by-step change processes, the snapshot-by-snapshot processes are

still useful resources in mapping discovery. In essence, external changes indicate

not only the element-to-element correspondences but also the evolution trends in the

same application domain. Both information can be employed to facilitate the automatic

discovery of new mappings and the later manual correction.

2) Internal changes. Modification of mappings comprises the changes on the

mapping composition and the updates of their related context information, i.e., the

metadata of mappings. A mapping instance can be created from the scratch, or from

another instance with some modification. There is no guarantee that these mapping

instances discovered by machines or humans are completely free from error and

always function well. These changes are traceable. As shown in Figure 4-2 (a), (b)

and (c), the correction includes the changes on the expressions (i.e., ExpST ) and

the element lists (i.e., EListS and EListT ). To enhance the robustness of customized

meta-queriers, internal changes should be stored for the possible recovery, especially in

the community-based construction and maintenance.

4.2.2 Modeling

Based on the above observations, we design a change-oriented mapping model to

preserve mapping evolution. First, we explain the life cycle of a mapping in our whole

framework. And then, a model (referred to as mapping objects) is presented for mapping

evolution.

Mapping Lifecycle: Each mapping instance has its own lifecycle starting from the initial

creation to the physical deletion. The state transitions are determined by its validation

40

Usable L-Deleted

Detached

Validating

CreatedP-Deleted

Figure 4-3. The life cycle of a mapping.

status. Fig. 4-3 illustrates a state diagram with the four states that a mapping can be

in: Validating, Usable, Detached, and L-Deleted. The Validating state of a mapping

indicates that its current correctness is undetermined. It remains to this state until

community members or machines validate it completely. A newly created mapping

begins its lifecycle in the Validating state. The state of an existing mapping transitions

into the Validating when its correctness status is changed. While a mapping is at

the Usable state (i.e., ready for being utilized), it can perform correctly in the current

meta-querier. When a previously correct mapping is identified to be incorrect, it enters

the Detached state. Such a mapping might be useful for discovering new mappings. For

the mappings that are invaluable (e.g., never correct), they enter L-Deleted state (i.e.,

logically deleted). The lifecycle of a mapping is finished when it is physically deleted.

Mapping Objects: Between one query form QFS and another QFT , there exist a set

of mapping objects. A mapping object mapObjST denotes a specific relation between

one query form QFS and another QFT (i.e., QFS→QFT and QFT→QFS ). It can be

represented by a bipartite graph whose edges Set〈RST ,RTS 〉 only connect the nodes

from two disjoint node sets Set〈NS〉 and Set〈NT 〉. Each node in Set〈NS〉 and Set〈NT 〉

respectively corresponds to an element list from QFS and QFT . Each solid edge

represents a mapping instance of mapObjST . Based on the semantics, every pair of the

nodes respectively in Set〈NS〉 and Set〈NT 〉 is a correspondence corrST .

Each mapping object consists of a group of mapping instances that represent the

same relation between two query forms. These instances are regarded as different

41

R

N1S N1

T

ST

1Set N‹ ›S Set N‹ ›T

RTS

2

(b)

(e) (f)

R

N1S

N2S

N1T

ST

1

RST

3

Set N‹ ›S Set N‹ ›T

RTS

2

(c)

R

N1S

N2S

N1T

ST

1

RST

3


(d)

RST

4C1

R

N1S

N2S

N1T

ST

1

RST

3


RST

4C1

N2T

C2

RST

5

R

N1S

N2S

N1T

ST

1

RST

3


RST

4

N2T

RST

5

(a)

R

N1S N1

T

ST

1Set N‹ ›S Set N‹ ›T

Figure 4-4. The incremental formulation of a mapping object.

versions of this object. Thus, a mapping object can be represented using a bipartite

graph G = 〈Set〈NS〉, Set〈NT 〉, Set〈RST/RTS 〉〉, as illustrated in Fig. 4-4(f). We assume

that all the mapping instances in the same direction (e.g., QFS→QFT ) are independent

of each other. This assumption is feasible since two instances can be easily merged

until there does not exist any constraint between them.

Fig. 4-4 shows the formulation of a mapping object between one query form QFS

and QFT . Originally, the mapping object is only a single mapping instance 1RST (shown

in Fig. 4-4(a)). Later, another instance 2RST with the inverse direction is added into the

object (shown in (b)). When the external changes occur in QFS , two original mapping

instances, 1RST and 2RST , are detached and a new mapping instance 3RST is automatically

42

discovered by machines (shown in (c)). Then, since 3RST is incorrect based on manual

validation, it is logically deleted and replaced by another mapping instance 4RST (shown

in (d)). After the internal changes in QFS shown in (e), all the existing usable mappings

are detached. Finally, 5RST is the only usable mapping instance.

4.3 Ontology-based Mapping Modeling

In the context of meta-querier customization, the scale of such mappings might be

considerably large since a potentially large number of meta-queriers need to be built to

meet various user needs. We develop a novel semantics-based approach to modeling

mappings [77]. This model can be employed to organize unstructured domain-specific

mappings into a well-defined ontology. Such a modeling makes mappings easier to

manage by humans, especially for those non-technical users. Many-to-many complex

mappings can also be discovered more efficiently by leveraging the ontology abstracted

from the existing mappings.

4.3.1 Motivating Scenario

To motivate our proposed model, we use a simple scenario illustrated in Fig. 4-5(a).

Consider two simple meta-queriers MS1 and MS2 built for job seekers with different

preferences. MS1 and MS2 provide users different query forms (i.e., mediated schemas)

by integrating their own data sources. There exists an overlap between these two sets of

sources, such as LS . The contents of LS can be accessed by a query form (called local

schema). In a query form, the components (e.g., textbox and menus) and the related

descriptive texts and potential instances are regarded as schema elements.

In order to integrate LS into MS1 and MS2, the system designers need to specify the

mappings between schemas (called schema-level mappings) MS1 → LS and MS2 → LS

for translating user queries. Since the schema-level mappings hide the mapping details,

we decompose them into two sets of mappings between schema elements (called

element-level mappings) so that reusing element-level mappings becomes possible. The

43

directed edges in Fig. 4-5(a) indicate the element-level mappings from MS1 and MS2 to

LS .

Without considering the structural information, the schemas of LS , MS1 and

MS2 can be transformed to a flat schema as shown in Fig. 4-5(b). Each schema is

represented as a dotted-line rectangle, in which a solid-line rectangle corresponds

to a schema element. A mapping edge and its connected elements constitute an

element-level mapping between schemas.

There are two common approaches to model element-level mappings. The first

simply uses a table to model mappings as shown in Fig. 4-5(c). Its columns are

employed to represent the properties of mappings, such as source elements, target

elements and mapping expressions. The second uses a mapping graph as shown

in Fig. 4-5(d) to connect all the mappings together. That is, each node corresponds

E2

E5

E6

E7

E4

MS1

MS2

LS

E1

E6

E4E1

E7

E5

E2

E6

E4E1

E2E7

E5

C4C1

C5

C6

C2

(a) (b)

(c)

E3

E8

E3E8

C3E3E8

MS E1 4. LS E. 1 m1

From To Rule

MS E1 5. LS.E2 m2

MS E2 6. LS E. 1 m1

m2

m3

LS E. 2

MS E2 8.

MS E2 7.

LS E. 3

m1

Choose your industry State (optional)

Keywords

Job Category:

Please Select a Category

Location:

Please Select a Location

- Select a Job Category -

Keywords Location

Categories

e.g. Manager or Sales or enter a Web ID Chicago, IL or 60601

MS1

MS2

(d) (e)

E1

E2E3

E4

E5

E8

E7E6

m2

m3

m2

m1

m2

m2

m1

m1

m3

m1

m1

m2

m2

m3

m1

m2

m3

LS

Figure 4-5. Three job search forms with the mappings.

44

to a source or target element and each edge represents a mapping. This graph-view

structure is more explicit and straightforward for discovering new mappings through

composition of mappings [17, 48, 88].

However, the above two approaches become almost impractical for managing a

large volume of mappings, particularly when the mappings are shared and operated

by a community. Thus, we introduce a semantics-based mapping modeling to makes

mappings easier to manage. In the MS1 and MS2 meta-queriers, we observed that some

mappings are highly similar. For example, two mappings {E4,E1,m1} and {E6, E1,m1}

share the same target elements E1 and mapping expressions m1. From these similar

mappings, some concepts can be identified by grouping the schema elements based

on their semantics; the related relations can be transformed from the corresponding

element-level mappings. These concepts and relations can be used to form a mapping

ontology. For instance, shown in Fig. 4-5(e), the original two pairs of mappings

{E4,E1,m1} and {E6, E1,m1} can be represented by a single relation {C4,C1,m1} where

the concepts C4 and C1 are formed respectively by {E4,E6} and {E1}. In this setting,

mapping management becomes more intelligible to human users. Manipulations on

individual mappings can be replaced by more straightforward and convenient operations

on concepts and relations. Mapping redundancy and incorrectness are easier to

detect by humans or even machines. Repetitive operations on semantics-equivalent

mapping instances might also be avoided to eliminate the potential update anomalies.

Furthermore, semantic modeling also benefits the discovery of new mappings by both

humans and machines. Since it is a semantics-based abstraction of mappings, the

reuse of previous mappings become more straightforward and effective. Based on the

above discussion, it is evident that semantic modeling is a better approach with several

advantages. The rest of this section presents our proposed ontology-based modeling.

45

4.3.2 Modeling

In this model, element-level mappings are considered as the first-class entities.

Mappings between two schemas can be decomposed into some separate element-level

mappings; element-level mappings between two schemas can also be combined to

generate a schema-level mapping [53]. In Section 4.3.1, we use a simple example to

explain the basic idea of the proposed element-level mapping modeling. The example

shown in Fig. 4-5(e) comprises only several one-to-one simple mappings, but real-world

mappings can be much more complex. Section 4.2 introduces a semantically richer

representation of element-level mappings. With this mapping representation, we can

present a graph model for the proposed mapping ontology.

The mapping ontology is modeled as a directed acyclic graph where a node

represents a concept with a set of associated instances; an edge corresponds to a

relationship between two concepts. Both nodes and edges have some properties that

describe their semantics and constraints. Based on the composition of concepts, all

the nodes are classified into three different types: Elementary, Generalization and

Aggregation. Edges also have three categories for different purposes: Part-of, Is-a and

Transformation. Fig. 4-6 and 4-7 depict an ontology fragment about departure date in

the domain of air ticket booking. The graph model will be explained in details with this

example as follows.

E1

E2

Descriptive AttributeE-Node

E3<name:dateLeavingMonth>,<title:Select depart month>,<id:AIR_frommonth>,<text:Depart>

is-a

E1

E2 E3

G3

<name:AIR_frommonth>,<id: dm>,<class: formfield>

<name:outbound_departMonth>,<id:monthCalendar>,<class:month>,<text:Depart>

Figure 4-6. A fragment of a mapping ontology (E-Nodes and G-Nodes).

46

Elementary Nodes: An Elementary node (called an E-Node, shown as a rectangle in

Fig. 4-6) represents the concept of a schema element using a tuple 〈URI, D-Attributes,

Instances, I-Constraints〉, where URI identifies the location of the schema, D-Attributes

refer to the descriptive attributes associated with this element (e.g., name, id, value,

class, domain), and Instances and I-Constraints are the known data instances and the

instance constraints, such as the instance type and domain.

In ontology-based modeling, E-Nodes are the most fundamental concept units

for constructing high-level concepts, i.e., the concepts represented by Generalization

nodes and Aggregation nodes. The introduction of E-Nodes is to eliminate language

heterogeneity. Language heterogeneity refers to the difference in the languages

used to represent schemas, such as XHTML or HTML. Each schema element can

be transformed to an E-Node without semantic loss. The semantics of an E-Node is

reflected by the D-attributes of the corresponding schema element. Schema designers

always attempt to use a schema element for conveying a certain concept. Intuitively,

such a concept can be instantiated by the data instances under the corresponding

schema element. For example, the E-Node E1 in Fig. 4-6 corresponds to a schema

element whose D-Attributes are 〈name: outbound departMonth〉, 〈id: monthCalendar〉,

〈class: month〉 and 〈text: Depart〉, and instances are represented in a number from “1”

to “12”. It can be correctly inferred from its D-Attributes and instances that this element

indicates a concept about departure months.

Generalization Nodes: A Generalization node (called a G-Node, shown as a round in

Fig. 4-6 and 4-7) represents a concept by a tuple 〈SetE−Node , D-Attributes, D-Labels,

Instances, I-Constraints〉, where SetE−Node refers to an unordered set of E-Nodes as the

internal constitution of this G-Node. D-Attributes and D-Labels are used to describe the

semantics of the G-Node, and Instances denote a potential instance set Set INSgn under

the constraints of I-Constraints.

47

The semantics of a G-Node is equivalent to that of its inclusive E-Nodes. The

D-Attributes and D-Labels of a G-Node gn are two sets of word tuples SetDAgn and

SetDLgn . Each word tuple (wi , wfi ) consists of a distinct word wi and the corresponding

appearance frequencies wfi in gn. SetDAgn is generated from the normalized D-Attributes

of the included E-Nodes whose insertion is already verified by humans. The D-Labels

can be automatically derived from frequent appearance of D-Attributes or manually

input by humans. The detailed algorithms for finding the D-Attributes and D-Labels of a

G-Node are discussed in the next chapter. In addition, the Instances and I-Constraints

of a G-Node are directly obtained from the inclusive E-Nodes. For example, G3 has the

same instance set and I-Constraints as E1, E2 and E3.

The purpose of G-Nodes is for syntactic heterogeneity. The E-Nodes in SetE−Node

share the same semantics and instance formats but their syntactic representations

are different. That is, the instances of these E-Nodes share the same format with the

identical semantics if they represent the same object, but the D-Attributes of these

E-Nodes could be totally different. To take an example, in Fig. 4-6, the concept of

G-Node G3 is generalized from three verified E-Nodes. Although these E-Nodes

have equivalent semantics and instances format, their D-Attributes are different in

the attributes’ numbers, types and values. It is not necessary to impose any syntactic

transformation when these E-Nodes exchange their instances.

Aggregation Nodes: An Aggregation node (called an A-Node, shown as a triangle

in Fig. 4-7) represents a concept by a triple 〈ListG−Nodes , D-Labels, Instances〉, where

ListG−Nodes denotes an ordered list of G-Nodes, D-Labels and Instances respectively

refer to its descriptive attributes and possible instances.

For the purpose of representing many-to-many mappings, A-Nodes are generated

by aggregating an ordered list of G-Nodes. That is, an A-Node saves the order of its

inclusive concepts. Its Instances and D-Labels can be fetched from its inclusive nodes,

and then combined in such an order. For example, two A-Nodes A1 and A2 in Figure

48

4-6 represent a concept ”departure dates” by aggregating the same three G-Nodes

G1, G2 and G3. These G-Nodes respectively represent a calendar year, day and month

by numbers and their formats are shown in Fig. 4-7. Although their semantics are

equivalent, their instance formats are different due to different element orders. A1 uses

middle endian forms “month-day-year” (i.e., starting with the month). A2 uses big endian

forms “year-month-day” (i.e., starting with the year). Note that aggregation cannot be

directly applied on E-Nodes. In other words, A-Nodes only connect with G-Nodes. This

requirement enforces E-Nodes to be encapsulated into G-Nodes. It indicates more

concepts can be generated by clustering schema elements.

Is-a Edges and Part-of Edges: Is-a edges and Part-of edges (shown as single-arrow

edges separately in Fig. 4-6 and 4-7) represent hierarchical relations in mapping

ontologies. They respectively correspond to two distinct forms of concept abstraction:

generalization and aggregation. Both attempt to conceptualize E-Nodes. The main

benefit of generalization is to share semantic information among schema elements:

1) In a G-Node, all the E-Nodes have the same semantics and instance formats,

their descriptive labels (such as name, id, class) and instances can be exchanged for

understanding their semantics more precisely. 2) In a G-Node, all the inclusive E-Nodes

share the mappings. That is, as long as an E-Node is assigned to a G-Node, all the

part-of

G4

G1G2 G3

A1

A2

TR3TR1

part-of

G0

part-of

A0TR2

TR4

A0

A1

A2

Instance FormatA-Node

month-day-year(month:English)

month-day-year(month:num)

year-month-day(month:num)

G0

G1

G2

Instance FormatG-Node

month:English

year:num

day:num

G3

G4

month:num

month-day(month:num)

TR5

Figure 4-7. A fragment of a mapping ontology (G-Nodes and A-Nodes).

49

pre-configured mappings associated with this G-Node are also applied on this E-Node.

In addition, the introduction of aggregation is necessary for modeling n:m complex

mappings which include multiple schema elements in a certain order.

Transformation Edges: A Transformation edge (called T-Edge, shown as a directed

edge with double arrows in Fig. 4-7) connecting two nodes represents a transformation

relation. Its mapping expression is stored as an attribute of the related T-Edge. T-Edges

cannot connect E-Nodes.

Combining with the G-Nodes and A-Nodes, T-Edges aims to address three different

types of heterogeneity between two semantic-overlap concepts. They are instance

heterogeneity, structural heterogeneity and semantic heterogeneity: 1) Instance

heterogeneity refers to the variety in instance formats for the same entity. For example,

both G0 and G3 represent the departure months, but they respectively use different

formats, i.e., number and English. Thus, a T-Edge TR5 might be created for transforming

instances from G3 to G0, if necessary. 2) Structural heterogeneity often occurs when the

aggregation orders of concepts are different. For example, A1 and A2 represent a date

in a different order, i.e., “ month-day-year” and “year-month-day”. A T-Edge TR4 is used

to link A1 and A2. 3) Semantic heterogeneity represents the differences in the concept

coverage and granularity. For instance, TR3 is used to connect two concepts A1 and G4

with different coverage. Different from A1, G4 only represents the departure date without

“year”.

In addition to the above explicit relations, there exist some implicit transformation

relations stored in G-Nodes. In each G-Node, E-Nodes have an equivalence relation

between each other. As mentioned above, the instances of these E-Nodes are shared

among them without any change. This kind of relation is common in real applications.

4.4 Metadata in Mapping Modeling

The last two sections respectively present the change-based and ontology-based

approaches to modeling mappings. Besides the internal composition of mappings, their

50

related context information (i.e., the metadata) also plays a critical role especially in the

context of community-based meta-querier construction. The metadata is mainly used

to facilitate community cooperation and enhance system effectiveness and robustness.

The metadata includes who, what, where, when and how aspects associated with

the whole lifecycle of mappings, including their creation, verification, modification

and deletion. To represent the metadata, the proposed mapping modeling assign the

following metadata on each node and edge:

• Creation: Nodes/edges can be created manually or automatically. If humans are

the creators, the user names and the creation time are recorded in the related

nodes/edges. If they are discovered by machines, the algorithm names, the

credibility values (ranging between 0 and 1 and indicating how confident the

mapping holds) and the creation time will be stored. In addition, the creating

process is also recorded. When a mapping map is inserted, all the relative nodes

and edges are annotated by map.

• Verification: Verification checks for accuracy and completeness of nodes/edges.

In the current implementation, each node/edge is accompanied by a set of

attribute triples: 〈VS , Name, Time〉, where VS refers to the validation status of

this node/edge, Name and Time respectively represent the user names and time

points. VS has three types of values: Undetermined, Incorrect, and Correct.

Human users are able to view and verify all the components of mapping ontologies

manually. Although the current solution does not automatically evaluate the

correctness of mappings, it is not difficult to support such functionality if machines

have the capability of detecting the usage of mappings.

• Modification and Deletion: Mappings are not static. Their nodes/edges need

be updated or deleted. Since they might also be maliciously/accidentally

revised or removed, some version mechanisms [33, 66] should be merged into

51

Correct Incorrect

Undetermined

Created

Deleted

Figure 4-8. The life cycle of a node/edge.

ontology-based modeling. Although the current implementation does not include

such functionalities, it does record the related information, such as, who and when

the nodes/edges are modified or deleted.

• Annotation: Not all the mappings can be automatically discovered if there exist

some hidden contexts. For example, the scale-factor of instances can be set to

any integer value, such as 1, 1000 and 1000000. The money amounts can be

reported in different currencies, such as Dollar, Euro and Yuan. Normally, such

contexts are almost impossible to reason by machines from the individual mapping

instances. Thus, our solution is to enable users to annotate the nodes and edges

with additional information for understanding by others.

4.5 Related Work

4.5.1 Mapping Modeling for Data Integration

Mapping modeling is the foundation of mapping repositories. The existing methods

can be classified based on the extent to which mappings are abstracted: instance-level

(IL) and abstract-level (AL) modeling. IL modeling is a method that regards mapping

instances as separate entities without any abstraction, while AL modeling considers

mapping instances as a whole body under some hidden templates/rules, which can be

discovered from manual or automatic analysis and reasoning on the mapping instances.

IL modeling is widely utilized in many data integration systems. Most mapping

repositories [93, 101, 155] simply employ a large table to store schema-level or

element-level mappings. Its columns represent different properties of mappings, such

52

as schema/element names, mapping expressions, and similarity values. Others use

graph models in which nodes denote schemas [16, 37] or elements [13, 96] and edges

represent mappings. In a mapping graph, a new mapping edge between two nodes

could be discovered by combining all the edges in a mapping path between these two

nodes [17, 48, 88].

Compared with IL modeling, AL modeling facilitates mapping management and

re-utilization by providing the patterns/heuristics hidden in mapping instances. Patterns

are useful to assist humans/machines in reviewing previous mappings and finding new

mappings. A study in [86] employs machine-learning techniques for finding the patterns.

It leverages a large number of the existing mapping instances to train classifiers.

Essentially, these classifiers can be viewed as a media to store the implicit patterns

for finding new mappings. However, patterns in real life are too complex to learn by

machines especially when the volume of training data is not large enough. Another

problem is that these patterns are not straightforward to users. It is difficult for human

users to view, verify and modify them. Consequently, this method is often applied to

discover approximate mappings [87], rather than exact mappings with expressions.

Other studies [26, 43, 144] rely on human efforts to find templates, which can be

viewed as explicit mapping patterns. These templates are separately hardcoded into

the systems. They simply assume that the templates are static and separate. It is

highly desirable that these templates can be semantically organized and dynamically

managed.

Our proposed semantic modeling is a type of AL modeling. Different from the above

learning-based and template-based methods, our proposed semantics-based method is

to form a domain-specific ontology by conceptualizing mapping instances. The ontology

is a container that stores a large number of templates. Users can dynamically manage

these templates that are organized based on their semantics.

53

Compared with the learning-based methods, our proposed semantic modeling is

more straightforward to humans, especially for the non-technical users. Since user

interaction is inevitable in mapping management and discovery, easy-to-interact

is of much significance particularly for large-scale mappings. In this context, the

ontology-based model enhances the understanding of humans. In addition, these

mapping ontologies are visualizable, and thus, the operations in mapping management

can be transformed to several simple operations on graphs.

Compared with the template-based methods, our method is abstract and dynamic.

Each template represents a group of mappings with some specific patterns. In

our model, individual templates are merged into domain ontologies based on their

meanings. The semantics-based organization facilitates the discovery of new templates

by maximally reusing the previous ones. Furthermore, these domain ontologies can be

easily re-configured through periodic updates, such as insertion and modification.

4.5.2 Schema Element Clustering for Data Integration

Clustering schema elements is not a new approach in the field of data integration.

Element clustering is a process of organizing similar elements into the same groups and

dissimilar elements into the different groups. Its main applications in data integration

include schema merging, result merging and schema matching. For supporting schema

merging, mediated schemas are generated by abstracting a representative schema

element from each element cluster adopted in ARTEMIS [24] and Dragut et al. [41]. In

result merging, element clustering facilitates detection of the duplicated results returned

from different data sources IWIZ [111].

Most relevant to our work is the clustering algorithms in schema matching, including

Corpus-based [86], IceQ [144], and Zhao & Ram [156]. The hierarchical agglomerative

methods are employed by all the above references in constructing multiple concept

clusters. However, all these algorithms operate in a batch mode. That is, they assume

that all the schema elements are already collected before the start of clustering. In

54

fact, it is not feasible in the dynamic context. For dynamic reconfiguration, schema

elements to be clustered are incrementally occurred. Furthermore, Zhao & Ram [156]

also attempts to use K-means and SOM neural network as element clustering methods.

Like the above methods, they also did not consider the user interaction issues during the

clustering construction.

Our clustering algorithm is to identify concepts and relations for constructing a

mapping ontology. It is a type of representative object-based methods. The degree of

similarity of an element to a cluster is equal to its similarity value to the representative

of this cluster. The representatives (i.e., D-Labels) of a cluster are abstracted from the

existing nodes based on the term frequency statistics of their D-Attributes, which are

treated as a bag of words.

Our algorithm mainly differs from the above work in several respects. 1) Inputs:

the inputs to our system are mappings rather than schemas, since our system is to

store, manage and discover the mappings with their expressions. 2) Outputs: their

clustering algorithms only output the concept clusters without any relation among these

clusters, but our construction method is to build domain ontologies including concepts

and relations. 3) Algorithm: our algorithm is incremental and “human friendly”. First,

the representative object-based clustering is more straightforward to non-technical

users. The cluster representatives offer users simple and straightforward descriptions

of clusters that are easier for human understanding. The performance of our algorithm

can be more easily improved by manually changing the representatives, e.g., adding

or removing several D-Lables and D-Attributes. Second, for adapting to the evolution

of DI systems, it classifies the to-be-matched elements as soon as they arrive. This

incremental mechanism is very important for a long running and shared mapping

repository.

55

CHAPTER 5MAPPING REPOSITORY

User queries through global forms are translated to respective local queries. To

translate user queries, meta-queriers must find the element mappings from global forms

to local ones. In the context of meta-querier customization, the scale of such mappings

might be considerably large since a potentially large number of meta-queriers need to

be built to meet various user needs. Thus, it is highly desirable to design mechanisms

to manage these mappings. This chapter introduces two management mechanisms,

M-Ontology and MO-Repository, based on the aforementioned ontology-based and

change-oriented mapping modeling.

5.1 M-Ontology

M-Ontology is an ontology-based mapping repository. Mappings are the main

information sources for constructing such a domain-based mapping ontology. The

construction of the ontology is through incremental insertions of element-level mappings.

The incremental approach is very critical for building a long running and shared mapping

repository. Since both users and data sources are autonomous units, source selection

and local schemas are dynamic. New mappings continue to be inserted as new data

sources are inserted or old ones evolve. M-Ontology requires periodic update. It is more

effective to update the existing ontology incrementally rather than re-learn the ontology

from scratch, especially when the ontology has been manually corrected.

Like schema matching in data integration, ontology construction is also a labor-intensive,

time-consuming and error-prone problem [110, 121]. In order to reduce human efforts

involved in the construction, M-Ontology provides a semi-automatic algorithm [76] for

EndStartInserting

ListE1

Found?

[No]

[YES]

InsertingListE2

HumanInteraction

CreatingT-Edge

SearchingT-Edge

Figure 5-1. The flowchart of ontology construction.

56

inserting mappings. Each insertion involves three main tasks: a) concept identification,

including E-Nodes, G-Nodes and A-Nodes; b) relationship identification, including Is-a

edges, Part-of edges and T-Edges; c) metadata storage, including all the metadata

of concepts and relationships. Generally, it is straightforward to record the metadata

at the same time as the edges and nodes are manipulated during the whole life cycle

of ontology. Hence, the following contents in this subsection mainly focus on how to

deal with the first two tasks when inserting a mapping (map〈EListS ,EListT ,ExpST 〉).

M-Ontology splits the semi-automatic mapping insertion into three sequential phases as

illustrated in Figure 5-1: 1) inserting the element lists, EListS and EListT ; 2) inserting the

expression ExpST ; 3) validating the insertion by humans.

Phase 1: Element Insertion . The purpose of this phase is to obtain E-Nodes,

G-Nodes, A-Nodes, Is-a edges and Part-of edges when inserting the element lists. The

element lists, ListE1 and ListE2, are inserted separately. Algorithm 5-1 illustrates the

whole process of inserting an element list EList into a mapping ontology MO. Besides

updating the ontology MO, the algorithm returns an abstract node cn corresponding to

EList for the next phase (Note that for the simplicity of description, an abstract node is

introduced hereafter to represent a node that might be a G-Node/A-Node.)

In the algorithm, each schema element e in EList is checked if it already exists in

the ontology MO (line 2-3). If true, its G-Node is (a part of) the target abstract node

(line 4). If false, it is encapsulated into a new E-Node en whose initial validation status is

UNKNOWN (line 6). Then, line 7 seeks the most suitable G-Node gn in MO for inserting

en. If not found, en is inserted into a newly created G-Node; otherwise, en is directly

inserted into the found gn (line 8-9). If EList only contains a single element, gn is the

abstract node cn that will be returned (line 8,15,17); or else each of the found G-Nodes

gn is inserted into a newly created A-Node an in the same order with the original one of

schema elements in EList (line 1,2,11). Then, line 12 is to go over MO for finding out

an A-Node anInMO with the same contents like an. If such an A-Node exists, anInMO

57

is the node which need to be returned; or else an will be returned (line 13-14,17).

Before returning cn, the insertion effects on MO are executed in line 16 with the relative

metadata recorded.

Algorithm 5-1 : Insert an element list into a mapping ontology.

Data: A mapping ontology MO, an element list EList

Result : MO ← MO ∪ {EList}, a returned abstract node cn corresponding to EList

an = new ANode();1

foreach Schema Element e in EList do2

if e ∈ MO then3

gn = e.getENode().getGNode();4

else5

en = new ENode(e);6

gn = MO.searchGNode(en);7

if gn = ∅ then gn = new GNode();8

gn.insertENode(en);9

if EList.getElementNum()>1 then10

an.insertGNode(gn);11

anInMO = MO.searchANode(an);12

if anInMO 6= ∅ then cn = anInMO;13

else if an.getNodeNum() 6= 0 then cn = an;14

else cn = gn;15

MO.insertCNode(cn);16

return cn;17

58

The core of Algorithm 5-1 is two search problems (line 12,7): MO.searchANode(an)

and MO.searchGNode(en):

a) searchANode can be implemented by directly evaluating the equivalence

between an and the A-Nodes in MO. Two A-Nodes are equivalent only when they

include the same G-Nodes in the same order.

b) searchGNode is implemented by comparing the semantic similarities of all the

G-Nodes in MO with the E-Node en. The target G-Node gn is the one with the highest

similarity value simgn−en that must exceed a given threshold. If no G-Node is eligible,

searchGNode returns a null value to the caller. The similarity simgn−en between gn and

en is determined by two steps:

Step 1: Preprocessing. The D-Attributes and Instances of en are collected and then

normalized by NLP (natural language processing) techniques: tokenization, stop-word

removal and stemming. The resulting texts are considered as two bag of words, that is,

two unordered sets of words, SetDAen and Set INSen .

Step 2: Calculating. Resulting from the previous verified insertion of E-Nodes, the

G-Node gn has an instance set Set INSgn and two word sets SetDLgn and SetDAgn , respectively,

corresponding to its D-Labels and D-Attributes. If gn and en are constraint-compatible,

their semantic similarity is calculated by,

simgn−en = w1×sim1(SetDLgn ,Set

DAen )+w2×sim2(Set

DAgn ,Set

DAen )+w3×sim3(Set

INSgn ,Set

INSen )

where w1, w2 and w3 are three weights in [0,1] such that∑3i=1 wi = 1. The similarity

functions sim1, sim2 and sim3 are implemented by a hybrid matcher that combines

several linguistics-based matchers, such as WordNet-synonyms distances, 3-gram

distances and Jaccard distances. Many algorithms for schema/ontology matching have

been proposed (see the surveys [46, 116]). Most of them can be easily employed in

this phase. As this is not the target of our research, the details are not explained in our

research.

59

Phase 2: Expression Insertion . The target of this phase is to identify and create

T-Edges. The inputs of phase 2 include an expression Exp, and two abstract nodes cn1

and cn2 that respectively correspond to ListE1 and ListE2. Phase 2 checks if there exists

a T-Edge with the same expression as Exp from cn1 to cn2. If it does not exist, a T-Edge

containing Exp is created with a validation status of UNKNOWN; otherwise, this phase is

finished.

Phase 3: Human Interaction . If M-Ontology is only used to store mappings, the

whole process is fully automatic. However, M-Ontology also supports reuse-oriented

mapping discovery. Since mapping discovery is known as an AI-complete problem

[54, 91], human involvement cannot be avoided in the construction of M-Ontology, during

which many new mappings are discovered. Humans are responsible for the validation

and correction of nodes and edges. For improving the performance of M-Ontology, three

intervention strategies are designed, as follows,

• Validation . For alleviating possible redundancy and error propagation, only the nodes

and edges that have been validated by humans can affect the future mapping insertion.

VerifyENode and VerifyTEdge are two of the main verification functions.

a) VerifyENode (Verification of E-Nodes): After an E-Node en is verified, its two

normalized bags of words, SetDAen and Set INSen , start to affect the corresponding G-Node

gn. The D-Attributes SetDAgn and Instances Set INSgn of gn need to be updated to combine

the words occurring in the E-Node. The machine-found D-Labels SetDLgn might also

evolve by selecting the D-Attributes with the top-k frequencies (e.g., k=5). In case of

updates on the G-Node, the relative A-Node is also updated if necessary.

b) VerifyTEdge (Verification of T-Edges): The verification of a new T-Edge affects

not only its originally associated mapping, but also all the indirectly connected E-Nodes

inserted in the past and future. Such an activity indicates n × m mappings are validated

and stored, where n and m are respectively the value of possible E-Node combination

60

of the connected abstract nodes. More mappings can be found from VerifyTEdge, if

mapping composition [17, 48, 88] is supported.

• Avoidance . Validation does not always guarantee the correctness and non-redundancy

of the mapping ontology, especially in the context of mass collaboration. Thus, some

automatic detectors are proposed based on the structural properties of M-Ontology.

Alarms are sent to human users for assistance if one of the following cases happens:

(i) two or more verified T-Edges are found from one abstract node to another; (ii) a

new abstract node is created for mapping insertion; (iii) the found cn1 and cn2 are not

identical if the original mapping expression is “equivalence”.

• Prevention . Even if the current contents of M-Ontology are correct and non-redundant,

human users are also encouraged to involve in the enrichment of mapping ontologies

for improving the performance of construction. Manually modifying the attributes of

abstract nodes (such as D-Labels and Instances) can improve the precision and recall

of construction algorithm, since our algorithm is based on these attributes to insert

E-Nodes and expressions. In addition, the comments on nodes/edges are welcomed

from the creators/modifiers. The information stored in metadata (such as annotation) is

critical for understanding of others.

Following the above mapping insertion algorithm, M-Ontology is constructed

and enriched as more and more mappings are inserted. The main functionalities of

M-Ontology are mapping storage, management and discovery. First, the basic mapping

retrieval can be achieved in a fully automatic way. Even if the algorithm does not find

the correct abstract nodes for insertion, the inserted mappings can also be accurately

retrieved based on their URI and related metadata. Second, the management of a large

volume of mappings becomes easier for human beings: (i) the mapping ontology can

be represented as a graph, and thus, the operations in mapping management can be

transformed to several simple operations on graphs; (ii) the semantic modeling makes

sense of unorganized mapping information so as to enhance the understanding by

61

humans; (iii) the representative object-based clustering algorithm is more straightforward

to human users since they can easily improve the performance by manually changing

the representatives, e.g., adding or removing several D-Lables and D-Attributes. Third,

based on the semantic modeling and construction mechanism, a reuse-oriented

algorithm for mapping discovery can be efficiently implemented as discussed in the next

chapter.

Digging a little deeper into the structure of the mapping ontology, we can see two

important considerations:

• 1) Traditional domain ontology construction aims at semantic richness [121]

that is important for performing complex reasoning. However, richer semantics

requires more human involvement in ontology construction. This contradicts with

one of our original goals, that is, the best-effort reduction of human efforts during

the construction and maintenance of DI systems. Therefore, the contents that

M-Ontology contains and maintains are restricted to a sufficiently rich level. It

only includes the mapping-related information, since it is used only for mapping

storage and reutilization. The following chapter discusses how to construct such an

ontology.

• 2) Most traditional mapping repositories [93, 101, 155] store many-to-many

mapping instances directly without organizing them based on their semantics.

In contrast, M-Ontology attempts to identify and store the hidden concepts from

instances in a straightforward approach, and at the same time, address five

types of heterogeneity: language heterogeneity, syntactic heterogeneity, instance

heterogeneity, structural heterogeneity and semantic heterogeneity. It aims at

better mapping management and reutilization not only by machines but also by

normal users. The following chapter explains how to reuse mappings in such an

ontology.

62

5.2 MO-Repository and M-Table

Complementary to M-Ontology, MO-Repository is a repository of mapping

evolution. It records the evolution of mappings that serve the meta-queriers in a specific

application domain. That is, it stores the corresponding mapping objects Set〈mapObj〉,

each of which mapObjST denotes a concrete relation between two query forms QFS and

QFT , as defined in Chapter 4.2. mapObjST consists of the different versions of such a

relation. Each version corresponds to a mapping instance.

A M-Table is a logical partition of MO-Repository. It stores the mapping evolution of

a specific query form. That means it only contains the mapping instances that connect

to that query form. Different M-Tables could share the same mapping objects. The

introduction of M-Tables is to discover the new mappings through exploiting the evolution

process of mappings (i.e., the reuse of mappings). This part will be discussed in the next

chapter.

Construction of MO-Repository can be viewed as the process of incrementally

building mapping objects (i.e., recording the mapping evolution). The evolution can

be categorized into two groups: 1) traceable evolution; 2) untraceable evolution. The

difference between two types of evolution is based on the availability of the step-by-step

changes on query forms. If available, we can obtain the correspondences of schema

elements between two versions of query forms. Based on this correspondence, the

mapping instances can be easily inserted into the corresponding mapping objects.

Thus, the key problem is how to constitute MO-Repository by inserting the mapping

instances without step-by-step mapping evolution.

Algorithm 5-2 presents an incremental algorithm for construction of MO-Repository.

The algorithm shows the whole process of inserting a mapping instance mapST into a

MO-Repository MOR. Besides updating MOR, the algorithm returns a mapping object

mapObj that contains the instance mapST .

63

Algorithm 5-2 : Insert a mapping instance into a MO-Repository.

Data: A mapping instance map 〈EListS ,EListT ,ExpST 〉, a MO-Repository MOR.

Result : MOR ← MOR ∪ {map}, a returned mapping object mapObj that contains

map.

Set〈mapObjST 〉 = MOR.findMOBJ (S .URI ,T .URI );1

Set〈SmapObj〉 = synEquvMOBJ (EListS , Set〈mapObjST 〉);2

Set〈TmapObj〉 = synEquvMOBJ (EListT , Set〈mapObjST 〉);3

if ( countMOBJ (Set〈SmapObj〉) == 1 && countMOBJ (Set〈TmapObj〉) == 1 ) then4

if equalMOBJ (Set〈SmapObj〉, Set〈TmapObj〉) then5

mapObj = MOR.insertEdge(map);6

else7

switch ( countMOBJ (Set〈SmapObj〉) + countMOBJ (Set〈TmapObj〉) ) do8

case 09

mapObj = MOR.createMOBJ(map);10

case 111

mapObj = MOR.insertNodeEdge(map);12

otherwise13

MOR.mergeMOBJ(Set〈SmapObj〉, Set〈TmapObj〉);14

mapObj = MOR.insertNodeEdge(map);15

return mapObj ;16

As shown in Figure 5-2 (a), each mapObjST can be represented by a bipartite graph

G = 〈Set〈NS〉, Set〈NT 〉, Set〈RST/RTS 〉〉. In the bipartite graph G , a mapping instance

〈EListS ,EListT ,ExpST 〉 is represented as an edge RST directed from a node NS to another

NT . For a to-be-inserted mapping mapST , the target mapObj must connect query forms

64

(c)(b)

exp1EList1

S

exp2

EList2

S

EList T EList S

Global Local

EList1

T

EList2

T

exp1

exp2

Global LocalR

N1S

N2S

N1T

ST

1

RST

3


RST

4

N2T

RST

5

(a)

Figure 5-2. (a) An example of mapping object. (b) The evolution of a global query form.(c) The evolution of a local query form

QFS and QFT . In the algorithm, the first step is to obtain a set of mapping objects

Set〈mapObjST 〉 by imposing the URI conditions on the connected query forms (line 1 in

Algorithm 5-2).

The second step is to search Set〈mapObjST 〉 to identify a target mapping object

mapObj , into which mapST is to be inserted. The relation represented by mapObj

should be semantically equivalent to the one of mapST . As discussed in Chapter

5.1, the semantics-based approaches cannot be fully automatized. That means

the MO-Repository construction might require additional involvement of community

members. To avoid such side effects, we design an automatic approach to searching

mapObj . Our approach makes the following assumption: a mapping instance is

contained in a mapping object if and only if the element lists (EListS or EListT ) of

this instance are evolved from or syntactically equivalent to the ones of another instance

in this mapping object. Following this assumption, two subsequent issues need to be

discussed before the description of our algorithm. First, the generated mapping objects

(i.e., relations) between two query forms are independent of each other. If the mapping

objects share the syntactically identical element lists, they should be merged to form

a new mapping object. The merging is feasible and reasonable in the customized

meta-queriers. Second, there might exist several generated mapping objects that

represent the same mapping object. However, this will not occur in the customized

meta-queriers, since the evolution of global query forms is traceable (as shown in Figure

65

5-2 (b) and (c)). Therefore, two sets Set〈SmapObj〉 and Set〈TmapObj〉 are respectively

selected from Set〈mapObjST 〉 by syntactically comparing EListS or EListT with the nodes

in the sets Set〈NS〉, Set〈NT 〉 (line 2-3 in Algorithm 5-2).

If both sets Set〈SmapObj〉 and Set〈TmapObj〉 include only one equivalent mapping

object, this mapping object is the target one. Then, an edge is inserted into this

mapping object to represent the inserted mapping instance map (line 4-6 ). If both

sets Set〈SmapObj〉 and Set〈TmapObj〉 are empty, a new mapping object is created with

two nodes and a single edge incorporated to represent map (line 7-10 ). If the sets only

contain one mapping object, this object is the returned one and we create a node and

an edge to contain map (line 11-12 ). Otherwise, the mapping objects in Set〈SmapObj〉

and Set〈TmapObj〉 should be merged to form a single mapping object (line 13-14 ).

Then, line 15 inserts the corresponding nodes and edge for map.

66

CHAPTER 6REUSE-ORIENTED MAPPING DISCOVERY

The holistic integration of construction and maintenance of meta-queriers through

component-level reuse greatly eases the task of large-scale meta-querier construction.

This chapter presents a novel approach[79] to schema matching by employing both

semantics and evolution of mappings. The previous versions of mappings and the peers

are potentially numerous and similar.

6.1 Problem Definition

Mapping discovery is a critical operation in the meta-querier construction and

maintenance. To build a new meta-querier, the mappings among local query forms are

necessary for automatic generation of a global query form, and the mappings between

global and local query forms are required for query translation at runtime. If the global

and local query forms are updated, the existing meta-querier should be reconfigured

with new mappings.

The goal of mapping discovery is to find the active mapping instances from one

query form QFS to another QFT . However, existing techniques [46][116] are not

adequate in discovering the mapping automatically, especially for the instances with

non-equivalence expressions. Thus, our proposed solution aims at decreasing the

complexity and workloads of mapping discovery made by community members. Our

algorithm outputs not only the active mapping instances mapST but also their previous

versions (i.e., the mapping object mapObjST ). If the mapping expressions cannot be

found, the correspondences corrST are also offered for facilitating mapping discovery.

In addition, in the pursuit of wide and active participation of non-technical volunteers,

our solution is designed intentionally to be straightforward to ordinary users. That is,

it is easier for them to understand the process, validate the results, and improve the

performance.

67

Algorithm 6-1 : Reuse-oriented mapping discovery.

Data: A source query form QFS , a target query form QFT , a M-Ontology MO, a

M-Table MT .

Result : Set〈mapST , corrST , mapObjST 〉.

Set〈eS〉 = extract (QFS );1

Set〈eT 〉 = extract (QFT );2

Set〈eMT , eS/T 〉 = MT .matchElement2Element(Set〈eS 〉, Set〈eT 〉, uri(QFS ), uri3

(QFT ));

Set〈mapST , corrST , mapObjST 〉 = MT .reuseMTable(Set〈eMT , eS/T 〉);4

Set〈cG , cA, eS/T 〉 = MO.matchElement2Concept(Set〈eS 〉, Set〈eT 〉,5

Set〈eMT , eS/T 〉);

Set〈mapST , corrST , mapObjST 〉 = MO.reuseMOntology(Set〈cG , cA, eS/T 〉, Set〈mapST ,6

corrST , mapObjST 〉);

verify (Set〈mapST , corrST , mapObjST 〉);7

return Set〈mapST , corrST , mapObjST 〉;8

The core of our approach is mapping reuse. Reusing mappings implies modeling

the mappings at some level of abstraction. Instead of directly reusing the individual

mappings as in the previous work [116][144][43][35], our strategy is to reuse the

conceptual abstraction from the mappings and their evolution for better performance.

M-Ontology and M-Table are respectively built on the ontology-based and change-oriented

modeling of mappings. With a domain-specific M-Ontology MO and a system-specific

M-Table MT , our reuse-oriented algorithm is to discover Set〈mapST , corrST , mapObjST 〉

from a query form QFS to another query form QFT , as illustrated in Algorithm 6-1. The

overall algorithm can be split into three phases: 1) mapping discovery through the

68

mapping table MT (lines 1-4); 2) mapping discovery through the mapping ontology MO

(lines 5-6); 3) human validation and correction of the discovered mappings (line 7). The

details are discussed below.

6.2 Discovery through M-Table

In the first phase, discovery through MO-Repository MOR is a search process

for reusing the existing mapping instances between the same pair of query forms. To

match the query forms from QFS to QFT , we first seek two sets of one-to-one element

correspondences Set〈(eS , eMOR)〉 and Set〈(eT , eMOR)〉, where eMOR is the elements

from MOR. Through these correspondences, the eligible mapping instances mapST ,

correspondences corrST and mapping object mapObjST are selected by reasoning on

MOR. The undetermined elements in Set〈eS〉 and Set〈eT 〉 are sent to the second

procedure with the element correspondences Set〈(eS , eMOR)〉 and Set〈(eT , eMOR)〉. The

core is the implementation of two functions: matchElement2Element and reuseMORepository.

• matchElement2Element (line 3) selects and returns a set of element correspondences

Set〈(eX , eMOR)〉 for each side (that is, X can be S or T ). The objective of correspondence

selection is to maximize the utility value as computed by,∑

eX∈X

∑

eMOR∈MOR

R(eX , eMOR)× a(eX , eMOR), subject to the following four constraints:

∑

eX∈X

a(eX , eMOR) ≤ 1, for eMOR ∈ MOR

∑

eMOR∈MOR

a(eX , eMOR) ≤ 1, for eX ∈ X

a(eX , eMOR) ∈ {0, 1} for eX , eMOR ∈ X ,MOR

R(eX , eMOR) ∈ {0, 1} for eX , eMOR ∈ X ,MOR

where, all the elements eX should be from the same query form X (i.e., QFS or

QFT ), and element eMOR is from MO-Repository MOR and with the same query form.

a(eX , eMOR) takes the value 1 if a correspondence between eX and eMOR is selected,

and 0 otherwise. The feasible result must guarantee that each eMOR is assigned to

at most one eX and each eX is assigned to at most one eMOR . The similarity value

69

R(eX , eMOR) is equal to 1 only when the similarity value between eX and eMOR is larger

than a pre-defined threshold δ; otherwise it is assigned with a value of 0. Their similarity

values can be calculated by the following formula:

w1 × simDA−DA(eX , eMOR) + w2 × simINS−INS(eX , eMOR)

where w1 and w2 are two weights in [0,1] such that∑2k=1 wk = 1. The similarity

values returned from simDA−DA and simINS−INS show the extent how similar element

eX and element eMOR are regarding to their Descriptive Attributes and INStances,

respectively. The element attributes and instances are first normalized by NLP (natural

language processing) techniques: tokenization, stop-word removal and stemming.

The resulting texts are then considered as bags of words, that is, unordered sets of

words. To compare the similarities of these word sets, a hybrid matcher is employed

by combining several typical linguistics-based matchers, such as WordNet-synonyms

distances, 3-gram distances and Jaccard distances. Many algorithms for schema/ontology

matching have been proposed (see the surveys[116][46]). Most of them can be easily

introduced in this phase. As this is not the major target of this research, the details are

not explained here.

The decision of these element correspondences is a generalized assignment

problem, which is a NP-hard problem. However, a simple greedy solution can quickly

attain the results in this context. First, the similarity matrix size and number is not big

[28]. For example, normally, the query forms in flight-ticket searching contain around 10

functional components. The query forms are compared with only the query form with the

same query form identifier. Second, the similarity matrix is sparse. Very few similarity

values are equal to non-zero.

• reuseMTable (line 4) is to determine mapping instances, correspondences and

mapping objects (i.e., Set〈mapST , corrST ,mapObj

ST 〉) through the identified element

correspondences. 1) A mapping object is selected only if the corresponding bipartite

graph contains at least one node in Set〈NS〉 or Set〈NT 〉 is fully hooked, that is, all

70

the elements of this node appear in element correspondences Set〈(eS , eMOR)〉 or

Set〈(eT , eMOR)〉. For example, in Fig. 4-4(f), if all the elements in N1S are fully hooked,

the mapping object will be included in the return set. 2) corrST is determined based on

a property of MO-Repository. The elements in every pair of nodes respectively from

Set〈NS〉 and Set〈NT 〉 is a correspondence. For example, in Fig. 4-4(f), the elements

in N1S and N2T constitute a correspondence. Thus, if both two node sets Set〈NS〉 and

Set〈NT 〉 have at least one node that are fully hooked, the elements eS/T that are

linked to these node pairs constitute new correspondences. 3) the mapping instance

is discovered through examining each identified correspondence: i) whether there

exists an edge to connect the nodes in the corresponding bipartite graph; ii) whether

the mapping state of this edge is Detached or Usable; iii) whether the edge direction is

QFS→QFT . If all the states are true, the correspondence with this edge can be regarded

as a mapping instance mapST . Finally, Set〈mapST , corrST , mapObjST 〉 are found via reuse of

the previous mappings in MOR.

6.3 Discovery through M-Ontology

The second phase is to discover the mappings through exploiting the conceptual

abstraction from the existing mappings in the same domain. Our approach is based

on a feature of M-Ontology: when an E-Node that encapsulates a schema element

is classified into an existing G-Node, all the mappings associated with this G-Node

will be automatically assigned to this E-Node. That is, all the E-Nodes in the same

G-Nodes share their mapping information since they have the identical semantics in the

same formats. A-Nodes are selected if all the inclusive G-Nodes are already identified.

The mapping information stored in T-Edges can be reused if these edges connect the

selected A/G-Nodes.

To determine the mappings from QFS to QFT , all the schema elements in Set〈eS〉

and Set〈eT 〉 are first classified into the appropriate G/A-Nodes in M-Ontology MO.

Based on classification results, we can discover new mappings by reusing the mappings

71

associated with the corresponding G/A-Nodes. The implementation is through two major

functions: matchElement2Concept and reuseMOntology.

• matchElement2Concept (line 5) is to classify each undetermined element eS/T in

Set〈eS〉 and Set〈eT 〉 to appropriate G/A-Nodes from the M-Ontology MO. The resulting

G/A-Nodes constitute two concept sets ASetS and ASetT . A-Nodes are selected if all

the inclusive G-Nodes are already identified. Thus, the major focus is how to find an

appropriate G-Node gn for a given element eS/T . Five strategies [80] can be used to

measure the suitability of the G-Node gn for eS/T :

• 1) MIN: This strategy returns the minimum similarity value between eS/T and

all E-Nodes associated with gn. It is a pessimistic expectation since similarity

matching for gn is more likely to fail if any one of the E-Nodes is found less similar

to eS/T .

SimilarityE2GN (eS/T , gn) = MINeni∈gn{SimilarityE2EN (eS/T , eni ) }

• 2) MAX: This strategy is an optimistic forecast. It returns the maximum similarity

value between eS/T and all E-Nodes in gn. As long as there exists a E-Node highly

similar to eS/T , gn is also considered to be very similar.

SimilarityE2GN (eS/T , gn) = MAXeni∈gn{SimilarityE2EN (eS/T , eni ) }

• 3) AVG: This strategy returns the average overall similarity values between eS/T

and E-Nodes in gn. It is a compromise of the above two plans.

SimilarityE2GN (eS/T , gn) = AVGeni∈gn{SimilarityE2EN (eS/T , eni ) }

All the above three strategies use the definition of similarities between eS/T and

eni . We represent this similarity function SimilarityE2EN (eS/T , eni ) as follows, where

the comparisons are based on the properties/attributes of the E-Nodes.

SimilarityE2EN (eS/T , eni ) = Aggregationpropi∈eS/T ,propj∈eni (SimProp2Prop (propi ,

propj ))

72

• 4) HISTORY: This strategy uses the similarity values between eS/T and its

previous version eH . It assumes the query-form evolution is not a comprehensive

change. At least, some components still remain identical, especially in the design,

functionalities and representation. Normally, this strategy is employed as a

complementary solution to improve the performance of the other strategies. The

core is how to find and select the previous version eH . In our solution, the selection

of eH depends on the utility maximization of the function matchElement2Element

(discussed in Section 6.2).

• 5) ABSTRACT: This strategy returns the semantic similarity between eS/T and

the representative object ro of gn. ro is represented by a tuple 〈DA,DL, INS , IT 〉,

where DA is a bag of descriptive words that can be generated through normalizing

the descriptive attributes of all human-verified E-Nodes in gn using the aforementioned

NLP techniques; DL is a set of descriptive labels that consist of the terms with the

top-k term frequency from DA; INS and IT respectively represent the instances

and their constraint types obtained from the inclusive verified E-Nodes. In a

sense, the element classification problem is converted into a one-to-one semantic

matching problem between en and ro. The implementation of semantic matching is

the same as the algorithms mentioned in searchGNode (Section 5.1):

SimilarityE2GN (eS/T , gn) = SimilarityE2RO (eS/T , ro)

To make the algorithm easier to understand and control, we design a hybrid solution

that combines the last two strategies. First, the results of matchElement2Element (line

3) Set〈eMT , eS/T 〉 is reused to determine eH . In essence, for a specific pair 〈eMT , eS/T 〉,

the corresponding eMT is the target eH . For the eS/T without their previous versions,

representative-based strategy can be employed to discover the corresponding concept

nodes. Non-technical users can easily understand the whole process and be involved in

the result validation of element classification. The outcomes can also be improved

through manual changes on representative objects, and the operations can be

73

implemented with more straightforward graph operations using many emerging HCI

technologies (e.g., [30, 98, 107, 108]).

• reuseMOntology (line 6) discovers Set〈mapST , corrST 〉 (i.e., the mapping instances

and correspondences) from M-Ontology. First, we find two G/A-Node sets RSetS and

CSetS whose nodes are reachable from the nodes in the concept set ASetS . RSetS

consists of the nodes that can be reached through a single T-Edge from any node in

ASetS . CSetS is the node union of maximal connected subgraphs of the nodes in ASetS

(T-Edge direction is ignored in this calculation). Second, the desired concept-level

mappings are the overlap ASetS ∩ ASetT and RSetS ∩ ASetT . The first overlap denotes

the semantics-equivalence mappings between two concepts. The second one indicates

the concept mappings whose expressions are in the connected T-Edges. We also can

deduce the target concept correspondences from the overlap CSetS ∩ ASetT , although

M-Ontology does not have a T-Edge to link them together. Finally, we can easily discover

Set〈mapST , corrST 〉 through the element classification (from matchElement2Concept) with

these concept mappings and correspondences.

6.4 Validating & Correcting Mappings

The final phase is to verify and correct the machine-discovered mappings by human

beings. Supporting meta-querier customization is likely to result in a large number

of mappings to be discovered. It is impractical for a small set of domain experts and

system designers to validate and correct these mappings. We propose the following

strategies to decrease their workloads:

Mass collaboration . Although the to-be-discovered mappings might be numerous,

meta-querier customization will also attract more users. We believe these users not

only have enough motivation be involved in the validation, but also are willing to

volunteer to assist others. In our design, these users along with a small number of

domain experts can practically form a collaborative domain-specific community for

meta-querier construction. We divide the error-prone procedure into three subsequent

74

procedures with different difficulty levels. The tasks in the easiest level include two

verification jobs: verifying if the classified schema elements are assigned to the correct

G-Nodes and checking if the discovered T-Edges are appropriate to represent the

relations among the schema elements. Normally, these basic tasks can be assigned to

novices. In the second difficulty level, the tasks are determining unfound mappings and

searching M-Ontology to seek correct G-Nodes, if existing, for unclassified elements.

The experienced users are able to handle these jobs. At the most difficult level of tasks,

M-Ontology needs to be updated (e.g., creating a new G-Node) for supporting new

mapping instances. The maintenance of M-Ontology should be one of the major jobs of

the domain experts.

Human friendliness . For any community-based system to be successful, it

is absolutely necessary that the system must be easy to control. In our proposed

discovery algorithms, we can see the following special considerations in the design.

First, in the function matchElement2Element, correspondence selection is viewed

as a utility maximization problem on similarity matrix R(eX , eMOR). Thus, the results

can be improved through manual or automatic changes on R(eX , eMOR). For example,

verification on mappings might trigger changes on the corresponding values. Second,

in the function matchElement2Concept, manual modification on the representative

objects of concept nodes (such as D-Labels and Instances) can improve the precision

and recall. Third, graph-oriented interface is an essential requirement for systems with

a large community of novice users. Thus, both mapping objects and mapping ontology

can be represented by graphs. The related operations can be implemented with more

straightforward graph operations using emerging HCI technologies (e.g., [30][98]). Last

but not least, even if the current contents of M-Ontology are correct and non-redundant,

domain experts are also encouraged to involve in the enrichment of M-Ontology for

improving the performance. The comments on nodes/edges are welcomed from the

75

creators/modifiers. The information stored in metadata (such as annotation) is critical for

understanding of others.

6.5 Related Work

Although schema matching has been widely researched for thirty years, most of the

current solutions (as in surveys [46, 116]) only consider one-to-one mappings. In reality,

many-to-many mappings are pervasive in real-world applications. Current solutions to

complex mappings can be classified to three categories as follows:

Learning-based approaches learn mapping templates from the existing mapping

instances. iMAP [35], also called COMAP [38], is proposed to solve several types of

1-to-n complex mappings by comparing the content or statistical properties of data

instances from two schemas. However, the number of its hardcoded rules limits the

types of complex mappings that can be found. HSM [135] and DCM [56] employ

correlation-mining techniques to find the grouping relations (i.e., co-occurrence) among

elements, but their solutions do not take into account how these elements correspond to

each other (i.e., mapping expressions).

Template-based approaches search the most appropriate mapping templates

from a template set. The main drawback is their limited capability of finding complex

mappings. The templates are separate and static. In addition, the template number is

normally very limited. IceQ [144] integrates two human-specified rules into their systems

for partially finding Part-of and Is-a type mappings (i.e., two common one-to-many

mappings). QuickMig[43], as an extension of COMA[9], summarizes ten common

mapping templates, some of which are complex mappings. Several possible combinations

of templates are also presented. It also designs an instance-level matcher for finding

splitting or concatenation relationships.

Ontology-based approaches employ external ontologies for mapping discovery.

Two schemas to be matched are first matched to a single ontology separately, and

then element-level mappings can be found by deducing from the intermediary ontology.

76

However, its performance is largely affected by the coverage and modeling of ontologies.

For example, the ontology defined in SCROL[117] does not consider the syntactic

heterogeneity (defined in Chapter 4.3) caused by various semantics-equivalent

representations. In addition, the existing ontology-based solutions [117, 148] simply

assume that the shared ontologies are already built, but building such well-organized

ontologies is as hard as finding complex mappings. Ontology construction is also a

labor-intensive, time-consuming and error-prone problem [36, 110, 121].

Our proposed mapping discovery algorithm is a hybrid solution. First, the evolution

history of a specific mapping can be viewed as a template/rule. In a sense, the

discovery through MO-Repository can be viewed as finding rules to apply. Second,

discovery through M-Ontology is a typical ontology-based approach. We provide an

integrated solution to three indispensable subproblems: ontology design, ontology

construction and mapping discovery. The design of M-Ontology addresses five kinds of

heterogeneity: language heterogeneity, syntactic heterogeneity, instance heterogeneity,

structural heterogeneity and semantic heterogeneity. Third, ontology construction

employs learning-based approaches, i.e. incremental representative-based clustering.

Different from the classical ontology construction [36][110], the proposed ontology

is generated from schemas and mappings, which are abundant in our proposed

MQ-Customizer. As more schemas and mappings are inserted into the ontology,

more mappings can be correctly discovered from the ontology.

6.6 Conclusion

Current research proves that mapping discovery cannot be fully automated in

the foreseen future, especially for those many-to-many mappings with expressions.

We believe that reuse of existing mappings is one of the best (or maybe the only)

solutions. In our proposed solution, reusing the similar mappings in the same domain

can avoid the repetitive tasks, when a large number of customized meta-queriers need

to be built and maintained. To enhance reuse potential from the existing mappings, we

77

introduce the ontology-based and change-oriented mapping models. Based on these

two models, a user-friendly discovery algorithm is provided. In our solution, discovery

of many-to-many mappings is converted to finding of one-to-one correspondences

from their evolution history and a domain-based mapping ontology. The proposed

algorithms for finding correspondences are straightforward to non-technical users. Our

experimental results on real-world data sets confirm the feasibility and effectiveness of

our reuse-oriented solution.

78

CHAPTER 7ONTOLOGY-CENTRIC SOURCE SELECTION

Our work[78] investigates how to exploit the query capabilities [74][27][149]

of semi-structured sources for achieving more accurate selection. We propose an

ontology-centric approach to source selection. A domain-based ontology (M-Ontology)

was designed for meta-querier customization. With the assistance of the concepts and

relations in M-Ontology, user demands and source capabilities are modeled as concept

sets, and identified through query-form annotation, matched by an additive utility

function. The experiments on real-world data (in Chapter 8.3) illustrate the potential of

this ontology-centric method.

7.1 Related Work

Source selection refers to the process that deals with the selection of a list of

relevance-ranked data sources for meta-queriers. To distinguish our work from the

current solutions to source selection, we discuss the solutions to source selection based

on the following two aspects:

• Source types: unstructured and (semi-)structured data sou rces.

In the fields of distributed information retrieval systems, the prior researches on

source selection mainly focus on the selection of sources containing unstructured data

(a.k.a., texts). To acquire the contents of data sources, randomly generated queries are

sent to obtain the sample texts. Through these fetched samples, the data sources can

be represented as a single big document (such as in CORI[21], CVV[150] and KL[147])

or a set of big documents (such as, in ReDDE[131], CRCS[129] and SUSHI[138]).

With the rapid growth of the deep Web, (semi-)structured information sources

have been experiencing a remarkable increase. Normally, the query interfaces of

structured information sources are more complex than those of text sources. First, the

query-based sampling becomes impractical since it is difficult to retrieve the sample

documents through randomly generated queries. The automatically generated queriers

79

usually cannot satisfy the hidden constraints on the inputs to the query forms. Second,

the topics of sources can be directly acquired through the semantics analysis on the

query forms. More complex query forms often contain more semantics at the same

time. Several researches [57][134][84] have been conducted on cross-domain source

selection. They cluster structured Web sources based on the query-form similarity

so that each cluster corresponds to a single domain. However, the domain-oriented

clustering is a coarse classification without considering the capability distinction.

• Selection types (selection-per-query v.s. selection-per -engine).

The personalization of meta-queriers can be achieved through the source selection.

The selection can be made when users issue a query (called selection-per-query) or

when the meta-queriers are constructed (called selection-per-engine).

In some domains, the query forms are considerably simple. For example, most

news search engines only include a single keyword box and a click button. Thus, the

construction of a global query form is not difficult to integrate all the forms in the same

domain. In this context, the data sources can be selected based on the user inputs to

the global interface (i.e., selection-per-query).

However, in the other domains, it is not practical to build such a single interface

(or even a few) to encompass all the functionalities provided by the query interfaces in

the same domain. For instance, in the air-ticket booking domain, we can observe that

various meta-queriers provide different query interfaces with different functionalities.

Most differences are caused by the selection of sources in their construction (i.e.,

selection-per-engine).

Our work proposes a selection-per-engine strategy in the customization of

meta-queriers. The user is allowed to select their preferred data sources. To better

reuse the pre-integrated data sources, we provide a capability-based source selection

algorithm to recommend users their potentially desired data sources. Complimentary

to the query-based solutions, our approach is based on the query capabilities of data

80

sources, instead of the sampled source contents. Unlike the prior work on query-form

clustering, our approach is able to distinguish the query capabilities of the data sources

whose query forms have been clustered in the same domain.

7.2 Capability-based Recommendation

Capability-based source selection is a typical content-based recommendation

problem [4]. In our capability-based recommendation, users can declare their needs

of the capabilities by inputting the domains DMI and their preferred data sources

DSI . Based on the user needs, our system recommends users a ranked list of the

pre-integrated sources DSO . Such a capability-based recommendation problem can be

simplified to a utility maximization problem. More formally, let u be a utility function that

quantifies the desirability of a meta-querier mq or a data source ds for a specific user

need, i.e., u : DMI ×DSI ×DSO → R, where R is all nonnegative real numbers. To match

the user needs with the best, the recommendation system should maximize the utility by

selecting the meta-queriers mq and the data sources ds. That is, ∀dmI ∈ DMI , ∀dsI ∈

DSI , ds∗ = argmax ds∈DSOu(dmI , dsI , ds). The effectiveness of such a solution relies on

the correct understanding of the user demands and data sources. Section 7.2.1 first

explains our ontology-centric modeling for source capabilities and the user preferences,

and then Section 7.2.2 presents the corresponding source-selection algorithms.

7.2.1 Modeling

In the context of meta-querier customization, query capabilities refer to the abstract

abilities of sources to retrieve information. The information of these sources can be

accessed through their associated query forms. Generally, the contents of query forms

decide the possible queries that can be posed by users, and thus they also dictate the

capabilities of the meta-queriers and data sources. Understanding the query forms

is a cornerstone of the capability-based recommendation. Therefore, we propose an

ontology-based approach to capture their capabilities by analyzing and understanding

the query forms.

81

As described in Chapter 4, each query form can be regarded as a set of query

conditions [151][62], we call schema elements. An individual element or an ordered

element list can represent a specific query capability [74][27][149] that a resource

possesses. However, the information carried in individual components is very limited.

It does not contain the context or knowledge concerning the textual description.

Methodologies of ontology are commonly used to address such representation

heterogeneity. Ontologies store well-defined concepts and relations including context

knowledge. If query forms are properly annotated using concepts in the same ontology,

machines can understand their semantics. As described in Chapter 5, M-Ontology is

designed for the storage, management and discovery of mappings that are employed to

transform the user inputs (for queries) among query forms. In the following, we briefly

introduce the nodes/edges and its correspondences with query capabilities.

(1) E-Nodes, Is-a Edges and G-Nodes. Each E-Node encapsulates a schema

element in a specific query form. Through generalization of a set of E-Nodes with the

same semantics and instance formats, an Is-a Edge can formulate a new concept node,

i.e. a G-Node. In essence, each G-Node corresponds to an abstract query capability.

For describing the semantics of such a query capability, a representative object ro

is automatically abstracted from the associated attributes of the inclusive schema

elements.

(2) A-Nodes and Part-of Edges. Each A-Node is formed through aggregating a

list of G-Nodes. A Part-of Edge is used to represent such an aggregation relation by

linking the A-Node to its inclusive G-Nodes. The A-Node also denotes an abstract query

capability, whose semantics can be approximated to a list of representative objects

Listro .

(3) T-Edges. A T-Edge corresponds to a transformation relation between two

concept nodes (i.e., G/A-Nodes) which can be used to fetch similar contents. The

transformation relation contains a specific rule to convert the instances of the connected

82

concept nodes. Since their instances are convertible, all these connected nodes

represent similar query capabilities. In a sense, M-Ontology is a domain-specific query

capability repository, where each connected G/A-Node sub-graph generally corresponds

to an abstracted query capability.

Capability modeling : By using the proposed M-Ontology, we model the capabilities of

data sources based on their own query forms. Each data source has its own query form,

whose inclusive schema elements can be clustered into a set of G-Nodes in M-Ontology.

Let M-Ontology includes a set of G-Nodes denoted by GN = {N1,N2, ...,Nn}, which

are numbered from 1 to n based on their creation time. Since each G-Node denotes an

abstract capability in a specific domain, we use a subset of GN (called Capability Set

CS) to represent the capabilities that a data source is able to provide. That is, CS can be

denoted by a G-Node set {Ni |Ni ∈ GN}.

To obtain such a capability set, we need to seek correct G-Nodes to annotate the

schema elements in the corresponding query form. More precisely, the process of

capability capture can be viewed as schema annotation. In the context of meta-querier

customization, it is straightforward to capture the capabilities of the pre-integrated data

sources from their query forms. M-Ontology functions as a mapping repository, and thus

the mappings linked to these query forms should have been inserted into M-Ontology.

That means, all the schema elements have been clustered into the corresponding

G-Nodes. These G-Nodes are the components of the corresponding capability sets. The

detailed algorithms are presented in the following section 7.2.2.

Preference modeling : The widely used preference model is based on keywords.

However, in the real-world scenarios, a few keywords are often unable to represent

the exact semantics. For the purpose of accurately understanding user demands, our

solution relies on the mapping-generated M-Ontology. In our solution, a specific user

need is modeled by a vector PV , called a preference vector. Given that M-Ontology

contains G-Nodes {N1,N2, ...,Nn}, the vector PV has n corresponding entries. The i th

83

entry of vector PV , denoted by PV (i), is a preference value that indicates how the user

prefers this capability in the target.

To construct such a preference vector, our solution provides two types of customization

mechanisms: i) Needs-oriented. Users are allowed to specify their explicit demands

through inputting the preferred data sources DSI and the corresponding application

domain DMI . Based on the inputs, our system identifies the potential user-preferred

entries in the PV . ii) Feature-oriented. Users can modify the preference values of the

identified entries. The implementation of these two mechanisms is explained in the next

section.

7.2.2 Demand Capture and Matching

This section presents the procedure of demand capture and matching. As illustrated

in Figure 7-1, the whole procedure consists of six phases. Based on user selection

(i.e., DMI ) from a list of data domains that exist in the system, Ontology Selection is

to choose an appropriate M-Ontology MO for understanding the user-preferred data

sources (i.e., DSI ). Only for the data sources that have not been integrated into any

meta-querier, Q-Form Normalization is invoked to unify their query form representation.

By analyzing the normalized query forms, Q-Form Analysis can output the capability

set CS for each data source. From the analyzed query forms, Demand Identification

constructs a preference vector PV . Demand Matching generates a ranked list of

resources for user selection. After user selection, if necessary, Annotation Verification

verifies the correctness of analysis on query forms of un-integrated data sources. The

details are described as follows.

• Ontology Selection is to choose an appropriate M-Ontology MO. In different query

forms, there might exist many schema elements with the highly similar representation,

but, in fact, their semantics are completely different due to the different context. For

example, the word “keywords”, which appears in the different domain, has various

meanings. In job seeking, it represents job titles or fields, but it also might denote the

84

song name in the music search. Thus, our designed M-Ontology is domain-specific.

This assumption is reasonable for the customization of meta-queriers, which combine

the data sources in the same domain.

• Query-form Normalization is to unify the representation of the query form for each

data source in DSI that is not pre-integrated into our system. The data sources can be

categorized as two types, pre-integrated and new data sources. Since the pre-integrated

data sources have been analyzed, the results can be directly reused without repetitive

normalization. This phase only processes the query forms qformI that are not included

in M-Ontology. For each query form, its inclusive schema elements are associated

with some descriptive attributes and a set of potential instances (if they exist). We

first perform natural language processing techniques on the descriptive attributes and

instances in the following orders: tokenization, stop-word removal and stemming. Then,

we treat the normalized texts as two unordered bags of words, Set〈wDAi 〉 and Set〈w INSi 〉.

These two bags of words constitute a word-bag pair, which can indicate the semantic

characteristics of this component. Finally, each query form corresponds to a set of

word-bag pairs, Set〈(Set〈wDAi 〉,Set〈wINSi 〉)〉

• Query-form Analysis is to understand the query forms by analyzing the semantics of

each inclusive schema element e. Our solution is to annotate each e by semantics-equivalent

concepts (i.e.,G-Nodes from the M-Ontology MO). The involved G-Nodes constitute the

capability set CS of the corresponding data source. G-Node searching can be divided

EndDemand

Identification

OntologySelection

Start

UnintegratedForms

Integrated Forms

Q-FormNormalization

Q-FormAnalysis

DemandMatchingAnnotation

Verification

Figure 7-1. The ontology-centric source selection algorithm.

85

into two separate types: a) For the pre-integrated data sources, their query forms have

been included in the M-Ontology, and thus each schema element of these forms should

have been clustered to a certain G-Node. That is, such an association can be directly

reused. b) For the new data sources, each schema element in their query forms is

normalized to two word bags Set〈wDAi 〉 and Set〈w INSi 〉. As described in Chapter 5.1, the

suitability of a G-Node gn can be measured by the constraint equivalence and semantic

similarity between e and the corresponding representative object ro of gn. The semantic

similarity can be calculated as follows.

λ1n1∑

i=0

m1∑

j=0

sim(wDAi , tDAj ) + λ2

n2∑

i=0

m2∑

j=0

sim(wDAi , tDLj ) + λ3

n3∑

i=0

m3∑

j=0

sim(w INSi , tINSj )

where, λi is a scale factor, two word sets Set〈wDAi 〉 and Set〈w INSi 〉 are the semantic

characteristics of e, and a tuple 〈Set〈tDAi 〉,Set〈tDLi 〉,Set〈t

INSi 〉〉 is the representative

object ro. The function sim is to determine the semantic similarity between two terms

(respectively from e and ro). Our implementation relies on the WordNet-synonyms

distance, a linguistic-based matcher.

• Demand Identification constructs a preference vector PV based on the selected data

sources and user interaction. The whole procedure is composed of two steps:

1) Automatic discovery: Each G-Node corresponds to an entry PV (i) whose value

indicates the preference degree against a specific capability. In M-Ontology, G-Nodes

that are connected via a single or multiple T-Edges and Part-of Edges whose directions

are ignored constitute a maximal connected G-Node sub-graph. Such a sub-graph

corresponds to an abstracted query capability. In the sub-graph, the preference values

of these G-Nodes are correlated. Since the conversion through T-Edges and Part-of

Edges might lose the semantics, the distance between two G-Nodes i and j indicates

their dissimilarity. Here, the distance refers to the minimum hop number from one

node to another. The value 1/distance(i , j) is used to represent the potential difference

between their capabilities. Assuming that GSetM is a multiset that contains the G-Nodes

encapsulating the schema elements of data source in DSI . It might contain duplicates.

86

The occurrence number of a G-Node in GSetM is equal to the appearance frequencies

of its encapsulated schema elements in DSI . Preference vectors can be decided by two

modes: combination and accumulation.

• The combination mode is preferred when users want to find the data sources

containing all the capabilities (with the same desirability) that are supported by

the user-inputted sources DSI . The values can be obtained through the following

procedure: First, all the entries whose G-Nodes are in GSetM are set to one, i.e.,

PV (i) = 1 if i ∈ GSetM ; otherwise, the values are initialized to zero. Second, for the

G-Nodes that are not in GSetM , their values are calculated based on the distance

to the nearest node in GSetM . The procedure can be represented as eq. (7-1).

PV (i) =

1 i ∈ GSetM

1min

j∈GSetM

{distance(i,j)+1}i /∈ GSetM

(7–1)

• The accumulation mode assumes the most critical capabilities that user desired

are the ones that appear most frequently in the user-inputted sources. For the

entries whose G-Nodes are in GSetM , their values of PV (i) are equal to the

multiplicity (i.e., occurrence number) of their corresponding G-Nodes in GSetM .

For the other entries, their values are accumulated based on the distances to the

nodes in GSetM and their preference values, as shown in eq. (7-2).

PV (i) =

multiplicity(i) i ∈ GSetM

∑

j∈GSetM

PV (j)distance(i,j)+1

i /∈ GSetM

(7–2)

2) Manual correction (optional): Automatic decision of preference values might not

accurately demonstrate the user demands, optional manual correction is necessary to

87

correct the values by the users themselves. However, it often becomes inappropriate or

impractical to present the whole preference vector, especially when the vector is very

long. In our design, the users are able to modify the values that are not equal to zero.

This mechanism is so-called feature-oriented.

• Demand Matching : This phase is to generate a ranked list of data sources that best

match the identified demands. The ranking of the list is through comparing the numeric

values of the utility function u that demonstrates the desirability of a data source for a

specific user need. More exactly, in the application domain DMI , a resource R1 ∈ DMI

is preferred to a resource R2 ∈ DMI if and only if the expected utility of R1 is greater

than the expected utility of R2: ∀R1,R2 : R1 % R2 ⇔ u(R1) ≥ u(R2). Such a rational

preference relation % is transitive, reflexive and complete.

The preference ranking is a typical multi-criteria decision making problem. In

meta-querier customization, each criterion corresponds to a query capability identified

in preference vectors (i.e., a G-Node). To make the ranking outcomes manageable

by users, we assume additive independence exists among the maximal connected

subgraphs, which is a normal assumption [67]. The utility of a resource R can be

approximated by using an additive value function that breaks one n-criteria function

into n individual one-criterion functions. Such an approximation not only simplifies

the automatic adjustment and manual correction, but also performs well, even if the

assumption does not strictly hold [120]. We construct an additive utility function u to

aggregate the utility cu(R[Ni ]) of each individual capability Ni provided by the resource

R. The utility cu(R[Ni ]) is 1 when the capability set CS of R contains Ni ; otherwise it

is zero. The additive weight of Ni is decided by its preference value PV (i), which are

generated from user inputs. For the nodes in any maximal connected subgraph (sg),

the sum of their utility values should be less than or equal to δ. δ is 1, if the combination

mode is used. δ is set equal to∑

Ni∈V (sg)∩GSetM

PV (i), if the accumulation mode is used.

88

Let MCSG be a set of maximal connected subgraphs, the weighted utility function can

be represented as follows,

u(R[N1,N2, ...,Nn]) =

n∑

i=0

(PV (i)× cu(R[Ni ])) (7–3)

subject to the following constraint, ∀sg ∈ MCSG ,∑

Ni∈V (sg)

(PV (i)× cu(R[Ni ])) ≤ δ

• Annotation Verification (after run-time): This phase is to verify whether the new

data sources are annotated by the accurate G-Nodes. For those schema elements

that cannot match with any concept node, the manual annotation is invoked to update

and maintain M-Ontology (e.g., by inserting new mappings among the related query

forms). As an optional phase, the manual verification can be conducted after the source

selection. Although the new data sources might not be supported immediately, the

verified annotation can be reused for the future recommendation.

7.3 Conclusion

Capability-based customization is desired to meet various user needs. In our

customization strategy, we design a capability-based solution to source selection

based on user inputs. Modeling of user demands is still an open problem in the

selection of semi-structured sources, especially for the capability-based selection.

Without adequately accurate understanding of source capabilities, the selection in the

previous research [57][134][84] is normally coarse-grained and unable to distinguish the

functionality difference of the sources. Our solution is based on M-Ontology, which is

viewed as a repository of capabilities. Both the user demands and source capabilities

are modeled by using the concepts in M-Ontology. A semi-automatic solution to

demand elicitation is also proposed through semantic annotation on the query forms

of user-preferred data sources.

In the recommendation results, data sources are ranked by their matching values

between needs and sources. Each matching value quantifies the desirability of a

specific data source for a particular user needs. It is calculated through measuring the

89

similarity of the users’ preference vector with the capability sets of the sources. The

value calculation is treated as a multi-criteria decision making problem. Each criterion

corresponds to the desirability of a specific capability. An additive utility function is

proposed to combine all the criteria through comparison of the preference vector with

the capability set.

90

CHAPTER 8IMPLEMENTATION AND EXPERIMENTS

8.1 System Structure of Mapping Repositories

Mapping repositories are built on top of an open-source system “Alignment Server”

[47]. This system provides some basic services, like mapping storage and ontology

matching. It employs a big table to store mappings, each of which is treated as a tuple

in the table. We extend it to support our proposed mapping repositories. As shown in

Figure 8-1, the mapping repositories can be employed by different applications such as

mapping retrieval, schema matching and source selection. The repositories are roughly

divided into four levels: mapping level, node level, structure level and system level.

Mapping level is responsible for management of mapping instances and the

associated metadata. These instances can be retrieved, inserted, edited or deleted

by community members. The query forms connected by the instances are stored in a

separate repository, called schema repository. Each form is represented as an owl file.

The file name can be used to identify and retrieve the corresponding query form for

mapping management.

Element level implements the concept of E-Nodes and offers the basic operations

on E-Nodes. Schema elements in mapping instances are extracted individually and

Relational Database

Repository API

MappingRepositories

Schema Matching Source Selection Mapping Retrieval

Applications

system level

ViewsOperations

element level

E-Nodes

levelstructure

MO-RepositoryM-Ontology

mapping level

Mappings

Figure 8-1. The Repository Structure.

91

normalized by NLP (natural language processing) techniques: tokenization, stop-word

removal and stemming. Different algorithms can be selected for different application

domains. For example, we support two types of stemmer: Porter Stemmer [112] and

Krovetz Stemmer [69]. At this level, no relation between E-Nodes is considered. All the

E-Nodes are operated separately.

Structure level implements the mechanisms of MO-Repository and M-Ontology.

Mapping objects, G-Nodes, A-Nodes and T-Edges are defined and implemented in this

level. The operations (e.g., creation, merging, modification, insertion and deletion) on

these nodes and edges are conducted in this level, while the associated metadata are

also recorded.

System level offers the support of customized meta-querier construction and

maintenance. First, it offers a mechanism to define customized views to adapt

meta-querier customization. Second, it provides various syntactic difference and

similarity calculation among E-Nodes, G-Nodes and A-Nodes through combination of

third-party software components, e.g., the calculation of 3-gram distances, WordNet-synonym

distances and Jaccard distances.

8.2 Experiments for ontology construction

To evaluate the feasibility and effectiveness of M-Ontology, we conduct experiments

to simulate ontology construction. Mapping diversity and repetition depend on user

selection of data sources, the composition of local schemas and the degree and

frequency of their changes. Thus, it is hard to simulate the complete processes and

performances of real applications. The following experiments focus on the effectiveness

of two functions (i.e., SearchGNode and VerifyENode) that are the main factors

determining the performance of the proposed algorithms in ontology construction and

mapping discovery. More specifically, two questions are addressed in the experiments.

i) Feasibility: in the real-world meta-queriers, how high the percentage of schema

elements have corresponding concepts stored in M-Ontology during the incremental

92

construction process? ii) Effectiveness: how well do our algorithms perform correctly in

searching out suitable G-Nodes for a schema (i.e., the function SearchGNode) after the

insertion results of a set of schemas have been verified (i.e., the function VerifyENode)?

Performance Metrics. We use four metrics to evaluate the performance of the

algorithms, Hit-ratee , Precisione , Recalle , Fmeasuree . Hit-ratee is designed to measure

the feasibility of M-Ontology. Precision and Recall are widely used in evaluating the

effectiveness of information retrieval systems. Precision measures the degree of

correctness of G-Node searching. Recalle measures the degree of completeness of

G-Node searching. By combining Precisione and Recalle , Fmeasuree [37] is to examine

the overall effectiveness of the proposed algorithm. In the setting of M-Ontology, these

metrics can be defined for a specific value m as:

• Hit-ratee = MOe/NUMe

• Precisione = Crte/Totale

• Recalle = Crte/MOe

• Fmeasuree = 2× Precisione×RecallePrecisione+Recalle

where, NUMe is the total element number; MOe is the number of elements that

can be correctly identified with the G-Nodes; Crte is the number of elements that are

matched to the correct G-Nodes; Totale is the number of elements that are matched to

a specific G-Node. Fmeasuree is the weighted harmonic mean of Precisione and Recalle .

Note all the above element numbers are from a schema to be inserted/matched.

Data Sets. To examine the performance of M-Ontology in real-world data integration

problems, we collect 141 query form URLs from the UIUC web integration repository[2]

after removing the inactive websites. All these query-form URLs are from three domains

: Books(47), Movies(49) and Music Records(45), where the value in parentheses

denotes the number of URLs in the domain. The information of query forms is manually

extracted from the HTML source codes to generate the corresponding Web schemas.

93

Domain GN no. Rare GN pct. Rare Schema pct.Movies 30 43.3% 16.3%Books 35 49.0% 12.8%Reords 23 56.5% 15.6%

Table 8-1. Statistics of the domains

These schemas are represented in OWL to accommodate the mapping format of

Alignment Server.

Given these schemas in OWL, we identify the concepts (i.e. G-Nodes) by manually

clustering the schema elements for each domain. As illustrated in Table 8-1, 30

G-Nodes are generated from the data set “Movies”. Among them, 43.3% G-Nodes

have only a single schema element, but these schema elements are from 16.3%

schemas. That is, the “rare” concepts occur only in a few schemas. It indicates that most

to-be-classified schema elements can find the corresponding concepts from the other

schemas in the same domain. The other two domains have the same patterns.

Experiment Scenario. Assume that m Web schemas from the same domain are

already inserted into M-Ontology based on our proposed algorithms. The insertions

of these schemas have been already corrected and verified by human beings so that

the attributes (e.g., SetDAgn ) of the corresponding G-Nodes are updated to combine the

verified schema elements. We attempt to find correct G-Nodes for a single schema X

from M-Ontology. To estimate the performances, we design two sets of experiments: 1)

X is not in M-Ontology (illustrated in Figure 8-2); 2) X might be in M-Ontology (illustrated

in Figure 8-3).

Each experimental result shows eleven values of m ranging from 1 to 45. The

performance measures for each m are calculated by the average of 200 samples, each

of which is automatically generated from the schema sets in that domain. Each sample

includes m different schemas existed in M-Ontology and the schema X . All these

schemas are randomly selected.

94

Experiment 1: Without Schema Repetition. In this set of experiments, the schema X

is randomly selected from the schemas that are not stored in M-Ontology. That is, X is

different from any schema in M-Ontology for the purpose of removing the influence of

possible repetitiveness. As shown in Figure 8-2, the same trends can be observed in all

three domains.

In Figure 8-2(a), when m reaches 15, at least 90% elements in X can find out the

correct G-Nodes so that these elements can directly reuse the associated mappings.

The importance of existing schemas on Hit-rates are clearly indicated, as a sharp

increase of Hit-rate can be seen when m is smaller than 15. The remaining curve seems

to indicate that the additional schemas (> 15) almost do not affect the Hit-rates since

90% concepts are already collected from the first 15 schemas. However, this is not

always true since more mapping edges might be added with more schemas inserted.

(a) (b)

(c)

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1

Number of Schemas Included in M-Ontology

Hit

-Ra

te

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1


Pre

cis

ion

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1


Re

ca

ll

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1


Fm

ea

su

re

(d)

+booksmoviesrecords

(c)+booksmoviesrecords



(c)

Figure 8-2. Experiment results of schema element classification without schemarepetition.

95

The experimental results also match the conclusion by a related study [28] showing

that the aggregated vocabularies used to describe schema elements are “clustering in

localities and converging in size”.

In Figure 8-2(b), the values of Precision stay around 90% after the initial learning

process (m > 10). Although more G-Nodes are available for classification with m

increasing (Figure 8-2(a)), our proposed searching algorithm also can classify most

schema elements into the correct concepts.

In Figure 8-2(c), a sharp increase of Recall can be seen before m reaches 10,

followed by a steady and slow improvement with m increases. This phenomenon

indicates that D-Labels of G-Nodes cannot be correctly identified when the inclusive

schema elements are very few. When m increases, the contents of D-Labels become

steady but Instances and D-Attributes can accumulate more useful information from the

newly verified schema insertion. Thus, Recall increases slowly and steadily after m is

larger than 10.

In Figure 8-2(d), the values of Fmeasure are higher than 85% when m > 15. As an

overall performance measure, Fmeasure values indicate that the algorithms are effective

in the classification of schema elements. Its real-world performance should improve

further if we also employ the other two human intervention strategies, i.e., avoidance

and prevention, discussed in the end of Section 5.1.

Experiment 2: With Schema Repetition. When building a large number of meta-queriers,

it is highly possible that X has already been inserted into M-Ontology since the ontology

is shared by all the meta-queriers in the same domain. One typical example is given

in the motivating scenario of Section 4.3.1. Thus, we also conducted the second set of

experiments in which X is randomly selected from the whole schema set and it might be

identical to one of the existing schemas in M-Ontology.

Intuitively, the results of Experiment 2 should be better than that of Experiment 1.

The results prove this expectation clearly. Figure 8-3(a) reveals that the Hit-rate has

96

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1


Hit

-Ra

te

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1


Fm

ea

su

re

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1


Pre

cis

ion

0 5 10 15 20 25 30 35 40 450.4

0.5

0.6

0.7

0.8

0.9

1


Re

ca

ll

(a) (b)

(c) (d)





Figure 8-3. Experiment results of schema element classification with schema repetition.

reached 90% when m is around 10 and almost achieves the highest value (i.e., 100%)

when m is around 45. Figure 8-3(d) shows that Fmeasure values are above 90% after

m is greater than 20, and also reach the highest value. In real implementation, the

algorithms should perform better than the experiment results in Figure 8-3, since most

users will choose the data sources with more functionalities (i.e., complex query forms)

and better data quality.

From the above two sets of experiments, we see that schema elements in the

same domain can be clustered to generate a relatively “small” mapping ontology. That

indicates it is feasible to construct a mapping ontology as a knowledge base shared

by all the meta-queriers in the same domain. The performance results in terms of

Fmeasure demonstrate the effectiveness of the algorithms even without considering the

avoidance and prevention human intervention strategies. Furthermore, as the number of

97

data sources on the Web has been steadily increasing [20, 28, 87], better performance

of M-Ontology can be expected when more mappings are included.

8.3 Experiments for mapping discovery

We evaluate our reuse-oriented mapping discovery by conducting several sets

of experiments on real-world query forms. The goal of experiments is to examine

the feasibility and effectiveness of our approach. Specifically, the experiments are

designed to answer the following three research questions in the context meta-querier

customization: 1) Is it practical to find the target mappings/correspondences from

M-Ontology? 2) Is our algorithm effective to identify the mappings from M-Ontology?

3) Is our algorithm able to improve the effectiveness of mapping discovery with the

introduction of mapping evolution?

Data sets: We first collected 38 query-form URLs from the air-ticket booking data set in

the UIUC web integration repository[2] after removing the inactive forms in May 2009.

These query forms are manually extracted from their HTML codes and represented

in Ontology Language (OWL). These OWL files form the first data set AIRO. After 22

months, we revisited the web pages containing these query forms, 23 of which were

syntactically changed. 23 corresponding OWL files are generated for them. The total

61 query forms constitute the second data set AIR. From these forms, 730 schema

elements are manually classified based on their semantics. Each schema element is

classified to a single G-Node and a single G/A-Subgraph. These classification results

are used in the performance evaluation of the following four sets of experiments.

Experiment setup: In the following experiments, we assume that m Web query forms

are already contained in M-Ontology/MO-Repository. For M-Ontology, the representative

objects are automatically generated from the verified schema elements, without manual

correction on the D-Attributes, D-Labels and Instances. We also assume that T-Edges

have been created to connect the semantics-equivalent concept nodes. In the following

experiment results, each value is obtained by the average of 100 samples. To ensure

98

measurement fairness and accuracy, each sample is randomly generated from the

data sets without any duplicate. Four measures are used in each experiment: Hit-rate

measures the proportion of the expected concepts/mappings that exist in a mapping

repository (i.e., M-Ontology or MO-Repository). Precision measures the proportion

of the correctly identified concepts/mappings over the total identified ones. Recall

measures the proportion of the correctly identified concepts/mappings over the total

correct and identifiable ones. Fmeasure is the weighted harmonic mean of precision

and recall.

Experiment 1 evaluates the performance of mapping discovery by using only M-Ontology.

Only the abstract-based element classification is used without the history-based

strategy. The experiment is to match two query forms (from the data set AIR) that are

not in M-Ontology. As illustrated by red dotted lines in Fig. 8-4, our ontology-based

approach to mapping discovery looks promising in both feasibility and effectiveness.

(The blue solid lines show the performance of abstract-based element classification. The

0 5 10 15 20 25 30 35 40 45 50 55 60

0 4.

0 5.

0 6.

0 7.

0 8.

0 9.

1

Number of Schemas Included in M Ontology-

Hit

-Ra

te

0 5 10 15 20 25 30 35 40 45 50 55 60

0 6.

0 7.

0 8.

0 9.

1


Re

ca

ll

0 5 10 15 20 25 30 35 40 45 50 55 60

0 5.

0 6.

0 7.

0 8.

0 9.

1


Fm

ea

su

re

0 5 10 15 20 25 30 35 40 45 50 55 60

0 5.

0 6.

0 7.

0 8.

0 9.

1


Pre

cis

ion

(c) (d)

(a) (b)

Figure 8-4. Experiment results of mapping discovery through M-Ontology.

99

comparison and discussion are provided in Experiment 2.) First, Fig. 8-4(a) shows that

Hit-Rate reaches almost the highest value 1.0 when m is above 10. That indicates that

most of mappings can be found from M-Ontology when at least 10 query forms are fully

integrated in meta-queriers. That is, the reuse of mappings is practical. Second, in Fig.

8-4(b), we observe Fmeasure is above 0.8 if m is larger than 5 and above 0.9 when m

is larger than 35. As expected, the effectiveness of our mapping discovery algorithm

improves with the M-Ontology enrichment. When there exists enough information

in M-Ontology, most of mappings can be effectively identified by our algorithm. The

following two experiments will present more observations and analyses.

Experiment 2 evaluates the performance of abstract-based element classification,

which is a core operation in mapping discovery through M-Ontology. This operation

is to classify the elements in a specific query form (from the data set AIR) to suitable

concept nodes based on their similarity with the representative objects of these concept

nodes. The corresponding performance is shown by blue solid lines in Fig. 8-4. As

illustrated, the similar trends can be observed for both element classification and

mapping discovery. Discovery of correct mappings requires exact classification of the

elements in both query forms. Thus, ideally, element classification should perform

apparently better than mapping discovery with respect to Hit-Rate and Precision and

worse with respect to Recall. However, the difference of Recall values is relatively minor,

compared with the corresponding precision values, especially when m is smaller than 10

and larger than 35. Based on our analysis, the major reason is that some concept nodes

appear only in few (more than one) schemas. The schema elements that are classified

to these concept nodes often do not correspond to any element in the query form

that is to be matched. The total number of target mappings is lower than the average

of schema elements in our data sets. Thus, the recall difference is not large. The

existence of such concept nodes can be identified at the Fig. 8-4(a). When m is above

10, almost all the mappings can be found, but still more than 10% schema elements

100

(c) (d)

(a) (b)

0 5 10 15 20 25 30 35 40 45 50 55 60

0 4.

0 5.

0 6.

0 7.

0 8.

0 9.

1


Hit

-Ra

te

0 5 10 15 20 25 30 35 40 45 50 55 60

0 5.

0 6.

0 7.

0 8.

0 9.

1


Pre

cis

ion

0 5 10 15 20 25 30 35 40 45 50 55 60

0 5.

0 6.

0 7.

0 8.

0 9.

1


Fm

ea

su

re

0 5 10 15 20 25 30 35 40 45 50 55 60

0 6.

0 7.

0 8.

0 9.

1


Re

ca

ll

Figure 8-5. Experiment results of concept searching for schema elements.

cannot be classified to the concept nodes. In our experiments, we do not consider

the mismatches caused by T-Edges incompleteness and errors. The performance of

mapping discovery might be worse than the number shown in Experiment 1, but we also

expect the performance can be improved through manual enrichment on M-Ontology.

For example, humans can manually correct the auto-generated information in the

representative objects (e.g., representative labels).

Experiment 3 examines the effects of query-form evolution on the discovery through

M-Ontology without history-based element classification. The data set used in

the Experiment 1 and 2 is AIR, which consists of the original query forms and the

evolved ones. To identify the possible influence of inclusion of the evolved forms,

we conduct another set of experiments for element classification using the data set

AIRO (without the evolved forms). As illustrated in Fig. 8-5, the blue solid lines and

the green dashed lines represent the performance values of abstract-based element

classification respectively using AIR and AIRO. All these four charts indicate that these

curve lines almost coincide. Although the similarities in terms of design, naming and

101

0 5 10 15 20 25 30 35 40 45 50 55 60

0 5.

0 6.

0 7.

0 8.

0 9.

1

Number of Schemas Included in Mapping Repositories

Pre

cis

ion

0 5 10 15 20 25 30 35 40 45 50 55 60

0 6.

0 7.

0 8.

0 9.

1

Number of Schemas Included in Mapping Repositories

Re

ca

ll

(a) (b)

Figure 8-6. Experiment results of mapping discovery through M-Ontology andMO-Repository.

functionality can be observed between the original query forms and the evolved ones,

our abstract-based classification ignores these similarities and does not gain benefits

from them.

Experiment 4 is to evaluate the performance improvement with the usage of MO-Repository.

Due to limits of the real-world data sets (AIR), there does not exist enough evolved

mappings to obtain a complete evaluation. We still can conduct a set of experiments

to show the benefits gained from the function matchElement2Element, which returns

a set of element correspondences from MO-Repository. These correspondences are

used in the history-based element classification. As illustrated in Fig. 8-6, the red dotted

lines and black solid lines represent the performance values of mapping discovery

respectively using only M-Ontology and both repositories. They share the same

experiment samples. We see a significant increase in Recall and almost no change

in Precision when less than 25 query forms are integrated into the mapping repositories.

That means more correct mappings are found without loss of precision. If the number

of integrated query forms is larger than 25, the Recall increase appears modest and

even negligible. That is, most mappings can be discovered using M-Ontology (without

MO-Repository), when sufficient mappings have been inserted.

From the above four sets of experiments, we observe several important properties

of our reuse-oriented mapping discoverer. The data set AIR from real-world query forms

shows almost all the mappings can be found in the mapping repositories, when there

102

exist sufficient T-Edges to connect G/A-Nodes. Our proposed hybrid solution has a

capability of effectively discovering most of mappings, as demonstrated in the promising

results of Precision, Recall and Fmeasure. We expect better performance values can be

obtained through manual correction on mapping repositories.

8.4 Experiments for source selection

To show the effectiveness of our approach in real-world scenarios, we design and

conduct two sets of experiments in the domain of air-ticket booking. Since the quality of

ranking is subjective, it is hard to measure its correctness. Given that our primary goal is

to find suitable data sources from a source repository SR to satisfy user demands (their

preferred data sources DSI ) on capabilities, the focus of our experiments is to evaluate

whether our approach can correctly identify capability matches between user inputs

DSI and the data sources in SR. Specifically, the experiments are designed to evaluate

the effectiveness of capability matching, which is the most critical factor that affect the

recommendation performance.

Experiment setup: From the UIUC web integration repository[2], we collect 38 query

forms for air-ticket booking after eliminating the inactive webpages. First, we manually

extract all the query forms from the webpages. They are expressed in Web Ontology

Language (OWL) and follow a query-form ontology that was designed based on

the HTML specification [1]. Second, we manually classify the schema elements of

these forms to generate 54 maximal connected G/A-Node sub-graphs, based on their

capabilities. Each schema element is associated with a G-Node and a G/A-Subgraph.

The manual classification is utilized in the initial construction of M-Ontology and the final

evaluation of our algorithm.

Experiment scenarios: Assume that users input three preferred data sources DSI , n of

which are not in the repository SR. Our experiments will examine how well the algorithm

can correctly find the data sources with the desired capabilities (that are possessed by

the sources in DSI ) from SR.

103

The first experiment is for our proposed solution (referred to as MOM). Except these

n non-inclusive data sources in DSI , all the remaining sources are used to construct

a domain-based M-Ontology. We assume users can correctly choose an appropriate

domain DMI for their queries. That is, an appropriate M-Ontology MO is chosen. With

assistance of MO, the query forms in DSI are normalized and analyzed to identify the

user demands.

The second experiment evaluates the performance of a reference solution that

is a classical nearest neighbor method (referred to as NNM). To find the capability

correspondences, it compares the query forms in DSI with the form of each source

in SR. For the performance comparison, we use the same algorithms of query-form

normalization and similarity calculation in both NNM and MOM.

To evaluate the performance, we use the following three measures: precision

measures the proportion of the identified capabilities that are actually desired by users;

recall measures the proportion of the desired and identified capabilities out of all the

desired and identifiable capabilities; Fmeasure is the weighted harmonic mean of

precision and recall.

Experiment results: The performance values per n in the Table 1 and 2 are calculated

by the average of 100 samples. All the samples are randomly generated. The first and

second set of experiments share the same samples.

Unintegrated sources (n) precision recall Fmeasure

3 of 3 90.1% 86.6% 88.3%2 of 3 92.8% 90.6% 91.7%1 of 3 96.5% 95.8% 96.1%

Table 8-2. Capability-based matching by MOM

The first set of experiment results in Table 8-2 show promising evidence of

effectiveness of MOM. Even if all the data sources in user inputs (DSI ) have not been

integrated by any existing meta-querier, the Fmeasure rate also reaches 88%. If only

104

one data source is unintegrated, the capabilities of almost all the data sources can be

correctly and completely identified. That indicates a high possibility that our solution can

make an accurate recommendation.

Unintegrated sources (n) precision recall Fmeasure

3 of 3 76.0% 35.4% 48.3%2 of 3 83.9% 56.8% 67.7%1 of 3 92.0% 78.5% 84.7%

Table 8-3. Capability-based matching by NNM

The second set of experiment results are shown in Table 8-3. Clearly, our method

MOM outperforms the reference method NNM by a large factor, especially when

the user-preferred sources have not been integrated. The major reason is the name

ambiguity in HTML codes so that it is difficult to find the capability similarity between two

individual data sources. Different from NNM, our method MOM has a better performance

by utilizing some regular patterns that are learned from the integrated data sources.

From these two experiment sets, we can observe that our approach MOM performs

better than NNM in all the measures, no matter which cases are selected. MOM is able

to correctly identify capability matches between user inputs and the data sources in the

repository. Through the capability matches, we believe our approach can effectively

provide appropriate recommendation for users in terms of the capabilities.

105

CHAPTER 9CONCLUSION AND FUTURE DIRECTIONS

Due to the explosive growth of data in the Internet, one of the major challenges

for future computing is how to build effective infrastructures to facilitate user-friendly

information access and knowledge discovery from the ever-increasing number of

searchable databases over the web. Answering to this challenge is the vision of the

research. We tackle the challenge through the design of meta-queriers to address the

issues of interoperability and scalability in accessing hidden webs.

Meta-queriers are a general information access tool for any applications where

there are multiple heterogeneous data sources. Although in the dissertation, we

use e-commerce examples such as e-travel and book-sales, the applications could

be extended to any domains with similar characteristics, for example, e-library,

e-job, e-newspaper and e-science (such as bioinformatics and physics with data

intensive sharing). The shared M-Ontology is easily exportable. We believe that early

bootstrapping of the shared knowledge base is a key to the snowballing success of a

community-based system. Another significant impact of the project lies in the fact that

the mapping repositories developed can be used for dealing with general interoperability

problems between heterogeneous data sources as well.

The following interesting directions are suggested for future research topics:

• Schema generation is to construct a global schema by integrating a set of local

schemas. The output schema should satisfy three important principles [12]: 1)

Completeness: the output schema encompasses all the local schema elements.

2) Minimality: each group of overlapping elements has minimal representative

elements in the output schema. 3) Understandability: the output schemas are

easy to understand by users. The potential approach to schema generation is

different from the traditional approaches [12, 14, 18, 92, 114] in a number of

aspects. In the context of community-based customization, there exist a large

106

number of similar global schemas that can be shared and reused. The knowledge

of schema generation is cooperatively contributed by the community members

and the workload of schema generation is distributed among community members

as well, and both are achieved in a user-friendly manner. Moreover, the issue of

understandability can also be addressed through the composition and layout of the

output schema.

• Source selection is to intelligently select meta-queriers and data sources in

the domain to meet specific user needs. Our solution is ontology-centric. To

reduce the user interaction, ontology/domain selection should be automated by

calculating the similarity between user inputs and ontology contents. Additionally,

the resource preferences of a specific user can also be learned from the behaviors

and preference vectors of self and peers through classical recommendation

algorithms [4, 139]: 1) content filtering algorithms [10, 104–106] that recommend

the user resources similar to those that the user liked in the past; 2) collaborative

filtering algorithms [61, 81, 124, 137] that recommend the user resources that the

other users with similar capability vectors preferred.

• Communication platform is absolutely necessary in a collaborative data

integration. Conventional communication methods (e.g., emails and discussion

forums) and emerging wiki systems are widely used as information-sharing tools.

However, these “informal” methods often lead to inefficiency and inconsistency

due to the ambiguity of unstructured natural language representation, especially in

a long-running platform where people may join and leave after. Often a small

ambiguity by one could cause a large adverse effect on others. It is highly

desirable to augment these information-sharing methods with a facility that

enhances the clarity of communication in a collaboration platform. In addition to

support meta-querier construction, our M-Ontology can be extended with a strong

emphasis to decrease the degree of informality in the collaboration platform. In

107

M-Ontology, the shared components (i.e., mappings and schema elements) are

organized based on their semantics and changing history to form a semantic

space (i.e., domain-specific ontologies) and a version space (i.e., versioning trees),

respectively. When building/maintaining a specific meta-querier, the member

can easily track the previous work and find the similar/repetitive tasks with better

understanding of the construction process.

108

REFERENCES

[1] “HTML 4.01 Specification: Forms.”http://www.w3.org/TR/html4/interact/forms.html, 1999.

[2] “The UIUC Web Integration Repository.” http://metaquerier.cs.uiuc.edu/repository,2003.

[3] Aboulnaga, Ashraf and Gebaly, Kareem El. “µBE: User Guided Source Selectionand Schema Mediation for Internet Scale Data Integration.” ICDE. 2007, 186–195.

[4] Adomavicius, Gediminas and Tuzhilin, Alexander. “Toward the Next Generationof Recommender Systems: A Survey of the State-of-the-Art and PossibleExtensions.” IEEE Trans. Knowl. Data Eng. 17 (2005).6: 734–749.

[5] Anand, Sarabjot S., Kearney, Patricia, and Shapcott, Mary. “Generatingsemantically enriched user profiles for Web personalization.” ACM Trans. In-ternet Techn. 7 (2007).4.

[6] Arasu, Arvind and Garcia-Molina, Hector. “Extracting Structured Data from WebPages.” SIGMOD Conference. 2003, 337–348.

[7] Arens, Yigal, Chee, Chin Y., Hsu, Chun-Nan, and Knoblock, Craig A. “Retrievingand Integrating Data from Multiple Information Sources.” Int. J. Cooperative Inf.Syst. 2 (1993).2: 127–258.

[8] Aslam, Javed A. and Montague, Mark H. “Models for Metasearch.” SIGIR. 2001,275–284.

[9] Aumueller, David, Do, Hong-Hai, Massmann, Sabine, and Rahm, Erhard.“Schema and ontology matching with COMA++.” SIGMOD Conference. 2005,906–908.

[10] Balabanovic, Marko and Shoham, Yoav. “Content-Based, CollaborativeRecommendation.” Commun. ACM 40 (1997).3: 66–72.

[11] Basili, Victor R. “Viewing Maintenance as Reuse-Oriented Software Development.”IEEE Software 7 (1990).1: 19–25.

[12] Batini, Carlo, Lenzerini, Maurizio, and Navathe, Shamkant B. “A ComparativeAnalysis of Methodologies for Database Schema Integration.” ACM Comput. Surv.18 (1986).4: 323–364.

[13] Bergamaschi, Sonia, Castano, Silvana, and Vincini, Maurizio. “SemanticIntegration of Semistructured and Structured Data Sources.” SIGMOD Record 28(1999).1: 54–59.

109

[14] Bergamaschi, Sonia, Castano, Silvana, Vincini, Maurizio, and Beneventano,Domenico. “Semantic integration of heterogeneous information sources.” DataKnowl. Eng. 36 (2001).3: 215–249.

[15] Bergman, Michael K. “The Deep Web: Surfacing Hidden Value.” The Journal ofElectronic Publishing 7 (2001).1.

[16] Bernstein, Philip A. “Applying Model Management to Classical Meta DataProblems.” CIDR. 2003.

[17] Bernstein, Philip A., Green, Todd J., Melnik, Sergey, and Nash, Alan.“Implementing mapping composition.” VLDB J. 17 (2008).2: 333–353.

[18] Buneman, Peter, Davidson, Susan B., and Kosky, Anthony. “Theoretical Aspectsof Schema Merging.” EDBT. 1992, 152–167.

[19] Buyukkokten, Orkut, Kaljuvee, Oliver, Garcia-Molina, Hector, Paepcke, Andreas,and Winograd, Terry. “Efficient web browsing on handheld devices using page andform summarization.” ACM Trans. Inf. Syst. 20 (2002).1: 82–115.

[20] Cafarella, Michael J., Halevy, Alon Y., Zhang, Yang, Wang, Daisy Zhe, and Wu,Eugene. “Uncovering the Relational Web.” WebDB. 2008.

[21] Callan, James P. “Document Filtering With Inference Networks.” SIGIR. 1996,262–269.

[22] Callan, James P. and Connell, Margaret E. “Query-based sampling of textdatabases.” ACM Trans. Inf. Syst. 19 (2001).2: 97–130.

[23] Callan, James P., Lu, Zhihong, and Croft, W. Bruce. “Searching DistributedCollections with Inference Networks.” SIGIR. 1995, 21–28.

[24] Castano, Silvana, Antonellis, Valeria De, and di Vimercati, Sabrina De Capitani.“Global Viewing of Heterogeneous Data Sources.” IEEE Trans. Knowl. Data Eng.13 (2001).2: 277–297.

[25] Chang, Chia-Hui, Kayed, Mohammed, Girgis, Moheb R., and Shaalan, Khaled F.“A Survey of Web Information Extraction Systems.” IEEE Trans. Knowl. Data Eng.18 (2006).10: 1411–1428.

[26] Chang, Kevin Chen-Chuan and Garcia-Molina, Hector. “Conjunctive ConstraintMapping for Data Translation.” ACM DL. 1998, 49–58.

[27] Chang, Kevin Chen-Chuan, Garcia-Molina, Hector, and Paepcke, Andreas.“Boolean Query Mapping Across Heterogeneous Information Sources.” IEEETrans. Knowl. Data Eng. 8 (1996).4: 515–521.

110

[28] Chang, Kevin Chen-Chuan, He, Bin, Li, Chengkai, Patel, Mitesh, and Zhang,Zhen. “Structured Databases on the Web: Observations and Implications.”SIGMOD Record 33 (2004).3: 61–70.

[29] Chang, Kevin Chen-Chuan, He, Bin, and Zhang, Zhen. “Toward Large ScaleIntegration: Building a MetaQuerier over Databases on the Web.” CIDR. 2005,44–55.

[30] Chapuis, Olivier, Labrune, Jean-Baptiste, and Pietriga, Emmanuel. “DynaSpot:speed-dependent area cursor.” CHI. 2009.

[31] Chuang, Shui-Lung and Chang, Kevin Chen-Chuan. “Integrating web queryresults: holistic schema matching.” CIKM. 2008, 33–42.

[32] Chuang, Shui-Lung, Chang, Kevin Chen-Chuan, and Zhai, ChengXiang.“Context-Aware Wrapping: Synchronized Data Extraction.” VLDB. 2007, 699–710.

[33] Conradi, Reidar and Westfechtel, Bernhard. “Version Models for SoftwareConfiguration Management.” ACM Comput. Surv. 30 (1998).2: 232–282.

[34] Davis, Stanley M. Future Perfect. Reading, MA: Addison Wesley, 1987, 1st ed.

[35] Dhamankar, Robin, Lee, Yoonkyong, Doan, AnHai, Halevy, Alon, and Domingos,Pedro. “iMAP: discovering complex semantic matches between databaseschemas.” SIGMOD Conference. 2004, 383–394.

[36] Ding, Ying and Foo, Schubert. “Ontology research and development. Part 1 -a review of ontology generation.” Journal of Information Science 28 (2002).2:123–136.

[37] Do, Hong Hai. Schema Matching and Mapping-based Data Integration. Ph.D.thesis, University of Leipzig, http://lips.informatik.uni-leipzig.de/pub/2006-4, 2006.

[38] Doan, Anhai. Learning to map between structured representations of data. Ph.D.thesis, University of Washington, 2002.

[39] Doan, AnHai, Domingos, Pedro, and Halevy, Alon Y. “Reconciling Schemasof Disparate Data Sources: A Machine-Learning Approach.” SIGMOD. 2001,509–520.

[40] Doan, AnHai and Halevy, Alon Y. “Semantic Integration Research in the DatabaseCommunity: A Brief Surve.” AI Magazine 26 (2005).1: 83–94.

[41] Dragut, Eduard C., Wu, Wensheng, Sistla, A. Prasad, Yu, Clement T., and Meng,Weiyi. “Merging Source Query Interfaces on Web Databases.” ICDE. 2006, 46.

[42] Dreilinger, Daniel and Howe, Adele E. “Experiences with Selecting SearchEngines Using Metasearch.” ACM Trans. Inf. Syst. 15 (1997).3: 195–222.

111

[43] Drumm, Christian, Schmitt, Matthias, Do, Hong Hai, and Rahm, Erhard.“Quickmig: automatic schema matching for data migration projects.” CIKM.2007, 107–116.

[44] Dwork, Cynthia, Kumar, Ravi, Naor, Moni, and Sivakumar, D. “Rank aggregationmethods for the Web.” WWW. 2001, 613–622.

[45] Embley, David W., Campbell, Douglas M., Jiang, Y. S., Liddle, Stephen W., Ng,Yiu-Kai, Quass, Dallan, and Smith, Randy D. “Conceptual-Model-Based DataExtraction from Multiple-Record Web Pages.” Data Knowl. Eng. 31 (1999).3:227–251.

[46] Euzenat, Jerome and Shvaiko, Pavel. Ontology Matching. Springer-Verlag NewYork, Inc., 2007.

[47] Euzenat, Jerome, Valtchev, Petko, Duc, Chan Le, DAVID, Jerome, Pierson,Jerome, Lee, Seunkeun, Troncy, Raphael, Sharma, Arun, Stoilos, Giorgos,Stamou, Georges, Bechhofer, Sean, Voltz, Raphael, Elonen, Jarno,and Nedas, Konstantinos A. “Alignment api and alignment server.”http://alignapi.gforge.inria.fr/, 2009.

[48] Fagin, Ronald, Kolaitis, Phokion G., Popa, Lucian, and Tan, Wang Chiew.“Composing schema mappings: Second-order dependencies to the rescue.”ACM Trans. Database Syst. 30 (2005).4: 994–1055.

[49] Ferragina, Paolo and Gulli, Antonio. “A personalized search engine based onWeb-snippet hierarchical clustering.” WWW. 2005, 801–810.

[50] Gravano, Luis, Garcia-Molina, Hector, and Tomasic, Anthony. “GlOSS:Text-Source Discovery over the Internet.” ACM Trans. Database Syst. 24 (1999).2:229–264.

[51] Gravano, Luis, Ipeirotis, Panagiotis G., and Sahami, Mehran. “QProber: A systemfor automatic classification of hidden-Web databases.” ACM Trans. Inf. Syst. 21(2003).1: 1–41.

[52] Grossman, David A. and Frieder, Ophir. Information Retrieval: Algorithms andHeuristics. The Kluwer International Series of Information Retrieval. Springer,2004, second ed.

[53] Haas, Laura M., Hernandez, Mauricio A., Ho, Howard, Popa, Lucian, and Roth,Mary. “Clio grows up: from research prototype to industrial tool.” SIGMODConference. 2005, 805–810.

[54] Halevy, Alon Y., Rajaraman, Anand, and Ordille, Joann J. “Data Integration: TheTeenage Years.” VLDB. 2006, 9–16.

[55] He, Bin and Chang, Kevin Chen-Chuan. “Statistical Schema Matching across WebQuery Interfaces.” SIGMOD Conference. 2003, 217–228.

112

[56] ———. “Automatic complex schema matching across Web query interfaces: Acorrelation mining approach.” ACM Trans. Database Syst. 31 (2006).1: 346–395.

[57] He, Bin, Tao, Tao, and Chang, Kevin Chen-Chuan. “Organizing structured websources by query schemas: a clustering approach.” CIKM. 2004, 22–31.

[58] He, Hai, Meng, Weiyi, Lu, Yiyao, Yu, Clement T., and Wu, Zonghuan. “TowardsDeeper Understanding of the Search Interfaces of the Deep Web.” World WideWeb (2007): 133–155.

[59] He, Hai, Meng, Weiyi, Yu, Clement T., and Wu, Zonghuan. “Automatic integrationof Web search interfaces with WISE-Integrator.” VLDB J. 13 (2004).3: 256–273.

[60] ———. “Constructing Interface Schemas for Search Interfaces of WebDatabases.” WISE. 2005, 29–42.

[61] Herlocker, Jonathan L., Konstan, Joseph A., Terveen, Loren G., and Riedl, John.“Evaluating collaborative filtering recommender systems.” ACM Trans. Inf. Syst. 22(2004).1: 5–53.

[62] Hong, Jun, He, Zhongtian, and Bell, David A. “Extracting Web Query InterfacesBased on Form Structures and Semantic Similarity.” ICDE. 2009, 1259–1262.

[63] Howe, Adele E. and Dreilinger, Daniel. “SAVVYSEARCH: A Metasearch EngineThat Learns Which Search Engines to Query.” AI Magazine 18 (1997).2: 19–25.

[64] Ipeirotis, Panagiotis G. and Gravano, Luis. “Distributed Search over the HiddenWeb: Hierarchical Database Sampling and Selection.” VLDB. 2002, 394–405.

[65] Kashyap, Vipul and Sheth, Amit P. “Semantic and Schematic Similarities BetweenDatabase Objects: A Context-Based Approach.” VLDB J. 5 (1996).4: 276–304.

[66] Katz, Randy H. “Towards a Unified Framework for Version Modeling inEngineering Databases.” ACM Comput. Surv. 22 (1990).4: 375–408.

[67] Keeney, R.L. and Raiffa, H. Decisions with multiple objectives: Preferences andvalue tradeoffs. J. Wiley, New York, 1976.

[68] Kerschberg, Larry, Kim, Wooju, and Scime, Anthony. “Intelligent Web Search viaPersonalizable Meta-search Agents.” CoopIS/DOA/ODBASE. 2002, 1345–1358.

[69] Krovetz, Robert. “Viewing Morphology as an Inference Process.” SIGIR. 1993,191–202.

[70] Kushmerick, Nicholas. “Learning to Invoke Web Forms.” CoopIS/DOA/ODBASE.2003, 997–1013.

[71] Laender, Alberto H. F., Ribeiro-Neto, Berthier A., da Silva, Altigran Soares, andTeixeira, Juliana S. “A Brief Survey of Web Data Extraction Tools.” SIGMODRecord 31 (2002).2: 84–93.

113

[72] Larson, James A., Navathe, Shamkant B., and Elmasri, Ramez. “A Theory ofAttribute Equivalence in Databases with Application to Schema Integration.” IEEETrans. Software Eng. 15 (1989).4: 449–463.

[73] Lenzerini, Maurizio. “Data Integration: A Theoretical Perspective.” PODS. 2002,233–246.

[74] Levy, Alon Y., Rajaraman, Anand, and Ordille, Joann J. “Querying HeterogeneousInformation Sources Using Source Descriptions.” VLDB (1996): 251–262.

[75] Li, Chen, Yerneni, Ramana, Vassalos, Vasilis, Garcia-Molina, Hector,Papakonstantinou, Yannis, Ullman, Jeffrey D., and Valiveti, Murty. “CapabilityBased Mediation in TSIMMIS.” SIGMOD Conference. 1998, 564–566.

[76] Li, Xiao and Chow, Randy. “An Ontology-based Mapping Repository for Dynamicand Customized Data Integration.” Tech. Rep. REP-2009-483, Dept. of CISE atUniversity of Florida, 2009.http://www.cise.ufl.edu/~chow/techreport483.pdf.

[77] ———. “An Ontology-based Mapping Repository for Meta-querier Customization.”SEKE. 2010, 325–330.

[78] ———. “Ontology-centric Source Selection for Meta-querier Customization.”Under Review. 2011.

[79] ———. “Reuse-oriented Mapping Discovery for Meta-querier Customization.”Under Review. 2011.

[80] Li, Xiao, Chow, Randy, and Chen, Lu. “Dynamic Personalization of Meta-Queriers.”IRI. 2009, 361–365.

[81] Li, Xiao, Wang, Ye-Yi, and Acero, Alex. “Learning query intent from regularizedclick graphs.” SIGIR. 2008, 339–346.

[82] Lin, Jinxin and Mendelzon, Alberto O. “Merging Databases Under Constraints.”Int. J. Cooperative Inf. Syst. 7 (1998).1: 55–76.

[83] Liu, Fang, Yu, Clement T., and Meng, Weiyi. “Personalized Web Search ForImproving Retrieval Effectiveness.” IEEE Trans. Knowl. Data Eng. 16 (2004).1:28–40.

[84] Lu, Yiyao, He, Hai, Peng, Qian, Meng, Weiyi, and Yu, Clement T. “Clusteringe-commerce search engines based on their search interface pages usingWISE-Cluster.” Data Knowl. Eng. 59 (2006).2: 231–246.

[85] Lu, Yiyao, Wu, Zonghuan, Zhao, Hongkun, Meng, Weiyi, Liu, King-Lup, Raghavan,Vijay, and Yu, Clement T. “MySearchView: a customized metasearch enginegenerator.” SIGMOD Conference. 2007, 1113–1115.

114

[86] Madhavan, Jayant, Bernstein, Philip A., Doan, AnHai, and Halevy, Alon Y.“Corpus-based Schema Matching.” ICDE. 2005, 57–68.

[87] Madhavan, Jayant, Cohen, Shirley, Dong, Xin Luna, Halevy, Alon Y., Jeffery,Shawn R., Ko, David, and Yu, Cong. “Web-scale Data Integration: You can onlyafford to Pay As You Go.” CIDR. 2007, 342–350.

[88] Madhavan, Jayant and Halevy, Alon Y. “Composing Mappings Among DataSources.” VLDB. 2003, 572–583.

[89] Madhavan, Jayant, Ko, David, Kot, Lucja, Ganapathy, Vignesh, Rasmussen, Alex,and Halevy, Alon Y. “Google’s Deep Web crawl.” PVLDB 1 (2008).2: 1241–1252.

[90] McCann, Robert, Shen, Warren, and Doan, AnHai. “Matching Schemas in OnlineCommunities: A Web 2.0 Approach.” ICDE. 2008, 110–119.

[91] Melnik, Sergey. Generic Model Management: Concepts and Algorithms. Ph.D.thesis, University of Leipzig, 2004.

[92] Melnik, Sergey, Bernstein, Philip A., Halevy, Alon Y., and Rahm, Erhard.“Supporting Executable Mappings in Model Management.” SIGMOD Confer-ence. 2005, 167–178.

[93] Melnik, Sergey, Rahm, Erhard, and Bernstein, Philip A. “Rondo: A ProgrammingPlatform for Generic Model Management.” SIGMOD Conference. 2003, 193–204.

[94] Mena, Eduardo, Illarramendi, Arantza, Kashyap, Vipul, and Sheth, Amit P.“OBSERVER: An Approach for Query Processing in Global Information SystemsBased on Interoperation Across Pre-Existing Ontologies.” Distributed and ParallelDatabases 8 (2000).2: 223–271.

[95] Meng, Weiyi, Yu, Clement T., and Liu, King-Lup. “Building efficient and effectivemetasearch engines.” ACM Comput. Surv. 34 (2002).1: 48–89.

[96] Mitra, Prasenjit, Wiederhold, Gio, and Kersten, Martin L. “A Graph-Oriented Modelfor Articulation of Ontology Interdependencies.” EDBT. 2000, 86–100.

[97] Modica, Giovanni A., Gal, Avigdor, and Jamil, Hasan M. “The Use ofMachine-Generated Ontologies in Dynamic Information Seeking.” CoopIS.2001, 433–448.

[98] Moscovich, Tomer, Chevalier, Fanny, Henry, Nathalie, Pietriga, Emmanuel, andFekete, Jean-Daniel. “Topology-aware navigation in large networks.” CHI. 2009.

[99] Ng, Wilfred, Deng, Lin, and Lee, Dik Lun. “Mining User preference using Spyvoting for search engine personalization.” ACM Trans. Internet Techn. 7 (2007).4.

[100] Noy, Natalya Fridman. “Semantic Integration: A Survey Of Ontology-BasedApproaches.” SIGMOD Record 33 (2004).4: 65–70.

115

[101] Noy, Natalya Fridman, Griffith, Nicholas, and Musen, Mark A. “CollectingCommunity-Based Mappings in an Ontology Repository.” ISWC. 2008, 371–386.

[102] Noy, Natalya Fridman and Klein, Michel C. A. “Ontology Evolution: Not the Sameas Schema Evolution.” Knowl. Inf. Syst. 6 (2004).4: 428–440.

[103] Parent, Christine and Spaccapietra, Stefano. “Issues and Approaches ofDatabase Integration.” Commun. ACM 41 (1998).5: 166–178.

[104] Pazzani, Michael J. “A Framework for Collaborative, Content-Based andDemographic Filtering.” Artif. Intell. Rev. 13 (1999).5-6: 393–408.

[105] Pazzani, Michael J. and Billsus, Daniel. “Learning and Revising User Profiles: TheIdentification of Interesting Web Sites.” Machine Learning 27 (1997).3: 313–331.

[106] ———. “Content-Based Recommendation Systems.” The Adaptive Web. 2007,325–341.

[107] Pietriga, Emmanuel and Appert, Caroline. “Sigma lenses: focus-contexttransitions combining space, time and translucence.” CHI. 2008.

[108] Pietriga, Emmanuel, Appert, Caroline, and Beaudouin-Lafon, Michel. “Pointingand beyond: an operationalization and preliminary evaluation of multi-scalesearching.” CHI. 2007.

[109] Pine, B. Joseph and Davis, Stan. Mass customization : the new frontier inbusiness competition. Boston, Mass.: Harvard Business School Press, 1999.

[110] Pinto, Helena Sofia Andrade N. P. and Martins, Joao Pavao. “Ontologies: How canThey be Built?” Knowl. Inf. Syst. 6 (2004).4: 441–464.

[111] Pluempitiwiriyawej, Charnyote and Hammer, Joachim. “Element matching acrossdata-oriented XML sources using a multi-strategy clustering model.” Data Knowl.Eng. 48 (2004).3: 297–333.

[112] Porter, M. F. “An algorithm for suffix stripping.” Readings in information retrieval.Morgan Kaufmann Publishers Inc., 1997, 313–316.

[113] Pottinger, Rachel and Bernstein, Philip A. “Merging Models Based on GivenCorrespondences.” VLDB. 2003, 826–873.

[114] ———. “Schema merging and mapping creation for relational sources.” EDBT.2008, 73–84.

[115] Raghavan, Sriram and Garcia-Molina, Hector. “Crawling the Hidden Web.” VLDB.2001, 129–138.

[116] Rahm, Erhard and Bernstein, Philip A. “A survey of approaches to automaticschema matching.” VLDB J. 10 (2001).4: 334–350.

116

[117] Ram, Sudha and Park, Jinsoo. “Semantic Conflict Resolution Ontology (SCROL):An Ontology for Detecting and Resolving Data and Schema-Level SemanticConflicts.” IEEE Trans. Knowl. Data Eng. 16 (2004).2: 189–202.

[118] Rasolofo, Yves, Hawking, David, and Savoy, Jacques. “Result merging strategiesfor a current news metasearcher.” Inf. Process. Manage. 39 (2003).4: 581–609.

[119] Roddick, John F. “A survey of schema versioning issues for database systems.”Information and Software Technology 37 (1995).7: 383–393.

[120] Russell, Stuart J. and Norvig, Peter. Artificial Intelligence: A Modern Approach,chap. 16.4. Prentice Hall, 2009, 3 ed., 622–626.

[121] Sabou, Marta, Wroe, Chris, Goble, Carole A., and Stuckenschmidt, Heiner.“Learning domain ontologies for semantic Web service descriptions.” J. Web Sem.3 (2005).4: 340–365.

[122] Salton, Gerard and Buckley, Chris. “Term-Weighting Approaches in Automatic TextRetrieval.” Inf. Process. Manage. 24 (1988).5: 513–523.

[123] Sarma, Anish Das, Dong, Xin, and Halevy, Alon Y. “Bootstrapping pay-as-you-godata integration systems.” SIGMOD Conference. 2008, 861–874.

[124] Sarwar, Badrul M., Karypis, George, Konstan, Joseph A., and Riedl, John.“Item-based collaborative filtering recommendation algorithms.” WWW. 2001,285–29.

[125] Seligman, Leonard J., Rosenthal, Arnon, Lehner, Paul E., and Smith, Angela.“Data Integration: Where Does the Time Go?” IEEE Data Eng. Bull. 25 (2002).3:3–10.

[126] Shen, Warren, DeRose, Pedro, McCann, Robert, Doan, AnHai, andRamakrishnan, Raghu. “Toward best-effort information extraction.” SIGMODConference. 2008, 1031–1042.

[127] Shestakov, Denis, Bhowmick, Sourav S., and Lim, Ee-Peng. “DEQUE: queryingthe deep Web.” Data Knowl. Eng. 52 (2005).3: 273–311.

[128] Sheth, Amit P. and Larson, James A. “Federated Database Systems for ManagingDistributed, Heterogeneous, and Autonomous Databases.” ACM Comput. Surv. 22(1990).3: 183–236.

[129] Shokouhi, Milad. “Central-Rank-Based Collection Selection in UncooperativeDistributed Information Retrieval.” ECIR. 2007, 160–172.

[130] Shokouhi, Milad, Zobel, Justin, Scholer, Falk, and Tahaghoghi, Seyed M. M.“Capturing collection size for distributed non-cooperative retrieval.” SIGIR. 2006,316–323.

117

[131] Si, Luo and Callan, James P. “Relevant document distribution estimation methodfor resource selection.” SIGIR. 2003, 298–305.

[132] Si, Luo and Callan, Jamie. “A semisupervised learning method to merge searchengine results.” ACM Trans. Inf. Syst. 21 (2003).4: 457–491.

[133] Spaccapietra, Stefano and Parent, Christine. “View Integration: A Step Forward inSolving Structural Conflicts.” IEEE Trans. Knowl. Data Eng. 6 (1994).2: 258–274.

[134] Su, Weifeng, Wang, Jiying, and Lochovsky, Frederick H. “Automatic HierarchicalClassification of Structured Deep Web Databases.” WISE. 2006, 210–221.

[135] ———. “Holistic Schema Matching for Web Query Interfaces.” EDBT. 2006,77–94.

[136] ———. “ODE: Ontology-assisted data extraction.” ACM Trans. Database Syst. 34(2009).2.

[137] Sugiyama, Kazunari, Hatano, Kenji, and Yoshikawa, Masatoshi. “Adaptive Websearch based on user profile constructed without any effort from users.” WWW.2004, 675–684.

[138] Thomas, Paul and Shokouhi, Milad. “SUSHI: scoring scaled samples for serverselection.” SIGIR. 2009, 419–426.

[139] Uchyigit, Gulden and Ma, Matthew Y., eds. Personalization Techniques andRecommender Systems, vol. 70. World Scientific Publishing, 2008.

[140] Wang, Jiying, Wen, Ji-Rong, Lochovsky, Frederick H., and Ma, Wei-Ying.“Instance-based Schema Matching for Web Databases by Domain-specificQuery Probing.” VLDB. 2004, 408–419.

[141] Wu, Ping, Wen, Ji-Rong, Liu, Huan, and Ma, Wei-Ying. “Query SelectionTechniques for Efficient Crawling of Structured Web Sources.” ICDE. 2006,47.

[142] Wu, Wensheng, Doan, AnHai, and Yu, Clement T. “Merging Interface Schemas onthe Deep Web via Clustering Aggregation.” ICDM. 2005, 801–804.

[143] ———. “WebIQ: Learning from the Web to Match Deep-Web Query Interfaces.”ICDE. 2006, 44.

[144] Wu, Wensheng, Yu, Clement, Doan, AnHai, and Meng, Weiyi. “An interactiveclustering-based approach to integrating source query interfaces on the deepWeb.” SIGMOD Conference. 2004, 95–106.

[145] Wu, Zonghuan, Raghavan, Vijay, Du, Chun, C, Komanduru Sai, Meng, Weiyi, He,Hai, and Yu, Clement T. “SE-LEGO: creating metasearch engines on demand.”SIGIR. 2003, 464.

118

[146] Xin, Dong, Han, Jiawei, and Chang, Kevin Chen-Chuan. “Progressive andselective merge: computing top-k with ad-hoc ranking functions.” SIGMODConference. 2007, 103–114.

[147] Xu, Jinxi and Croft, W. Bruce. “Cluster-Based Language Models for DistributedRetrieval.” SIGIR. 1999, 254–261.

[148] Xu, Li and Embley, David W. “Discovering Direct and Indirect Matches for SchemaElements.” DASFAA. 2003, 39–46.

[149] Yerneni, Ramana, Li, Chen, Garcia-Molina, Hector, and Ullman, Jeffrey D.“Computing Capabilities of Mediators.” SIGMOD Conference. 1999, 443–454.

[150] Yuwono, Budi and Lee, Dik Lun. “Server Ranking for Distributed Text RetrievalSystems on the Internet.” DASFAA. 1997, 41–50.

[151] Zhang, Zhen, He, Bin, and Chang, Kevin Chen-Chuan. “Understanding WebQuery Interfaces: Best-Effort Parsing with Hidden Syntax.” SIGMOD Conference.2004, 107–118.

[152] ———. “Light-weight Domain-based Form Assistant: Querying Web DatabasesOn the Fly.” VLDB. 2005, 97–108.

[153] Zhao, Hongkun, Meng, Weiyi, Wu, Zonghuan, Raghavan, Vijay, and Yu,Clement T. “Fully automatic wrapper generation for search engines.” WWW.2005, 66–75.

[154] Zhao, Hongkun, Meng, Weiyi, and Yu, Clement T. “Automatic Extraction ofDynamic Record Sections From Search Engine Result Pages.” VLDB. 2006,989–1000.

[155] Zhdanova, Anna V. and Shvaiko, Pavel. “Community-Driven Ontology Matching.”ESWC. 2006, 34–49.

[156] Zhou, Huimin and Ram, Sudha. “Clustering Schema Elements for SemanticIntegration of Heterogeneous Data Sources.” J. Database Manag. 15 (2004).4:88–106.

[157] Ziegler, Patrick and Dittrich, Klaus R. “User-Specific Semantic Integration ofHeterogeneous Data: The SIRUP Approach.” ICSNW. 2004, 44–64.

[158] Ziegler, Patrick, Dittrich, Klaus R., and Hunt, Ela. “A call for personal semanticdata integration.” ICDE Workshops. 2008, 250–253.

119

BIOGRAPHICAL SKETCH

Xiao Li received a bachelor’s degree from Nanjing University of Science and

Technology in 2004, and continued his graduate studies in pattern recognition until

2005. Next, he transferred to University of Florida in Gainesville, graduated with honors

in computer science.

120

MAPPING REUSE FOR META-QUERIER...

Documents

Transcript of MAPPING REUSE FOR META-QUERIER...