MAPPING REUSE FOR META-QUERIER...
Transcript of MAPPING REUSE FOR META-QUERIER...
MAPPING REUSE FOR META-QUERIER CUSTOMIZATION
By
XIAO LI
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2011
c© 2011 Xiao Li
2
In memory of my grandfather, Peixin Li
3
ACKNOWLEDGMENTS
First, I would like to thank my supervisor, Dr. Randy Chow, for providing the
guidance and support throughout the course of my research. None of this would have
been possible without his patience and support. Second, I would also like to thank the
other members of my advisory committee (Dr. Jih-Kwon Peir, Dr. Markus Schneider, Dr.
Tuba Yavuz and Dr. Raymond Issa), for teaching me what constitutes quality research.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1 Meta-querier Customization . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Mapping Reuse in Community-driven Meta-querier Customization . . . . 121.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Meta-queriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Requirement Specification in Data Integration Systems . . . . . . . . . . 20
3 META-QUERIER CUSTOMIZATION . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 MQ-Customizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Customization Workflow . . . . . . . . . . . . . . . . . . . . . . . . 243.2.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Ontology-centric Mass Customization . . . . . . . . . . . . . . . . . . . . 273.4 Reuse-oriented Meta-querier Construction and Maintenance . . . . . . . 30
4 MAPPING MODELING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Modeling of Query Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Change-oriented Mapping Modeling . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Motivating Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Ontology-based Mapping Modeling . . . . . . . . . . . . . . . . . . . . . . 434.3.1 Motivating Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Metadata in Mapping Modeling . . . . . . . . . . . . . . . . . . . . . . . . 504.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.1 Mapping Modeling for Data Integration . . . . . . . . . . . . . . . . 524.5.2 Schema Element Clustering for Data Integration . . . . . . . . . . 54
5
5 MAPPING REPOSITORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 M-Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 MO-Repository and M-Table . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6 REUSE-ORIENTED MAPPING DISCOVERY . . . . . . . . . . . . . . . . . . . 67
6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2 Discovery through M-Table . . . . . . . . . . . . . . . . . . . . . . . . . . 696.3 Discovery through M-Ontology . . . . . . . . . . . . . . . . . . . . . . . . 716.4 Validating & Correcting Mappings . . . . . . . . . . . . . . . . . . . . . . . 746.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 ONTOLOGY-CENTRIC SOURCE SELECTION . . . . . . . . . . . . . . . . . . 79
7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.2 Capability-based Recommendation . . . . . . . . . . . . . . . . . . . . . . 81
7.2.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.2.2 Demand Capture and Matching . . . . . . . . . . . . . . . . . . . . 84
7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8 IMPLEMENTATION AND EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . 91
8.1 System Structure of Mapping Repositories . . . . . . . . . . . . . . . . . 918.2 Experiments for ontology construction . . . . . . . . . . . . . . . . . . . . 928.3 Experiments for mapping discovery . . . . . . . . . . . . . . . . . . . . . . 988.4 Experiments for source selection . . . . . . . . . . . . . . . . . . . . . . . 103
9 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . . . . . 106
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6
LIST OF TABLES
Table page
8-1 Statistics of the domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8-2 Capability-based matching by MOM . . . . . . . . . . . . . . . . . . . . . . . . 104
8-3 Capability-based matching by NNM . . . . . . . . . . . . . . . . . . . . . . . . 105
7
LIST OF FIGURES
Figure page
3-1 The workflow of MQ-Customizer. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3-2 The system architecture of MQ-Customizer. . . . . . . . . . . . . . . . . . . . . 25
4-1 A query form and its graph model. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4-2 Mapping evolution scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4-3 The life cycle of a mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4-4 The incremental formulation of a mapping object. . . . . . . . . . . . . . . . . . 42
4-5 Three job search forms with the mappings. . . . . . . . . . . . . . . . . . . . . 44
4-6 A fragment of a mapping ontology (E-Nodes and G-Nodes). . . . . . . . . . . . 46
4-7 A fragment of a mapping ontology (G-Nodes and A-Nodes). . . . . . . . . . . . 49
4-8 The life cycle of a node/edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5-1 The flowchart of ontology construction. . . . . . . . . . . . . . . . . . . . . . . 56
5-2 (a) An example of mapping object. (b) The evolution of a global query form.(c) The evolution of a local query form . . . . . . . . . . . . . . . . . . . . . . . 65
7-1 The ontology-centric source selection algorithm. . . . . . . . . . . . . . . . . . 85
8-1 The Repository Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8-2 Experiment results of schema element classification without schema repetition. 95
8-3 Experiment results of schema element classification with schema repetition. . . 97
8-4 Experiment results of mapping discovery through M-Ontology. . . . . . . . . . 99
8-5 Experiment results of concept searching for schema elements. . . . . . . . . . 101
8-6 Experiment results of mapping discovery through M-Ontology and MO-Repository.102
8
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
MAPPING REUSE FOR META-QUERIER CUSTOMIZATION
By
Xiao Li
May 2011
Chair: Dr. Randy ChowMajor: Computer Engineering
The primary goal of this dissertation is to investigate the methodologies for
developing a system framework that supports the customization of meta-queriers
over the Web. Meta-queriers facilitate effective information retrieval from multiple
and heterogeneous data sources that are accessible only through query interfaces.
They are virtual data integration systems that shield users from data heterogeneity
and sources locations. Due to the ever-increasing number of available data sources
and sophistication of users, it is highly desirable (and in many cases necessary) to
allow for customization of meta-queriers based on users’ preferences (or necessity).
Customization comes with some cost: the scalability of the underlying system and the
complexity of human-machine interaction. This dissertation investigates new open
research issues for the implementation of meta-queriers with respect to customization.
There are two aspects of the scalability issue: a potentially large number of
customized meta-queriers and a large repository that stores mapping information
between entities in different data sources for interoperability. The complexity of
human-machine interaction is resulted from the need for collaboration in sharing the
mapping information. The proposed approach and the innovation is that it tackles
both the scalability and complexity problems holistically through a system framework
based on a synergetic combination of two concepts: a community-driven collaborative
meta-querier construction and an ontology-based sharing of meta-querier components.
9
Meta-queriers are domain-specific. Domain users constitute a loose community with
some common interest and might need to collaborate. In our solution approach, we
turn the challenge of the need for human collaboration into opportunities, that is,
the scalability problem can be alleviated through the distribution of the meta-querier
construction workload by harvesting the human power (assistance) in the community.
Another very significant challenge-to-opportunity is the reuse potential from a possibly
large scale of existing meta-querier and their components. The innovation in our
ontology-based mapping repository is its ability to fully exploit the impact of reuse for
human-friendly construction and maintenance of meta-queriers.
10
CHAPTER 1INTRODUCTION
1.1 Meta-querier Customization
As the tremendous amount of information sources accessible through HTML
query interfaces [20, 28, 87] continues to grow, data integration has become a very
challenging issue for effective and flexible information retrieval from multiple and
heterogeneous data sources in the WWW. Meta-queriers (a.k.a., vertical search
engines) are virtual data integration systems that shield users from data heterogeneity
and source locations. They provide the users with a uniform query interface (a.k.a.
global schema) for simultaneous access to a set of integrated data sources in the same
domain. Users do not need to input repetitive information to each source interface (a.k.a.
local schema). User queries over the global schema are respectively reformulated to
the queries in terms of the local schemas, and then the query results from data sources
are presented to the users in an integrated form. As opposed to physically integrating
the data sources, virtual data integration offers more flexibility when the underlying
system involves a large number of data sources and a large variety of user needs, and
in particular, when the number and the variety are frequently changing.
Much effort has been made on the construction of a single domain-specific
meta-querier (e.g., WISE-Integrator [59] and Meta-Queriers [29]). In general, these
data integration systems are design for specific user requirements to integrate a given
set of data sources. However, the differences in user interests and preferences require
different source selection and global schemas (interfaces). For example, kayak.com
and zoomtra.com are two meta-queriers for searching airfares. Although kayak.com
is more popular in general, zoomtra.com finds better airfares to India most of the time.
The difference between them is mainly due to different selections of data sources, and
consequently different interfaces. As the number of data sources continue to grow, it is
highly desirable that users have the freedom to customize their meta-queriers: 1) by
11
selecting their preferred data sources according to the functionalities, data quality, and
site credibility of the sources; 2) by tailoring the global schema with needed functional
components. This is the ultimate goal of the proposed research.
Furthermore, user-specified source selection is not static but dynamic, and the
query interfaces of data sources also are evolving. For example, users might want to
insert a new data source into an existing meta-querier. The changes on source selection
might affect the functionalities and the global schema of a meta-querier, and can even
destroy the whole system. This means the existing global schema and mappings need
to be updated to adapt to these changes. Therefore, meta-queriers should be allowed
to dynamically evolve to adapt to the changing user needs and source interfaces (i.e.,
dynamic re-configurability). Most often, this important requirement is not fully considered
in the existing meta-querier research, especially in the context of multiple customized
meta-queriers.
The dissertation explores the open design and implementation issues, and outlines
with some preliminary work on a community-driven infrastructure and an ontology-
based mapping management scheme to effectively achieve the goal.
1.2 Mapping Reuse in Community-driven Meta-querier Custom ization
Allowing ad hoc customization of meta-queriers without some planned strategies
could naturally result in a large number of independently constructed meta-queries, and
many of them are overlapping or redundant with duplicated efforts. Since meta-queriers
are designed for a specific application domain and users in the domain share some
common interests, it is beneficial for meta-querier builders to form a community for the
purpose of collaborative construction of meta-queriers through sharing of knowledge on
schema transformation. In the following, we summarize four research fronts along the
line of how the concepts of mapping reuse are to be investigated in the context of data
integration.
12
Mapping modeling : Mapping modeling is a critical problem in the meta-querier
customization. It is highly desirable to design new approaches to mapping modeling
for facilitating mapping management, utilization and reuse. The major challenges
include the potentially large scales and untraceable evolution of mappings. First, to
match various user needs and numerous data sources, the meta-querier customization
must solve the issues caused by the interoperability between the corresponding global
schemas of meta-queriers and local schemas of data sources. The potential large
volume of schemas (both types) indicates the necessity of modeling a larger number of
mappings, which is a new research topic. Second, to support dynamic re-configurability
of meta-queriers, the mappings must be evolved to adapt the emerging user needs and
the updated schemas. Since the changes on local schemas are normally unpredictable
and untraceable, the evolution of mappings cannot be modeled through the traditional
techniques in schema evolution and versioning.
Mapping sharing : Meta-querier construction centers on the mapping repositories
shared by the meta-querier in the same domain. Separate storage of mappings might
cause a high degree of data redundancy and potential update anomalies. Through
sharing of construction information, construction of a meta-querier can reuse the
previous work of self and peer meta-queriers. It is necessary to develop reuse-
oriented repositories to store shared components for effective construction. The
major design considerations should include: 1) Facilitation of mapping reuse by both
human and machines. The internal structure should not only be reasonable by machines
and but also be easy to understand and manage even for non-expert volunteers. 2)
Best-effort avoidance of human intervention in repository construction. The original
goal of repositories is to reduce the human efforts in building meta-queriers, and thus
the repository construction should not introduce more involvement than what the
repositories reduced.
13
Mapping discovery : In the context of meta-querier customization, query-form
matching need to be revisited due to the increasing scalability issues caused by
the customization. In essence, it is equivalent to a classical research problem of
schema matching. It is well known that full automation is AI-complete [54, 91]. So
much ambiguity and uncertainty exists in real-world applications. Inevitably, human
intervention cannot be avoided in any practical solutions. As a considerable number
of mappings are to be discovered holistically than individually, care must be taken
to reduce the overall workload by avoiding repetitive tasks and reusing the existing
mappings.
Source selection : The selection of data sources determines the content coverage
of meta-queriers. User-driven source selection is a convenient and straightforward
method to customize meta-queriers. For achieving accurate selection, the system
need to understand the capabilities of data sources, which can be learned from the
related query forms. To understand the semantics of query forms, methodologies of
ontology are commonly used. However, the construction and maintenance of ontologies
are labor-intensive and error-prone. Since the mappings between the heterogeneous
query forms of the data sources can be viewed as instances of relations connecting
concepts, the potential large number of unordered mappings could be employed to
form an ontology. Thus, it is desirable to design a mapping-driven solution to ontology
construction and an ontology-centric solution to source selection.
1.3 Contributions
Concentrating on the research fronts discussed above, our research makes the
following specific contributions:
• Architecture design : Our research pioneers to explicitly articulate an open
three-layer architecture, called MQ-Customizer, for meta-querier customization.
The first layer, called service layer, is to capture the individual needs from users
14
for meta-querier discovery and construction. The bottom layer, called builder
layer is composed of a reuse-oriented auto builder and a community-driven mass
builder. As the second layer, info layer stores and manages the information for
the operations in the other two layers. Additionally, for the first time, we introduce
into data integration a novel concept, mass customization, which originally is a
marketing and manufacturing strategy [34, 109] for combining low cost of mass
production and high flexibility of individual customization. Its open issues and
potential solution approaches are elaborated in the discussions of the proposed
system architecture.
• Mapping modeling : To capture the semantics of numerous and unordered
mappings, we introduce the concepts of ontology to model the mappings. These
semantics-based concepts ease the understanding of human beings on mappings,
and enhance the reasonability of machines. The structure of the ontology helps
to guide the abstraction, optimization and validation of the mapping information
for effective sharing and reuse by the community. Furthermore, we propose a
change-based model, mapping object, to record the evolution of mappings for
the reuse. Different from the traditional techniques, our model can support the
untraceable evolution of mappings.
• Mapping repository : Two different repositories (called M-Ontology and MO-
Repository) are respectively designed based on the proposed ontology-base
and change-based mapping model. To adapt the dynamic customization of
meta-queriers, we also develop construction algorithms through incrementally
inserting individual mappings. The insertion procedures also consider the ease
of human understanding, which is one of the major design principles in the
community-driven approaches. In addition, the proposed mapping-driven ontology
15
construction is also a novel approach to build a task-specific ontology in the field of
knowledge representation.
• Mapping discovery : Although many attempts have been made in the mapping
discovery (a.k.a., schema matching), the discovery of complex mappings is still an
open problem. We propose a promising approach to this problem with assistance
of MO-Repository and M-Ontology. In essence, this is a reuse-oriented solution
that reuses their self-history (i.e., the previous versions) and peers (i.e., the other
mappings in the same domain). Our approach is straightforward to ordinary users
so that they can be easily involved in mapping discovery and validation.
• Source selection : To better reuse the pre-integrated data sources, we provide a
capability-based source selection algorithm to recommend users their potentially
desired sources. Complimentary to the query-based solutions, our approach is
based on the query capabilities of data sources, instead of the sampled source
contents. Unlike the prior work on query-form clustering, our approach is able to
distinguish the query capabilities of the data sources whose query forms have
been clustered in the same domain.
1.4 Dissertation Outline
The rest of the dissertation is organized as follows: Chapter 2 discusses the current
state of the art on data integration, meta-queriers and the requirement specification
in data integration systems. We make a system-level comparison of our work with the
others. Chapter 3 focuses on the system design of MQ-Customizer. In this chapter, we
first discussed the open issues in the meta-querier customization. We then present our
proposed workflow of the customization process. To realize the two-phase customization
process, we design a three-layer architecture for the MQ-Customizer. Under this system,
we propose the potential solutions with the relevant challenges and directions for future
work. The solutions include ontology-based mass customization and reuse-oriented
16
construction and maintenance. Chapter 4 first presents query-form modeling for
meta-queriers, and then two models of mappings from the perspectives of their
semantics and evolution history. These two mapping models are change-based and
ontology-based modeling. Finally, we review various mapping modeling published in
the literature in comparison with our models. Based on our proposed two models, we
develop two corresponding mapping repositories, i.e., M-Ontology and MO-Repository,
discussed in Chapter 5. This chapter also includes their incremental construction
algorithms with human-friendliness consideration. Chapter 6 and Chapter 7 respectively
apply these two repositories to address two critical research problems, mapping
discovery and source selection. In Chapter 8, we presents the implementation and
experimental analysis for repository construction, source selection and mapping
discovery. Finally, we conclude with a summary of future work in Chapter 9.
17
CHAPTER 2RELATED WORK
2.1 Data Integration
Data integration aims at providing users a uniform manner to manipulate and
manage the (heterogenous) data residing at different sources. Major solutions can be
categorized into two groups based on the location of data: physical integration and
virtual integration. Physical integration (a.k.a., data warehousing) loads data into a
repository through data extraction and transformation from a collection of data sources.
User can directly interact with the physical repository without any query reformulation.
As opposed to physically integrating the data sources, virtual integration provides users
a virtual view (a.k.a., global schema) of underlying data sources without any physical
storage. User queries through such a virtual view are translated to respective local
queries and then the combined local query results are returned. The query translation
is based on the mappings between the global schema and the local schemas of the
underlying sources.
Content on the web is blossoming. It has been experiencing a tremendous growth
in (semi-)structured information, in particular, the database behind deep Web [28, 87].
Virtual integration techniques are widely applied in many data integration frameworks
[54, 73] to accommodate the rising needs of integrating data sources over the Web.
These data sources are self-governing agents that are autonomous and independent
from data integration systems. Most of such sources are only accessible through HTML
query interfaces. Thus, the data completeness and freshness is almost impossible to
achieve in physical integration. Studies [15, 32, 58] have shown that virtual integration
is the only practical solution to accurate information retrieval and integration from online
data sources.
18
2.2 Meta-queriers
Meta-queriers are virtual data integration systems. The most significant benefit
of a meta-querier is its ability to utilize multiple local search engines to query multiple
different but related objects at the same time. Many such commercial meta-queriers,
such as kayak.com, have implemented similar but simplified functions. Their internal
designs are proprietary and were often not reported. With the prevalence of e-commerce,
multiple research groups [29, 59, 95, 127] have attempted to analyze and build
meta-queriers from various perspectives. These systems can be viewed as an automatic
integration of web databases to facilitate efficient web exploration.
However, these frameworks only focus on the initial construction of meta-queriers,
but ignore the maintenance issues caused by evolving user needs. In addition, they do
not describe any specific details for the storage of mappings, which are essential for
the constructions and operations of meta-queriers such as schema matching, schema
merging, and query translation.
The construction of a single meta-querier can be divided into five major research
problems: result extraction [6, 25, 45, 71, 126, 136, 153, 154], query-form extraction
[19, 59, 60, 62, 70, 97, 115, 151], result merging [8, 23, 31, 42, 44, 118, 132, 146],
query-form matching [55, 56, 59, 90, 135, 140, 143, 144, 152], global-form generation
[3, 41, 59, 142]. Since the first two research problems are not affected by meta-querier
customization, the related research results can be directly applied to MQ-Customizer.
Personalized result merging and query refinement have been widely studied in the
previous research on the construction of customized meta-search engines [42, 49,
63, 68, 83, 85, 99, 145]. These customized meta-search engines mainly function as
document/text retrieval systems, and thus, their local query forms are simple and
uniform. They do not address the challenges of query-form matching and global-form
generation in terms of complex local forms. The scalability and maintenance resulting
19
from meta-querier customization were not an issue in these customized meta-search
engines.
We will focus on query-form matching and global-form generation that need
to be revisited in the context of meta-querier customization. In essence, they are
equivalent to two classical research problems of schema matching [12, 46, 65, 72,
103, 116] and schema merging [12, 14, 18, 82, 92, 113, 114, 128, 133]. Although
much research effort has been made in different contexts (e.g., semi-structured data,
ontology and ER model), the existing solutions still heavily rely on human involvement.
It is impractical to solve them separately due to the increasing scalability issues
caused by customization. To tackle these challenges, we propose an ontology-based
community-driven MQ-Customizer by leveraging the opportunities brought by meta-querier
customization (i.e., human contribution from a potentially large number of users,
and reuse potentials from a possibly large scale of existing meta-queriers and their
components).
2.3 Requirement Specification in Data Integration Systems
Data integration systems are regarded as mediators to effectively match information
needs and resources. The scale and diversity of user needs have been recognized in
the context of data integration over the Web [29, 87]. Generally, users can specify their
preference and needs explicitly or implicitly. MetaQuerier [29] and WISE-Integrator [59]
(data integration systems) allow users to explicitly select a domain of interest. For each
domain, a single mediated schema is created by combining the local schemas of all
inclusive data sources. Such a mediated schema can only provide the functionalities
that are supported by all the underlying sources; otherwise, the results from data
sources can not conform to the original queries inputted from the users. More data
sources it includes, less functionalities it can support. Moreover, different users may
have different selection criteria for data sources, e.g., data qualities, site credibility and
brand loyalty. Thus, this approach hardly satisfy the diverse user needs and preferences.
20
PAYGO-based data integration architecture [87] attempts to obtain the information needs
from the keywords input by users. However, the extracted user needs are usually not
exact and it is difficult for machines to transform user queries from flat keywords to
structured queries. MySearchView [59, 85] enables users to specify their preferred data
sources for generating the personalized data integration systems, but it provides users
a very limited source type for selection, i.e., only single-keyword-box data sources. In
reality, query forms are much more complex, i.e., they include more controls. Due to
the limit on source types, MySearchView does not address the challenges of mapping
discovery between complex query forms, which is the core in data integration system
construction.
Our work follows the basic idea of MySearchView on personalized source
selection, but employs a different construction approach, i.e., community-driven and
reuse-oriented construction, for the purpose of eliminating its limitation on data sources.
Since such customization leads to a large variety of data integration systems to be
constructed, the construction burdens should be distributed to a considerable number of
cooperative members in a community, which can include users, technicians and domain
experts. In this setting, we propose a “human-friendly” mapping repository for efficient
mapping storage, management and discovery. Unlike the traditional data integration
systems, these data integration systems in the same domain share the same mapping
repository. Its main consideration is that the construction and maintenance of the data
integration systems might benefit from the previous outcomes of repetitive tasks. This
assumption is also observed in our experiment results. The more mapping information
a mapping repository owns, the more benefits their construction and maintenance can
obtain.
21
CHAPTER 3META-QUERIER CUSTOMIZATION
3.1 Research Questions
A growing trend of increasing variety of meta-queriers is noticeable even in the
same application (e.g., airline ticket booking). It is impossible to build a one-size-fit-all
meta-querier that can satisfy diverse user needs and preferences [158]. Therefore, a
meta-querier construction system intended for a large audience must provide users with
the freedom to customize their meta-queriers. Many challenging issues arise and we
propose to address the fundamental questions on the design and implementation of an
infrastructure to support dynamic customization of meta-queriers:
Q1: How can users’ needs be best accommodated through the cus tomization of
meta-queriers?
From the users’ viewpoints, the global query form and the selection of source
databases are the most critical (or perhaps the only) factors that influence the contents
retrieved from the meta-queriers. First, source selection determines the content
coverage of meta-queries. The meta-queriers are virtual data integration systems
[73] that do not physically store any information. That is, the returned contents are
completely determined by the underlying data sources. Second, modifying the global
schema is the only way for the users to express their demands on the results. All the
returned contents should be conformed to the constraints set by the users, i.e., by
modifications of the controls (clicking radio button, drop-down menu, and entering
text, etc.) In a sense, the contents of global forms are also decided by the selection
of data sources, since the global forms should only consists of the functionalities that
are supported by every underlying data sources; otherwise, the results might include
some/many records that violate the original user-specified conditions. Therefore, source
selection is arguably one of the most critical problems in meta-querier customization. In
our research, we focus on source selection for customization.
22
Q2: What are the challenges for supporting meta-querier cus tomization?
Automating the construction of data integration systems is one of the primary
research areas in the information integration community [40, 54, 123, 125]. The
construction of meta-queriers has also received considerable attention [29, 59, 95, 127].
Much effort has been made especially on two fundamental research issues: i)
constructing and maintaining global schemas; ii) constructing and maintaining schema
mappings. All the existing complete/partial solutions mainly aim at building a single (or
very few) data integration system, and most of them still rely heavily on the involvement
of domain experts and system designers. Supporting meta-querier customization is
likely to result in the creation of a large number of meta-queriers, which is impractical
for a small set of domain experts and system designers to construct and to maintain.
Our research tackles both construction and maintenance issues with an approach that
minimizes the interaction between machine automation and human involvement.
Q3: Why is the reuse-oriented approach essential for suppor ting community-
based customization?
Another opportunity brought by meta-querier customization is the reuse potential
of the previous human efforts. In the same application-specific domain, users share
common interests and similar background. To meet user needs, data sources offer
highly overlapped functionalities. The designs of query forms often follow the same
trend. The involved vocabularies and control types are “clustering in localities and
converging in sizes”[28]. When more and more meta-queriers are available in the same
domain, we believe that the reuse of these existing meta-queriers (i.e., system-level
reuse) and their components (i.e., component-level reuse) is arguably the best approach
to construct new meta-queriers. Our proposed strategies, models and algorithms
capture the essence of reuse.
23
Mass Builder
Data sources
Required inputs
Users
Keywords
A domain
Data sources
An existingmeta-querierwiththe unwantedsources
Optional inputs
SourceRecommender
User needs
Auto BuilderMeta-querierRecommender The desired
sources
A customizedmeta-querier
Recommendation
Meta-queriers
Figure 3-1. The workflow of MQ-Customizer.
3.2 MQ-Customizer
3.2.1 Customization Workflow
Before digging into the details of MQ-Customizer, we first present the workflow
of the customization process. Imagine a user who wants to construct a meta-querier
tailored to his/her individual needs. As illustrated in Figure 3-1, the whole process of
customization is abstracted in two phases from the viewpoint of the user:
Phase 1 : Resource Selection. From a user-specified domain, Meta-Querier Recom-
mender and Source Recommender recommend a ranked list of meta-queriers and a
ranked list of data sources, respectively, based on user requirements. The requirements
are derived from the user inputs: the preferred data sources and the corresponding
application domain. The domain can be selected only from the existing ones in the
system. If the user can find an appropriate meta-querier from the meta-querier list, the
final goal is achieved by reusing an existing one; otherwise, it enters the next phases to
create a new one.
Phase 2 : Meta-querier construction. A new meta-querier can be generated by
integrating a set of data sources from scratch, or built on top of an existing one by
removing the unwanted data sources and inserting some additional sources. Following
24
RepositoryManager
ProfileRecorder
Schema Generator
...
Users
...
Schema Matcher
CompatibilityChecker
SourceRecommender
Communities
MappingRepositories
SchemaRepository
Meta-queriers
ProfileRepository
BuilderLayer
Information Layer
Service Layer
Meta-querierRecommender
Figure 3-2. The system architecture of MQ-Customizer.
the user requirements, Auto Builder attempts to construct a new meta-querier without
human assistance. If the effort fails, Mass Builder is invoked to continue the task with
Auto Builder. Mass Builder relies on a mass collaboration strategy. We believe the users
not only have enough motivation to build and maintain their own meta-queriers, but
also are willing to volunteer to assist others. These users along with a small number
of domain experts can practically form a collaborative domain-specific community for
meta-querier construction. Finally, a customized meta-querier is delivered to the user.
3.2.2 System Architecture
To realize the two-phase customization process, we present a three-layer architecture
for the MQ-Customizer in Figure 3-2. For simplicity, this figure only shows the major
components relating to the three research strategies discussed in the introduction,
i.e., mass customization, mass collaboration, and reuse-oriented construction and
maintenance.
25
Service Layer provides interactive services to assist users in discovering and
constructing meta-queriers based on their individual needs. Meta-querier Recommender
performs the recommendation of the existing meta-queriers with the pre-integrated
data sources. With the assistance of Compatibility Checker , Source Recommender
guides and recommends users in their selection of data sources. Both recommenders
should be fully automated to provide users an interactive environment. The success
of recommendation is mainly determined by the understanding of user requirements
and resource capabilities/functionalities. To improve the performance of run-time
recommendation, a functionality-based algorithm should be designed for matching
user needs and system resources (including all the meta-queriers and their underlying
data sources). The mining algorithms for discovering user needs and preferences have
been extensively studied in user adaptive systems, e.g., content-based [10, 105, 106]
and collaborative [5, 124, 137] filtering algorithms. They should be included with our
ontology-based resource selection to implement the three major components in the
service layer.
Builder Layer contains the reuse-oriented Auto Builder and the community-driven Mass
Builder. Two major research issues here are schema generation (i.e., generating global
query forms) and schema matching (i.e., discovering query-form mappings) for the
construction and maintenance of a large number of meta-queriers. Unfortunately, both
are well known AI-complete problems [54, 91]. Most existing approaches still rely heavily
on human involvement, especially in the context of dynamic and complex schema
integration. Auto Builder is proposed to address the two problems by maximally reusing
the previous validated outcomes. The research focus is to design human-friendly
reuse-oriented Schema Matcher and Generator to reduce and to facilitate the inevitable
human interaction. Auto Builder is augmented by the Mass Builder, which exploits
collaborative intelligence to further tackle the same problems. However, collaborative
activities in Mass Builder might lead to potential inconsistencies or errors introduced
26
due to intentional or unintentional mistakes by community members. Thus, a secondary
focus in this layer is to reduce the adverse influences through the incorporation of three
error-handling mechanisms: error avoidance, identification and recovery.
Information Layer stores and manages the information for the operation in the other
two layers. The first responsibility of Profile Recorder is to acquire user profiles (e.g., the
interaction between users and meta-queriers) and system logs (e.g., the involvement
records of community members), and then to store them in Profile Repository. The
second one is to record each human-validated version of meta-queriers (i.e., their
global and local schemas and the associated mappings) into Schema Repository.
With the assistance of community members, Repository Manager constructs and
maintains two complementary domain-specific mapping repositories: an evolution-
oriented repository (MO-Repository ) that records the evolution of mappings, and a
task ontology (M-Ontology ), which is the core of the whole system. M-Ontology has
multiple responsibilities through functioning as: (1) a mapping repository for efficiently
storing and managing a large number of mappings in a human-friendly manner; (2)
a communication platform for simple and secure mass collaboration of meta-querier
construction and maintenance; (3) a knowledge base for maximal reuse and complex
reasoning of the previous validated schema matching and generation; Because of its
critical roles in our system, our research addresses the issues in the whole life cycle of
M-Ontology, from the design, generation, to the utilization and maintenance of it. The
rest of this dissertation explains our work on M-Ontology with promising experimental
results over real-world data sources.
3.3 Ontology-centric Mass Customization
The proposed customization process of meta-queriers is a kind of mass customization
[34, 109], which lies between two extreme types of customizations: one-size-fit-all
and individual personalized meta-queriers. It is desirable to minimize the number of
necessary and useful customizations to reduce the overall workload of meta-querier
27
construction and maintenance in the community. Our solution strategy focuses on the
best-effort reuse of pre-built meta-queriers and pre-integrated data sources in meeting
diverse user needs. The system-level and component-level reuse in meta-querier
construction will be discussed in Section 3.4. This section mainly copes with the
recommendation problem for the reuse (as discussed in phases 1 of Section 3.2.1): how
to intelligently select meta-queriers and data sources in the domain to meet specific user
needs.
Related Work : This recommendation problem can be regarded as a source selection
problem in a non-cooperative environment, where the local data sources are autonomous
and self-governing, and their contents can be acquired only through submitting queries.
Different from sampling in distributed text retrieval systems [22, 51, 64], surfacing
the contents hidden behind complex HTML forms is very difficult [115, 140, 141] and
even infeasible [89]. That means this recommendation problem cannot be solved
through the traditional source selection techniques (e.g., selecting local sources by their
contents [50, 131] or scales [130, 131] estimated from query-based samples). Thus, it is
necessary to design new algorithms for such an emerging source selection problem.
Specific Research Tasks : The research should investigate how to exploit the query
capabilities [27, 74, 149] of resources for achieving more accurate recommendation.
The recommendation procedure is composed of three steps: acquiring user needs
from the current inputs and history records, calculating the matching scores between
the needs and each meta-querier and data source, and presenting users a list of
meta-queriers and data sources sorted by the matching scores. To integrate query
capabilities into this procedure, the following research tasks should be conducted:
• Resource modeling : Data sources and meta-queriers are two primary resources
in our system. Query forms for these resources can be regarded as a set of query
conditions [62, 151], we call schema elements. Each element consists of a control
and its associated attributes. A single or a set of query components represent a
28
specific query capability [27, 74, 149] that a resource possesses. The components
with the same capability (normally in different query forms) are clustered to form a
higher-level capability concept (i.e., a G/A node) in M-Ontology, as discussed in Section
3.3. In a sense, M-Ontology is a domain-specific query capability repository, where
each connected G/A-Node sub-graph generally corresponds to an abstracted query
capability. Thus, each resource is modeled as a set of abstracted query capabilities.
Additionally, other resource properties (e.g., popularity, stability and credibility) that can
be discovered dynamically from the interaction records are included in the resource
model.
• Requirement modeling and user interaction : In terms of query capabilities, user
needs and interests can be modeled by three preference vectors respectively storing
data sources, meta-queriers and capabilities. Three interaction mechanisms can be
introduced to construct these three vectors: 1) user inputs of the preferred data sources;
2) user selection of the preferred query capabilities from a list automatically generated
from M-Ontology; 3) user selection of meta-queriers with wanted/unwanted sources.
The preference vectors can be directly learned from the current and previous user
behaviors. In our current solution, based on the explicit capability specification from
users, the capability vector is acquired by grouping the capabilities of the user-preferred
meta-queriers and sources.
• Matching : The calculation of matching scores between needs and resources is based
on capability similarity. Similarity between user needs and resource capabilities can
be identified through comparison of the users’ preference vectors with the existing
resources and their query capabilities. It can be treated as a multi-criteria decision
making problem. Each criterion corresponds to the desirability of a specific capability.
By combining all the criteria, a utility function is desired to calculate the matching scores
that quantify the desirability of all the data sources for a particular user need.
29
3.4 Reuse-oriented Meta-querier Construction and Mainten ance
Dynamic customization of meta-queriers is our ultimate research goal. To achieve
this goal, a meta-querier must have two unique capabilities: 1) Customizability: its
construction enables users to specify their needs and preferences. 2) Dynamic re-
configurability: it can evolve to adapt to the changing user needs and source interfaces.
Both system users and data sources are self-governing agents that are autonomous
and independent from meta-queriers. The maintenance issues are even more
important than the initial construction, especially in such a dynamic virtual environment.
Therefore, MQ-Customizer provides services for the construction and maintenance of
user-customized meta-queriers.
Specific Research Tasks : The research tasks are categorized as: a) design-level tasks
– analysis of user requirements, and methodologies of construction and maintenance; b)
implementation-level tasks – considerations and approaches to schema matching and
generation.
• User requirements : A fundamental question about the design of a meta-querier is:
“what kind of global query forms are expected by the normal users?” In global query
forms, schema elements can be classified based on how they match with local query
forms: 1) fully-overlapped elements are the global elements that have corresponding
elements in each and every underlying local schema; 2) partially-overlapped elements
that are otherwise. We observe that the partially-overlapped elements greatly affect the
effectiveness of meta-queriers. Imagine that a meta-querier integrating only two data
sources DA and DB . Assume that besides full-overlapped elements, its global schema
has two partially-overlapped elements Ea and Eb, respectively, from DA and DB . If users
input some query conditions in both Ea and Eb (e.g., entering some texts), the query
results should be null or further filtered out by the meta-querier; otherwise, the result
might include some/many records that violate the original user-specified conditions.
Such result filtering and other post-processing operations are not hard to implement in
30
a integration system over a relatively small number of pre-configured data sources (e.g.,
Information Manifold[74], TSIMMIS [75]). However, the implementation of result filtering
become more challenging in the construction and maintenance of meta-queriers, where
a large number of fully-autonomous data sources that need to be integrated. Best-effort
approaches are widely used in result extraction and merging.
Motivated by this observation, we propose two additional strategies complimentary
to result filtering. The first strategy is to reduce the number of partially-overlapped
elements in global schemas. This can be achieved by introducing interaction mechanisms
in the source selection phase that guide and recommend users in their selection of data
sources. The second strategy is to reduce the possibility of inappropriate user inputs.
This can be accomplished in the global-interface generation phase by separating
partially-overlapped elements from fully-overlapped elements, and placing them in
groups based on the overlapping relations of their sources. We also provide an option to
exclude partially-overlapped elements from global schemas as far as possible.
• Methodologies of construction and maintenance : Our solution supports two types
of construction procedures: 1) Scratch construction is to construct a new meta-querier
from a given set of data sources. It is the default approach in all current meta-querier
construction systems [29, 59, 127]. To construct a meta-querier from scratch, the first
step is to find the correspondence among the local query forms (i.e., schema matching).
These element correspondences are used to generate global schemas (i.e., schema
generation). For translating user queries through global schemas to respective local
queries, the data transformation rules from global schema elements to local elements
must be discovered (i.e., schema matching). 2) Prefabricated construction is to build a
new meta-querier by re-configuring an existing meta-querier (e.g., inserting new data
sources or deleting inclusive sources). Surprisingly, no sound solution exists in the
literature to cope with this construction approach. In our adaptation of the prefabricated
construction solution, MQ-Customizer first recommends meta-queriers and data
31
sources based on the specific requirements derived from user inputs such as keywords,
preferred data sources and essential functionalities. After the selection of a specific
meta-querier, the construction process is completed by two subsequent operations,
source deletion and source insertion, for removing unwanted and inserting preferred
user data sources, respectively.
After the initial construction, the global schemas and the associated transformation
rules need to be updated, if data sources change their original query interfaces or users
want to further insert/remove data sources. In essence, such system maintenance
is equivalent to a reuse-oriented construction process similar to the concept of reuse
in software development [11]. Thus, in our proposed solution, system maintenance
is integrated with prefabricated construction through system-level reuse (i.e., reusing
pre-existing meta-queriers). The two major operations, source deletion and insertion, for
implementing system-level reuse are discussed as follows.
Constructing and maintaining holistically a large number of meta-queriers, we take
into account three major design considerations: 1) Information-sharing: Construction of
a meta-querier can reuse the previous work of peer meta-queriers in the same domain
through sharing of construction information in the common knowledge bases, i.e.,
M-Ontology and Schema Repository; 2) Incremental construction: Construction of a
meta-querier always reuses its self-history as the first attempt of schema matching
and generation instead of starting from scratch; 3) Human-friendliness: the reasoning
procedures and results of construction should be easy to understand and manage even
for non-expert volunteers.
• Source insertion and deletion are two basic operations for incremental updates
of global schemas and their associated mappings for insertion and deletion of local
schemas. Intuitively, source deletion can be implemented by simply removing the
corresponding schema elements (if not shared) in the global schema and its associated
mappings [59]. However, this solution of source deletion cannot undo the effect of
32
source insertion. Consider for example, an air-ticket booking meta-querier consisting of
two sources with different child-ticket age ranges, {2-16} and {4-12}. The integrated
global schema contains five sub-ranges on passenger age, {0-2, 2-4, 4-12, 12-16,
16+}, or else errors could occur. When one of the sources is removed, the global
schema remains the same. After successive source insertions and deletions on this
meta-querier, the global schema can become unnecessary complex and even hard to
understand. This problem is worsened in our system since system-level reuse is applied
in both prefabricated construction and system maintenance. There is a need to develop
algorithms for source deletion and insertion that are invertible to each other. To insert or
delete a data source, MQ-Customizer first attempts to reuse the existing meta-queriers
with the same underlying data sources by searching Schema Repository. Then, the
ontology-based schema matcher and merger are called to update global schemas and
the associated mappings.
• Schema matching is to discover the mappings from one schema to another
schema. Although many-to-many mappings with conversion rules are pervasive in
real-world applications, fully automated discovery of them is almost impossible in most
existing learning-based [35, 39, 56, 86] and template-based [37, 43, 144] solutions.
Ontology-based approaches [7, 94, 117, 148, 157] show promising results for matching
schemas through an external ontology that specifies domain-specific knowledge.
However, there is no universal/generic ontology or even a small set of them in real-world
applications [100]. Constructing a customized global ontology [36, 110, 121] and
matching various local ontologies [46] are also labor-intensive, time-consuming and
error-prone problems. Additionally, most ontologies do not contain conversion rules that
are hard to reason by machines. To tackle these problems, the proposed M-Ontology is
generated from schemas and mappings. As more schemas and mappings are inserted
into the ontology, more concept nodes and T-Edges can be generated and subsequently
more mappings can be discovered.
33
The basic algorithm for the schema matcher is to match two schemas by classifying
all the schema elements into the concept nodes in M-Ontology. The approach is based
on a feature of M-Ontology: when an E-Node that encapsulates a schema element is
classified into an existing G-Node, all the mappings associated with this G-Node are
automatically assigned to this E-Node. That is, all the E-Nodes in the same G-Nodes
share their mapping information since they have the identical semantics in the same
formats. A-Nodes are selected if all the inclusive G-Nodes are already identified. The
mapping information stored in T-Edges can be reused if these edges connect the
selected A/G-Nodes. Following this basic idea, the critical issue is to classify the schema
elements into the correct concept nodes (i.e., G-Nodes).
We design a human-friendly algorithm for this specific element classification
problem. To classify an element en, we first search M-Ontology for its current or
previous versions. If none exists, we use semantic matching to compare en with
the representative object of each G-Node. For each G-Node gn in M-Ontology, a
representative object ro is automatically generated to describe its semantics as follows:
1) generating a bag of descriptive words DA by normalizing the descriptive attributes
of all human-verified E-Nodes in gn using NLP techniques [52] such as tokenization,
stop-word removal and stemming; 2) obtaining a set of descriptive labels DL by selecting
the terms with the top-k TF-IDF (i.e., term frequency-inverse document frequency [122])
weight from DA; 3) finding instances IST and constraints IC by combining the instances
and constraints of all verified E-Nodes. Finally, we generate the representative object ro
with a tuple 〈SetE−Node, DA, DL, IST, IC〉. In a sense, the element classification problem
is converted into a one-to-one schema-matching problem. This model provides three
main benefits: 1) ro offers community members simple and straightforward descriptions
of concept nodes that are easy to understand. Members can correct and enrich its
contents (discussed in Section 4.1); 2) The performance of the classification algorithm
can be easily improved by community members (e.g., through changing DL and DA); 3)
34
Most existing semantics matching techniques [46, 116] can be combined and integrated
into the classification algorithm. Instead of treating these matching techniques as black
boxes, we need to enable the matching procedures and results for efficient visualization
and control by the community members.
35
CHAPTER 4MAPPING MODELING
The cornerstones of M-Customizer are the mappings from global forms to local
forms. In our proposed solution, both global form generation and user query translation
rely on the existing mappings. Before digging into our reuse-oriented mapping discover
and merger, this chapter first presents query-form modeling for meta-queriers, and then
two models of mappings from the perspectives of their semantics and evolution history.
4.1 Modeling of Query Forms
The proposed mapping repository requires a uniform representation of query
forms. This section introduces an undirected graph model for representing query forms.
Current languages used to represent query forms include HTML, CSS, and some
scripting languages. Structures of query forms can be classified into two groups:
one-step forms and multi-step forms. Theoretically, the multi-step forms can be
decomposed to multiple one-step forms based on their appearance dependence[127].
Therefore, this work considers only the modeling of one-step forms.
For a one-step form, the W3Cs HTML specification [1] defines it as a section
of a document containing normal content, markup, special elements called controls
(checkboxes, radio buttons, menus, etc.), and labels on those controls. In meta-queriers,
global and local query forms can be regarded as a set of query conditions [151][62].
User requests are normally made by modifying the HTML controls, e.g., clicking radio
buttons, entering text, etc.
A control and its associated attributes (e.g., name, id and class) and instances (i.e.,
possible user inputs) are regarded as “a whole”, also referred to as a schema element.
For instance, in the bottom left of Fig. 4-1, E1 is a schema element extracted from a
query interface. It has a control type “menu”, a label name “price”, a descriptive text
“Price”, and an instance set (from “20-40” to “350-400” and “All”). Typically, Query-form
Extractor (IE) can extract many useful attributes such as control type, name/label,
36
Figure 4-1. A query form and its graph model.
descriptive text, instances, data domain, default value, scale/unit (e.g., kg, million,
dollar), and data/value types (e.g., date type, time format, char type, etc.) [19, 59, 60,
62, 70, 97, 115, 151]. This work focuses on how to utilize these extractable attributes to
automate the construction and maintenance of meta-queriers.
In each global or local query form, schema elements are the most fundamental
building blocks. These schema elements are structured in a certain order and required
to obey some constraints, such as domain constraints and referential constraints. Two
schema elements are called syntactically equivalent iff all the attributes of two elements
are the same.
To capture the structural semantics among different elements, we translate
query forms from the native format into undirected graphs. In this graph, each vertex
corresponds to a specific schema element in one form. An edge is used to connect two
adjacent vertexes, with a boolean property to represent its adjacency type, vertical or
horizontal. Each maximum connected subgraph in the graph corresponds to a semantic
block with a descriptive text D (if available). Each block is assigned a unique identifier
called BlockID. In addition, each graph corresponds to a unique query form. It can be
identified by its uniform resource identifier (URI) and a version number (denoted by
37
the time span T during which users can successfully access the data source through
this query form). The left portion of Fig. 4-1 shows a query interface for a hotel booking
system. It requires users to input three categories of information. The right portion
of Figure 4-1 shows its corresponding graph model. This graph is composed of three
corresponding semantic blocks. In each block, the vertical relations between schema
elements are denoted by solid lines, whereas the horizontal ones are denoted by dotted
lines.
4.2 Change-oriented Mapping Modeling
Before the introduction of change-oriented mapping modeling, we first present the
definition of mappings between two query forms based on the representation of query
forms (described in last section). Having a semantically rich representation of mappings
is particularly important. The rest of this dissertation follows the following definition.
Definition 1. An element mapping mapST (also called a mapping instance) is an instance
of a specific relation from a query form QFS to QFT . It can be represented by a tuple
〈EListS ,EListT ,ExpST 〉, where
• EListS and EListT are two ordered lists of schema elements respectively from QFS
and QFT , whose semantics are relevant to each other. The element number of a
list can be one or greater than one, and thus mapping cardinality might be 1:1, 1:n
or n:m (n > 1 and m > 1).
• ExpST denotes a high-level declarative expression that specifies the transformation
rules from EListS to EListT . Expressions can be list-oriented functions (e.g.
equivalence, concatenation, mathematic expressions) or other more complex
statements (e.g. if-else, while-loop). In addition, the format of Exp should be both
human-understandable (i.e., able to be easily modified by normal users), and
machine-processable (i.e., can be automatically transformed to executable rules).
• An element mapping without ExpST is called as a correspondence corrST .
38
4.2.1 Motivating Scenario
Mappings are not static but dynamic in the context of meta-querier customization,
where both end users and data sources are autonomous agents that are independent
from meta-queriers. The concrete scenarios of mapping evolution can be characterized
into two types:
1) External changes. The changes in the schema elements from local and global
forms often causes the evolution of the corresponding mapping, as shown in Figure 4-2
(a) and (b). For example, if a car-rental local form wants to support new emerging car
models (e.g., 2011 Ford Fiesta), the corresponding entries need to be included in its
car-model control (e.g., a selection menu). Furthermore, changes in global forms occur
due to the evolution of its underlying local forms and user-specified source selection.
Consider for example, an air-ticket booking meta-querier consisting of two sources with
different child-ticket age ranges, {2-16} and {4-12}. The integrated global form will
contain five sub-ranges on passenger age, {0-2, 2-4, 4-12, 12-16, 16+}, or else errors
could occur. If one source (e.g., 2-16) is removed, the age element in the global form
should be simplified (i.e., {0-4, 4-12, 12+}).
These changes of the query forms are unpredictable and untraceable, since their
updates (e.g., functionalities and representation) often occur without any notification
and information. That is, it is almost impossible to obtain when and how the query-forms
are changed step by step. The syntactically different query forms with the equivalent
URI normally are regarded as the different versions of the gateway/portal to a specific
data source. The data source is accessible separately through these forms during the
different time spans.
The unpredictable and untraceable characteristics contradict with the significant
assumptions[119][102] in the schema evolution and versioning, which are two traditional
research problems in the database field. Thus, we cannot simply employ the related
techniques to model the changes of query forms. Although we cannot automatically
39
exp1
exp2
(c)(b)(a)
exp1EList
1
S
exp2
EList2
S
EList TEList
1
T
EList2
T
EList S
EList Sexp1
exp2
EList T
Figure 4-2. Mapping evolution scenarios.
identify the step-by-step change processes, the snapshot-by-snapshot processes are
still useful resources in mapping discovery. In essence, external changes indicate
not only the element-to-element correspondences but also the evolution trends in the
same application domain. Both information can be employed to facilitate the automatic
discovery of new mappings and the later manual correction.
2) Internal changes. Modification of mappings comprises the changes on the
mapping composition and the updates of their related context information, i.e., the
metadata of mappings. A mapping instance can be created from the scratch, or from
another instance with some modification. There is no guarantee that these mapping
instances discovered by machines or humans are completely free from error and
always function well. These changes are traceable. As shown in Figure 4-2 (a), (b)
and (c), the correction includes the changes on the expressions (i.e., ExpST ) and
the element lists (i.e., EListS and EListT ). To enhance the robustness of customized
meta-queriers, internal changes should be stored for the possible recovery, especially in
the community-based construction and maintenance.
4.2.2 Modeling
Based on the above observations, we design a change-oriented mapping model to
preserve mapping evolution. First, we explain the life cycle of a mapping in our whole
framework. And then, a model (referred to as mapping objects) is presented for mapping
evolution.
Mapping Lifecycle: Each mapping instance has its own lifecycle starting from the initial
creation to the physical deletion. The state transitions are determined by its validation
40
Usable L-Deleted
Detached
Validating
CreatedP-Deleted
Figure 4-3. The life cycle of a mapping.
status. Fig. 4-3 illustrates a state diagram with the four states that a mapping can be
in: Validating, Usable, Detached, and L-Deleted. The Validating state of a mapping
indicates that its current correctness is undetermined. It remains to this state until
community members or machines validate it completely. A newly created mapping
begins its lifecycle in the Validating state. The state of an existing mapping transitions
into the Validating when its correctness status is changed. While a mapping is at
the Usable state (i.e., ready for being utilized), it can perform correctly in the current
meta-querier. When a previously correct mapping is identified to be incorrect, it enters
the Detached state. Such a mapping might be useful for discovering new mappings. For
the mappings that are invaluable (e.g., never correct), they enter L-Deleted state (i.e.,
logically deleted). The lifecycle of a mapping is finished when it is physically deleted.
Mapping Objects: Between one query form QFS and another QFT , there exist a set
of mapping objects. A mapping object mapObjST denotes a specific relation between
one query form QFS and another QFT (i.e., QFS→QFT and QFT→QFS ). It can be
represented by a bipartite graph whose edges Set〈RST ,RTS 〉 only connect the nodes
from two disjoint node sets Set〈NS〉 and Set〈NT 〉. Each node in Set〈NS〉 and Set〈NT 〉
respectively corresponds to an element list from QFS and QFT . Each solid edge
represents a mapping instance of mapObjST . Based on the semantics, every pair of the
nodes respectively in Set〈NS〉 and Set〈NT 〉 is a correspondence corrST .
Each mapping object consists of a group of mapping instances that represent the
same relation between two query forms. These instances are regarded as different
41
R
N1S N1
T
ST
1Set N‹ ›S Set N‹ ›T
RTS
2
(b)
(e) (f)
R
N1S
N2S
N1T
ST
1
RST
3
Set N‹ ›S Set N‹ ›T
RTS
2
(c)
R
N1S
N2S
N1T
ST
1
RST
3
Set N‹ ›S Set N‹ ›T
(d)
RST
4C1
R
N1S
N2S
N1T
ST
1
RST
3
Set N‹ ›S Set N‹ ›T
RST
4C1
N2T
C2
RST
5
R
N1S
N2S
N1T
ST
1
RST
3
Set N‹ ›S Set N‹ ›T
RST
4
N2T
RST
5
(a)
R
N1S N1
T
ST
1Set N‹ ›S Set N‹ ›T
Figure 4-4. The incremental formulation of a mapping object.
versions of this object. Thus, a mapping object can be represented using a bipartite
graph G = 〈Set〈NS〉, Set〈NT 〉, Set〈RST/RTS 〉〉, as illustrated in Fig. 4-4(f). We assume
that all the mapping instances in the same direction (e.g., QFS→QFT ) are independent
of each other. This assumption is feasible since two instances can be easily merged
until there does not exist any constraint between them.
Fig. 4-4 shows the formulation of a mapping object between one query form QFS
and QFT . Originally, the mapping object is only a single mapping instance 1RST (shown
in Fig. 4-4(a)). Later, another instance 2RST with the inverse direction is added into the
object (shown in (b)). When the external changes occur in QFS , two original mapping
instances, 1RST and 2RST , are detached and a new mapping instance 3RST is automatically
42
discovered by machines (shown in (c)). Then, since 3RST is incorrect based on manual
validation, it is logically deleted and replaced by another mapping instance 4RST (shown
in (d)). After the internal changes in QFS shown in (e), all the existing usable mappings
are detached. Finally, 5RST is the only usable mapping instance.
4.3 Ontology-based Mapping Modeling
In the context of meta-querier customization, the scale of such mappings might be
considerably large since a potentially large number of meta-queriers need to be built to
meet various user needs. We develop a novel semantics-based approach to modeling
mappings [77]. This model can be employed to organize unstructured domain-specific
mappings into a well-defined ontology. Such a modeling makes mappings easier to
manage by humans, especially for those non-technical users. Many-to-many complex
mappings can also be discovered more efficiently by leveraging the ontology abstracted
from the existing mappings.
4.3.1 Motivating Scenario
To motivate our proposed model, we use a simple scenario illustrated in Fig. 4-5(a).
Consider two simple meta-queriers MS1 and MS2 built for job seekers with different
preferences. MS1 and MS2 provide users different query forms (i.e., mediated schemas)
by integrating their own data sources. There exists an overlap between these two sets of
sources, such as LS . The contents of LS can be accessed by a query form (called local
schema). In a query form, the components (e.g., textbox and menus) and the related
descriptive texts and potential instances are regarded as schema elements.
In order to integrate LS into MS1 and MS2, the system designers need to specify the
mappings between schemas (called schema-level mappings) MS1 → LS and MS2 → LS
for translating user queries. Since the schema-level mappings hide the mapping details,
we decompose them into two sets of mappings between schema elements (called
element-level mappings) so that reusing element-level mappings becomes possible. The
43
directed edges in Fig. 4-5(a) indicate the element-level mappings from MS1 and MS2 to
LS .
Without considering the structural information, the schemas of LS , MS1 and
MS2 can be transformed to a flat schema as shown in Fig. 4-5(b). Each schema is
represented as a dotted-line rectangle, in which a solid-line rectangle corresponds
to a schema element. A mapping edge and its connected elements constitute an
element-level mapping between schemas.
There are two common approaches to model element-level mappings. The first
simply uses a table to model mappings as shown in Fig. 4-5(c). Its columns are
employed to represent the properties of mappings, such as source elements, target
elements and mapping expressions. The second uses a mapping graph as shown
in Fig. 4-5(d) to connect all the mappings together. That is, each node corresponds
E2
E5
E6
E7
E4
MS1
MS2
LS
E1
E6
E4E1
E7
E5
E2
E6
E4E1
E2E7
E5
C4C1
C5
C6
C2
(a) (b)
(c)
E3
E8
E3E8
C3E3E8
MS E1 4. LS E. 1 m1
From To Rule
MS E1 5. LS.E2 m2
MS E2 6. LS E. 1 m1
m2
m3
LS E. 2
MS E2 8.
MS E2 7.
LS E. 3
m1
Choose your industry State (optional)
Keywords
Job Category:
Please Select a Category
Location:
Please Select a Location
- Select a Job Category -
Keywords Location
Categories
e.g. Manager or Sales or enter a Web ID Chicago, IL or 60601
MS1
MS2
(d) (e)
E1
E2E3
E4
E5
E8
E7E6
m2
m3
m2
m1
m2
m2
m1
m1
m3
m1
m1
m2
m2
m3
m1
m2
m3
LS
Figure 4-5. Three job search forms with the mappings.
44
to a source or target element and each edge represents a mapping. This graph-view
structure is more explicit and straightforward for discovering new mappings through
composition of mappings [17, 48, 88].
However, the above two approaches become almost impractical for managing a
large volume of mappings, particularly when the mappings are shared and operated
by a community. Thus, we introduce a semantics-based mapping modeling to makes
mappings easier to manage. In the MS1 and MS2 meta-queriers, we observed that some
mappings are highly similar. For example, two mappings {E4,E1,m1} and {E6, E1,m1}
share the same target elements E1 and mapping expressions m1. From these similar
mappings, some concepts can be identified by grouping the schema elements based
on their semantics; the related relations can be transformed from the corresponding
element-level mappings. These concepts and relations can be used to form a mapping
ontology. For instance, shown in Fig. 4-5(e), the original two pairs of mappings
{E4,E1,m1} and {E6, E1,m1} can be represented by a single relation {C4,C1,m1} where
the concepts C4 and C1 are formed respectively by {E4,E6} and {E1}. In this setting,
mapping management becomes more intelligible to human users. Manipulations on
individual mappings can be replaced by more straightforward and convenient operations
on concepts and relations. Mapping redundancy and incorrectness are easier to
detect by humans or even machines. Repetitive operations on semantics-equivalent
mapping instances might also be avoided to eliminate the potential update anomalies.
Furthermore, semantic modeling also benefits the discovery of new mappings by both
humans and machines. Since it is a semantics-based abstraction of mappings, the
reuse of previous mappings become more straightforward and effective. Based on the
above discussion, it is evident that semantic modeling is a better approach with several
advantages. The rest of this section presents our proposed ontology-based modeling.
45
4.3.2 Modeling
In this model, element-level mappings are considered as the first-class entities.
Mappings between two schemas can be decomposed into some separate element-level
mappings; element-level mappings between two schemas can also be combined to
generate a schema-level mapping [53]. In Section 4.3.1, we use a simple example to
explain the basic idea of the proposed element-level mapping modeling. The example
shown in Fig. 4-5(e) comprises only several one-to-one simple mappings, but real-world
mappings can be much more complex. Section 4.2 introduces a semantically richer
representation of element-level mappings. With this mapping representation, we can
present a graph model for the proposed mapping ontology.
The mapping ontology is modeled as a directed acyclic graph where a node
represents a concept with a set of associated instances; an edge corresponds to a
relationship between two concepts. Both nodes and edges have some properties that
describe their semantics and constraints. Based on the composition of concepts, all
the nodes are classified into three different types: Elementary, Generalization and
Aggregation. Edges also have three categories for different purposes: Part-of, Is-a and
Transformation. Fig. 4-6 and 4-7 depict an ontology fragment about departure date in
the domain of air ticket booking. The graph model will be explained in details with this
example as follows.
E1
E2
Descriptive AttributeE-Node
E3<name:dateLeavingMonth>,<title:Select depart month>,<id:AIR_frommonth>,<text:Depart>
is-a
E1
E2 E3
G3
<name:AIR_frommonth>,<id: dm>,<class: formfield>
<name:outbound_departMonth>,<id:monthCalendar>,<class:month>,<text:Depart>
Figure 4-6. A fragment of a mapping ontology (E-Nodes and G-Nodes).
46
Elementary Nodes: An Elementary node (called an E-Node, shown as a rectangle in
Fig. 4-6) represents the concept of a schema element using a tuple 〈URI, D-Attributes,
Instances, I-Constraints〉, where URI identifies the location of the schema, D-Attributes
refer to the descriptive attributes associated with this element (e.g., name, id, value,
class, domain), and Instances and I-Constraints are the known data instances and the
instance constraints, such as the instance type and domain.
In ontology-based modeling, E-Nodes are the most fundamental concept units
for constructing high-level concepts, i.e., the concepts represented by Generalization
nodes and Aggregation nodes. The introduction of E-Nodes is to eliminate language
heterogeneity. Language heterogeneity refers to the difference in the languages
used to represent schemas, such as XHTML or HTML. Each schema element can
be transformed to an E-Node without semantic loss. The semantics of an E-Node is
reflected by the D-attributes of the corresponding schema element. Schema designers
always attempt to use a schema element for conveying a certain concept. Intuitively,
such a concept can be instantiated by the data instances under the corresponding
schema element. For example, the E-Node E1 in Fig. 4-6 corresponds to a schema
element whose D-Attributes are 〈name: outbound departMonth〉, 〈id: monthCalendar〉,
〈class: month〉 and 〈text: Depart〉, and instances are represented in a number from “1”
to “12”. It can be correctly inferred from its D-Attributes and instances that this element
indicates a concept about departure months.
Generalization Nodes: A Generalization node (called a G-Node, shown as a round in
Fig. 4-6 and 4-7) represents a concept by a tuple 〈SetE−Node , D-Attributes, D-Labels,
Instances, I-Constraints〉, where SetE−Node refers to an unordered set of E-Nodes as the
internal constitution of this G-Node. D-Attributes and D-Labels are used to describe the
semantics of the G-Node, and Instances denote a potential instance set Set INSgn under
the constraints of I-Constraints.
47
The semantics of a G-Node is equivalent to that of its inclusive E-Nodes. The
D-Attributes and D-Labels of a G-Node gn are two sets of word tuples SetDAgn and
SetDLgn . Each word tuple (wi , wfi ) consists of a distinct word wi and the corresponding
appearance frequencies wfi in gn. SetDAgn is generated from the normalized D-Attributes
of the included E-Nodes whose insertion is already verified by humans. The D-Labels
can be automatically derived from frequent appearance of D-Attributes or manually
input by humans. The detailed algorithms for finding the D-Attributes and D-Labels of a
G-Node are discussed in the next chapter. In addition, the Instances and I-Constraints
of a G-Node are directly obtained from the inclusive E-Nodes. For example, G3 has the
same instance set and I-Constraints as E1, E2 and E3.
The purpose of G-Nodes is for syntactic heterogeneity. The E-Nodes in SetE−Node
share the same semantics and instance formats but their syntactic representations
are different. That is, the instances of these E-Nodes share the same format with the
identical semantics if they represent the same object, but the D-Attributes of these
E-Nodes could be totally different. To take an example, in Fig. 4-6, the concept of
G-Node G3 is generalized from three verified E-Nodes. Although these E-Nodes
have equivalent semantics and instances format, their D-Attributes are different in
the attributes’ numbers, types and values. It is not necessary to impose any syntactic
transformation when these E-Nodes exchange their instances.
Aggregation Nodes: An Aggregation node (called an A-Node, shown as a triangle
in Fig. 4-7) represents a concept by a triple 〈ListG−Nodes , D-Labels, Instances〉, where
ListG−Nodes denotes an ordered list of G-Nodes, D-Labels and Instances respectively
refer to its descriptive attributes and possible instances.
For the purpose of representing many-to-many mappings, A-Nodes are generated
by aggregating an ordered list of G-Nodes. That is, an A-Node saves the order of its
inclusive concepts. Its Instances and D-Labels can be fetched from its inclusive nodes,
and then combined in such an order. For example, two A-Nodes A1 and A2 in Figure
48
4-6 represent a concept ”departure dates” by aggregating the same three G-Nodes
G1, G2 and G3. These G-Nodes respectively represent a calendar year, day and month
by numbers and their formats are shown in Fig. 4-7. Although their semantics are
equivalent, their instance formats are different due to different element orders. A1 uses
middle endian forms “month-day-year” (i.e., starting with the month). A2 uses big endian
forms “year-month-day” (i.e., starting with the year). Note that aggregation cannot be
directly applied on E-Nodes. In other words, A-Nodes only connect with G-Nodes. This
requirement enforces E-Nodes to be encapsulated into G-Nodes. It indicates more
concepts can be generated by clustering schema elements.
Is-a Edges and Part-of Edges: Is-a edges and Part-of edges (shown as single-arrow
edges separately in Fig. 4-6 and 4-7) represent hierarchical relations in mapping
ontologies. They respectively correspond to two distinct forms of concept abstraction:
generalization and aggregation. Both attempt to conceptualize E-Nodes. The main
benefit of generalization is to share semantic information among schema elements:
1) In a G-Node, all the E-Nodes have the same semantics and instance formats,
their descriptive labels (such as name, id, class) and instances can be exchanged for
understanding their semantics more precisely. 2) In a G-Node, all the inclusive E-Nodes
share the mappings. That is, as long as an E-Node is assigned to a G-Node, all the
part-of
G4
G1G2 G3
A1
A2
TR3TR1
part-of
G0
part-of
A0TR2
TR4
A0
A1
A2
Instance FormatA-Node
month-day-year(month:English)
month-day-year(month:num)
year-month-day(month:num)
G0
G1
G2
Instance FormatG-Node
month:English
year:num
day:num
G3
G4
month:num
month-day(month:num)
TR5
Figure 4-7. A fragment of a mapping ontology (G-Nodes and A-Nodes).
49
pre-configured mappings associated with this G-Node are also applied on this E-Node.
In addition, the introduction of aggregation is necessary for modeling n:m complex
mappings which include multiple schema elements in a certain order.
Transformation Edges: A Transformation edge (called T-Edge, shown as a directed
edge with double arrows in Fig. 4-7) connecting two nodes represents a transformation
relation. Its mapping expression is stored as an attribute of the related T-Edge. T-Edges
cannot connect E-Nodes.
Combining with the G-Nodes and A-Nodes, T-Edges aims to address three different
types of heterogeneity between two semantic-overlap concepts. They are instance
heterogeneity, structural heterogeneity and semantic heterogeneity: 1) Instance
heterogeneity refers to the variety in instance formats for the same entity. For example,
both G0 and G3 represent the departure months, but they respectively use different
formats, i.e., number and English. Thus, a T-Edge TR5 might be created for transforming
instances from G3 to G0, if necessary. 2) Structural heterogeneity often occurs when the
aggregation orders of concepts are different. For example, A1 and A2 represent a date
in a different order, i.e., “ month-day-year” and “year-month-day”. A T-Edge TR4 is used
to link A1 and A2. 3) Semantic heterogeneity represents the differences in the concept
coverage and granularity. For instance, TR3 is used to connect two concepts A1 and G4
with different coverage. Different from A1, G4 only represents the departure date without
“year”.
In addition to the above explicit relations, there exist some implicit transformation
relations stored in G-Nodes. In each G-Node, E-Nodes have an equivalence relation
between each other. As mentioned above, the instances of these E-Nodes are shared
among them without any change. This kind of relation is common in real applications.
4.4 Metadata in Mapping Modeling
The last two sections respectively present the change-based and ontology-based
approaches to modeling mappings. Besides the internal composition of mappings, their
50
related context information (i.e., the metadata) also plays a critical role especially in the
context of community-based meta-querier construction. The metadata is mainly used
to facilitate community cooperation and enhance system effectiveness and robustness.
The metadata includes who, what, where, when and how aspects associated with
the whole lifecycle of mappings, including their creation, verification, modification
and deletion. To represent the metadata, the proposed mapping modeling assign the
following metadata on each node and edge:
• Creation: Nodes/edges can be created manually or automatically. If humans are
the creators, the user names and the creation time are recorded in the related
nodes/edges. If they are discovered by machines, the algorithm names, the
credibility values (ranging between 0 and 1 and indicating how confident the
mapping holds) and the creation time will be stored. In addition, the creating
process is also recorded. When a mapping map is inserted, all the relative nodes
and edges are annotated by map.
• Verification: Verification checks for accuracy and completeness of nodes/edges.
In the current implementation, each node/edge is accompanied by a set of
attribute triples: 〈VS , Name, Time〉, where VS refers to the validation status of
this node/edge, Name and Time respectively represent the user names and time
points. VS has three types of values: Undetermined, Incorrect, and Correct.
Human users are able to view and verify all the components of mapping ontologies
manually. Although the current solution does not automatically evaluate the
correctness of mappings, it is not difficult to support such functionality if machines
have the capability of detecting the usage of mappings.
• Modification and Deletion: Mappings are not static. Their nodes/edges need
be updated or deleted. Since they might also be maliciously/accidentally
revised or removed, some version mechanisms [33, 66] should be merged into
51
Correct Incorrect
Undetermined
Created
Deleted
Figure 4-8. The life cycle of a node/edge.
ontology-based modeling. Although the current implementation does not include
such functionalities, it does record the related information, such as, who and when
the nodes/edges are modified or deleted.
• Annotation: Not all the mappings can be automatically discovered if there exist
some hidden contexts. For example, the scale-factor of instances can be set to
any integer value, such as 1, 1000 and 1000000. The money amounts can be
reported in different currencies, such as Dollar, Euro and Yuan. Normally, such
contexts are almost impossible to reason by machines from the individual mapping
instances. Thus, our solution is to enable users to annotate the nodes and edges
with additional information for understanding by others.
4.5 Related Work
4.5.1 Mapping Modeling for Data Integration
Mapping modeling is the foundation of mapping repositories. The existing methods
can be classified based on the extent to which mappings are abstracted: instance-level
(IL) and abstract-level (AL) modeling. IL modeling is a method that regards mapping
instances as separate entities without any abstraction, while AL modeling considers
mapping instances as a whole body under some hidden templates/rules, which can be
discovered from manual or automatic analysis and reasoning on the mapping instances.
IL modeling is widely utilized in many data integration systems. Most mapping
repositories [93, 101, 155] simply employ a large table to store schema-level or
element-level mappings. Its columns represent different properties of mappings, such
52
as schema/element names, mapping expressions, and similarity values. Others use
graph models in which nodes denote schemas [16, 37] or elements [13, 96] and edges
represent mappings. In a mapping graph, a new mapping edge between two nodes
could be discovered by combining all the edges in a mapping path between these two
nodes [17, 48, 88].
Compared with IL modeling, AL modeling facilitates mapping management and
re-utilization by providing the patterns/heuristics hidden in mapping instances. Patterns
are useful to assist humans/machines in reviewing previous mappings and finding new
mappings. A study in [86] employs machine-learning techniques for finding the patterns.
It leverages a large number of the existing mapping instances to train classifiers.
Essentially, these classifiers can be viewed as a media to store the implicit patterns
for finding new mappings. However, patterns in real life are too complex to learn by
machines especially when the volume of training data is not large enough. Another
problem is that these patterns are not straightforward to users. It is difficult for human
users to view, verify and modify them. Consequently, this method is often applied to
discover approximate mappings [87], rather than exact mappings with expressions.
Other studies [26, 43, 144] rely on human efforts to find templates, which can be
viewed as explicit mapping patterns. These templates are separately hardcoded into
the systems. They simply assume that the templates are static and separate. It is
highly desirable that these templates can be semantically organized and dynamically
managed.
Our proposed semantic modeling is a type of AL modeling. Different from the above
learning-based and template-based methods, our proposed semantics-based method is
to form a domain-specific ontology by conceptualizing mapping instances. The ontology
is a container that stores a large number of templates. Users can dynamically manage
these templates that are organized based on their semantics.
53
Compared with the learning-based methods, our proposed semantic modeling is
more straightforward to humans, especially for the non-technical users. Since user
interaction is inevitable in mapping management and discovery, easy-to-interact
is of much significance particularly for large-scale mappings. In this context, the
ontology-based model enhances the understanding of humans. In addition, these
mapping ontologies are visualizable, and thus, the operations in mapping management
can be transformed to several simple operations on graphs.
Compared with the template-based methods, our method is abstract and dynamic.
Each template represents a group of mappings with some specific patterns. In
our model, individual templates are merged into domain ontologies based on their
meanings. The semantics-based organization facilitates the discovery of new templates
by maximally reusing the previous ones. Furthermore, these domain ontologies can be
easily re-configured through periodic updates, such as insertion and modification.
4.5.2 Schema Element Clustering for Data Integration
Clustering schema elements is not a new approach in the field of data integration.
Element clustering is a process of organizing similar elements into the same groups and
dissimilar elements into the different groups. Its main applications in data integration
include schema merging, result merging and schema matching. For supporting schema
merging, mediated schemas are generated by abstracting a representative schema
element from each element cluster adopted in ARTEMIS [24] and Dragut et al. [41]. In
result merging, element clustering facilitates detection of the duplicated results returned
from different data sources IWIZ [111].
Most relevant to our work is the clustering algorithms in schema matching, including
Corpus-based [86], IceQ [144], and Zhao & Ram [156]. The hierarchical agglomerative
methods are employed by all the above references in constructing multiple concept
clusters. However, all these algorithms operate in a batch mode. That is, they assume
that all the schema elements are already collected before the start of clustering. In
54
fact, it is not feasible in the dynamic context. For dynamic reconfiguration, schema
elements to be clustered are incrementally occurred. Furthermore, Zhao & Ram [156]
also attempts to use K-means and SOM neural network as element clustering methods.
Like the above methods, they also did not consider the user interaction issues during the
clustering construction.
Our clustering algorithm is to identify concepts and relations for constructing a
mapping ontology. It is a type of representative object-based methods. The degree of
similarity of an element to a cluster is equal to its similarity value to the representative
of this cluster. The representatives (i.e., D-Labels) of a cluster are abstracted from the
existing nodes based on the term frequency statistics of their D-Attributes, which are
treated as a bag of words.
Our algorithm mainly differs from the above work in several respects. 1) Inputs:
the inputs to our system are mappings rather than schemas, since our system is to
store, manage and discover the mappings with their expressions. 2) Outputs: their
clustering algorithms only output the concept clusters without any relation among these
clusters, but our construction method is to build domain ontologies including concepts
and relations. 3) Algorithm: our algorithm is incremental and “human friendly”. First,
the representative object-based clustering is more straightforward to non-technical
users. The cluster representatives offer users simple and straightforward descriptions
of clusters that are easier for human understanding. The performance of our algorithm
can be more easily improved by manually changing the representatives, e.g., adding
or removing several D-Lables and D-Attributes. Second, for adapting to the evolution
of DI systems, it classifies the to-be-matched elements as soon as they arrive. This
incremental mechanism is very important for a long running and shared mapping
repository.
55
CHAPTER 5MAPPING REPOSITORY
User queries through global forms are translated to respective local queries. To
translate user queries, meta-queriers must find the element mappings from global forms
to local ones. In the context of meta-querier customization, the scale of such mappings
might be considerably large since a potentially large number of meta-queriers need to
be built to meet various user needs. Thus, it is highly desirable to design mechanisms
to manage these mappings. This chapter introduces two management mechanisms,
M-Ontology and MO-Repository, based on the aforementioned ontology-based and
change-oriented mapping modeling.
5.1 M-Ontology
M-Ontology is an ontology-based mapping repository. Mappings are the main
information sources for constructing such a domain-based mapping ontology. The
construction of the ontology is through incremental insertions of element-level mappings.
The incremental approach is very critical for building a long running and shared mapping
repository. Since both users and data sources are autonomous units, source selection
and local schemas are dynamic. New mappings continue to be inserted as new data
sources are inserted or old ones evolve. M-Ontology requires periodic update. It is more
effective to update the existing ontology incrementally rather than re-learn the ontology
from scratch, especially when the ontology has been manually corrected.
Like schema matching in data integration, ontology construction is also a labor-intensive,
time-consuming and error-prone problem [110, 121]. In order to reduce human efforts
involved in the construction, M-Ontology provides a semi-automatic algorithm [76] for
EndStartInserting
ListE1
Found?
[No]
[YES]
InsertingListE2
HumanInteraction
CreatingT-Edge
SearchingT-Edge
Figure 5-1. The flowchart of ontology construction.
56
inserting mappings. Each insertion involves three main tasks: a) concept identification,
including E-Nodes, G-Nodes and A-Nodes; b) relationship identification, including Is-a
edges, Part-of edges and T-Edges; c) metadata storage, including all the metadata
of concepts and relationships. Generally, it is straightforward to record the metadata
at the same time as the edges and nodes are manipulated during the whole life cycle
of ontology. Hence, the following contents in this subsection mainly focus on how to
deal with the first two tasks when inserting a mapping (map〈EListS ,EListT ,ExpST 〉).
M-Ontology splits the semi-automatic mapping insertion into three sequential phases as
illustrated in Figure 5-1: 1) inserting the element lists, EListS and EListT ; 2) inserting the
expression ExpST ; 3) validating the insertion by humans.
Phase 1: Element Insertion . The purpose of this phase is to obtain E-Nodes,
G-Nodes, A-Nodes, Is-a edges and Part-of edges when inserting the element lists. The
element lists, ListE1 and ListE2, are inserted separately. Algorithm 5-1 illustrates the
whole process of inserting an element list EList into a mapping ontology MO. Besides
updating the ontology MO, the algorithm returns an abstract node cn corresponding to
EList for the next phase (Note that for the simplicity of description, an abstract node is
introduced hereafter to represent a node that might be a G-Node/A-Node.)
In the algorithm, each schema element e in EList is checked if it already exists in
the ontology MO (line 2-3). If true, its G-Node is (a part of) the target abstract node
(line 4). If false, it is encapsulated into a new E-Node en whose initial validation status is
UNKNOWN (line 6). Then, line 7 seeks the most suitable G-Node gn in MO for inserting
en. If not found, en is inserted into a newly created G-Node; otherwise, en is directly
inserted into the found gn (line 8-9). If EList only contains a single element, gn is the
abstract node cn that will be returned (line 8,15,17); or else each of the found G-Nodes
gn is inserted into a newly created A-Node an in the same order with the original one of
schema elements in EList (line 1,2,11). Then, line 12 is to go over MO for finding out
an A-Node anInMO with the same contents like an. If such an A-Node exists, anInMO
57
is the node which need to be returned; or else an will be returned (line 13-14,17).
Before returning cn, the insertion effects on MO are executed in line 16 with the relative
metadata recorded.
Algorithm 5-1 : Insert an element list into a mapping ontology.
Data: A mapping ontology MO, an element list EList
Result : MO ← MO ∪ {EList}, a returned abstract node cn corresponding to EList
an = new ANode();1
foreach Schema Element e in EList do2
if e ∈ MO then3
gn = e.getENode().getGNode();4
else5
en = new ENode(e);6
gn = MO.searchGNode(en);7
if gn = ∅ then gn = new GNode();8
gn.insertENode(en);9
if EList.getElementNum()>1 then10
an.insertGNode(gn);11
anInMO = MO.searchANode(an);12
if anInMO 6= ∅ then cn = anInMO;13
else if an.getNodeNum() 6= 0 then cn = an;14
else cn = gn;15
MO.insertCNode(cn);16
return cn;17
58
The core of Algorithm 5-1 is two search problems (line 12,7): MO.searchANode(an)
and MO.searchGNode(en):
a) searchANode can be implemented by directly evaluating the equivalence
between an and the A-Nodes in MO. Two A-Nodes are equivalent only when they
include the same G-Nodes in the same order.
b) searchGNode is implemented by comparing the semantic similarities of all the
G-Nodes in MO with the E-Node en. The target G-Node gn is the one with the highest
similarity value simgn−en that must exceed a given threshold. If no G-Node is eligible,
searchGNode returns a null value to the caller. The similarity simgn−en between gn and
en is determined by two steps:
Step 1: Preprocessing. The D-Attributes and Instances of en are collected and then
normalized by NLP (natural language processing) techniques: tokenization, stop-word
removal and stemming. The resulting texts are considered as two bag of words, that is,
two unordered sets of words, SetDAen and Set INSen .
Step 2: Calculating. Resulting from the previous verified insertion of E-Nodes, the
G-Node gn has an instance set Set INSgn and two word sets SetDLgn and SetDAgn , respectively,
corresponding to its D-Labels and D-Attributes. If gn and en are constraint-compatible,
their semantic similarity is calculated by,
simgn−en = w1×sim1(SetDLgn ,Set
DAen )+w2×sim2(Set
DAgn ,Set
DAen )+w3×sim3(Set
INSgn ,Set
INSen )
where w1, w2 and w3 are three weights in [0,1] such that∑3i=1 wi = 1. The similarity
functions sim1, sim2 and sim3 are implemented by a hybrid matcher that combines
several linguistics-based matchers, such as WordNet-synonyms distances, 3-gram
distances and Jaccard distances. Many algorithms for schema/ontology matching have
been proposed (see the surveys [46, 116]). Most of them can be easily employed in
this phase. As this is not the target of our research, the details are not explained in our
research.
59
Phase 2: Expression Insertion . The target of this phase is to identify and create
T-Edges. The inputs of phase 2 include an expression Exp, and two abstract nodes cn1
and cn2 that respectively correspond to ListE1 and ListE2. Phase 2 checks if there exists
a T-Edge with the same expression as Exp from cn1 to cn2. If it does not exist, a T-Edge
containing Exp is created with a validation status of UNKNOWN; otherwise, this phase is
finished.
Phase 3: Human Interaction . If M-Ontology is only used to store mappings, the
whole process is fully automatic. However, M-Ontology also supports reuse-oriented
mapping discovery. Since mapping discovery is known as an AI-complete problem
[54, 91], human involvement cannot be avoided in the construction of M-Ontology, during
which many new mappings are discovered. Humans are responsible for the validation
and correction of nodes and edges. For improving the performance of M-Ontology, three
intervention strategies are designed, as follows,
• Validation . For alleviating possible redundancy and error propagation, only the nodes
and edges that have been validated by humans can affect the future mapping insertion.
VerifyENode and VerifyTEdge are two of the main verification functions.
a) VerifyENode (Verification of E-Nodes): After an E-Node en is verified, its two
normalized bags of words, SetDAen and Set INSen , start to affect the corresponding G-Node
gn. The D-Attributes SetDAgn and Instances Set INSgn of gn need to be updated to combine
the words occurring in the E-Node. The machine-found D-Labels SetDLgn might also
evolve by selecting the D-Attributes with the top-k frequencies (e.g., k=5). In case of
updates on the G-Node, the relative A-Node is also updated if necessary.
b) VerifyTEdge (Verification of T-Edges): The verification of a new T-Edge affects
not only its originally associated mapping, but also all the indirectly connected E-Nodes
inserted in the past and future. Such an activity indicates n × m mappings are validated
and stored, where n and m are respectively the value of possible E-Node combination
60
of the connected abstract nodes. More mappings can be found from VerifyTEdge, if
mapping composition [17, 48, 88] is supported.
• Avoidance . Validation does not always guarantee the correctness and non-redundancy
of the mapping ontology, especially in the context of mass collaboration. Thus, some
automatic detectors are proposed based on the structural properties of M-Ontology.
Alarms are sent to human users for assistance if one of the following cases happens:
(i) two or more verified T-Edges are found from one abstract node to another; (ii) a
new abstract node is created for mapping insertion; (iii) the found cn1 and cn2 are not
identical if the original mapping expression is “equivalence”.
• Prevention . Even if the current contents of M-Ontology are correct and non-redundant,
human users are also encouraged to involve in the enrichment of mapping ontologies
for improving the performance of construction. Manually modifying the attributes of
abstract nodes (such as D-Labels and Instances) can improve the precision and recall
of construction algorithm, since our algorithm is based on these attributes to insert
E-Nodes and expressions. In addition, the comments on nodes/edges are welcomed
from the creators/modifiers. The information stored in metadata (such as annotation) is
critical for understanding of others.
Following the above mapping insertion algorithm, M-Ontology is constructed
and enriched as more and more mappings are inserted. The main functionalities of
M-Ontology are mapping storage, management and discovery. First, the basic mapping
retrieval can be achieved in a fully automatic way. Even if the algorithm does not find
the correct abstract nodes for insertion, the inserted mappings can also be accurately
retrieved based on their URI and related metadata. Second, the management of a large
volume of mappings becomes easier for human beings: (i) the mapping ontology can
be represented as a graph, and thus, the operations in mapping management can be
transformed to several simple operations on graphs; (ii) the semantic modeling makes
sense of unorganized mapping information so as to enhance the understanding by
61
humans; (iii) the representative object-based clustering algorithm is more straightforward
to human users since they can easily improve the performance by manually changing
the representatives, e.g., adding or removing several D-Lables and D-Attributes. Third,
based on the semantic modeling and construction mechanism, a reuse-oriented
algorithm for mapping discovery can be efficiently implemented as discussed in the next
chapter.
Digging a little deeper into the structure of the mapping ontology, we can see two
important considerations:
• 1) Traditional domain ontology construction aims at semantic richness [121]
that is important for performing complex reasoning. However, richer semantics
requires more human involvement in ontology construction. This contradicts with
one of our original goals, that is, the best-effort reduction of human efforts during
the construction and maintenance of DI systems. Therefore, the contents that
M-Ontology contains and maintains are restricted to a sufficiently rich level. It
only includes the mapping-related information, since it is used only for mapping
storage and reutilization. The following chapter discusses how to construct such an
ontology.
• 2) Most traditional mapping repositories [93, 101, 155] store many-to-many
mapping instances directly without organizing them based on their semantics.
In contrast, M-Ontology attempts to identify and store the hidden concepts from
instances in a straightforward approach, and at the same time, address five
types of heterogeneity: language heterogeneity, syntactic heterogeneity, instance
heterogeneity, structural heterogeneity and semantic heterogeneity. It aims at
better mapping management and reutilization not only by machines but also by
normal users. The following chapter explains how to reuse mappings in such an
ontology.
62
5.2 MO-Repository and M-Table
Complementary to M-Ontology, MO-Repository is a repository of mapping
evolution. It records the evolution of mappings that serve the meta-queriers in a specific
application domain. That is, it stores the corresponding mapping objects Set〈mapObj〉,
each of which mapObjST denotes a concrete relation between two query forms QFS and
QFT , as defined in Chapter 4.2. mapObjST consists of the different versions of such a
relation. Each version corresponds to a mapping instance.
A M-Table is a logical partition of MO-Repository. It stores the mapping evolution of
a specific query form. That means it only contains the mapping instances that connect
to that query form. Different M-Tables could share the same mapping objects. The
introduction of M-Tables is to discover the new mappings through exploiting the evolution
process of mappings (i.e., the reuse of mappings). This part will be discussed in the next
chapter.
Construction of MO-Repository can be viewed as the process of incrementally
building mapping objects (i.e., recording the mapping evolution). The evolution can
be categorized into two groups: 1) traceable evolution; 2) untraceable evolution. The
difference between two types of evolution is based on the availability of the step-by-step
changes on query forms. If available, we can obtain the correspondences of schema
elements between two versions of query forms. Based on this correspondence, the
mapping instances can be easily inserted into the corresponding mapping objects.
Thus, the key problem is how to constitute MO-Repository by inserting the mapping
instances without step-by-step mapping evolution.
Algorithm 5-2 presents an incremental algorithm for construction of MO-Repository.
The algorithm shows the whole process of inserting a mapping instance mapST into a
MO-Repository MOR. Besides updating MOR, the algorithm returns a mapping object
mapObj that contains the instance mapST .
63
Algorithm 5-2 : Insert a mapping instance into a MO-Repository.
Data: A mapping instance map 〈EListS ,EListT ,ExpST 〉, a MO-Repository MOR.
Result : MOR ← MOR ∪ {map}, a returned mapping object mapObj that contains
map.
Set〈mapObjST 〉 = MOR.findMOBJ (S .URI ,T .URI );1
Set〈SmapObj〉 = synEquvMOBJ (EListS , Set〈mapObjST 〉);2
Set〈TmapObj〉 = synEquvMOBJ (EListT , Set〈mapObjST 〉);3
if ( countMOBJ (Set〈SmapObj〉) == 1 && countMOBJ (Set〈TmapObj〉) == 1 ) then4
if equalMOBJ (Set〈SmapObj〉, Set〈TmapObj〉) then5
mapObj = MOR.insertEdge(map);6
else7
switch ( countMOBJ (Set〈SmapObj〉) + countMOBJ (Set〈TmapObj〉) ) do8
case 09
mapObj = MOR.createMOBJ(map);10
case 111
mapObj = MOR.insertNodeEdge(map);12
otherwise13
MOR.mergeMOBJ(Set〈SmapObj〉, Set〈TmapObj〉);14
mapObj = MOR.insertNodeEdge(map);15
return mapObj ;16
As shown in Figure 5-2 (a), each mapObjST can be represented by a bipartite graph
G = 〈Set〈NS〉, Set〈NT 〉, Set〈RST/RTS 〉〉. In the bipartite graph G , a mapping instance
〈EListS ,EListT ,ExpST 〉 is represented as an edge RST directed from a node NS to another
NT . For a to-be-inserted mapping mapST , the target mapObj must connect query forms
64
(c)(b)
exp1EList1
S
exp2
EList2
S
EList T EList S
Global Local
EList1
T
EList2
T
exp1
exp2
Global LocalR
N1S
N2S
N1T
ST
1
RST
3
Set N‹ ›S Set N‹ ›T
RST
4
N2T
RST
5
(a)
Figure 5-2. (a) An example of mapping object. (b) The evolution of a global query form.(c) The evolution of a local query form
QFS and QFT . In the algorithm, the first step is to obtain a set of mapping objects
Set〈mapObjST 〉 by imposing the URI conditions on the connected query forms (line 1 in
Algorithm 5-2).
The second step is to search Set〈mapObjST 〉 to identify a target mapping object
mapObj , into which mapST is to be inserted. The relation represented by mapObj
should be semantically equivalent to the one of mapST . As discussed in Chapter
5.1, the semantics-based approaches cannot be fully automatized. That means
the MO-Repository construction might require additional involvement of community
members. To avoid such side effects, we design an automatic approach to searching
mapObj . Our approach makes the following assumption: a mapping instance is
contained in a mapping object if and only if the element lists (EListS or EListT ) of
this instance are evolved from or syntactically equivalent to the ones of another instance
in this mapping object. Following this assumption, two subsequent issues need to be
discussed before the description of our algorithm. First, the generated mapping objects
(i.e., relations) between two query forms are independent of each other. If the mapping
objects share the syntactically identical element lists, they should be merged to form
a new mapping object. The merging is feasible and reasonable in the customized
meta-queriers. Second, there might exist several generated mapping objects that
represent the same mapping object. However, this will not occur in the customized
meta-queriers, since the evolution of global query forms is traceable (as shown in Figure
65
5-2 (b) and (c)). Therefore, two sets Set〈SmapObj〉 and Set〈TmapObj〉 are respectively
selected from Set〈mapObjST 〉 by syntactically comparing EListS or EListT with the nodes
in the sets Set〈NS〉, Set〈NT 〉 (line 2-3 in Algorithm 5-2).
If both sets Set〈SmapObj〉 and Set〈TmapObj〉 include only one equivalent mapping
object, this mapping object is the target one. Then, an edge is inserted into this
mapping object to represent the inserted mapping instance map (line 4-6 ). If both
sets Set〈SmapObj〉 and Set〈TmapObj〉 are empty, a new mapping object is created with
two nodes and a single edge incorporated to represent map (line 7-10 ). If the sets only
contain one mapping object, this object is the returned one and we create a node and
an edge to contain map (line 11-12 ). Otherwise, the mapping objects in Set〈SmapObj〉
and Set〈TmapObj〉 should be merged to form a single mapping object (line 13-14 ).
Then, line 15 inserts the corresponding nodes and edge for map.
66
CHAPTER 6REUSE-ORIENTED MAPPING DISCOVERY
The holistic integration of construction and maintenance of meta-queriers through
component-level reuse greatly eases the task of large-scale meta-querier construction.
This chapter presents a novel approach[79] to schema matching by employing both
semantics and evolution of mappings. The previous versions of mappings and the peers
are potentially numerous and similar.
6.1 Problem Definition
Mapping discovery is a critical operation in the meta-querier construction and
maintenance. To build a new meta-querier, the mappings among local query forms are
necessary for automatic generation of a global query form, and the mappings between
global and local query forms are required for query translation at runtime. If the global
and local query forms are updated, the existing meta-querier should be reconfigured
with new mappings.
The goal of mapping discovery is to find the active mapping instances from one
query form QFS to another QFT . However, existing techniques [46][116] are not
adequate in discovering the mapping automatically, especially for the instances with
non-equivalence expressions. Thus, our proposed solution aims at decreasing the
complexity and workloads of mapping discovery made by community members. Our
algorithm outputs not only the active mapping instances mapST but also their previous
versions (i.e., the mapping object mapObjST ). If the mapping expressions cannot be
found, the correspondences corrST are also offered for facilitating mapping discovery.
In addition, in the pursuit of wide and active participation of non-technical volunteers,
our solution is designed intentionally to be straightforward to ordinary users. That is,
it is easier for them to understand the process, validate the results, and improve the
performance.
67
Algorithm 6-1 : Reuse-oriented mapping discovery.
Data: A source query form QFS , a target query form QFT , a M-Ontology MO, a
M-Table MT .
Result : Set〈mapST , corrST , mapObjST 〉.
Set〈eS〉 = extract (QFS );1
Set〈eT 〉 = extract (QFT );2
Set〈eMT , eS/T 〉 = MT .matchElement2Element(Set〈eS 〉, Set〈eT 〉, uri(QFS ), uri3
(QFT ));
Set〈mapST , corrST , mapObjST 〉 = MT .reuseMTable(Set〈eMT , eS/T 〉);4
Set〈cG , cA, eS/T 〉 = MO.matchElement2Concept(Set〈eS 〉, Set〈eT 〉,5
Set〈eMT , eS/T 〉);
Set〈mapST , corrST , mapObjST 〉 = MO.reuseMOntology(Set〈cG , cA, eS/T 〉, Set〈mapST ,6
corrST , mapObjST 〉);
verify (Set〈mapST , corrST , mapObjST 〉);7
return Set〈mapST , corrST , mapObjST 〉;8
The core of our approach is mapping reuse. Reusing mappings implies modeling
the mappings at some level of abstraction. Instead of directly reusing the individual
mappings as in the previous work [116][144][43][35], our strategy is to reuse the
conceptual abstraction from the mappings and their evolution for better performance.
M-Ontology and M-Table are respectively built on the ontology-based and change-oriented
modeling of mappings. With a domain-specific M-Ontology MO and a system-specific
M-Table MT , our reuse-oriented algorithm is to discover Set〈mapST , corrST , mapObjST 〉
from a query form QFS to another query form QFT , as illustrated in Algorithm 6-1. The
overall algorithm can be split into three phases: 1) mapping discovery through the
68
mapping table MT (lines 1-4); 2) mapping discovery through the mapping ontology MO
(lines 5-6); 3) human validation and correction of the discovered mappings (line 7). The
details are discussed below.
6.2 Discovery through M-Table
In the first phase, discovery through MO-Repository MOR is a search process
for reusing the existing mapping instances between the same pair of query forms. To
match the query forms from QFS to QFT , we first seek two sets of one-to-one element
correspondences Set〈(eS , eMOR)〉 and Set〈(eT , eMOR)〉, where eMOR is the elements
from MOR. Through these correspondences, the eligible mapping instances mapST ,
correspondences corrST and mapping object mapObjST are selected by reasoning on
MOR. The undetermined elements in Set〈eS〉 and Set〈eT 〉 are sent to the second
procedure with the element correspondences Set〈(eS , eMOR)〉 and Set〈(eT , eMOR)〉. The
core is the implementation of two functions: matchElement2Element and reuseMORepository.
• matchElement2Element (line 3) selects and returns a set of element correspondences
Set〈(eX , eMOR)〉 for each side (that is, X can be S or T ). The objective of correspondence
selection is to maximize the utility value as computed by,∑
eX∈X
∑
eMOR∈MOR
R(eX , eMOR)× a(eX , eMOR), subject to the following four constraints:
∑
eX∈X
a(eX , eMOR) ≤ 1, for eMOR ∈ MOR
∑
eMOR∈MOR
a(eX , eMOR) ≤ 1, for eX ∈ X
a(eX , eMOR) ∈ {0, 1} for eX , eMOR ∈ X ,MOR
R(eX , eMOR) ∈ {0, 1} for eX , eMOR ∈ X ,MOR
where, all the elements eX should be from the same query form X (i.e., QFS or
QFT ), and element eMOR is from MO-Repository MOR and with the same query form.
a(eX , eMOR) takes the value 1 if a correspondence between eX and eMOR is selected,
and 0 otherwise. The feasible result must guarantee that each eMOR is assigned to
at most one eX and each eX is assigned to at most one eMOR . The similarity value
69
R(eX , eMOR) is equal to 1 only when the similarity value between eX and eMOR is larger
than a pre-defined threshold δ; otherwise it is assigned with a value of 0. Their similarity
values can be calculated by the following formula:
w1 × simDA−DA(eX , eMOR) + w2 × simINS−INS(eX , eMOR)
where w1 and w2 are two weights in [0,1] such that∑2k=1 wk = 1. The similarity
values returned from simDA−DA and simINS−INS show the extent how similar element
eX and element eMOR are regarding to their Descriptive Attributes and INStances,
respectively. The element attributes and instances are first normalized by NLP (natural
language processing) techniques: tokenization, stop-word removal and stemming.
The resulting texts are then considered as bags of words, that is, unordered sets of
words. To compare the similarities of these word sets, a hybrid matcher is employed
by combining several typical linguistics-based matchers, such as WordNet-synonyms
distances, 3-gram distances and Jaccard distances. Many algorithms for schema/ontology
matching have been proposed (see the surveys[116][46]). Most of them can be easily
introduced in this phase. As this is not the major target of this research, the details are
not explained here.
The decision of these element correspondences is a generalized assignment
problem, which is a NP-hard problem. However, a simple greedy solution can quickly
attain the results in this context. First, the similarity matrix size and number is not big
[28]. For example, normally, the query forms in flight-ticket searching contain around 10
functional components. The query forms are compared with only the query form with the
same query form identifier. Second, the similarity matrix is sparse. Very few similarity
values are equal to non-zero.
• reuseMTable (line 4) is to determine mapping instances, correspondences and
mapping objects (i.e., Set〈mapST , corrST ,mapObj
ST 〉) through the identified element
correspondences. 1) A mapping object is selected only if the corresponding bipartite
graph contains at least one node in Set〈NS〉 or Set〈NT 〉 is fully hooked, that is, all
70
the elements of this node appear in element correspondences Set〈(eS , eMOR)〉 or
Set〈(eT , eMOR)〉. For example, in Fig. 4-4(f), if all the elements in N1S are fully hooked,
the mapping object will be included in the return set. 2) corrST is determined based on
a property of MO-Repository. The elements in every pair of nodes respectively from
Set〈NS〉 and Set〈NT 〉 is a correspondence. For example, in Fig. 4-4(f), the elements
in N1S and N2T constitute a correspondence. Thus, if both two node sets Set〈NS〉 and
Set〈NT 〉 have at least one node that are fully hooked, the elements eS/T that are
linked to these node pairs constitute new correspondences. 3) the mapping instance
is discovered through examining each identified correspondence: i) whether there
exists an edge to connect the nodes in the corresponding bipartite graph; ii) whether
the mapping state of this edge is Detached or Usable; iii) whether the edge direction is
QFS→QFT . If all the states are true, the correspondence with this edge can be regarded
as a mapping instance mapST . Finally, Set〈mapST , corrST , mapObjST 〉 are found via reuse of
the previous mappings in MOR.
6.3 Discovery through M-Ontology
The second phase is to discover the mappings through exploiting the conceptual
abstraction from the existing mappings in the same domain. Our approach is based
on a feature of M-Ontology: when an E-Node that encapsulates a schema element
is classified into an existing G-Node, all the mappings associated with this G-Node
will be automatically assigned to this E-Node. That is, all the E-Nodes in the same
G-Nodes share their mapping information since they have the identical semantics in the
same formats. A-Nodes are selected if all the inclusive G-Nodes are already identified.
The mapping information stored in T-Edges can be reused if these edges connect the
selected A/G-Nodes.
To determine the mappings from QFS to QFT , all the schema elements in Set〈eS〉
and Set〈eT 〉 are first classified into the appropriate G/A-Nodes in M-Ontology MO.
Based on classification results, we can discover new mappings by reusing the mappings
71
associated with the corresponding G/A-Nodes. The implementation is through two major
functions: matchElement2Concept and reuseMOntology.
• matchElement2Concept (line 5) is to classify each undetermined element eS/T in
Set〈eS〉 and Set〈eT 〉 to appropriate G/A-Nodes from the M-Ontology MO. The resulting
G/A-Nodes constitute two concept sets ASetS and ASetT . A-Nodes are selected if all
the inclusive G-Nodes are already identified. Thus, the major focus is how to find an
appropriate G-Node gn for a given element eS/T . Five strategies [80] can be used to
measure the suitability of the G-Node gn for eS/T :
• 1) MIN: This strategy returns the minimum similarity value between eS/T and
all E-Nodes associated with gn. It is a pessimistic expectation since similarity
matching for gn is more likely to fail if any one of the E-Nodes is found less similar
to eS/T .
SimilarityE2GN (eS/T , gn) = MINeni∈gn{SimilarityE2EN (eS/T , eni ) }
• 2) MAX: This strategy is an optimistic forecast. It returns the maximum similarity
value between eS/T and all E-Nodes in gn. As long as there exists a E-Node highly
similar to eS/T , gn is also considered to be very similar.
SimilarityE2GN (eS/T , gn) = MAXeni∈gn{SimilarityE2EN (eS/T , eni ) }
• 3) AVG: This strategy returns the average overall similarity values between eS/T
and E-Nodes in gn. It is a compromise of the above two plans.
SimilarityE2GN (eS/T , gn) = AVGeni∈gn{SimilarityE2EN (eS/T , eni ) }
All the above three strategies use the definition of similarities between eS/T and
eni . We represent this similarity function SimilarityE2EN (eS/T , eni ) as follows, where
the comparisons are based on the properties/attributes of the E-Nodes.
SimilarityE2EN (eS/T , eni ) = Aggregationpropi∈eS/T ,propj∈eni (SimProp2Prop (propi ,
propj ))
72
• 4) HISTORY: This strategy uses the similarity values between eS/T and its
previous version eH . It assumes the query-form evolution is not a comprehensive
change. At least, some components still remain identical, especially in the design,
functionalities and representation. Normally, this strategy is employed as a
complementary solution to improve the performance of the other strategies. The
core is how to find and select the previous version eH . In our solution, the selection
of eH depends on the utility maximization of the function matchElement2Element
(discussed in Section 6.2).
• 5) ABSTRACT: This strategy returns the semantic similarity between eS/T and
the representative object ro of gn. ro is represented by a tuple 〈DA,DL, INS , IT 〉,
where DA is a bag of descriptive words that can be generated through normalizing
the descriptive attributes of all human-verified E-Nodes in gn using the aforementioned
NLP techniques; DL is a set of descriptive labels that consist of the terms with the
top-k term frequency from DA; INS and IT respectively represent the instances
and their constraint types obtained from the inclusive verified E-Nodes. In a
sense, the element classification problem is converted into a one-to-one semantic
matching problem between en and ro. The implementation of semantic matching is
the same as the algorithms mentioned in searchGNode (Section 5.1):
SimilarityE2GN (eS/T , gn) = SimilarityE2RO (eS/T , ro)
To make the algorithm easier to understand and control, we design a hybrid solution
that combines the last two strategies. First, the results of matchElement2Element (line
3) Set〈eMT , eS/T 〉 is reused to determine eH . In essence, for a specific pair 〈eMT , eS/T 〉,
the corresponding eMT is the target eH . For the eS/T without their previous versions,
representative-based strategy can be employed to discover the corresponding concept
nodes. Non-technical users can easily understand the whole process and be involved in
the result validation of element classification. The outcomes can also be improved
through manual changes on representative objects, and the operations can be
73
implemented with more straightforward graph operations using many emerging HCI
technologies (e.g., [30, 98, 107, 108]).
• reuseMOntology (line 6) discovers Set〈mapST , corrST 〉 (i.e., the mapping instances
and correspondences) from M-Ontology. First, we find two G/A-Node sets RSetS and
CSetS whose nodes are reachable from the nodes in the concept set ASetS . RSetS
consists of the nodes that can be reached through a single T-Edge from any node in
ASetS . CSetS is the node union of maximal connected subgraphs of the nodes in ASetS
(T-Edge direction is ignored in this calculation). Second, the desired concept-level
mappings are the overlap ASetS ∩ ASetT and RSetS ∩ ASetT . The first overlap denotes
the semantics-equivalence mappings between two concepts. The second one indicates
the concept mappings whose expressions are in the connected T-Edges. We also can
deduce the target concept correspondences from the overlap CSetS ∩ ASetT , although
M-Ontology does not have a T-Edge to link them together. Finally, we can easily discover
Set〈mapST , corrST 〉 through the element classification (from matchElement2Concept) with
these concept mappings and correspondences.
6.4 Validating & Correcting Mappings
The final phase is to verify and correct the machine-discovered mappings by human
beings. Supporting meta-querier customization is likely to result in a large number
of mappings to be discovered. It is impractical for a small set of domain experts and
system designers to validate and correct these mappings. We propose the following
strategies to decrease their workloads:
Mass collaboration . Although the to-be-discovered mappings might be numerous,
meta-querier customization will also attract more users. We believe these users not
only have enough motivation be involved in the validation, but also are willing to
volunteer to assist others. In our design, these users along with a small number of
domain experts can practically form a collaborative domain-specific community for
meta-querier construction. We divide the error-prone procedure into three subsequent
74
procedures with different difficulty levels. The tasks in the easiest level include two
verification jobs: verifying if the classified schema elements are assigned to the correct
G-Nodes and checking if the discovered T-Edges are appropriate to represent the
relations among the schema elements. Normally, these basic tasks can be assigned to
novices. In the second difficulty level, the tasks are determining unfound mappings and
searching M-Ontology to seek correct G-Nodes, if existing, for unclassified elements.
The experienced users are able to handle these jobs. At the most difficult level of tasks,
M-Ontology needs to be updated (e.g., creating a new G-Node) for supporting new
mapping instances. The maintenance of M-Ontology should be one of the major jobs of
the domain experts.
Human friendliness . For any community-based system to be successful, it
is absolutely necessary that the system must be easy to control. In our proposed
discovery algorithms, we can see the following special considerations in the design.
First, in the function matchElement2Element, correspondence selection is viewed
as a utility maximization problem on similarity matrix R(eX , eMOR). Thus, the results
can be improved through manual or automatic changes on R(eX , eMOR). For example,
verification on mappings might trigger changes on the corresponding values. Second,
in the function matchElement2Concept, manual modification on the representative
objects of concept nodes (such as D-Labels and Instances) can improve the precision
and recall. Third, graph-oriented interface is an essential requirement for systems with
a large community of novice users. Thus, both mapping objects and mapping ontology
can be represented by graphs. The related operations can be implemented with more
straightforward graph operations using emerging HCI technologies (e.g., [30][98]). Last
but not least, even if the current contents of M-Ontology are correct and non-redundant,
domain experts are also encouraged to involve in the enrichment of M-Ontology for
improving the performance. The comments on nodes/edges are welcomed from the
75
creators/modifiers. The information stored in metadata (such as annotation) is critical for
understanding of others.
6.5 Related Work
Although schema matching has been widely researched for thirty years, most of the
current solutions (as in surveys [46, 116]) only consider one-to-one mappings. In reality,
many-to-many mappings are pervasive in real-world applications. Current solutions to
complex mappings can be classified to three categories as follows:
Learning-based approaches learn mapping templates from the existing mapping
instances. iMAP [35], also called COMAP [38], is proposed to solve several types of
1-to-n complex mappings by comparing the content or statistical properties of data
instances from two schemas. However, the number of its hardcoded rules limits the
types of complex mappings that can be found. HSM [135] and DCM [56] employ
correlation-mining techniques to find the grouping relations (i.e., co-occurrence) among
elements, but their solutions do not take into account how these elements correspond to
each other (i.e., mapping expressions).
Template-based approaches search the most appropriate mapping templates
from a template set. The main drawback is their limited capability of finding complex
mappings. The templates are separate and static. In addition, the template number is
normally very limited. IceQ [144] integrates two human-specified rules into their systems
for partially finding Part-of and Is-a type mappings (i.e., two common one-to-many
mappings). QuickMig[43], as an extension of COMA[9], summarizes ten common
mapping templates, some of which are complex mappings. Several possible combinations
of templates are also presented. It also designs an instance-level matcher for finding
splitting or concatenation relationships.
Ontology-based approaches employ external ontologies for mapping discovery.
Two schemas to be matched are first matched to a single ontology separately, and
then element-level mappings can be found by deducing from the intermediary ontology.
76
However, its performance is largely affected by the coverage and modeling of ontologies.
For example, the ontology defined in SCROL[117] does not consider the syntactic
heterogeneity (defined in Chapter 4.3) caused by various semantics-equivalent
representations. In addition, the existing ontology-based solutions [117, 148] simply
assume that the shared ontologies are already built, but building such well-organized
ontologies is as hard as finding complex mappings. Ontology construction is also a
labor-intensive, time-consuming and error-prone problem [36, 110, 121].
Our proposed mapping discovery algorithm is a hybrid solution. First, the evolution
history of a specific mapping can be viewed as a template/rule. In a sense, the
discovery through MO-Repository can be viewed as finding rules to apply. Second,
discovery through M-Ontology is a typical ontology-based approach. We provide an
integrated solution to three indispensable subproblems: ontology design, ontology
construction and mapping discovery. The design of M-Ontology addresses five kinds of
heterogeneity: language heterogeneity, syntactic heterogeneity, instance heterogeneity,
structural heterogeneity and semantic heterogeneity. Third, ontology construction
employs learning-based approaches, i.e. incremental representative-based clustering.
Different from the classical ontology construction [36][110], the proposed ontology
is generated from schemas and mappings, which are abundant in our proposed
MQ-Customizer. As more schemas and mappings are inserted into the ontology,
more mappings can be correctly discovered from the ontology.
6.6 Conclusion
Current research proves that mapping discovery cannot be fully automated in
the foreseen future, especially for those many-to-many mappings with expressions.
We believe that reuse of existing mappings is one of the best (or maybe the only)
solutions. In our proposed solution, reusing the similar mappings in the same domain
can avoid the repetitive tasks, when a large number of customized meta-queriers need
to be built and maintained. To enhance reuse potential from the existing mappings, we
77
introduce the ontology-based and change-oriented mapping models. Based on these
two models, a user-friendly discovery algorithm is provided. In our solution, discovery
of many-to-many mappings is converted to finding of one-to-one correspondences
from their evolution history and a domain-based mapping ontology. The proposed
algorithms for finding correspondences are straightforward to non-technical users. Our
experimental results on real-world data sets confirm the feasibility and effectiveness of
our reuse-oriented solution.
78
CHAPTER 7ONTOLOGY-CENTRIC SOURCE SELECTION
Our work[78] investigates how to exploit the query capabilities [74][27][149]
of semi-structured sources for achieving more accurate selection. We propose an
ontology-centric approach to source selection. A domain-based ontology (M-Ontology)
was designed for meta-querier customization. With the assistance of the concepts and
relations in M-Ontology, user demands and source capabilities are modeled as concept
sets, and identified through query-form annotation, matched by an additive utility
function. The experiments on real-world data (in Chapter 8.3) illustrate the potential of
this ontology-centric method.
7.1 Related Work
Source selection refers to the process that deals with the selection of a list of
relevance-ranked data sources for meta-queriers. To distinguish our work from the
current solutions to source selection, we discuss the solutions to source selection based
on the following two aspects:
• Source types: unstructured and (semi-)structured data sou rces.
In the fields of distributed information retrieval systems, the prior researches on
source selection mainly focus on the selection of sources containing unstructured data
(a.k.a., texts). To acquire the contents of data sources, randomly generated queries are
sent to obtain the sample texts. Through these fetched samples, the data sources can
be represented as a single big document (such as in CORI[21], CVV[150] and KL[147])
or a set of big documents (such as, in ReDDE[131], CRCS[129] and SUSHI[138]).
With the rapid growth of the deep Web, (semi-)structured information sources
have been experiencing a remarkable increase. Normally, the query interfaces of
structured information sources are more complex than those of text sources. First, the
query-based sampling becomes impractical since it is difficult to retrieve the sample
documents through randomly generated queries. The automatically generated queriers
79
usually cannot satisfy the hidden constraints on the inputs to the query forms. Second,
the topics of sources can be directly acquired through the semantics analysis on the
query forms. More complex query forms often contain more semantics at the same
time. Several researches [57][134][84] have been conducted on cross-domain source
selection. They cluster structured Web sources based on the query-form similarity
so that each cluster corresponds to a single domain. However, the domain-oriented
clustering is a coarse classification without considering the capability distinction.
• Selection types (selection-per-query v.s. selection-per -engine).
The personalization of meta-queriers can be achieved through the source selection.
The selection can be made when users issue a query (called selection-per-query) or
when the meta-queriers are constructed (called selection-per-engine).
In some domains, the query forms are considerably simple. For example, most
news search engines only include a single keyword box and a click button. Thus, the
construction of a global query form is not difficult to integrate all the forms in the same
domain. In this context, the data sources can be selected based on the user inputs to
the global interface (i.e., selection-per-query).
However, in the other domains, it is not practical to build such a single interface
(or even a few) to encompass all the functionalities provided by the query interfaces in
the same domain. For instance, in the air-ticket booking domain, we can observe that
various meta-queriers provide different query interfaces with different functionalities.
Most differences are caused by the selection of sources in their construction (i.e.,
selection-per-engine).
Our work proposes a selection-per-engine strategy in the customization of
meta-queriers. The user is allowed to select their preferred data sources. To better
reuse the pre-integrated data sources, we provide a capability-based source selection
algorithm to recommend users their potentially desired data sources. Complimentary
to the query-based solutions, our approach is based on the query capabilities of data
80
sources, instead of the sampled source contents. Unlike the prior work on query-form
clustering, our approach is able to distinguish the query capabilities of the data sources
whose query forms have been clustered in the same domain.
7.2 Capability-based Recommendation
Capability-based source selection is a typical content-based recommendation
problem [4]. In our capability-based recommendation, users can declare their needs
of the capabilities by inputting the domains DMI and their preferred data sources
DSI . Based on the user needs, our system recommends users a ranked list of the
pre-integrated sources DSO . Such a capability-based recommendation problem can be
simplified to a utility maximization problem. More formally, let u be a utility function that
quantifies the desirability of a meta-querier mq or a data source ds for a specific user
need, i.e., u : DMI ×DSI ×DSO → R, where R is all nonnegative real numbers. To match
the user needs with the best, the recommendation system should maximize the utility by
selecting the meta-queriers mq and the data sources ds. That is, ∀dmI ∈ DMI , ∀dsI ∈
DSI , ds∗ = argmax ds∈DSOu(dmI , dsI , ds). The effectiveness of such a solution relies on
the correct understanding of the user demands and data sources. Section 7.2.1 first
explains our ontology-centric modeling for source capabilities and the user preferences,
and then Section 7.2.2 presents the corresponding source-selection algorithms.
7.2.1 Modeling
In the context of meta-querier customization, query capabilities refer to the abstract
abilities of sources to retrieve information. The information of these sources can be
accessed through their associated query forms. Generally, the contents of query forms
decide the possible queries that can be posed by users, and thus they also dictate the
capabilities of the meta-queriers and data sources. Understanding the query forms
is a cornerstone of the capability-based recommendation. Therefore, we propose an
ontology-based approach to capture their capabilities by analyzing and understanding
the query forms.
81
As described in Chapter 4, each query form can be regarded as a set of query
conditions [151][62], we call schema elements. An individual element or an ordered
element list can represent a specific query capability [74][27][149] that a resource
possesses. However, the information carried in individual components is very limited.
It does not contain the context or knowledge concerning the textual description.
Methodologies of ontology are commonly used to address such representation
heterogeneity. Ontologies store well-defined concepts and relations including context
knowledge. If query forms are properly annotated using concepts in the same ontology,
machines can understand their semantics. As described in Chapter 5, M-Ontology is
designed for the storage, management and discovery of mappings that are employed to
transform the user inputs (for queries) among query forms. In the following, we briefly
introduce the nodes/edges and its correspondences with query capabilities.
(1) E-Nodes, Is-a Edges and G-Nodes. Each E-Node encapsulates a schema
element in a specific query form. Through generalization of a set of E-Nodes with the
same semantics and instance formats, an Is-a Edge can formulate a new concept node,
i.e. a G-Node. In essence, each G-Node corresponds to an abstract query capability.
For describing the semantics of such a query capability, a representative object ro
is automatically abstracted from the associated attributes of the inclusive schema
elements.
(2) A-Nodes and Part-of Edges. Each A-Node is formed through aggregating a
list of G-Nodes. A Part-of Edge is used to represent such an aggregation relation by
linking the A-Node to its inclusive G-Nodes. The A-Node also denotes an abstract query
capability, whose semantics can be approximated to a list of representative objects
Listro .
(3) T-Edges. A T-Edge corresponds to a transformation relation between two
concept nodes (i.e., G/A-Nodes) which can be used to fetch similar contents. The
transformation relation contains a specific rule to convert the instances of the connected
82
concept nodes. Since their instances are convertible, all these connected nodes
represent similar query capabilities. In a sense, M-Ontology is a domain-specific query
capability repository, where each connected G/A-Node sub-graph generally corresponds
to an abstracted query capability.
Capability modeling : By using the proposed M-Ontology, we model the capabilities of
data sources based on their own query forms. Each data source has its own query form,
whose inclusive schema elements can be clustered into a set of G-Nodes in M-Ontology.
Let M-Ontology includes a set of G-Nodes denoted by GN = {N1,N2, ...,Nn}, which
are numbered from 1 to n based on their creation time. Since each G-Node denotes an
abstract capability in a specific domain, we use a subset of GN (called Capability Set
CS) to represent the capabilities that a data source is able to provide. That is, CS can be
denoted by a G-Node set {Ni |Ni ∈ GN}.
To obtain such a capability set, we need to seek correct G-Nodes to annotate the
schema elements in the corresponding query form. More precisely, the process of
capability capture can be viewed as schema annotation. In the context of meta-querier
customization, it is straightforward to capture the capabilities of the pre-integrated data
sources from their query forms. M-Ontology functions as a mapping repository, and thus
the mappings linked to these query forms should have been inserted into M-Ontology.
That means, all the schema elements have been clustered into the corresponding
G-Nodes. These G-Nodes are the components of the corresponding capability sets. The
detailed algorithms are presented in the following section 7.2.2.
Preference modeling : The widely used preference model is based on keywords.
However, in the real-world scenarios, a few keywords are often unable to represent
the exact semantics. For the purpose of accurately understanding user demands, our
solution relies on the mapping-generated M-Ontology. In our solution, a specific user
need is modeled by a vector PV , called a preference vector. Given that M-Ontology
contains G-Nodes {N1,N2, ...,Nn}, the vector PV has n corresponding entries. The i th
83
entry of vector PV , denoted by PV (i), is a preference value that indicates how the user
prefers this capability in the target.
To construct such a preference vector, our solution provides two types of customization
mechanisms: i) Needs-oriented. Users are allowed to specify their explicit demands
through inputting the preferred data sources DSI and the corresponding application
domain DMI . Based on the inputs, our system identifies the potential user-preferred
entries in the PV . ii) Feature-oriented. Users can modify the preference values of the
identified entries. The implementation of these two mechanisms is explained in the next
section.
7.2.2 Demand Capture and Matching
This section presents the procedure of demand capture and matching. As illustrated
in Figure 7-1, the whole procedure consists of six phases. Based on user selection
(i.e., DMI ) from a list of data domains that exist in the system, Ontology Selection is
to choose an appropriate M-Ontology MO for understanding the user-preferred data
sources (i.e., DSI ). Only for the data sources that have not been integrated into any
meta-querier, Q-Form Normalization is invoked to unify their query form representation.
By analyzing the normalized query forms, Q-Form Analysis can output the capability
set CS for each data source. From the analyzed query forms, Demand Identification
constructs a preference vector PV . Demand Matching generates a ranked list of
resources for user selection. After user selection, if necessary, Annotation Verification
verifies the correctness of analysis on query forms of un-integrated data sources. The
details are described as follows.
• Ontology Selection is to choose an appropriate M-Ontology MO. In different query
forms, there might exist many schema elements with the highly similar representation,
but, in fact, their semantics are completely different due to the different context. For
example, the word “keywords”, which appears in the different domain, has various
meanings. In job seeking, it represents job titles or fields, but it also might denote the
84
song name in the music search. Thus, our designed M-Ontology is domain-specific.
This assumption is reasonable for the customization of meta-queriers, which combine
the data sources in the same domain.
• Query-form Normalization is to unify the representation of the query form for each
data source in DSI that is not pre-integrated into our system. The data sources can be
categorized as two types, pre-integrated and new data sources. Since the pre-integrated
data sources have been analyzed, the results can be directly reused without repetitive
normalization. This phase only processes the query forms qformI that are not included
in M-Ontology. For each query form, its inclusive schema elements are associated
with some descriptive attributes and a set of potential instances (if they exist). We
first perform natural language processing techniques on the descriptive attributes and
instances in the following orders: tokenization, stop-word removal and stemming. Then,
we treat the normalized texts as two unordered bags of words, Set〈wDAi 〉 and Set〈w INSi 〉.
These two bags of words constitute a word-bag pair, which can indicate the semantic
characteristics of this component. Finally, each query form corresponds to a set of
word-bag pairs, Set〈(Set〈wDAi 〉,Set〈wINSi 〉)〉
• Query-form Analysis is to understand the query forms by analyzing the semantics of
each inclusive schema element e. Our solution is to annotate each e by semantics-equivalent
concepts (i.e.,G-Nodes from the M-Ontology MO). The involved G-Nodes constitute the
capability set CS of the corresponding data source. G-Node searching can be divided
EndDemand
Identification
OntologySelection
Start
UnintegratedForms
Integrated Forms
Q-FormNormalization
Q-FormAnalysis
DemandMatchingAnnotation
Verification
Figure 7-1. The ontology-centric source selection algorithm.
85
into two separate types: a) For the pre-integrated data sources, their query forms have
been included in the M-Ontology, and thus each schema element of these forms should
have been clustered to a certain G-Node. That is, such an association can be directly
reused. b) For the new data sources, each schema element in their query forms is
normalized to two word bags Set〈wDAi 〉 and Set〈w INSi 〉. As described in Chapter 5.1, the
suitability of a G-Node gn can be measured by the constraint equivalence and semantic
similarity between e and the corresponding representative object ro of gn. The semantic
similarity can be calculated as follows.
λ1n1∑
i=0
m1∑
j=0
sim(wDAi , tDAj ) + λ2
n2∑
i=0
m2∑
j=0
sim(wDAi , tDLj ) + λ3
n3∑
i=0
m3∑
j=0
sim(w INSi , tINSj )
where, λi is a scale factor, two word sets Set〈wDAi 〉 and Set〈w INSi 〉 are the semantic
characteristics of e, and a tuple 〈Set〈tDAi 〉,Set〈tDLi 〉,Set〈t
INSi 〉〉 is the representative
object ro. The function sim is to determine the semantic similarity between two terms
(respectively from e and ro). Our implementation relies on the WordNet-synonyms
distance, a linguistic-based matcher.
• Demand Identification constructs a preference vector PV based on the selected data
sources and user interaction. The whole procedure is composed of two steps:
1) Automatic discovery: Each G-Node corresponds to an entry PV (i) whose value
indicates the preference degree against a specific capability. In M-Ontology, G-Nodes
that are connected via a single or multiple T-Edges and Part-of Edges whose directions
are ignored constitute a maximal connected G-Node sub-graph. Such a sub-graph
corresponds to an abstracted query capability. In the sub-graph, the preference values
of these G-Nodes are correlated. Since the conversion through T-Edges and Part-of
Edges might lose the semantics, the distance between two G-Nodes i and j indicates
their dissimilarity. Here, the distance refers to the minimum hop number from one
node to another. The value 1/distance(i , j) is used to represent the potential difference
between their capabilities. Assuming that GSetM is a multiset that contains the G-Nodes
encapsulating the schema elements of data source in DSI . It might contain duplicates.
86
The occurrence number of a G-Node in GSetM is equal to the appearance frequencies
of its encapsulated schema elements in DSI . Preference vectors can be decided by two
modes: combination and accumulation.
• The combination mode is preferred when users want to find the data sources
containing all the capabilities (with the same desirability) that are supported by
the user-inputted sources DSI . The values can be obtained through the following
procedure: First, all the entries whose G-Nodes are in GSetM are set to one, i.e.,
PV (i) = 1 if i ∈ GSetM ; otherwise, the values are initialized to zero. Second, for the
G-Nodes that are not in GSetM , their values are calculated based on the distance
to the nearest node in GSetM . The procedure can be represented as eq. (7-1).
PV (i) =
1 i ∈ GSetM
1min
j∈GSetM
{distance(i,j)+1}i /∈ GSetM
(7–1)
• The accumulation mode assumes the most critical capabilities that user desired
are the ones that appear most frequently in the user-inputted sources. For the
entries whose G-Nodes are in GSetM , their values of PV (i) are equal to the
multiplicity (i.e., occurrence number) of their corresponding G-Nodes in GSetM .
For the other entries, their values are accumulated based on the distances to the
nodes in GSetM and their preference values, as shown in eq. (7-2).
PV (i) =
multiplicity(i) i ∈ GSetM
∑
j∈GSetM
PV (j)distance(i,j)+1
i /∈ GSetM
(7–2)
2) Manual correction (optional): Automatic decision of preference values might not
accurately demonstrate the user demands, optional manual correction is necessary to
87
correct the values by the users themselves. However, it often becomes inappropriate or
impractical to present the whole preference vector, especially when the vector is very
long. In our design, the users are able to modify the values that are not equal to zero.
This mechanism is so-called feature-oriented.
• Demand Matching : This phase is to generate a ranked list of data sources that best
match the identified demands. The ranking of the list is through comparing the numeric
values of the utility function u that demonstrates the desirability of a data source for a
specific user need. More exactly, in the application domain DMI , a resource R1 ∈ DMI
is preferred to a resource R2 ∈ DMI if and only if the expected utility of R1 is greater
than the expected utility of R2: ∀R1,R2 : R1 % R2 ⇔ u(R1) ≥ u(R2). Such a rational
preference relation % is transitive, reflexive and complete.
The preference ranking is a typical multi-criteria decision making problem. In
meta-querier customization, each criterion corresponds to a query capability identified
in preference vectors (i.e., a G-Node). To make the ranking outcomes manageable
by users, we assume additive independence exists among the maximal connected
subgraphs, which is a normal assumption [67]. The utility of a resource R can be
approximated by using an additive value function that breaks one n-criteria function
into n individual one-criterion functions. Such an approximation not only simplifies
the automatic adjustment and manual correction, but also performs well, even if the
assumption does not strictly hold [120]. We construct an additive utility function u to
aggregate the utility cu(R[Ni ]) of each individual capability Ni provided by the resource
R. The utility cu(R[Ni ]) is 1 when the capability set CS of R contains Ni ; otherwise it
is zero. The additive weight of Ni is decided by its preference value PV (i), which are
generated from user inputs. For the nodes in any maximal connected subgraph (sg),
the sum of their utility values should be less than or equal to δ. δ is 1, if the combination
mode is used. δ is set equal to∑
Ni∈V (sg)∩GSetM
PV (i), if the accumulation mode is used.
88
Let MCSG be a set of maximal connected subgraphs, the weighted utility function can
be represented as follows,
u(R[N1,N2, ...,Nn]) =
n∑
i=0
(PV (i)× cu(R[Ni ])) (7–3)
subject to the following constraint, ∀sg ∈ MCSG ,∑
Ni∈V (sg)
(PV (i)× cu(R[Ni ])) ≤ δ
• Annotation Verification (after run-time): This phase is to verify whether the new
data sources are annotated by the accurate G-Nodes. For those schema elements
that cannot match with any concept node, the manual annotation is invoked to update
and maintain M-Ontology (e.g., by inserting new mappings among the related query
forms). As an optional phase, the manual verification can be conducted after the source
selection. Although the new data sources might not be supported immediately, the
verified annotation can be reused for the future recommendation.
7.3 Conclusion
Capability-based customization is desired to meet various user needs. In our
customization strategy, we design a capability-based solution to source selection
based on user inputs. Modeling of user demands is still an open problem in the
selection of semi-structured sources, especially for the capability-based selection.
Without adequately accurate understanding of source capabilities, the selection in the
previous research [57][134][84] is normally coarse-grained and unable to distinguish the
functionality difference of the sources. Our solution is based on M-Ontology, which is
viewed as a repository of capabilities. Both the user demands and source capabilities
are modeled by using the concepts in M-Ontology. A semi-automatic solution to
demand elicitation is also proposed through semantic annotation on the query forms
of user-preferred data sources.
In the recommendation results, data sources are ranked by their matching values
between needs and sources. Each matching value quantifies the desirability of a
specific data source for a particular user needs. It is calculated through measuring the
89
similarity of the users’ preference vector with the capability sets of the sources. The
value calculation is treated as a multi-criteria decision making problem. Each criterion
corresponds to the desirability of a specific capability. An additive utility function is
proposed to combine all the criteria through comparison of the preference vector with
the capability set.
90
CHAPTER 8IMPLEMENTATION AND EXPERIMENTS
8.1 System Structure of Mapping Repositories
Mapping repositories are built on top of an open-source system “Alignment Server”
[47]. This system provides some basic services, like mapping storage and ontology
matching. It employs a big table to store mappings, each of which is treated as a tuple
in the table. We extend it to support our proposed mapping repositories. As shown in
Figure 8-1, the mapping repositories can be employed by different applications such as
mapping retrieval, schema matching and source selection. The repositories are roughly
divided into four levels: mapping level, node level, structure level and system level.
Mapping level is responsible for management of mapping instances and the
associated metadata. These instances can be retrieved, inserted, edited or deleted
by community members. The query forms connected by the instances are stored in a
separate repository, called schema repository. Each form is represented as an owl file.
The file name can be used to identify and retrieve the corresponding query form for
mapping management.
Element level implements the concept of E-Nodes and offers the basic operations
on E-Nodes. Schema elements in mapping instances are extracted individually and
Relational Database
Repository API
MappingRepositories
Schema Matching Source Selection Mapping Retrieval
Applications
system level
ViewsOperations
element level
E-Nodes
levelstructure
MO-RepositoryM-Ontology
mapping level
Mappings
Figure 8-1. The Repository Structure.
91
normalized by NLP (natural language processing) techniques: tokenization, stop-word
removal and stemming. Different algorithms can be selected for different application
domains. For example, we support two types of stemmer: Porter Stemmer [112] and
Krovetz Stemmer [69]. At this level, no relation between E-Nodes is considered. All the
E-Nodes are operated separately.
Structure level implements the mechanisms of MO-Repository and M-Ontology.
Mapping objects, G-Nodes, A-Nodes and T-Edges are defined and implemented in this
level. The operations (e.g., creation, merging, modification, insertion and deletion) on
these nodes and edges are conducted in this level, while the associated metadata are
also recorded.
System level offers the support of customized meta-querier construction and
maintenance. First, it offers a mechanism to define customized views to adapt
meta-querier customization. Second, it provides various syntactic difference and
similarity calculation among E-Nodes, G-Nodes and A-Nodes through combination of
third-party software components, e.g., the calculation of 3-gram distances, WordNet-synonym
distances and Jaccard distances.
8.2 Experiments for ontology construction
To evaluate the feasibility and effectiveness of M-Ontology, we conduct experiments
to simulate ontology construction. Mapping diversity and repetition depend on user
selection of data sources, the composition of local schemas and the degree and
frequency of their changes. Thus, it is hard to simulate the complete processes and
performances of real applications. The following experiments focus on the effectiveness
of two functions (i.e., SearchGNode and VerifyENode) that are the main factors
determining the performance of the proposed algorithms in ontology construction and
mapping discovery. More specifically, two questions are addressed in the experiments.
i) Feasibility: in the real-world meta-queriers, how high the percentage of schema
elements have corresponding concepts stored in M-Ontology during the incremental
92
construction process? ii) Effectiveness: how well do our algorithms perform correctly in
searching out suitable G-Nodes for a schema (i.e., the function SearchGNode) after the
insertion results of a set of schemas have been verified (i.e., the function VerifyENode)?
Performance Metrics. We use four metrics to evaluate the performance of the
algorithms, Hit-ratee , Precisione , Recalle , Fmeasuree . Hit-ratee is designed to measure
the feasibility of M-Ontology. Precision and Recall are widely used in evaluating the
effectiveness of information retrieval systems. Precision measures the degree of
correctness of G-Node searching. Recalle measures the degree of completeness of
G-Node searching. By combining Precisione and Recalle , Fmeasuree [37] is to examine
the overall effectiveness of the proposed algorithm. In the setting of M-Ontology, these
metrics can be defined for a specific value m as:
• Hit-ratee = MOe/NUMe
• Precisione = Crte/Totale
• Recalle = Crte/MOe
• Fmeasuree = 2× Precisione×RecallePrecisione+Recalle
where, NUMe is the total element number; MOe is the number of elements that
can be correctly identified with the G-Nodes; Crte is the number of elements that are
matched to the correct G-Nodes; Totale is the number of elements that are matched to
a specific G-Node. Fmeasuree is the weighted harmonic mean of Precisione and Recalle .
Note all the above element numbers are from a schema to be inserted/matched.
Data Sets. To examine the performance of M-Ontology in real-world data integration
problems, we collect 141 query form URLs from the UIUC web integration repository[2]
after removing the inactive websites. All these query-form URLs are from three domains
: Books(47), Movies(49) and Music Records(45), where the value in parentheses
denotes the number of URLs in the domain. The information of query forms is manually
extracted from the HTML source codes to generate the corresponding Web schemas.
93
Domain GN no. Rare GN pct. Rare Schema pct.Movies 30 43.3% 16.3%Books 35 49.0% 12.8%Reords 23 56.5% 15.6%
Table 8-1. Statistics of the domains
These schemas are represented in OWL to accommodate the mapping format of
Alignment Server.
Given these schemas in OWL, we identify the concepts (i.e. G-Nodes) by manually
clustering the schema elements for each domain. As illustrated in Table 8-1, 30
G-Nodes are generated from the data set “Movies”. Among them, 43.3% G-Nodes
have only a single schema element, but these schema elements are from 16.3%
schemas. That is, the “rare” concepts occur only in a few schemas. It indicates that most
to-be-classified schema elements can find the corresponding concepts from the other
schemas in the same domain. The other two domains have the same patterns.
Experiment Scenario. Assume that m Web schemas from the same domain are
already inserted into M-Ontology based on our proposed algorithms. The insertions
of these schemas have been already corrected and verified by human beings so that
the attributes (e.g., SetDAgn ) of the corresponding G-Nodes are updated to combine the
verified schema elements. We attempt to find correct G-Nodes for a single schema X
from M-Ontology. To estimate the performances, we design two sets of experiments: 1)
X is not in M-Ontology (illustrated in Figure 8-2); 2) X might be in M-Ontology (illustrated
in Figure 8-3).
Each experimental result shows eleven values of m ranging from 1 to 45. The
performance measures for each m are calculated by the average of 200 samples, each
of which is automatically generated from the schema sets in that domain. Each sample
includes m different schemas existed in M-Ontology and the schema X . All these
schemas are randomly selected.
94
Experiment 1: Without Schema Repetition. In this set of experiments, the schema X
is randomly selected from the schemas that are not stored in M-Ontology. That is, X is
different from any schema in M-Ontology for the purpose of removing the influence of
possible repetitiveness. As shown in Figure 8-2, the same trends can be observed in all
three domains.
In Figure 8-2(a), when m reaches 15, at least 90% elements in X can find out the
correct G-Nodes so that these elements can directly reuse the associated mappings.
The importance of existing schemas on Hit-rates are clearly indicated, as a sharp
increase of Hit-rate can be seen when m is smaller than 15. The remaining curve seems
to indicate that the additional schemas (> 15) almost do not affect the Hit-rates since
90% concepts are already collected from the first 15 schemas. However, this is not
always true since more mapping edges might be added with more schemas inserted.
(a) (b)
(c)
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Hit
-Ra
te
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Pre
cis
ion
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Re
ca
ll
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Fm
ea
su
re
(d)
+booksmoviesrecords
(c)+booksmoviesrecords
(c)+booksmoviesrecords
(c)+booksmoviesrecords
(c)
Figure 8-2. Experiment results of schema element classification without schemarepetition.
95
The experimental results also match the conclusion by a related study [28] showing
that the aggregated vocabularies used to describe schema elements are “clustering in
localities and converging in size”.
In Figure 8-2(b), the values of Precision stay around 90% after the initial learning
process (m > 10). Although more G-Nodes are available for classification with m
increasing (Figure 8-2(a)), our proposed searching algorithm also can classify most
schema elements into the correct concepts.
In Figure 8-2(c), a sharp increase of Recall can be seen before m reaches 10,
followed by a steady and slow improvement with m increases. This phenomenon
indicates that D-Labels of G-Nodes cannot be correctly identified when the inclusive
schema elements are very few. When m increases, the contents of D-Labels become
steady but Instances and D-Attributes can accumulate more useful information from the
newly verified schema insertion. Thus, Recall increases slowly and steadily after m is
larger than 10.
In Figure 8-2(d), the values of Fmeasure are higher than 85% when m > 15. As an
overall performance measure, Fmeasure values indicate that the algorithms are effective
in the classification of schema elements. Its real-world performance should improve
further if we also employ the other two human intervention strategies, i.e., avoidance
and prevention, discussed in the end of Section 5.1.
Experiment 2: With Schema Repetition. When building a large number of meta-queriers,
it is highly possible that X has already been inserted into M-Ontology since the ontology
is shared by all the meta-queriers in the same domain. One typical example is given
in the motivating scenario of Section 4.3.1. Thus, we also conducted the second set of
experiments in which X is randomly selected from the whole schema set and it might be
identical to one of the existing schemas in M-Ontology.
Intuitively, the results of Experiment 2 should be better than that of Experiment 1.
The results prove this expectation clearly. Figure 8-3(a) reveals that the Hit-rate has
96
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Hit
-Ra
te
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Fm
ea
su
re
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Pre
cis
ion
0 5 10 15 20 25 30 35 40 450.4
0.5
0.6
0.7
0.8
0.9
1
Number of Schemas Included in M-Ontology
Re
ca
ll
(a) (b)
(c) (d)
(c)+booksmoviesrecords
(c)+booksmoviesrecords
(c)+booksmoviesrecords
(c)+booksmoviesrecords
Figure 8-3. Experiment results of schema element classification with schema repetition.
reached 90% when m is around 10 and almost achieves the highest value (i.e., 100%)
when m is around 45. Figure 8-3(d) shows that Fmeasure values are above 90% after
m is greater than 20, and also reach the highest value. In real implementation, the
algorithms should perform better than the experiment results in Figure 8-3, since most
users will choose the data sources with more functionalities (i.e., complex query forms)
and better data quality.
From the above two sets of experiments, we see that schema elements in the
same domain can be clustered to generate a relatively “small” mapping ontology. That
indicates it is feasible to construct a mapping ontology as a knowledge base shared
by all the meta-queriers in the same domain. The performance results in terms of
Fmeasure demonstrate the effectiveness of the algorithms even without considering the
avoidance and prevention human intervention strategies. Furthermore, as the number of
97
data sources on the Web has been steadily increasing [20, 28, 87], better performance
of M-Ontology can be expected when more mappings are included.
8.3 Experiments for mapping discovery
We evaluate our reuse-oriented mapping discovery by conducting several sets
of experiments on real-world query forms. The goal of experiments is to examine
the feasibility and effectiveness of our approach. Specifically, the experiments are
designed to answer the following three research questions in the context meta-querier
customization: 1) Is it practical to find the target mappings/correspondences from
M-Ontology? 2) Is our algorithm effective to identify the mappings from M-Ontology?
3) Is our algorithm able to improve the effectiveness of mapping discovery with the
introduction of mapping evolution?
Data sets: We first collected 38 query-form URLs from the air-ticket booking data set in
the UIUC web integration repository[2] after removing the inactive forms in May 2009.
These query forms are manually extracted from their HTML codes and represented
in Ontology Language (OWL). These OWL files form the first data set AIRO. After 22
months, we revisited the web pages containing these query forms, 23 of which were
syntactically changed. 23 corresponding OWL files are generated for them. The total
61 query forms constitute the second data set AIR. From these forms, 730 schema
elements are manually classified based on their semantics. Each schema element is
classified to a single G-Node and a single G/A-Subgraph. These classification results
are used in the performance evaluation of the following four sets of experiments.
Experiment setup: In the following experiments, we assume that m Web query forms
are already contained in M-Ontology/MO-Repository. For M-Ontology, the representative
objects are automatically generated from the verified schema elements, without manual
correction on the D-Attributes, D-Labels and Instances. We also assume that T-Edges
have been created to connect the semantics-equivalent concept nodes. In the following
experiment results, each value is obtained by the average of 100 samples. To ensure
98
measurement fairness and accuracy, each sample is randomly generated from the
data sets without any duplicate. Four measures are used in each experiment: Hit-rate
measures the proportion of the expected concepts/mappings that exist in a mapping
repository (i.e., M-Ontology or MO-Repository). Precision measures the proportion
of the correctly identified concepts/mappings over the total identified ones. Recall
measures the proportion of the correctly identified concepts/mappings over the total
correct and identifiable ones. Fmeasure is the weighted harmonic mean of precision
and recall.
Experiment 1 evaluates the performance of mapping discovery by using only M-Ontology.
Only the abstract-based element classification is used without the history-based
strategy. The experiment is to match two query forms (from the data set AIR) that are
not in M-Ontology. As illustrated by red dotted lines in Fig. 8-4, our ontology-based
approach to mapping discovery looks promising in both feasibility and effectiveness.
(The blue solid lines show the performance of abstract-based element classification. The
0 5 10 15 20 25 30 35 40 45 50 55 60
0 4.
0 5.
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Hit
-Ra
te
0 5 10 15 20 25 30 35 40 45 50 55 60
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Re
ca
ll
0 5 10 15 20 25 30 35 40 45 50 55 60
0 5.
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Fm
ea
su
re
0 5 10 15 20 25 30 35 40 45 50 55 60
0 5.
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Pre
cis
ion
(c) (d)
(a) (b)
Figure 8-4. Experiment results of mapping discovery through M-Ontology.
99
comparison and discussion are provided in Experiment 2.) First, Fig. 8-4(a) shows that
Hit-Rate reaches almost the highest value 1.0 when m is above 10. That indicates that
most of mappings can be found from M-Ontology when at least 10 query forms are fully
integrated in meta-queriers. That is, the reuse of mappings is practical. Second, in Fig.
8-4(b), we observe Fmeasure is above 0.8 if m is larger than 5 and above 0.9 when m
is larger than 35. As expected, the effectiveness of our mapping discovery algorithm
improves with the M-Ontology enrichment. When there exists enough information
in M-Ontology, most of mappings can be effectively identified by our algorithm. The
following two experiments will present more observations and analyses.
Experiment 2 evaluates the performance of abstract-based element classification,
which is a core operation in mapping discovery through M-Ontology. This operation
is to classify the elements in a specific query form (from the data set AIR) to suitable
concept nodes based on their similarity with the representative objects of these concept
nodes. The corresponding performance is shown by blue solid lines in Fig. 8-4. As
illustrated, the similar trends can be observed for both element classification and
mapping discovery. Discovery of correct mappings requires exact classification of the
elements in both query forms. Thus, ideally, element classification should perform
apparently better than mapping discovery with respect to Hit-Rate and Precision and
worse with respect to Recall. However, the difference of Recall values is relatively minor,
compared with the corresponding precision values, especially when m is smaller than 10
and larger than 35. Based on our analysis, the major reason is that some concept nodes
appear only in few (more than one) schemas. The schema elements that are classified
to these concept nodes often do not correspond to any element in the query form
that is to be matched. The total number of target mappings is lower than the average
of schema elements in our data sets. Thus, the recall difference is not large. The
existence of such concept nodes can be identified at the Fig. 8-4(a). When m is above
10, almost all the mappings can be found, but still more than 10% schema elements
100
(c) (d)
(a) (b)
0 5 10 15 20 25 30 35 40 45 50 55 60
0 4.
0 5.
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Hit
-Ra
te
0 5 10 15 20 25 30 35 40 45 50 55 60
0 5.
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Pre
cis
ion
0 5 10 15 20 25 30 35 40 45 50 55 60
0 5.
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Fm
ea
su
re
0 5 10 15 20 25 30 35 40 45 50 55 60
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in M Ontology-
Re
ca
ll
Figure 8-5. Experiment results of concept searching for schema elements.
cannot be classified to the concept nodes. In our experiments, we do not consider
the mismatches caused by T-Edges incompleteness and errors. The performance of
mapping discovery might be worse than the number shown in Experiment 1, but we also
expect the performance can be improved through manual enrichment on M-Ontology.
For example, humans can manually correct the auto-generated information in the
representative objects (e.g., representative labels).
Experiment 3 examines the effects of query-form evolution on the discovery through
M-Ontology without history-based element classification. The data set used in
the Experiment 1 and 2 is AIR, which consists of the original query forms and the
evolved ones. To identify the possible influence of inclusion of the evolved forms,
we conduct another set of experiments for element classification using the data set
AIRO (without the evolved forms). As illustrated in Fig. 8-5, the blue solid lines and
the green dashed lines represent the performance values of abstract-based element
classification respectively using AIR and AIRO. All these four charts indicate that these
curve lines almost coincide. Although the similarities in terms of design, naming and
101
0 5 10 15 20 25 30 35 40 45 50 55 60
0 5.
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in Mapping Repositories
Pre
cis
ion
0 5 10 15 20 25 30 35 40 45 50 55 60
0 6.
0 7.
0 8.
0 9.
1
Number of Schemas Included in Mapping Repositories
Re
ca
ll
(a) (b)
Figure 8-6. Experiment results of mapping discovery through M-Ontology andMO-Repository.
functionality can be observed between the original query forms and the evolved ones,
our abstract-based classification ignores these similarities and does not gain benefits
from them.
Experiment 4 is to evaluate the performance improvement with the usage of MO-Repository.
Due to limits of the real-world data sets (AIR), there does not exist enough evolved
mappings to obtain a complete evaluation. We still can conduct a set of experiments
to show the benefits gained from the function matchElement2Element, which returns
a set of element correspondences from MO-Repository. These correspondences are
used in the history-based element classification. As illustrated in Fig. 8-6, the red dotted
lines and black solid lines represent the performance values of mapping discovery
respectively using only M-Ontology and both repositories. They share the same
experiment samples. We see a significant increase in Recall and almost no change
in Precision when less than 25 query forms are integrated into the mapping repositories.
That means more correct mappings are found without loss of precision. If the number
of integrated query forms is larger than 25, the Recall increase appears modest and
even negligible. That is, most mappings can be discovered using M-Ontology (without
MO-Repository), when sufficient mappings have been inserted.
From the above four sets of experiments, we observe several important properties
of our reuse-oriented mapping discoverer. The data set AIR from real-world query forms
shows almost all the mappings can be found in the mapping repositories, when there
102
exist sufficient T-Edges to connect G/A-Nodes. Our proposed hybrid solution has a
capability of effectively discovering most of mappings, as demonstrated in the promising
results of Precision, Recall and Fmeasure. We expect better performance values can be
obtained through manual correction on mapping repositories.
8.4 Experiments for source selection
To show the effectiveness of our approach in real-world scenarios, we design and
conduct two sets of experiments in the domain of air-ticket booking. Since the quality of
ranking is subjective, it is hard to measure its correctness. Given that our primary goal is
to find suitable data sources from a source repository SR to satisfy user demands (their
preferred data sources DSI ) on capabilities, the focus of our experiments is to evaluate
whether our approach can correctly identify capability matches between user inputs
DSI and the data sources in SR. Specifically, the experiments are designed to evaluate
the effectiveness of capability matching, which is the most critical factor that affect the
recommendation performance.
Experiment setup: From the UIUC web integration repository[2], we collect 38 query
forms for air-ticket booking after eliminating the inactive webpages. First, we manually
extract all the query forms from the webpages. They are expressed in Web Ontology
Language (OWL) and follow a query-form ontology that was designed based on
the HTML specification [1]. Second, we manually classify the schema elements of
these forms to generate 54 maximal connected G/A-Node sub-graphs, based on their
capabilities. Each schema element is associated with a G-Node and a G/A-Subgraph.
The manual classification is utilized in the initial construction of M-Ontology and the final
evaluation of our algorithm.
Experiment scenarios: Assume that users input three preferred data sources DSI , n of
which are not in the repository SR. Our experiments will examine how well the algorithm
can correctly find the data sources with the desired capabilities (that are possessed by
the sources in DSI ) from SR.
103
The first experiment is for our proposed solution (referred to as MOM). Except these
n non-inclusive data sources in DSI , all the remaining sources are used to construct
a domain-based M-Ontology. We assume users can correctly choose an appropriate
domain DMI for their queries. That is, an appropriate M-Ontology MO is chosen. With
assistance of MO, the query forms in DSI are normalized and analyzed to identify the
user demands.
The second experiment evaluates the performance of a reference solution that
is a classical nearest neighbor method (referred to as NNM). To find the capability
correspondences, it compares the query forms in DSI with the form of each source
in SR. For the performance comparison, we use the same algorithms of query-form
normalization and similarity calculation in both NNM and MOM.
To evaluate the performance, we use the following three measures: precision
measures the proportion of the identified capabilities that are actually desired by users;
recall measures the proportion of the desired and identified capabilities out of all the
desired and identifiable capabilities; Fmeasure is the weighted harmonic mean of
precision and recall.
Experiment results: The performance values per n in the Table 1 and 2 are calculated
by the average of 100 samples. All the samples are randomly generated. The first and
second set of experiments share the same samples.
Unintegrated sources (n) precision recall Fmeasure
3 of 3 90.1% 86.6% 88.3%2 of 3 92.8% 90.6% 91.7%1 of 3 96.5% 95.8% 96.1%
Table 8-2. Capability-based matching by MOM
The first set of experiment results in Table 8-2 show promising evidence of
effectiveness of MOM. Even if all the data sources in user inputs (DSI ) have not been
integrated by any existing meta-querier, the Fmeasure rate also reaches 88%. If only
104
one data source is unintegrated, the capabilities of almost all the data sources can be
correctly and completely identified. That indicates a high possibility that our solution can
make an accurate recommendation.
Unintegrated sources (n) precision recall Fmeasure
3 of 3 76.0% 35.4% 48.3%2 of 3 83.9% 56.8% 67.7%1 of 3 92.0% 78.5% 84.7%
Table 8-3. Capability-based matching by NNM
The second set of experiment results are shown in Table 8-3. Clearly, our method
MOM outperforms the reference method NNM by a large factor, especially when
the user-preferred sources have not been integrated. The major reason is the name
ambiguity in HTML codes so that it is difficult to find the capability similarity between two
individual data sources. Different from NNM, our method MOM has a better performance
by utilizing some regular patterns that are learned from the integrated data sources.
From these two experiment sets, we can observe that our approach MOM performs
better than NNM in all the measures, no matter which cases are selected. MOM is able
to correctly identify capability matches between user inputs and the data sources in the
repository. Through the capability matches, we believe our approach can effectively
provide appropriate recommendation for users in terms of the capabilities.
105
CHAPTER 9CONCLUSION AND FUTURE DIRECTIONS
Due to the explosive growth of data in the Internet, one of the major challenges
for future computing is how to build effective infrastructures to facilitate user-friendly
information access and knowledge discovery from the ever-increasing number of
searchable databases over the web. Answering to this challenge is the vision of the
research. We tackle the challenge through the design of meta-queriers to address the
issues of interoperability and scalability in accessing hidden webs.
Meta-queriers are a general information access tool for any applications where
there are multiple heterogeneous data sources. Although in the dissertation, we
use e-commerce examples such as e-travel and book-sales, the applications could
be extended to any domains with similar characteristics, for example, e-library,
e-job, e-newspaper and e-science (such as bioinformatics and physics with data
intensive sharing). The shared M-Ontology is easily exportable. We believe that early
bootstrapping of the shared knowledge base is a key to the snowballing success of a
community-based system. Another significant impact of the project lies in the fact that
the mapping repositories developed can be used for dealing with general interoperability
problems between heterogeneous data sources as well.
The following interesting directions are suggested for future research topics:
• Schema generation is to construct a global schema by integrating a set of local
schemas. The output schema should satisfy three important principles [12]: 1)
Completeness: the output schema encompasses all the local schema elements.
2) Minimality: each group of overlapping elements has minimal representative
elements in the output schema. 3) Understandability: the output schemas are
easy to understand by users. The potential approach to schema generation is
different from the traditional approaches [12, 14, 18, 92, 114] in a number of
aspects. In the context of community-based customization, there exist a large
106
number of similar global schemas that can be shared and reused. The knowledge
of schema generation is cooperatively contributed by the community members
and the workload of schema generation is distributed among community members
as well, and both are achieved in a user-friendly manner. Moreover, the issue of
understandability can also be addressed through the composition and layout of the
output schema.
• Source selection is to intelligently select meta-queriers and data sources in
the domain to meet specific user needs. Our solution is ontology-centric. To
reduce the user interaction, ontology/domain selection should be automated by
calculating the similarity between user inputs and ontology contents. Additionally,
the resource preferences of a specific user can also be learned from the behaviors
and preference vectors of self and peers through classical recommendation
algorithms [4, 139]: 1) content filtering algorithms [10, 104–106] that recommend
the user resources similar to those that the user liked in the past; 2) collaborative
filtering algorithms [61, 81, 124, 137] that recommend the user resources that the
other users with similar capability vectors preferred.
• Communication platform is absolutely necessary in a collaborative data
integration. Conventional communication methods (e.g., emails and discussion
forums) and emerging wiki systems are widely used as information-sharing tools.
However, these “informal” methods often lead to inefficiency and inconsistency
due to the ambiguity of unstructured natural language representation, especially in
a long-running platform where people may join and leave after. Often a small
ambiguity by one could cause a large adverse effect on others. It is highly
desirable to augment these information-sharing methods with a facility that
enhances the clarity of communication in a collaboration platform. In addition to
support meta-querier construction, our M-Ontology can be extended with a strong
emphasis to decrease the degree of informality in the collaboration platform. In
107
M-Ontology, the shared components (i.e., mappings and schema elements) are
organized based on their semantics and changing history to form a semantic
space (i.e., domain-specific ontologies) and a version space (i.e., versioning trees),
respectively. When building/maintaining a specific meta-querier, the member
can easily track the previous work and find the similar/repetitive tasks with better
understanding of the construction process.
108
REFERENCES
[1] “HTML 4.01 Specification: Forms.”http://www.w3.org/TR/html4/interact/forms.html, 1999.
[2] “The UIUC Web Integration Repository.” http://metaquerier.cs.uiuc.edu/repository,2003.
[3] Aboulnaga, Ashraf and Gebaly, Kareem El. “µBE: User Guided Source Selectionand Schema Mediation for Internet Scale Data Integration.” ICDE. 2007, 186–195.
[4] Adomavicius, Gediminas and Tuzhilin, Alexander. “Toward the Next Generationof Recommender Systems: A Survey of the State-of-the-Art and PossibleExtensions.” IEEE Trans. Knowl. Data Eng. 17 (2005).6: 734–749.
[5] Anand, Sarabjot S., Kearney, Patricia, and Shapcott, Mary. “Generatingsemantically enriched user profiles for Web personalization.” ACM Trans. In-ternet Techn. 7 (2007).4.
[6] Arasu, Arvind and Garcia-Molina, Hector. “Extracting Structured Data from WebPages.” SIGMOD Conference. 2003, 337–348.
[7] Arens, Yigal, Chee, Chin Y., Hsu, Chun-Nan, and Knoblock, Craig A. “Retrievingand Integrating Data from Multiple Information Sources.” Int. J. Cooperative Inf.Syst. 2 (1993).2: 127–258.
[8] Aslam, Javed A. and Montague, Mark H. “Models for Metasearch.” SIGIR. 2001,275–284.
[9] Aumueller, David, Do, Hong-Hai, Massmann, Sabine, and Rahm, Erhard.“Schema and ontology matching with COMA++.” SIGMOD Conference. 2005,906–908.
[10] Balabanovic, Marko and Shoham, Yoav. “Content-Based, CollaborativeRecommendation.” Commun. ACM 40 (1997).3: 66–72.
[11] Basili, Victor R. “Viewing Maintenance as Reuse-Oriented Software Development.”IEEE Software 7 (1990).1: 19–25.
[12] Batini, Carlo, Lenzerini, Maurizio, and Navathe, Shamkant B. “A ComparativeAnalysis of Methodologies for Database Schema Integration.” ACM Comput. Surv.18 (1986).4: 323–364.
[13] Bergamaschi, Sonia, Castano, Silvana, and Vincini, Maurizio. “SemanticIntegration of Semistructured and Structured Data Sources.” SIGMOD Record 28(1999).1: 54–59.
109
[14] Bergamaschi, Sonia, Castano, Silvana, Vincini, Maurizio, and Beneventano,Domenico. “Semantic integration of heterogeneous information sources.” DataKnowl. Eng. 36 (2001).3: 215–249.
[15] Bergman, Michael K. “The Deep Web: Surfacing Hidden Value.” The Journal ofElectronic Publishing 7 (2001).1.
[16] Bernstein, Philip A. “Applying Model Management to Classical Meta DataProblems.” CIDR. 2003.
[17] Bernstein, Philip A., Green, Todd J., Melnik, Sergey, and Nash, Alan.“Implementing mapping composition.” VLDB J. 17 (2008).2: 333–353.
[18] Buneman, Peter, Davidson, Susan B., and Kosky, Anthony. “Theoretical Aspectsof Schema Merging.” EDBT. 1992, 152–167.
[19] Buyukkokten, Orkut, Kaljuvee, Oliver, Garcia-Molina, Hector, Paepcke, Andreas,and Winograd, Terry. “Efficient web browsing on handheld devices using page andform summarization.” ACM Trans. Inf. Syst. 20 (2002).1: 82–115.
[20] Cafarella, Michael J., Halevy, Alon Y., Zhang, Yang, Wang, Daisy Zhe, and Wu,Eugene. “Uncovering the Relational Web.” WebDB. 2008.
[21] Callan, James P. “Document Filtering With Inference Networks.” SIGIR. 1996,262–269.
[22] Callan, James P. and Connell, Margaret E. “Query-based sampling of textdatabases.” ACM Trans. Inf. Syst. 19 (2001).2: 97–130.
[23] Callan, James P., Lu, Zhihong, and Croft, W. Bruce. “Searching DistributedCollections with Inference Networks.” SIGIR. 1995, 21–28.
[24] Castano, Silvana, Antonellis, Valeria De, and di Vimercati, Sabrina De Capitani.“Global Viewing of Heterogeneous Data Sources.” IEEE Trans. Knowl. Data Eng.13 (2001).2: 277–297.
[25] Chang, Chia-Hui, Kayed, Mohammed, Girgis, Moheb R., and Shaalan, Khaled F.“A Survey of Web Information Extraction Systems.” IEEE Trans. Knowl. Data Eng.18 (2006).10: 1411–1428.
[26] Chang, Kevin Chen-Chuan and Garcia-Molina, Hector. “Conjunctive ConstraintMapping for Data Translation.” ACM DL. 1998, 49–58.
[27] Chang, Kevin Chen-Chuan, Garcia-Molina, Hector, and Paepcke, Andreas.“Boolean Query Mapping Across Heterogeneous Information Sources.” IEEETrans. Knowl. Data Eng. 8 (1996).4: 515–521.
110
[28] Chang, Kevin Chen-Chuan, He, Bin, Li, Chengkai, Patel, Mitesh, and Zhang,Zhen. “Structured Databases on the Web: Observations and Implications.”SIGMOD Record 33 (2004).3: 61–70.
[29] Chang, Kevin Chen-Chuan, He, Bin, and Zhang, Zhen. “Toward Large ScaleIntegration: Building a MetaQuerier over Databases on the Web.” CIDR. 2005,44–55.
[30] Chapuis, Olivier, Labrune, Jean-Baptiste, and Pietriga, Emmanuel. “DynaSpot:speed-dependent area cursor.” CHI. 2009.
[31] Chuang, Shui-Lung and Chang, Kevin Chen-Chuan. “Integrating web queryresults: holistic schema matching.” CIKM. 2008, 33–42.
[32] Chuang, Shui-Lung, Chang, Kevin Chen-Chuan, and Zhai, ChengXiang.“Context-Aware Wrapping: Synchronized Data Extraction.” VLDB. 2007, 699–710.
[33] Conradi, Reidar and Westfechtel, Bernhard. “Version Models for SoftwareConfiguration Management.” ACM Comput. Surv. 30 (1998).2: 232–282.
[34] Davis, Stanley M. Future Perfect. Reading, MA: Addison Wesley, 1987, 1st ed.
[35] Dhamankar, Robin, Lee, Yoonkyong, Doan, AnHai, Halevy, Alon, and Domingos,Pedro. “iMAP: discovering complex semantic matches between databaseschemas.” SIGMOD Conference. 2004, 383–394.
[36] Ding, Ying and Foo, Schubert. “Ontology research and development. Part 1 -a review of ontology generation.” Journal of Information Science 28 (2002).2:123–136.
[37] Do, Hong Hai. Schema Matching and Mapping-based Data Integration. Ph.D.thesis, University of Leipzig, http://lips.informatik.uni-leipzig.de/pub/2006-4, 2006.
[38] Doan, Anhai. Learning to map between structured representations of data. Ph.D.thesis, University of Washington, 2002.
[39] Doan, AnHai, Domingos, Pedro, and Halevy, Alon Y. “Reconciling Schemasof Disparate Data Sources: A Machine-Learning Approach.” SIGMOD. 2001,509–520.
[40] Doan, AnHai and Halevy, Alon Y. “Semantic Integration Research in the DatabaseCommunity: A Brief Surve.” AI Magazine 26 (2005).1: 83–94.
[41] Dragut, Eduard C., Wu, Wensheng, Sistla, A. Prasad, Yu, Clement T., and Meng,Weiyi. “Merging Source Query Interfaces on Web Databases.” ICDE. 2006, 46.
[42] Dreilinger, Daniel and Howe, Adele E. “Experiences with Selecting SearchEngines Using Metasearch.” ACM Trans. Inf. Syst. 15 (1997).3: 195–222.
111
[43] Drumm, Christian, Schmitt, Matthias, Do, Hong Hai, and Rahm, Erhard.“Quickmig: automatic schema matching for data migration projects.” CIKM.2007, 107–116.
[44] Dwork, Cynthia, Kumar, Ravi, Naor, Moni, and Sivakumar, D. “Rank aggregationmethods for the Web.” WWW. 2001, 613–622.
[45] Embley, David W., Campbell, Douglas M., Jiang, Y. S., Liddle, Stephen W., Ng,Yiu-Kai, Quass, Dallan, and Smith, Randy D. “Conceptual-Model-Based DataExtraction from Multiple-Record Web Pages.” Data Knowl. Eng. 31 (1999).3:227–251.
[46] Euzenat, Jerome and Shvaiko, Pavel. Ontology Matching. Springer-Verlag NewYork, Inc., 2007.
[47] Euzenat, Jerome, Valtchev, Petko, Duc, Chan Le, DAVID, Jerome, Pierson,Jerome, Lee, Seunkeun, Troncy, Raphael, Sharma, Arun, Stoilos, Giorgos,Stamou, Georges, Bechhofer, Sean, Voltz, Raphael, Elonen, Jarno,and Nedas, Konstantinos A. “Alignment api and alignment server.”http://alignapi.gforge.inria.fr/, 2009.
[48] Fagin, Ronald, Kolaitis, Phokion G., Popa, Lucian, and Tan, Wang Chiew.“Composing schema mappings: Second-order dependencies to the rescue.”ACM Trans. Database Syst. 30 (2005).4: 994–1055.
[49] Ferragina, Paolo and Gulli, Antonio. “A personalized search engine based onWeb-snippet hierarchical clustering.” WWW. 2005, 801–810.
[50] Gravano, Luis, Garcia-Molina, Hector, and Tomasic, Anthony. “GlOSS:Text-Source Discovery over the Internet.” ACM Trans. Database Syst. 24 (1999).2:229–264.
[51] Gravano, Luis, Ipeirotis, Panagiotis G., and Sahami, Mehran. “QProber: A systemfor automatic classification of hidden-Web databases.” ACM Trans. Inf. Syst. 21(2003).1: 1–41.
[52] Grossman, David A. and Frieder, Ophir. Information Retrieval: Algorithms andHeuristics. The Kluwer International Series of Information Retrieval. Springer,2004, second ed.
[53] Haas, Laura M., Hernandez, Mauricio A., Ho, Howard, Popa, Lucian, and Roth,Mary. “Clio grows up: from research prototype to industrial tool.” SIGMODConference. 2005, 805–810.
[54] Halevy, Alon Y., Rajaraman, Anand, and Ordille, Joann J. “Data Integration: TheTeenage Years.” VLDB. 2006, 9–16.
[55] He, Bin and Chang, Kevin Chen-Chuan. “Statistical Schema Matching across WebQuery Interfaces.” SIGMOD Conference. 2003, 217–228.
112
[56] ———. “Automatic complex schema matching across Web query interfaces: Acorrelation mining approach.” ACM Trans. Database Syst. 31 (2006).1: 346–395.
[57] He, Bin, Tao, Tao, and Chang, Kevin Chen-Chuan. “Organizing structured websources by query schemas: a clustering approach.” CIKM. 2004, 22–31.
[58] He, Hai, Meng, Weiyi, Lu, Yiyao, Yu, Clement T., and Wu, Zonghuan. “TowardsDeeper Understanding of the Search Interfaces of the Deep Web.” World WideWeb (2007): 133–155.
[59] He, Hai, Meng, Weiyi, Yu, Clement T., and Wu, Zonghuan. “Automatic integrationof Web search interfaces with WISE-Integrator.” VLDB J. 13 (2004).3: 256–273.
[60] ———. “Constructing Interface Schemas for Search Interfaces of WebDatabases.” WISE. 2005, 29–42.
[61] Herlocker, Jonathan L., Konstan, Joseph A., Terveen, Loren G., and Riedl, John.“Evaluating collaborative filtering recommender systems.” ACM Trans. Inf. Syst. 22(2004).1: 5–53.
[62] Hong, Jun, He, Zhongtian, and Bell, David A. “Extracting Web Query InterfacesBased on Form Structures and Semantic Similarity.” ICDE. 2009, 1259–1262.
[63] Howe, Adele E. and Dreilinger, Daniel. “SAVVYSEARCH: A Metasearch EngineThat Learns Which Search Engines to Query.” AI Magazine 18 (1997).2: 19–25.
[64] Ipeirotis, Panagiotis G. and Gravano, Luis. “Distributed Search over the HiddenWeb: Hierarchical Database Sampling and Selection.” VLDB. 2002, 394–405.
[65] Kashyap, Vipul and Sheth, Amit P. “Semantic and Schematic Similarities BetweenDatabase Objects: A Context-Based Approach.” VLDB J. 5 (1996).4: 276–304.
[66] Katz, Randy H. “Towards a Unified Framework for Version Modeling inEngineering Databases.” ACM Comput. Surv. 22 (1990).4: 375–408.
[67] Keeney, R.L. and Raiffa, H. Decisions with multiple objectives: Preferences andvalue tradeoffs. J. Wiley, New York, 1976.
[68] Kerschberg, Larry, Kim, Wooju, and Scime, Anthony. “Intelligent Web Search viaPersonalizable Meta-search Agents.” CoopIS/DOA/ODBASE. 2002, 1345–1358.
[69] Krovetz, Robert. “Viewing Morphology as an Inference Process.” SIGIR. 1993,191–202.
[70] Kushmerick, Nicholas. “Learning to Invoke Web Forms.” CoopIS/DOA/ODBASE.2003, 997–1013.
[71] Laender, Alberto H. F., Ribeiro-Neto, Berthier A., da Silva, Altigran Soares, andTeixeira, Juliana S. “A Brief Survey of Web Data Extraction Tools.” SIGMODRecord 31 (2002).2: 84–93.
113
[72] Larson, James A., Navathe, Shamkant B., and Elmasri, Ramez. “A Theory ofAttribute Equivalence in Databases with Application to Schema Integration.” IEEETrans. Software Eng. 15 (1989).4: 449–463.
[73] Lenzerini, Maurizio. “Data Integration: A Theoretical Perspective.” PODS. 2002,233–246.
[74] Levy, Alon Y., Rajaraman, Anand, and Ordille, Joann J. “Querying HeterogeneousInformation Sources Using Source Descriptions.” VLDB (1996): 251–262.
[75] Li, Chen, Yerneni, Ramana, Vassalos, Vasilis, Garcia-Molina, Hector,Papakonstantinou, Yannis, Ullman, Jeffrey D., and Valiveti, Murty. “CapabilityBased Mediation in TSIMMIS.” SIGMOD Conference. 1998, 564–566.
[76] Li, Xiao and Chow, Randy. “An Ontology-based Mapping Repository for Dynamicand Customized Data Integration.” Tech. Rep. REP-2009-483, Dept. of CISE atUniversity of Florida, 2009.http://www.cise.ufl.edu/~chow/techreport483.pdf.
[77] ———. “An Ontology-based Mapping Repository for Meta-querier Customization.”SEKE. 2010, 325–330.
[78] ———. “Ontology-centric Source Selection for Meta-querier Customization.”Under Review. 2011.
[79] ———. “Reuse-oriented Mapping Discovery for Meta-querier Customization.”Under Review. 2011.
[80] Li, Xiao, Chow, Randy, and Chen, Lu. “Dynamic Personalization of Meta-Queriers.”IRI. 2009, 361–365.
[81] Li, Xiao, Wang, Ye-Yi, and Acero, Alex. “Learning query intent from regularizedclick graphs.” SIGIR. 2008, 339–346.
[82] Lin, Jinxin and Mendelzon, Alberto O. “Merging Databases Under Constraints.”Int. J. Cooperative Inf. Syst. 7 (1998).1: 55–76.
[83] Liu, Fang, Yu, Clement T., and Meng, Weiyi. “Personalized Web Search ForImproving Retrieval Effectiveness.” IEEE Trans. Knowl. Data Eng. 16 (2004).1:28–40.
[84] Lu, Yiyao, He, Hai, Peng, Qian, Meng, Weiyi, and Yu, Clement T. “Clusteringe-commerce search engines based on their search interface pages usingWISE-Cluster.” Data Knowl. Eng. 59 (2006).2: 231–246.
[85] Lu, Yiyao, Wu, Zonghuan, Zhao, Hongkun, Meng, Weiyi, Liu, King-Lup, Raghavan,Vijay, and Yu, Clement T. “MySearchView: a customized metasearch enginegenerator.” SIGMOD Conference. 2007, 1113–1115.
114
[86] Madhavan, Jayant, Bernstein, Philip A., Doan, AnHai, and Halevy, Alon Y.“Corpus-based Schema Matching.” ICDE. 2005, 57–68.
[87] Madhavan, Jayant, Cohen, Shirley, Dong, Xin Luna, Halevy, Alon Y., Jeffery,Shawn R., Ko, David, and Yu, Cong. “Web-scale Data Integration: You can onlyafford to Pay As You Go.” CIDR. 2007, 342–350.
[88] Madhavan, Jayant and Halevy, Alon Y. “Composing Mappings Among DataSources.” VLDB. 2003, 572–583.
[89] Madhavan, Jayant, Ko, David, Kot, Lucja, Ganapathy, Vignesh, Rasmussen, Alex,and Halevy, Alon Y. “Google’s Deep Web crawl.” PVLDB 1 (2008).2: 1241–1252.
[90] McCann, Robert, Shen, Warren, and Doan, AnHai. “Matching Schemas in OnlineCommunities: A Web 2.0 Approach.” ICDE. 2008, 110–119.
[91] Melnik, Sergey. Generic Model Management: Concepts and Algorithms. Ph.D.thesis, University of Leipzig, 2004.
[92] Melnik, Sergey, Bernstein, Philip A., Halevy, Alon Y., and Rahm, Erhard.“Supporting Executable Mappings in Model Management.” SIGMOD Confer-ence. 2005, 167–178.
[93] Melnik, Sergey, Rahm, Erhard, and Bernstein, Philip A. “Rondo: A ProgrammingPlatform for Generic Model Management.” SIGMOD Conference. 2003, 193–204.
[94] Mena, Eduardo, Illarramendi, Arantza, Kashyap, Vipul, and Sheth, Amit P.“OBSERVER: An Approach for Query Processing in Global Information SystemsBased on Interoperation Across Pre-Existing Ontologies.” Distributed and ParallelDatabases 8 (2000).2: 223–271.
[95] Meng, Weiyi, Yu, Clement T., and Liu, King-Lup. “Building efficient and effectivemetasearch engines.” ACM Comput. Surv. 34 (2002).1: 48–89.
[96] Mitra, Prasenjit, Wiederhold, Gio, and Kersten, Martin L. “A Graph-Oriented Modelfor Articulation of Ontology Interdependencies.” EDBT. 2000, 86–100.
[97] Modica, Giovanni A., Gal, Avigdor, and Jamil, Hasan M. “The Use ofMachine-Generated Ontologies in Dynamic Information Seeking.” CoopIS.2001, 433–448.
[98] Moscovich, Tomer, Chevalier, Fanny, Henry, Nathalie, Pietriga, Emmanuel, andFekete, Jean-Daniel. “Topology-aware navigation in large networks.” CHI. 2009.
[99] Ng, Wilfred, Deng, Lin, and Lee, Dik Lun. “Mining User preference using Spyvoting for search engine personalization.” ACM Trans. Internet Techn. 7 (2007).4.
[100] Noy, Natalya Fridman. “Semantic Integration: A Survey Of Ontology-BasedApproaches.” SIGMOD Record 33 (2004).4: 65–70.
115
[101] Noy, Natalya Fridman, Griffith, Nicholas, and Musen, Mark A. “CollectingCommunity-Based Mappings in an Ontology Repository.” ISWC. 2008, 371–386.
[102] Noy, Natalya Fridman and Klein, Michel C. A. “Ontology Evolution: Not the Sameas Schema Evolution.” Knowl. Inf. Syst. 6 (2004).4: 428–440.
[103] Parent, Christine and Spaccapietra, Stefano. “Issues and Approaches ofDatabase Integration.” Commun. ACM 41 (1998).5: 166–178.
[104] Pazzani, Michael J. “A Framework for Collaborative, Content-Based andDemographic Filtering.” Artif. Intell. Rev. 13 (1999).5-6: 393–408.
[105] Pazzani, Michael J. and Billsus, Daniel. “Learning and Revising User Profiles: TheIdentification of Interesting Web Sites.” Machine Learning 27 (1997).3: 313–331.
[106] ———. “Content-Based Recommendation Systems.” The Adaptive Web. 2007,325–341.
[107] Pietriga, Emmanuel and Appert, Caroline. “Sigma lenses: focus-contexttransitions combining space, time and translucence.” CHI. 2008.
[108] Pietriga, Emmanuel, Appert, Caroline, and Beaudouin-Lafon, Michel. “Pointingand beyond: an operationalization and preliminary evaluation of multi-scalesearching.” CHI. 2007.
[109] Pine, B. Joseph and Davis, Stan. Mass customization : the new frontier inbusiness competition. Boston, Mass.: Harvard Business School Press, 1999.
[110] Pinto, Helena Sofia Andrade N. P. and Martins, Joao Pavao. “Ontologies: How canThey be Built?” Knowl. Inf. Syst. 6 (2004).4: 441–464.
[111] Pluempitiwiriyawej, Charnyote and Hammer, Joachim. “Element matching acrossdata-oriented XML sources using a multi-strategy clustering model.” Data Knowl.Eng. 48 (2004).3: 297–333.
[112] Porter, M. F. “An algorithm for suffix stripping.” Readings in information retrieval.Morgan Kaufmann Publishers Inc., 1997, 313–316.
[113] Pottinger, Rachel and Bernstein, Philip A. “Merging Models Based on GivenCorrespondences.” VLDB. 2003, 826–873.
[114] ———. “Schema merging and mapping creation for relational sources.” EDBT.2008, 73–84.
[115] Raghavan, Sriram and Garcia-Molina, Hector. “Crawling the Hidden Web.” VLDB.2001, 129–138.
[116] Rahm, Erhard and Bernstein, Philip A. “A survey of approaches to automaticschema matching.” VLDB J. 10 (2001).4: 334–350.
116
[117] Ram, Sudha and Park, Jinsoo. “Semantic Conflict Resolution Ontology (SCROL):An Ontology for Detecting and Resolving Data and Schema-Level SemanticConflicts.” IEEE Trans. Knowl. Data Eng. 16 (2004).2: 189–202.
[118] Rasolofo, Yves, Hawking, David, and Savoy, Jacques. “Result merging strategiesfor a current news metasearcher.” Inf. Process. Manage. 39 (2003).4: 581–609.
[119] Roddick, John F. “A survey of schema versioning issues for database systems.”Information and Software Technology 37 (1995).7: 383–393.
[120] Russell, Stuart J. and Norvig, Peter. Artificial Intelligence: A Modern Approach,chap. 16.4. Prentice Hall, 2009, 3 ed., 622–626.
[121] Sabou, Marta, Wroe, Chris, Goble, Carole A., and Stuckenschmidt, Heiner.“Learning domain ontologies for semantic Web service descriptions.” J. Web Sem.3 (2005).4: 340–365.
[122] Salton, Gerard and Buckley, Chris. “Term-Weighting Approaches in Automatic TextRetrieval.” Inf. Process. Manage. 24 (1988).5: 513–523.
[123] Sarma, Anish Das, Dong, Xin, and Halevy, Alon Y. “Bootstrapping pay-as-you-godata integration systems.” SIGMOD Conference. 2008, 861–874.
[124] Sarwar, Badrul M., Karypis, George, Konstan, Joseph A., and Riedl, John.“Item-based collaborative filtering recommendation algorithms.” WWW. 2001,285–29.
[125] Seligman, Leonard J., Rosenthal, Arnon, Lehner, Paul E., and Smith, Angela.“Data Integration: Where Does the Time Go?” IEEE Data Eng. Bull. 25 (2002).3:3–10.
[126] Shen, Warren, DeRose, Pedro, McCann, Robert, Doan, AnHai, andRamakrishnan, Raghu. “Toward best-effort information extraction.” SIGMODConference. 2008, 1031–1042.
[127] Shestakov, Denis, Bhowmick, Sourav S., and Lim, Ee-Peng. “DEQUE: queryingthe deep Web.” Data Knowl. Eng. 52 (2005).3: 273–311.
[128] Sheth, Amit P. and Larson, James A. “Federated Database Systems for ManagingDistributed, Heterogeneous, and Autonomous Databases.” ACM Comput. Surv. 22(1990).3: 183–236.
[129] Shokouhi, Milad. “Central-Rank-Based Collection Selection in UncooperativeDistributed Information Retrieval.” ECIR. 2007, 160–172.
[130] Shokouhi, Milad, Zobel, Justin, Scholer, Falk, and Tahaghoghi, Seyed M. M.“Capturing collection size for distributed non-cooperative retrieval.” SIGIR. 2006,316–323.
117
[131] Si, Luo and Callan, James P. “Relevant document distribution estimation methodfor resource selection.” SIGIR. 2003, 298–305.
[132] Si, Luo and Callan, Jamie. “A semisupervised learning method to merge searchengine results.” ACM Trans. Inf. Syst. 21 (2003).4: 457–491.
[133] Spaccapietra, Stefano and Parent, Christine. “View Integration: A Step Forward inSolving Structural Conflicts.” IEEE Trans. Knowl. Data Eng. 6 (1994).2: 258–274.
[134] Su, Weifeng, Wang, Jiying, and Lochovsky, Frederick H. “Automatic HierarchicalClassification of Structured Deep Web Databases.” WISE. 2006, 210–221.
[135] ———. “Holistic Schema Matching for Web Query Interfaces.” EDBT. 2006,77–94.
[136] ———. “ODE: Ontology-assisted data extraction.” ACM Trans. Database Syst. 34(2009).2.
[137] Sugiyama, Kazunari, Hatano, Kenji, and Yoshikawa, Masatoshi. “Adaptive Websearch based on user profile constructed without any effort from users.” WWW.2004, 675–684.
[138] Thomas, Paul and Shokouhi, Milad. “SUSHI: scoring scaled samples for serverselection.” SIGIR. 2009, 419–426.
[139] Uchyigit, Gulden and Ma, Matthew Y., eds. Personalization Techniques andRecommender Systems, vol. 70. World Scientific Publishing, 2008.
[140] Wang, Jiying, Wen, Ji-Rong, Lochovsky, Frederick H., and Ma, Wei-Ying.“Instance-based Schema Matching for Web Databases by Domain-specificQuery Probing.” VLDB. 2004, 408–419.
[141] Wu, Ping, Wen, Ji-Rong, Liu, Huan, and Ma, Wei-Ying. “Query SelectionTechniques for Efficient Crawling of Structured Web Sources.” ICDE. 2006,47.
[142] Wu, Wensheng, Doan, AnHai, and Yu, Clement T. “Merging Interface Schemas onthe Deep Web via Clustering Aggregation.” ICDM. 2005, 801–804.
[143] ———. “WebIQ: Learning from the Web to Match Deep-Web Query Interfaces.”ICDE. 2006, 44.
[144] Wu, Wensheng, Yu, Clement, Doan, AnHai, and Meng, Weiyi. “An interactiveclustering-based approach to integrating source query interfaces on the deepWeb.” SIGMOD Conference. 2004, 95–106.
[145] Wu, Zonghuan, Raghavan, Vijay, Du, Chun, C, Komanduru Sai, Meng, Weiyi, He,Hai, and Yu, Clement T. “SE-LEGO: creating metasearch engines on demand.”SIGIR. 2003, 464.
118
[146] Xin, Dong, Han, Jiawei, and Chang, Kevin Chen-Chuan. “Progressive andselective merge: computing top-k with ad-hoc ranking functions.” SIGMODConference. 2007, 103–114.
[147] Xu, Jinxi and Croft, W. Bruce. “Cluster-Based Language Models for DistributedRetrieval.” SIGIR. 1999, 254–261.
[148] Xu, Li and Embley, David W. “Discovering Direct and Indirect Matches for SchemaElements.” DASFAA. 2003, 39–46.
[149] Yerneni, Ramana, Li, Chen, Garcia-Molina, Hector, and Ullman, Jeffrey D.“Computing Capabilities of Mediators.” SIGMOD Conference. 1999, 443–454.
[150] Yuwono, Budi and Lee, Dik Lun. “Server Ranking for Distributed Text RetrievalSystems on the Internet.” DASFAA. 1997, 41–50.
[151] Zhang, Zhen, He, Bin, and Chang, Kevin Chen-Chuan. “Understanding WebQuery Interfaces: Best-Effort Parsing with Hidden Syntax.” SIGMOD Conference.2004, 107–118.
[152] ———. “Light-weight Domain-based Form Assistant: Querying Web DatabasesOn the Fly.” VLDB. 2005, 97–108.
[153] Zhao, Hongkun, Meng, Weiyi, Wu, Zonghuan, Raghavan, Vijay, and Yu,Clement T. “Fully automatic wrapper generation for search engines.” WWW.2005, 66–75.
[154] Zhao, Hongkun, Meng, Weiyi, and Yu, Clement T. “Automatic Extraction ofDynamic Record Sections From Search Engine Result Pages.” VLDB. 2006,989–1000.
[155] Zhdanova, Anna V. and Shvaiko, Pavel. “Community-Driven Ontology Matching.”ESWC. 2006, 34–49.
[156] Zhou, Huimin and Ram, Sudha. “Clustering Schema Elements for SemanticIntegration of Heterogeneous Data Sources.” J. Database Manag. 15 (2004).4:88–106.
[157] Ziegler, Patrick and Dittrich, Klaus R. “User-Specific Semantic Integration ofHeterogeneous Data: The SIRUP Approach.” ICSNW. 2004, 44–64.
[158] Ziegler, Patrick, Dittrich, Klaus R., and Hunt, Ela. “A call for personal semanticdata integration.” ICDE Workshops. 2008, 250–253.
119
BIOGRAPHICAL SKETCH
Xiao Li received a bachelor’s degree from Nanjing University of Science and
Technology in 2004, and continued his graduate studies in pattern recognition until
2005. Next, he transferred to University of Florida in Gainesville, graduated with honors
in computer science.
120