ITrails: Pay-as-you-go Information Integration in Dataspaces Presented By Marcos Vaz Salles, Jens...
-
Upload
ross-davis -
Category
Documents
-
view
214 -
download
0
Transcript of ITrails: Pay-as-you-go Information Integration in Dataspaces Presented By Marcos Vaz Salles, Jens...
iTrails: Pay-as-you-go Information Integration in iTrails: Pay-as-you-go Information Integration in DataspacesDataspaces
Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi
ETH Zurich
2008-02-22
Summerized By Sungchan Park
Copyright 2008 by CEBT
Problem: Querying Several SourcesProblem: Querying Several Sources
Center for E-Business Technology
Copyright 2008 by CEBT
Solution #1: Use a Search EngineSolution #1: Use a Search Engine
Center for E-Business Technology
Copyright 2008 by CEBT
Solution #2: Use an Information Integration Solution #2: Use an Information Integration SystemSystem
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Core IdeaiTrail Core Idea
Is there an integration solution in-between these two extremes?
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Core IdeaiTrail Core Idea
Center for E-Business Technology
Is there an integration solution in-between these two extremes?
Declaratively add lightweight ‘hints’ to a search engine thus allowing gradual enrichment of loosely integrated data sources
Copyright 2008 by CEBT
Example ScenarioExample Scenario
Query
“pdf yesterday”
Hints(Trails)
1. The date attribute is mapped to modified attribute
2. The date attribute is mapped to received attribute
3. The yesterday keyword is mapped to a query for values of the date attribute equal to the date of yesterday
4. The pdf keyword is mapped to a query for elements whose names end in pdf
Center for E-Business Technology
Copyright 2008 by CEBT
Where hints come from?Where hints come from?
Given by the user
Explicitly
Via Relevance Feedback
(Semi-)Automatically
Information extraction techniques
Automatic schema matching
Ontologies and thesauri (e.g., wordnet)
User communities (e.g., trails on gene data, bookmarks)
All these aspects are beyond the scope of this paper
Center for E-Business Technology
Copyright 2008 by CEBT
Data and Query ModelData and Query Model
Data Model
Assume that all data is represented by a logical graph G
Query also represented by graph
Center for E-Business Technology
Copyright 2008 by CEBT
Query SyntaxQuery Syntax
Center for E-Business Technology
Copyright 2008 by CEBT
Query ExampleQuery Example
“//Home/projects//*[“Mike”]”
Center for E-Business Technology
Copyright 2008 by CEBT
Basic Form of a TrailBasic Form of a Trail
An unidirectional trail
An bidirectional trail
Center for E-Business Technology
Copyright 2008 by CEBT
Trail ExampleTrail Example
Trails in an example scenario
Trails
Given query
– “pdf yesterday”
Transformed query
– “//*.pdf[modified=yesterday() OR received=yesterday() ].”
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Query ProcessingiTrail Query Processing
1. Matching
2. Transforming
3. Merging
Center for E-Business Technology
Copyright 2008 by CEBT
iTrail Query Processing ExampleiTrail Query Processing Example
Given Query
Q1 = //home/projects//* [“Mike”]
Trail
Ψ8 := //home/*.name ->
//calendar//*.tuple.category
Resulting Query
Q1{Ψ8} = //home/projects/*[“Mike”] U
//calendar//*[category=“project”]//*.[“Mike”]
Center for E-Business Technology
Utilizing G. Miklau and D. Suciu. Containment and Equivalence for an Xpath Fragment. In PODS, 2002.
Copyright 2008 by CEBT
Applying Multiple TrailApplying Multiple Trail
MMCA(Multiple Match Colouring Algorithm) algorithm
Trail can be applied infinitely
To prevent infinite recursion, a trail should not be rematched to nodes in a logical plan generated by itself
Center for E-Business Technology
Copyright 2008 by CEBT
Other IssuesOther Issues
Trail Pruning
Problem: MMCA is exponential in number of levels
Solution: Trail Pruning
– Prune by number of levels
– Prune by top-K trails matched in each level
Give weight and prob. to trails
– Prune by both top-K trails and number of levels
Trail Indexing
Precompute trail expressions in order to speed up query processing
Trail materialization
Center for E-Business Technology
Copyright 2008 by CEBT
ExperimentsExperiments
Setting
Configured iMeMex to act in three modes
– Baseline: Graph / IR search engine
– iTrails: Rewrite search queries with trails
– Perfect Query: Semantics-aware query
Data
Center for E-Business Technology
Copyright 2008 by CEBT
Experiment, QualityExperiment, Quality
Compare with baseline
Center for E-Business Technology
Copyright 2008 by CEBT
Experiment, overheadExperiment, overhead
Compare with perfect query
Overhead is not negligible
However, this can be fixed by exploiting trail materializations
Center for E-Business Technology
Copyright 2008 by CEBT
Experiment, Scalability #1Experiment, Scalability #1
Center for E-Business Technology
Rewrite Time
Query-rewrite time can be controlled with pruning
Copyright 2008 by CEBT
Experiment, Scalability #2Experiment, Scalability #2
Quality
Pruning improves precision
Center for E-Business Technology
Copyright 2008 by CEBT
ConclusionConclusion
Our Contributions
iTrails: generic method to model semantic relationships (e.g. implicit meaning, bookmarks, dictionaries, thesauri,attribute matches, ...)
We propose a framework and algorithms for Pay-as-you-go Information Integration
Smooth transition between search and data integration
Future Work
Trail Creation
– Use collections (ontologies, thesauri, wikipedia)
– Work on automatic mining of trails from the dataspace
Other types of trails
Center for E-Business Technology