Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information...
-
Upload
brittany-campbell -
Category
Documents
-
view
217 -
download
2
Transcript of Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information...
Discovering Data Sourcesin a Dynamic Grid Environment
Jürgen GöresHeterogeneous Information Systems Group
University of [email protected]
2nd VLDB Workshop on Data Management in GridsSeoul, South Korea
September 11th, 2006
Discovering Data Sources in a Dynamic Grid Environment 2
Outline
• Motivation – The Role of Data on the Grid
• The Discovery Problem
• Conclusion & Outlook
• Data Source Utility
Discovering Data Sources in a Dynamic Grid Environment 3
The Role of Data in the Grid―A Database Perspective
• From: Moving input and output data for number crunching
• Via: File-oriented bulk data storage
– Store large volumes of unstructured data (“BLOBs”)
– Retrieved and used in its original format and context
• To: Reuse and sharing of existing data
– Data becomes a resource in its own right
– The Grid is aware of the structure of the data
• Individual data sources will rarely fulfill all application requirements
Data from different sources has to be combined!
• Problem: Data sources are highly heterogeneous
Effective use of data requires application-specific integrated view
Discovering Data Sources in a Dynamic Grid Environment 4
Goals of Information Integration in Brief
• Provide an integrated, homogeneous view over a number of heterogeneous data sources, i.e.
– Create a mapping from the sources to an integrated schema
• Resolve heterogeneity:
– Technical Issues
– Different data models and structuring
– Uncertainties in the semantics of data
– Duplicate/ambiguous/contradictory records
Integration is difficult! (“AI-complete”)
To this day a largely manual task
• Bad news: Integration in the Grid won’t get any easier
• Good news: Lots of new research opportunities
Discovering Data Sources in a Dynamic Grid Environment 5
Conventional Information Integration
Planning
Dynamic
Integration Plan
ConcreteRequirements
(Target Schema)
DataSources
Deployment
IntegrationSystem
AnalysisUser/Application
Requirements
Discovery
101 - 102
Candidate Data Sources
Autonomous &ChangingSources
More sources103 - 106
Discovering Data Sources in a Dynamic Grid Environment 6
The Challenge of Data Source Discovery in the Grid
• Number of potential sources several magnitudes larger Informal manual discovery not an option Cannot start integration planning with all sources Idea: Only consider the most useful sources
• What makes a data source useful? Source must have the same “universe of discourse”
as the target Source and target must deal with identical or related concepts Concept represented by Tables, Classes, (XML-)Elements...Concept CoverageChoose Top-N sources
• Problems:– No support for concept-oriented search in current registries– How to identify identical or related concepts?
Discovering Data Sources in a Dynamic Grid Environment 7
Schema Matching
• Identify schema elements that are in some way similar • Result: semantic correspondences (a.k.a. "matches")
– Usually have a confidence ranking [0..1]– Can be basic or complex
• Problem: this is really hard to do automatically– Lots of automatic matching approaches
• Linguistic
• Structural
• Hybrid
• …
– Quality and performance is limited– User needs to review/correct/amend matches
Semi-automatically
Schema Matching against 103 - 106 sources?!
Discovering Data Sources in a Dynamic Grid Environment 8
Indirect Schema Matching
• Idea: Provide reference schemas for schema matching– Any “good” schema that models a given domain – Purpose built domain schemas (comp. “Ontologies”)
• Deployment– Match sources against domain schema(s)– store matches in the registry
• Discovery– Only match target schema against selected domain schema(s)– Semi-automatical matching feasible– Assuming transitivity, infer matches between source and target
via domain schema
Discovering Data Sources in a Dynamic Grid Environment 9
Indirect Schema Matching
Data Source 2
equivalent concept
superconcept
related concept
Data Source 1
AB AB BA AB
BC AC AC CA AC
BC AC AC AC AC
CB CA AC CA AC
BC AC AC AC AC
Target Schema
9
C
B
A
Data Source 2Data Source 1
Target Schema
Domain Schema
Discovering Data Sources in a Dynamic Grid Environment 10
Data Source 2
equivalent concept
superconcept
related concept
AB AB BA AB
BC AC AC CA AC
BC AC AC AC AC
CB CA AC CA AC
BC AC AC AC AC
Data Source 1
Target Schema
C
B
A
Indirect Schema Matching
Target Schema
Data Source 1 Data Source 2
9
Domain Schema
Discovering Data Sources in a Dynamic Grid Environment 11
Data Source 2
equivalent concept
superconcept
related concept
Target Schema
Domain Schema
AB AB BA AB
BC AC AC CA AC
BC AC AC AC AC
CB CA AC CA AC
BC AC AC AC AC
Data Source 1
Target Schema
A
B
C
Indirect Schema Matching
9
Data Source 2Data Source 1
Discovering Data Sources in a Dynamic Grid Environment 12
C
B
A
Indirect Schema Matching
equivalent concept
superconcept
related concept
Data Source 1
Target Schema
Domain Schema
AB AB BA AB
BC AC AC CA AC
BC AC AC AC AC
CB CA AC CA AC
BC AC AC AC AC
Data Source 1
Target Schema
C
B
A
Schema
Data Source 2
Data Source 2
9
Data Source 2
Discovering Data Sources in a Dynamic Grid Environment 13
Data Source 2
Data Source 2
equivalent concept
superconcept
related concept
Data Source 1
Target Schema
Domain Schema
AB AB BA AB
BC AC AC CA AC
BC AC AC AC AC
CB CA AC CA AC
BC AC AC AC AC
Data Source 1
Target Schema
A
B
C
Indirect Schema Matching
9
Discovering Data Sources in a Dynamic Grid Environment 14
• Weighted Utility Measure (Weighted)– Reduced weight for concepts farther away from schema root– Consider match types– Consider match confidence
Schema B
Schema A
Thoughts about Utility
• Isn‘t utility just a similarity measure?– Similarity is intuitively symmetric: sim(A, B) = sim(B, A)– Utility is asymmetric/directed:
– Schema A is very useful for Schema B: util(A, B) 1– Schema B is not as useful for Schema A: util(B, A) 0.4
10
• Basic Utility Measure (Base)# corresponding concepts in source / # concepts in target
Schema B
Schema A
Discovering Data Sources in a Dynamic Grid Environment 15
<Product> <GTIN>01234567891011</GTIN> <Name>Typewriter X-1000</Name> ... <Description>The X-1000 represents the culmination of typewriter development... </Description> <Category>Office Supplies</Category> <Supplier> <Name>Office World</Name> <URL>www.officeworld.com</URL> <Price>99.99</Price> <DeliveryTime>5 min</DeliveryTime> </Supplier> <Supplier> ... </Supplier></Product><Product> ...
EAN
Name
Spec
PID
ID
Address
Commodity
avail_at
Group
Price
SID
Shop
PriceSearch
Name
Scenario “Procurement” − Data Source 1
Product
GTIN
Name
Description
Category Name
URL
OrderNo
Price
DeliveryTime
Supplier
Target Schema Data Source S1Price Search Engine
11
…………
Typewriters…X-1000…00930…
GroupSpecNameEAN
………
4711109.9900930…
SIDPricePID
………
www.write...WriteTypers4711
AddressNameID
Commodity
avail_at
Shop
="office supplies"
ProcurementDepartment
Purchasepencils, paper, toner,
envelopes, …=“office supplies”
//Product[Category = “Office Supplies”]
Base = 7 / 11 0.73Weighted 0.61
Discovering Data Sources in a Dynamic Grid Environment 16
Scenario “Procurement” − Data Source 2
Product
Barcode
=„groceries"
Name
Description
Delivery
URL
Address
Phone
Contact
Price
Type
GroSupply
Target Schema Data Source S2Grocery Store
12
Product
GTIN
="office supplies"
Name
Description
Category Name
URL
OrderNo
Price
DeliveryTime
Supplier
Base = 9 / 11 0.82Weighted 0.74?
Discovering Data Sources in a Dynamic Grid Environment 17
Scenario “Procurement” − Data Source 3
Product
GTIN
="office supplies"
Name
Description
Category Name
URL
OrderNo
Price
DeliveryTime
Supplier
Target Schema
Product
UPC
Name
Information
Price OfficeWorld
13
Data Source S3
Office Supply Store
Base = 5 / 11 0.45Weighted 0.45?
Discovering Data Sources in a Dynamic Grid Environment 18
• Ranking:
0.45
0.73
0.82
Base
0.45Office Supply (S3)3
0.61Price Search (S1)2
0.74Grocery Store (S2)1
WeightedSourceRank• Ranking:
0.45
0.73
0.82
Base
0.45Office Supply (S3)3
0.61Price Search (S1)2
0.74Grocery Store (S2)1
WeightedSourceRank
Evaluation of the basic measures
• Basic measures only consider similar concepts– Instances of concepts can be completely disjoint!
Utility measure should consider instance properties– Using constraints
• Satisfiability is NP-complete
• Satisfiability does not indicate presence of useful instances
– Using histograms• Independent for each atomic feature/attribute
• No information about the combination of values (complex objects)
• But useful as a filter: lower weight to 0 if constraint is not satisfied
Instance-based measure Inst
14
Discovering Data Sources in a Dynamic Grid Environment 19
Scenario “Procurement” − Data Source 2
Product
Barcode
Name
Description
Delivery
Name
URL
Address
Phone
Contact
Price
Type
GroSupply
Target Schema Data Source S2Grocery Store
P= //Product[Category = “office supplies”] =
12
Product
GTIN
="office supplies"
Name
Description
Category Name
URL
OrderNo
Price
DeliveryTime
Supplier
263“sweets”
......
Histogram for Type
21“cereals”
45“beverages”
countvalue
Inst 0.3
Discovering Data Sources in a Dynamic Grid Environment 20
• Ranking:
0.3
0.45
0.61
Inst
Grocery Store (S2)3
Office Supply (S3)2
Price Search (S1)1
SourceRank• Ranking:
0.3
0.45
0.61
Inst
Grocery Store (S2)3
Office Supply (S3)2
Price Search (S1)1
SourceRank
Evaluation of Instance Completeness Measure
• Instance completeness– Devaluates false positives
• What about the Office Supply Store (S3)?
15
Discovering Data Sources in a Dynamic Grid Environment 21
Scenario “Procurement” − Data Source 3
Product
GTIN
="office supplies"
Name
Description
Category Name
URL
OrderNo
Price
DeliveryTime
Supplier
Target Schema
Product
UPC
Name
Information
Price OfficeWorld
13
Data Source S3Office Supply Store
URL
ShopName
Category
3454“office supplies”countvalue
1“Office World”countvalue
1“www.officew...countvalue
Schema Augmentation
Inst+ 0.77
Discovering Data Sources in a Dynamic Grid Environment 22
Ranking with Augmentation
• Ranking:
0.3
0.61
0.77
Inst+
Grocery Store (S2)3
Price Search (S1)2
Office Supply (S3)1
SourceRank• Ranking:
0.3
0.61
0.77
Inst+
Grocery Store (S2)3
Price Search (S1)2
Office Supply (S3)1
SourceRank
• Augmentation and instance completeness reproduce the intuitive ranking
16
Discovering Data Sources in a Dynamic Grid Environment 23
Conclusion
• Data source discovery as a grid-specific problem – Very large number of data sources– Only the most useful sources should be considered
• Basic utility measure based on concept coverage– Use schema matching to identify similar concepts– Use indirect schema matching during deployment
• Limitations of the basic measure• Instances completeness
– Use histograms to filter sources that are not possibly useful
• Missing context information in data sources– Implicitly known in original usage– Schema augmentation by data provider
17
Discovering Data Sources in a Dynamic Grid Environment 24
Outlook & Open Questions
• Who provides domain schemas?• Instance-based utility − Caveats
– Record matching problem– „Like schema matching with data“ instead of metadata– Scalability?– Concept hierarchies on values
• Limitations of the „Top-N“ approach– Sources which are very specific to a subset of concepts might
be filtered out
Partition target schema– The best n sources might not provide all concepts
Repeat discovery with the missing concepts only
18
Discovering Data Sources in a Dynamic Grid Environment 25
Thank you!
Questions?
19