Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information...

Discovering Data Sourcesin a Dynamic Grid Environment

Jürgen GöresHeterogeneous Information Systems Group

University of [email protected]

2nd VLDB Workshop on Data Management in GridsSeoul, South Korea

September 11th, 2006

Discovering Data Sources in a Dynamic Grid Environment 2

Outline

• Motivation – The Role of Data on the Grid

• The Discovery Problem

• Conclusion & Outlook

• Data Source Utility


The Role of Data in the Grid―A Database Perspective

• From: Moving input and output data for number crunching

• Via: File-oriented bulk data storage

– Store large volumes of unstructured data (“BLOBs”)

– Retrieved and used in its original format and context

• To: Reuse and sharing of existing data

– Data becomes a resource in its own right

– The Grid is aware of the structure of the data

• Individual data sources will rarely fulfill all application requirements

Data from different sources has to be combined!

• Problem: Data sources are highly heterogeneous

Effective use of data requires application-specific integrated view


Goals of Information Integration in Brief

• Provide an integrated, homogeneous view over a number of heterogeneous data sources, i.e.

– Create a mapping from the sources to an integrated schema

• Resolve heterogeneity:

– Technical Issues

– Different data models and structuring

– Uncertainties in the semantics of data

– Duplicate/ambiguous/contradictory records

Integration is difficult! (“AI-complete”)

To this day a largely manual task

• Bad news: Integration in the Grid won’t get any easier

• Good news: Lots of new research opportunities


Conventional Information Integration

Planning

Dynamic

Integration Plan

ConcreteRequirements

(Target Schema)

DataSources

Deployment

IntegrationSystem

AnalysisUser/Application

Requirements

Discovery

101 - 102

Candidate Data Sources

Autonomous &ChangingSources

More sources103 - 106


The Challenge of Data Source Discovery in the Grid

• Number of potential sources several magnitudes larger Informal manual discovery not an option Cannot start integration planning with all sources Idea: Only consider the most useful sources

• What makes a data source useful? Source must have the same “universe of discourse”

as the target Source and target must deal with identical or related concepts Concept represented by Tables, Classes, (XML-)Elements...Concept CoverageChoose Top-N sources

• Problems:– No support for concept-oriented search in current registries– How to identify identical or related concepts?


Schema Matching

• Identify schema elements that are in some way similar • Result: semantic correspondences (a.k.a. "matches")

– Usually have a confidence ranking [0..1]– Can be basic or complex

• Problem: this is really hard to do automatically– Lots of automatic matching approaches

• Linguistic

• Structural

• Hybrid

• …

– Quality and performance is limited– User needs to review/correct/amend matches

Semi-automatically

Schema Matching against 103 - 106 sources?!


Indirect Schema Matching

• Idea: Provide reference schemas for schema matching– Any “good” schema that models a given domain – Purpose built domain schemas (comp. “Ontologies”)

• Deployment– Match sources against domain schema(s)– store matches in the registry

• Discovery– Only match target schema against selected domain schema(s)– Semi-automatical matching feasible– Assuming transitivity, infer matches between source and target

via domain schema



Data Source 2

equivalent concept

superconcept

related concept

Data Source 1

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Target Schema

9

C

B

A

Data Source 2Data Source 1

Target Schema

Domain Schema


Data Source 2

equivalent concept

superconcept

related concept

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

C

B

A


Target Schema

Data Source 1 Data Source 2

9

Domain Schema


Data Source 2

equivalent concept

superconcept

related concept

Target Schema

Domain Schema

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

A

B

C


9

Data Source 2Data Source 1


C

B

A


equivalent concept

superconcept

related concept

Data Source 1

Target Schema

Domain Schema

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

C

B

A

Schema

Data Source 2

Data Source 2

9

Data Source 2


Data Source 2

Data Source 2

equivalent concept

superconcept

related concept

Data Source 1

Target Schema

Domain Schema

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

A

B

C


9


• Weighted Utility Measure (Weighted)– Reduced weight for concepts farther away from schema root– Consider match types– Consider match confidence

Schema B

Schema A

Thoughts about Utility

• Isn‘t utility just a similarity measure?– Similarity is intuitively symmetric: sim(A, B) = sim(B, A)– Utility is asymmetric/directed:

– Schema A is very useful for Schema B: util(A, B) 1– Schema B is not as useful for Schema A: util(B, A) 0.4

10

• Basic Utility Measure (Base)# corresponding concepts in source / # concepts in target

Schema B

Schema A


<Product> <GTIN>01234567891011</GTIN> <Name>Typewriter X-1000</Name> ... <Description>The X-1000 represents the culmination of typewriter development... </Description> <Category>Office Supplies</Category> <Supplier> <Name>Office World</Name> <URL>www.officeworld.com</URL> <Price>99.99</Price> <DeliveryTime>5 min</DeliveryTime> </Supplier> <Supplier> ... </Supplier></Product><Product> ...

EAN

Name

Spec

PID

ID

Address

Commodity

avail_at

Group

Price

SID

Shop

PriceSearch

Name

Scenario “Procurement” − Data Source 1

Product

GTIN

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Target Schema Data Source S1Price Search Engine

11

…………

Typewriters…X-1000…00930…

GroupSpecNameEAN

………

4711109.9900930…

SIDPricePID

………

www.write...WriteTypers4711

AddressNameID

Commodity

avail_at

Shop

="office supplies"

ProcurementDepartment

Purchasepencils, paper, toner,

envelopes, …=“office supplies”

//Product[Category = “Office Supplies”]

Base = 7 / 11 0.73Weighted 0.61



Product

Barcode

=„groceries"

Name

Description

Delivery

URL

Address

Phone

Contact

Price

Type

GroSupply

Target Schema Data Source S2Grocery Store

12

Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Base = 9 / 11 0.82Weighted 0.74?



Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Target Schema

Product

UPC

Name

Information

Price OfficeWorld

13

Data Source S3

Office Supply Store

Base = 5 / 11 0.45Weighted 0.45?


• Ranking:

0.45

0.73

0.82

Base

0.45Office Supply (S3)3

0.61Price Search (S1)2

0.74Grocery Store (S2)1

WeightedSourceRank• Ranking:

0.45

0.73

0.82

Base

0.45Office Supply (S3)3

0.61Price Search (S1)2

0.74Grocery Store (S2)1

WeightedSourceRank

Evaluation of the basic measures

• Basic measures only consider similar concepts– Instances of concepts can be completely disjoint!

Utility measure should consider instance properties– Using constraints

• Satisfiability is NP-complete

• Satisfiability does not indicate presence of useful instances

– Using histograms• Independent for each atomic feature/attribute

• No information about the combination of values (complex objects)

• But useful as a filter: lower weight to 0 if constraint is not satisfied

Instance-based measure Inst

14



Product

Barcode

Name

Description

Delivery

Name

URL

Address

Phone

Contact

Price

Type

GroSupply

Target Schema Data Source S2Grocery Store

P= //Product[Category = “office supplies”] =

12

Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

263“sweets”

......

Histogram for Type

21“cereals”

45“beverages”

countvalue

Inst 0.3


• Ranking:

0.3

0.45

0.61

Inst

Grocery Store (S2)3

Office Supply (S3)2

Price Search (S1)1

SourceRank• Ranking:

0.3

0.45

0.61

Inst

Grocery Store (S2)3

Office Supply (S3)2

Price Search (S1)1

SourceRank

Evaluation of Instance Completeness Measure

• Instance completeness– Devaluates false positives

• What about the Office Supply Store (S3)?

15



Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Target Schema

Product

UPC

Name

Information

Price OfficeWorld

13

Data Source S3Office Supply Store

URL

ShopName

Category

3454“office supplies”countvalue

1“Office World”countvalue

1“www.officew...countvalue

Schema Augmentation

Inst+ 0.77


Ranking with Augmentation

• Ranking:

0.3

0.61

0.77

Inst+

Grocery Store (S2)3

Price Search (S1)2

Office Supply (S3)1

SourceRank• Ranking:

0.3

0.61

0.77

Inst+

Grocery Store (S2)3

Price Search (S1)2

Office Supply (S3)1

SourceRank

• Augmentation and instance completeness reproduce the intuitive ranking

16


Conclusion

• Data source discovery as a grid-specific problem – Very large number of data sources– Only the most useful sources should be considered

• Basic utility measure based on concept coverage– Use schema matching to identify similar concepts– Use indirect schema matching during deployment

• Limitations of the basic measure• Instances completeness

– Use histograms to filter sources that are not possibly useful

• Missing context information in data sources– Implicitly known in original usage– Schema augmentation by data provider

17


Outlook & Open Questions

• Who provides domain schemas?• Instance-based utility − Caveats

– Record matching problem– „Like schema matching with data“ instead of metadata– Scalability?– Concept hierarchies on values

• Limitations of the „Top-N“ approach– Sources which are very specific to a subset of concepts might

be filtered out

Partition target schema– The best n sources might not provide all concepts

Repeat discovery with the missing concepts only

18


Thank you!

Questions?

19

Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information...

Documents

Transcript of Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information...