Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford...

28
Merging Ranks from Merging Ranks from Heterogeneous Internet Heterogeneous Internet Sources Sources Hector Garcia-Molina Hector Garcia-Molina Luis Gravano Luis Gravano Stanford University Stanford University
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford...

Page 1: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Merging Ranks from Merging Ranks from Heterogeneous Internet Heterogeneous Internet

SourcesSources

Hector Garcia-MolinaHector Garcia-Molina

Luis GravanoLuis Gravano

Stanford UniversityStanford University

Page 2: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 22Stanford UniversityStanford University

Users Have Many Available Users Have Many Available Information SourcesInformation Sources

Source 1Source 1 hh1111, h, h1212, h, h1313, ..., ...

Source 2Source 2

......

Nothing!Nothing!

User QueryUser Query Query ResultsQuery Results

““Houses Houses near near

Palo AltoPalo Alto for around for around $300K$300K.”.”

Page 3: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 33Stanford UniversityStanford University

ChallengesChallenges

• Sources are Sources are too numeroustoo numerous• Sources are Sources are heterogeneousheterogeneous

(query language, model, results)(query language, model, results)

• Users want a Users want a single query resultsingle query result

Page 4: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 44Stanford UniversityStanford University

MetasearcherMetasearcher

• Selects the good sources for a Selects the good sources for a queryquery

• Extracts and combines the query Extracts and combines the query results from the sourcesresults from the sources

Page 5: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 55Stanford UniversityStanford University

Text Sources Rank Query Text Sources Rank Query ResultsResults

Text SourceText Source

Doc 1: Doc 1: 0.80.8Doc 2: Doc 2: 0.60.6

......

““Distributed Distributed Databases”Databases”

Page 6: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 66Stanford UniversityStanford University

StructuredStructured Sources on the Sources on the Internet also Rank ResultsInternet also Rank Results

A real-estate agent receives A real-estate agent receives queries onqueries on LocationLocation and and PricePrice::

Q:Q: “Houses with preferred location “Houses with preferred location in in Palo AltoPalo Alto and preferred price and preferred price

around around $300K$300K.”.”

Page 7: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 77Stanford UniversityStanford University

The Agent Ranks its Houses Based The Agent Ranks its Houses Based on its Own Scoring Functionon its Own Scoring Function

Q:Q: “Houses with preferred location in “Houses with preferred location in Palo Palo AltoAlto and preferred price around and preferred price around $300K$300K.”.”

Rank House ID Source Score Location Price1 MV1 0.43 Mountain View $350K2 MV2 0.42 Mountain View $360K3 PA1 0.28 Palo Alto $600K

Page 8: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 88Stanford UniversityStanford University

A A Metasearcher Metasearcher then Faces then Faces Two ProblemsTwo Problems

• Extracting the top objectsExtracting the top objects from from the underlying sourcesthe underlying sources

• Merging the resultsMerging the results from the from the various sourcesvarious sources

Page 9: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 99Stanford UniversityStanford University

MergingMerging Query Results is Query Results is Easy with Enough InformationEasy with Enough InformationGiven a record like:Given a record like:

the metasearcher ignores thethe metasearcher ignores the Source Source scorescore and computes its and computes its Target scoreTarget score from from the Location and Pricethe Location and Price

Rank House ID Source Score Location Price1 MV1 0.43 Mountain View $350K

Page 10: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1010Stanford UniversityStanford University

ExtractingExtracting the Top Objects the Top Objects from a Source is Hardfrom a Source is Hard

The metasearcher’s scoring function The metasearcher’s scoring function might be different from the source’s!might be different from the source’s!

Rank House ID Target Score Location Price1 PA1 1 Palo Alto $600K2 MV1 0.51 Mountain View $350K3 MV2 0.5 Mountain View $360K

Page 11: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1111Stanford UniversityStanford University

We Want to Avoid Extracting We Want to Avoid Extracting All the Source’s ContentsAll the Source’s Contents

Assume a house Assume a house hh with: with:

•Source(Q, h) = 0Source(Q, h) = 0 (worst for source)(worst for source)

•Target(Q, h) = 1 Target(Q, h) = 1 (best for metasearcher)(best for metasearcher)

Problem!Problem!

Page 12: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1212Stanford UniversityStanford University

The Example Query is The Example Query is Not ManageableNot Manageable at the Agent at the Agent

A query Q is A query Q is manageablemanageable at a source at a source if if < 1 such that:< 1 such that:

SourceSource

TargetTarget(0,0)(0,0)

(1,1)(1,1)

Source(Q, h) Source(Q, h) Target(Q, h)-Target(Q, h)-

Page 13: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1313Stanford UniversityStanford University

Single-Attribute Queries Are Single-Attribute Queries Are More Likely to be ManageableMore Likely to be Manageable

Single-attribute queries for Q:Single-attribute queries for Q:

• QQ11:: Location = Palo AltoLocation = Palo Alto

• QQ22:: Price = $300KPrice = $300K

Page 14: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1414Stanford UniversityStanford University

The Example Becomes The Example Becomes Tractable!Tractable!

… … if the top if the top TargetTarget objects for objects for QQ are among the top are among the top SourceSource

objects for objects for QQ11 andand Q Q22

Page 15: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1515Stanford UniversityStanford University

A A CoverCover Bounds the Target Bounds the Target Scores for QScores for Q

QQ11, …, Q, …, Qmm single-attribute queries form a single-attribute queries form a

cover cover for Q if for Q if g g11, …, g, …, gmm, G such that:, G such that:

Target(QTarget(Qii, h) , h) g gii Target(Q, h) Target(Q, h) G G

Page 16: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1616Stanford UniversityStanford University

Having a Having a Manageable CoverManageable Cover for a for a Query is Query is SufficientSufficient......

Manageable Cover Manageable Cover for query Q at source Sfor query Q at source S

““Efficient” ExecutionsEfficient” ExecutionsPossible at SPossible at S

Page 17: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1717Stanford UniversityStanford University

Having a Having a Manageable CoverManageable Cover for a for a Query is Query is SufficientSufficient......

(1) Pick a manageable cover C = {Q(1) Pick a manageable cover C = {Q11, ..., Q, ..., Qmm} for Q at S} for Q at S

(2) For i = 1 to m: Find (2) For i = 1 to m: Find i i for Q for Qii

(3) Pick 0 (3) Pick 0 gg11, ..., g, ..., gmm, G < 1 for cover C, G < 1 for cover C

(4) For i = 1 to m(4) For i = 1 to m

(5) Retrieve all objects t with Source(Q(5) Retrieve all objects t with Source(Q ii, t) , t) G Gi i = g= gii - - i i

(6) Compute Target(Q, t) for all objects t retrieved(6) Compute Target(Q, t) for all objects t retrieved

(7) If (7) If i such that Gi such that G i i 0 Then Go to Step (11) 0 Then Go to Step (11)

(8) If for all t retrieved, Target(Q, t) (8) If for all t retrieved, Target(Q, t) G Then G Then

(9) Find new, lower 0 (9) Find new, lower 0 g g11, ..., g, ..., gmm, G < 1 for C, G < 1 for C

(10) Go to Step (4) (10) Go to Step (4)

(11) Output those objects retrieved with the highest Target score(11) Output those objects retrieved with the highest Target score

Page 18: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1818Stanford UniversityStanford University

Algorithm to Extract Top Algorithm to Extract Top Target ObjectsTarget Objects

QQ11 QQ22

00

11

gg11

gg22

Target(Q, h) G

Page 19: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 1919Stanford UniversityStanford University

Algorithm to Extract Top Algorithm to Extract Top Target ObjectsTarget Objects

QQ11 QQ22

00

11

gg11’’gg22’’

Target(Q, h) G’

Target(Q, h’) G’!h’

Page 20: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 2020Stanford UniversityStanford University

Preliminary Performance Preliminary Performance Results for our AlgorithmResults for our Algorithm

• Target=MinTarget=Min: 14% objects retrieved: 14% objects retrieved

• Target=MaxTarget=Max: 4% objects retrieved : 4% objects retrieved

10,000 objects10,000 objects4 query attributes4 query attributes

=0=0

Page 21: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 2121Stanford UniversityStanford University

Preliminary Performance Preliminary Performance Results for our AlgorithmResults for our Algorithm

• Target=MinTarget=Min: 25% objects retrieved: 25% objects retrieved

• Target=MaxTarget=Max: 44% objects retrieved : 44% objects retrieved

10,000 objects10,000 objects4 query attributes4 query attributes

=0.10=0.10

Page 22: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 2222Stanford UniversityStanford University

Having a Having a Manageable CoverManageable Cover for a for a Query is Also Query is Also NecessaryNecessary......

No Manageable Cover No Manageable Cover for query Q at source Sfor query Q at source S

Efficient ExecutionsEfficient ExecutionsImpossible at SImpossible at S

Page 23: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 2323Stanford UniversityStanford University

A Manageable Cover is Necessary: A Manageable Cover is Necessary: ProofProof

Consider QConsider Q11, Q, Q22, Q, Q33 minimal cover for Q with: minimal cover for Q with:

QQ11, Q, Q22 manageable, manageable, QQ33 not manageable not manageable

For For anyany “efficient “execution, build “efficient “execution, build hh such that: such that: • h is not retrieved h is not retrieved • Target(Q, h) > G Target(Q, h) > G = = max{Target(Q, o) | o retrieved}max{Target(Q, o) | o retrieved}

Page 24: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 2424Stanford UniversityStanford University

A Manageable Cover is Necessary: A Manageable Cover is Necessary: ProofProof

QQ11 QQ22 QQ33

00

11

gg11

gg22

gg33

Page 25: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

h’h’ h’h’

h’h’

Target(Q, h’) > G!Target(Q, h’) > G!

Page 26: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

h’ hh h’ hh

h’

Target(Q, h) > G!Target(Q, h) > G!

hh

Target(QTarget(Q33, h) , h) Target(Q, h’)Target(Q, h’)Target(Q, h’) > GTarget(Q, h’) > G

Page 27: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 2727Stanford UniversityStanford University

We Studied Two We Studied Two Metasearching ProblemsMetasearching Problems

• Extracting the top objectsExtracting the top objects from from the underlying sourcesthe underlying sources

• Merging the resultsMerging the results from the from the various sourcesvarious sources

Page 28: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University.

Luis GravanoLuis Gravano 2828Stanford UniversityStanford University

Related Work:Related Work:Collection Fusion Collection Fusion

•Voorhees et al.Voorhees et al.

•Callan/Lu/CroftCallan/Lu/Croft

•Gauch/WangGauch/Wang