Optimizing Recursive Information Gathering Plans Eric Lambrecht, Subbarao Kambhampati Senthil...

Post on 21-Dec-2015

213 views 0 download

Tags:

Transcript of Optimizing Recursive Information Gathering Plans Eric Lambrecht, Subbarao Kambhampati Senthil...

Optimizing Recursive Optimizing Recursive Information Gathering Information Gathering PlansPlans

Eric Lambrecht, Subbarao Kambhampati

Senthil Gnanaprakasam

Arizona State University

Tempe, USA

rakaposhi.eas.asu.edu/yochan.html

Information GatheringInformation Gathering

<html>

cgi

wrapper wrapper db

Gatherer user

Build query plan using source inversion

Logical Optimizations:Redundancy removal

Execution Optimizations: Source call ordering

Execute query plan

[Duschka (with Genesereth & Levy) 97]

EMERAC Query Planning EMERAC Query Planning SystemSystem

[Optimization steps]

OrganizationOrganization

•Optimization challenges in EMERAC

•Building Source Complete Plans: Review

•Logical optimization

•Minimization of recursive IG plans by removing redundant source calls

•Execution optimization

•Ordering source calls to minimize both access and tuple-transfer costs

•Implementation and Results

•Contributions

Modeling Information GatheringModeling Information Gathering

Information sources:

•relational

•answer ‘select’ queries (possibly a restricted set of query patterns)

•autonomous

World model:

•relational

Query on the world model:

Reformulate the query as calls on information sources. Optimize. Execute.

<html> cgi

wrapper wrapper db

Gatherer user

“Local as View” model

Modeling SourcesModeling Sources

Sources related to world model by describing them as views over world model:

Required binding..

movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z)

house-of-movies($X, Y) -> title-time(X, Y), title-actor(X, Z)

query(X, Y) :- title-time(X, Y)

Optimization challenges in Optimization challenges in EMERACEMERAC

• Each relation is exported in to-Each relation is exported in to-to by a single databaseto by a single database

• All sources are assumed to be All sources are assumed to be fully relationalfully relational

• Multiple sources export partial Multiple sources export partial and overlapping portions of a and overlapping portions of a relationrelation

– Need to minimize plans to remove Need to minimize plans to remove redundancyredundancy

• Sources are rarely fully Sources are rarely fully relational relational

– Only limited types of queries allowedOnly limited types of queries allowed

• Wrapped web-pagesWrapped web-pages

• Form-interfaced databasesForm-interfaced databases

• Certain forms of join Certain forms of join computation may be computation may be precludedprecluded

– Need to model query capabilitiesNeed to model query capabilities

Traditional Information Gathering

• Tuple-transfer costs are Tuple-transfer costs are assumed to dominate the assumed to dominate the query-execution costsquery-execution costs

– Use of “Bound-is-easier” Use of “Bound-is-easier” assumptionassumption

• Assume availability of full source-Assume availability of full source-statisticsstatistics

– Selectivity indices, histograms etc. Selectivity indices, histograms etc.

• Access cost & source latencies Access cost & source latencies tend to equal or dominate the tend to equal or dominate the transfer costtransfer cost

– Need to consider number of source Need to consider number of source callscalls

– Need for considering bushy joins Need for considering bushy joins (instead of just left-linear join trees) (instead of just left-linear join trees)

• Full statistics are rarely Full statistics are rarely available about internet sourcesavailable about internet sources

– Sources are decentralized and Sources are decentralized and autonomousautonomous

– Difficult to do systematic optimizationDifficult to do systematic optimization

[Continued]Optimization challenges in Optimization challenges in EMERACEMERAC

Source Access LimitationsSource Access Limitations• Sources can have a variety of access limitationsSources can have a variety of access limitations

– Form interfaced databases may require certain attributes to be Form interfaced databases may require certain attributes to be boundbound

• Whitepages may require the name of the personWhitepages may require the name of the person– To get the numbers of a set of To get the numbers of a set of nn people, we will have to people, we will have to

access the source access the source nn times times

– and may be unable to handle bindings of other attributesand may be unable to handle bindings of other attributes

• A Whitepages database may not take the address of a person A Whitepages database may not take the address of a person as a bound attributeas a bound attribute

– To get the number of John Doe, who lives on Lemon St, we To get the number of John Doe, who lives on Lemon St, we will have to get the numbers of will have to get the numbers of allall John Does, and locally John Does, and locally filter the ones not living on Lemon Street filter the ones not living on Lemon Street

– Wrapped web-pages cannot select over any attributesWrapped web-pages cannot select over any attributes

Representing Source Representing Source Access LimitationsAccess Limitations

• Use annotations on the attributes of the source relationUse annotations on the attributes of the source relation

– ““$$” annotation identifies attributes that ” annotation identifies attributes that mustmust be bound be bound

– ““%%” annotation identifies un-selectable attributes” annotation identifies un-selectable attributes

• S($X,%Y,Z) S($X,%Y,Z) – A form-interfaced web-page that requires bindings for X and A form-interfaced web-page that requires bindings for X and

is able to do selections only on Z.is able to do selections only on Z.

• $ and % annotations help identify feasible binding patterns for $ and % annotations help identify feasible binding patterns for sourcessources

– SSb-- b-- are feasible; Sare feasible; Sf--f-- are infeasible; are infeasible;

– SSbbf bbf must be modeled as S must be modeled as Sbffbff filtered locally with binding on Y filtered locally with binding on Y

Properties of optimal Properties of optimal information gathering information gathering plansplans

• Source-complete: no other plan returns more information using the available sources

• Source-minimal: a plan for which no information source can be removed, yet the plan returns the same answer.

• Access-cost minimal: a plan which reduces the number of separate accesses to individual sources

• Bandwidth-minimal: a plan that, when executed, transfers the smallest amount of data over the network yet is still source complete

Build query plan

Logical Optimizations

Execution Optimizations

Execute query plan

[Source completeness]

[Source-minimality][Access cost and bandwidth minimality]

Ensuring properties of Ensuring properties of optimal information optimal information gathering plansgathering plans

Building Source Complete Building Source Complete PlansPlans

movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z)

house-of-movies($X, Y) -> title-time(X, Y), title-actor(X, Z)

title-time(X, Y) :- dom(X), house-of-movies(X, Y)

<X, f2(X, Y)>

title-time(X, Y) :- movie-hut(X, Y)<X, f1(X, Y)>

[Duschka, Genesereth 97]

title-actor (X, X, Y) :- movie-hut(X, Y)

dom(X) :- movie-hut(X, Y)

dom(Y) :- movie-hut(X, Y)

title-actor (X, X, Y) :- dom(X), house-of-movies(X, Y)dom(Y) :- dom(X), house-of-movies(X, Y)query(X, Y) :- title-time(X, Y)

query(X, Y) :- title-time(X, Y)

Source Inversion Rules

Binding restrictions lead to recursion in the plan

Problems with Plans derived Problems with Plans derived from source inversion rulesfrom source inversion rules

title-time(X, Y) :- dom(X), house-of-movies(X, Y)

title-actor (Y, X, Y) :- dom(X), house-of-movies(X, Y)

dom(Y) :- dom(X), house-of-movies(X, Y)

query(X,Y) :- title-time(X, Y)

<X, f2(X, Y)>

title-time(X, Y) :- movie-hut(X, Y)

title-actor (Y, X, Y) :- movie-hut(X, Y)

dom(X) :- movie-hut(X, Y)

dom(Y) :- movie-hut(X, Y)

<X, f1(X, Y)>

If both movie-hut and house-of-

movies have same information:

• both sources are not necessary

• the recursion is not necessary

Every source that is remotely relevant to the query is made part of the plan

•Many of these sources may be overlapping

Minimizing information gathering Minimizing information gathering plansplans

Model source overlaps

– Use LCW statements

Rewrite the source-complete plan:

– Greedily remove rules from plan with uniform equivalence and LCW statements (= make the plan source-minimal)

• Uniform containment checks [Sagiv, 88]

• Use heuristics to guide removal and pull out recursion first

LCW StatementsLCW Statements

View: movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z)

LCW: movie-hut(X, Y) <- title-time(X, Y), title-actor(X, Z)

To check if one rule, r , with information source predicates contains another rule, r , see if

r [s s l] contains r [s s v]

1

1

2

2

[Etzioni et al 97], [Duschka 97]

Inter-source subsumption relations[Mirror sources] can also be handled

Uniform Uniform EquivalenceEquivalence

Equivalence:

• Two datalog programs X and Y are equivalent if, for every set of extensional predicates, the two programs produce the same output.

• Undecidable

Uniform Equivalence:

• X and Y are equivalent if, for every set of extensional and intensional predicates the two plans produce the same output

• Decidable

• Implies equivalence [Sagiv 88]

Testing for Uniform Testing for Uniform ContainmentContainment

p(X, Y) :- q(X, Y)

q(X, Y) :- r(X, Y)p(W, X) :- r(W, X)

does

uniformly contain

?

assert r(“W”, “X”) and try to derive p(“W”, “X”)

Greedily Minimizing Information Greedily Minimizing Information Gathering PlansGathering Plans

Remove non-recursive IDB predicates

Sort the rules so those with dom predicates come before those without dom predicates

for each rule r do

let r be a rule of P that has not yet been considered

let P be the program obtained by deleting rule r from P

if P[s s l] uniformly contains r[s s v] then

replace P with P. Prune unreachable rules.

^

^

^

Sour

ce cos

ts

can

be u

sed

Uniform containment check is exponential in the worst case

Minimization exampleMinimization example

title-time(X, Y) :- dom(X), house-of-movies(X, Y)

<X, f2(X, Y)>

title-time(X, Y) :- movie-hut(X, Y)<X, f1(X, Y)>title-actor (X, X, Y) :- movie-hut(X, Y)

dom(X) :- movie-hut(X, Y)

dom(Y) :- movie-hut(X, Y)

title-actor (X, X, Y) :- dom(X), house-of-movies(X, Y)dom(Y) :- dom(X), house-of-movies(X, Y)query(X, Y) :- title-time(X, Y)

movie-hut(X, Y) <- title-time(X, Y), title-actor(X, Z)

Build query plan

Logical Optimizations

Execution Optimizations

Execute query plan

[Source completeness]

[Source-minimality][Access cost and bandwidth minimality]

EMERACEMERAC

Issues in ordering source Issues in ordering source callscalls

• Execution cost is a function of both access cost and the tuple-transfer cost (Execution cost is a function of both access cost and the tuple-transfer cost ( ignoring local ignoring local processing costs…)processing costs…)

• Tension between access costs & traffic costsTension between access costs & traffic costs

– E.g. Execute “E.g. Execute “S1(W,X) & S2(X,Y)S1(W,X) & S2(X,Y)” where the query binds W ” where the query binds W

– Tuple-transfer cost reduction motivates calling sources with the least general binding patterns possibleTuple-transfer cost reduction motivates calling sources with the least general binding patterns possible

• Bound-is-easier (S1 first, and then feed X bindings to S2)Bound-is-easier (S1 first, and then feed X bindings to S2)

– Access cost reduction motivates calling sources with the most general binding patterns possibleAccess cost reduction motivates calling sources with the most general binding patterns possible

• Feeding X bindings for S2 will generate many separate accesses, increasing the access costFeeding X bindings for S2 will generate many separate accesses, increasing the access cost

sttransfer

sst

taccess

ssa DCnCMinimize

coscos

**

Our Approach: Our Approach: AssumptionsAssumptions

• Exact optimization is not worth it…Exact optimization is not worth it…

– Lack of full source statisticsLack of full source statistics

– NP-hardness of the optimization problemNP-hardness of the optimization problem

• Join-ordering, which is a special case, is already Join-ordering, which is a special case, is already NP-CompleteNP-Complete

• Source access costs dominate tuple-transfer costs Source access costs dominate tuple-transfer costs by defaultby default

– Reasonable given the large setup and latency costs Reasonable given the large setup and latency costs for internet sourcesfor internet sources

Our Approach: OverviewOur Approach: Overview• A greedy approach (along the lines of “bound-is-easier” type A greedy approach (along the lines of “bound-is-easier” type

procedures)procedures)

• By default, attempts to access each source with the most general By default, attempts to access each source with the most general feasible binding patternfeasible binding pattern

– Reasonable given the assumption that access costs dominate transfer Reasonable given the assumption that access costs dominate transfer costscosts

• The default is over-ridden if a binding pattern is known to produce The default is over-ridden if a binding pattern is known to produce too much traffictoo much traffic

– Binding patterns producing high traffic are stored in a table called Binding patterns producing high traffic are stored in a table called HTBPHTBP

• Implicitly produces bushy join treesImplicitly produces bushy join trees

The HTBP TableThe HTBP Table• The HTBP table contains, for every source S, the least general The HTBP table contains, for every source S, the least general

binding patterns of S which are known to produce “high” trafficbinding patterns of S which are known to produce “high” traffic

– A call to source S with binding pattern B is considered high-traffic A call to source S with binding pattern B is considered high-traffic producing, if HTBP contains Sproducing, if HTBP contains SB’ B’ and B is either equal or more general and B is either equal or more general than B’than B’

– E.g. E.g. Book(Author,Title,ISBN,Subj,Price,Pages)Book(Author,Title,ISBN,Subj,Price,Pages)

• HTBP may contain all binding patterns that do not bind at least one HTBP may contain all binding patterns that do not bind at least one of the first four attributesof the first four attributes

– BookBookffffbb ffffbb listed explicitly in HTBPlisted explicitly in HTBP– BookBookfffffb fffffb BookBookfffffbf fffffbf BookBookffffffffffff

would be considered to be implicitly in HTBPwould be considered to be implicitly in HTBP

• Advantage: HTBP should be easy to specify even if full source Advantage: HTBP should be easy to specify even if full source statistics are not availablestatistics are not available

The AlgorithmThe Algorithm

For each stage i from 1 to m do For each unchosen subgoal S pick the most general & feasible BP B of S w.r.t. V & FBP such that B is not in HTBP. If such a B exists, Push SB into C[i]. Mark S chosen. Add all variables of S to V If no such B exists, but there is a feasible binding pattern for S Pick the BP B’ with most bound variables (in terms of #(.)) Push SB’ into P[i] If no subgoal has been chosen at this level (C[i] is empty), and there are some postponed sources (P[i] is non-empty) Choose Sk

B in P[i] with the maximum #(B) value Push Sk

B into C[i] Add all variables of Sk to V Return the array C[1…m]

Default case: Reduce accesses

HTBP case: Reduce transfer costs

ExampleExample•Sources: DP(A:Author,T:Title,Y:Year)

SM98(T:Title,U:URL)

•Query: Q(A,T,U,1998)

•Plan: Q(A,T,U,1998) :- DP(A,T,1998) & SM98(T,U)

HTBP: {DPbbb SM98bb}

Step 1. V={Y}

Cand: DPfff DPffb SM98ff

XX XX XX

P[1] = {DPffb SM98ff}

C[1] = DPffb

Step 2. V={A,T,Y}

Cand: SM98ff SM98bf

XX XX

P[2]={SM98bf}

C[2]=SM98bf

HTBP: {DPffb}

Step 1. V={Y}

Cand: DPfff DPffb SM98ff

XX XX

C[1] = SM98ff

Step 2. V={Y, U, T}

Cand: DPfff DPffb DPfbf DPfbb

XX XX XX

C[2] = DPfbf

HTBP: {}

Step 1. V={Y}

Cand: DPfff DPffb SM98ff

C[1] = SM98ff DPfff

Bound-is-easier

The Emerac Information Gatherer

•written in Java

•incorporates rewriting and execution ordering techniques

•executes plans in parallel

•returns partial results during plan execution

•object oriented design makes it easy to modify

ImplementationImplementation

ExperimentsExperiments

• Experimented with simulated sources derived form DBLP Experimented with simulated sources derived form DBLP datadata

– Our minimization approach reduces access costs by Our minimization approach reduces access costs by removing redundant recursive sourcesremoving redundant recursive sources

• Minimization cost offset by the improvements in Minimization cost offset by the improvements in execution timeexecution time

– Our source ordering approach tended to reduce the total Our source ordering approach tended to reduce the total cost over bound-is-easier approach whenever there were cost over bound-is-easier approach whenever there were significant number of binding patterns that are not significant number of binding patterns that are not subsumed by HBTPsubsumed by HBTP

LCW vs. Naïve [Artificial Sources]LCW vs. Naïve [Artificial Sources]

1.00E+03

1.00E+04

1.00E+05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# redundant constrained sources

Tim

e t

o p

lan

& E

xecu

te (

ms)

[lo

g] Naïve d=1LCW d=1Naïve d=3LCW d=3Naïve d=5LCW d=5

LCW vs. Naïve [DBLP Sources]LCW vs. Naïve [DBLP Sources]

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1 2 3 4 5 6 7 8

# redundant constrained sources

Tim

e t

o p

lan

& E

xe

cu

te (

in m

. se

c.)

(lo

g)

Naive 256 (1)

LCW 256 (1)

Naive 256 (3)

LCW 256 (3)

Graceful degradation Graceful degradation

ContributionsContributions

•An approach for minimizing recursive information gathering plans•An approach for ordering source calls in information gathering plans

•Attempts at minimizing both access cost and tuple-transfer cost

•Implementation & Evaluation in EMERAC

Current directionsCurrent directions

• Integrate minimization & source-call ordering Integrate minimization & source-call ordering phases phases

• Model cost-quality tradeoffsModel cost-quality tradeoffs

• Handling run-time exceptionsHandling run-time exceptions

– unavailability of sources etc.unavailability of sources etc.

• Tracking time and solution quality statisticsTracking time and solution quality statistics

– Improve the granularity of the HTBP tableImprove the granularity of the HTBP table