Kossmann Survey Cs561

7/30/2019 Kossmann Survey Cs561

1/31

Distributed Query Processing

Based on The state of the art in distributed query processing

Donald Kossman (ACM Computing Surveys, 2000)


2/31

Motivation

Cost and scalability: network of off-shelfmachines

Integration of different software vendors

(with own DBMS) Integration of legacy systems

Applications inherently distributed, such asworkflow or collaborative-design

State-of-the-art distributed informationtechnologies (e-businesses)


3/31

Part 1 : Basics

Query Processing Basics

centralized query processing

distributed query processing


4/31

Problem Statement

Input: Query such as Biological objects in

study A referenced in a literature in journal Y.

Output: Answer

Objectives:

response time, throughput, first answers, little IO, ...

Centralized vs. Distributed Query Processing

same basic problem

but, more and different parameters, such(data sites

or available machine power) and objectives


5/31

Steps in Query Processing

Input: Declarative Query

SQL, XQuery, ...

Step 1: Translate Query into Algebra

Tree of operators (query plan generation)

Step 2: Optimize Query

Tree of operators (logical) - also select partitions of table

Tree of operators (physical)also site annotations (Compilation)

Step 3: Execution

Interpretation; Query result generation


6/31

Algebra

relational algebra for SQL very well understood

algebra for XQuery mostly understood

SELECT A.d

FROM A, B

WHERE A.a = B.b

AND A.c = 35

A.d

A.a = B.b,

A.c = 35

X

A B


7/31

Query Optimization

logical, e.g., push down cheap predicates

enumerate alternative plans, apply cost model

use search heuristics to find cheapest plan

A.d

A.a = B.b,

A.c = 35

X

A B

A.d

hashjoin

B.b

index A.c B


8/31

Basic Query Optimization

Classical Dynamic Programming algorithm

Performs join order optimization

Input : Join query on n relations

Output : Best join order


9/31

The Dynamic Prog. Algorithm

for i = 1 to n do {optPlan({Ri}) = accessPlans(Ri)prunePlans(optPlan({Ri}))

}for i = 2 to n dofor all S { R1, R2 Rn } such that |S| = i do {

optPlan(S) = for all O S do {

optPlan(S) = optPlan(S) joinPlans(optPlan(O), optPlan(S O))

prunePlans(optPlan(S))}

}return optPlan({R1, R2, Rn})


10/31

Query Execution

library of operators (hash join, merge join, ...)

exploit indexes and clustering in database

pipelining (iterator model)

A.d

hashjoin

B.b

index A.c B

(John, 35, CS)(Mary, 35, EE) (Edinburgh, CS,5.0)

(Edinburgh, AS, 6.0)

(CS)

(AS)

(John, 35, CS)

John


11/31

Summary : Centralized Queries

Basic SQL (SPJG, nesting) well understood Very good extensibility

spatial joins, time series, UDF, xquery, etc.

Current problemsBetter statistics : cost model for optimization

Physical database design expensive & complex

Some Trendsinteractiveness during execution

approximate answers, top-k

self-tuning capabilities (adaptive; robust; etc.)


12/31

Distributed Query Processing: Basics

Idea:

Extension of centralized query processing. (SystemR* et al. in 80s)

What is different?extend physical algebra: send&receive operators

other metrics : optimize for response time

resource vectors, network interconnect matrixcaching and replication

less predictability in cost model (adaptive algos)

heterogeneity in data formats and data models


13/31

Issues in Distributed Databases

Plan enumeration

The time and space complexity of traditional dynamic

programming algorithm is very large Iterative Dynamic Programming (heuristic for large

queries)

Cost Models

Classic Cost Model Response Time Model

Economic Models


14/31

Distributed Query Plan

A.d

hashjoin

B.b

index A.c B

receive receive

send send

Forms

Of

Parallelism?


15/31

Cost : Resource Utilization

1

8

2

5 10

1 6

1 6

Total Cost =

Sum of Cost of Ops

Cost = 40


16/31

Another Metric : Response Time

25, 33

24, 32

0, 12

0, 5 0, 10

0, 7 0, 24

0, 6 0, 18

Total Cost = 40first tuple = 25

last tuple = 33

first tuple = 0

last tuple = 10

Pipelined

parallelism

Independent

parallelism


17/31

Query Execution Techniques for

Distributed Databases Row Blocking

Multi-cast optimization

Multi-threaded execution

Joins with horizontal partitioning

Semi joins Top n queries


18/31

Query Execution Techniques for DD

Row Blocking

SEND and RECEIVE operators in query plan

to model communicationImplemented by TCP/IP, UDP, etc.

Ship tuples in block-wise fashion (batch);

smooth burstiness


19/31


Multi-cast Optimization

Location of sending/receiving may affectcommunication costs; forwarding versus multi-casting

Multi-threaded execution

Several threads for operators at the same site (intra-query parallelism)

May be useful to enable concurrent reads for diverse

machines (while continuing query processing) Must consider if resources warrant concurrent operator

execution (say two sorts each needing all memory)


20/31


Joins with Data (horizontal) partitioning: Hash-based partitioning to conduct joins on independent partitions

Semi Joins :

Reduce communication costs; Send only join keys instead ofcomplete tuples to the site to extract relevant join partners

Double-pipelined hash joins :

Non-blocking join operators to deliver first results quickly; fullyexploit pipelined parallelism, and reduce overall response time

Top n queries : Isloate top n tuples quickly and only perform other expensive

operations (like sort, join, etc) on those few (use stop operators)


21/31

Adaptive Algorithms

Deal with unpredictable events at run time

delays in arrival of data, burstiness of network

autonomity of nodes, changes in policies

Example: double pipelined hash joinsbuild hash table for both input streams

read inputs in separate threads

good for bursty arrival of data

Re-optimization at run time (LEO, etc.) monitor execution of query

adjust estimates of cost model

re-optimize if delta is too large


22/31

Special Techniques for Client-Server

Architectures Shipping techniques

Query shipping

Data shippingHybrid shipping

Query Optimization

Site SelectionWhere to optimize

Two Phase Optimization


23/31

Special Techniques for Federated

Database Systems Wrapper architecture

Query optimization

Query capabilities

Cost estimation

Calibration Approach

Wrapper Cost Model

Parameter Binding


24/31

Heterogeneity

Use Wrappers to hide heterogeneity Wrappers take care of data format, packaging

Wrappers map from local to global schema

Wrappers carry out cachingconnections, cursors, data, ...

Wrappers map queries into local dialect

Wrappers participate in query planning!!!

define the subset of queries that can be handled

give cost information, statistics

capability-based rewriting


25/31

Middleware

Two kinds of middleware

data warehouses

virtual integration

Data Warehouses

good: query response times

good: materializes results of data cleaning

bad: high resource requirements in middleware

bad: staleness of data Virtual Integration

the opposite

caching possible to improve response times


26/31

Virtual Integration

Query

Middleware(query decomposition, result composition)

DB1 DB2

wrapper

subquery

wrapper

subquery


27/31

IBM Data Joiner

SQL Query

Data Joiner

SQL DB1 SQL DB2

wrapper

subquery

wrapper

subquery


28/31

Adding XML

Query

Middleware (SQL)

DB1 DB2

wrapper

sub

query

wrapper

sub

query

XML Publishing


29/31

XML Data Integration

XML Query

Middleware (XML)

DB1 DB2

wrapper

XML

query

wrapper

XML

query


30/31

XML Data Integration

Example: BEA Liquid Data

Advantage

Availability of XML wrappers for all major databases

Problems

XML - SQL mapping is very difficult

XML is not always the right language

(e.g., decision support style queries)


31/31

Summary

Middleware looks like a homogenous centralizeddatabase

location transparency

data model transparency

Middleware provides global schema

data sources map local schemas to global schema

Various kinds of middleware (SQL, XML)

Stacks of middleware possible

Data cleaning requires special attention

Kossmann Survey Cs561

Documents

Transcript of Kossmann Survey Cs561