Kossmann Survey Cs561
-
Upload
rukayat-gbemisola-adebayo -
Category
Documents
-
view
224 -
download
0
Transcript of Kossmann Survey Cs561
-
7/30/2019 Kossmann Survey Cs561
1/31
Distributed Query Processing
Based on The state of the art in distributed query processing
Donald Kossman (ACM Computing Surveys, 2000)
-
7/30/2019 Kossmann Survey Cs561
2/31
Motivation
Cost and scalability: network of off-shelfmachines
Integration of different software vendors
(with own DBMS) Integration of legacy systems
Applications inherently distributed, such asworkflow or collaborative-design
State-of-the-art distributed informationtechnologies (e-businesses)
-
7/30/2019 Kossmann Survey Cs561
3/31
Part 1 : Basics
Query Processing Basics
centralized query processing
distributed query processing
-
7/30/2019 Kossmann Survey Cs561
4/31
Problem Statement
Input: Query such as Biological objects in
study A referenced in a literature in journal Y.
Output: Answer
Objectives:
response time, throughput, first answers, little IO, ...
Centralized vs. Distributed Query Processing
same basic problem
but, more and different parameters, such(data sites
or available machine power) and objectives
-
7/30/2019 Kossmann Survey Cs561
5/31
Steps in Query Processing
Input: Declarative Query
SQL, XQuery, ...
Step 1: Translate Query into Algebra
Tree of operators (query plan generation)
Step 2: Optimize Query
Tree of operators (logical) - also select partitions of table
Tree of operators (physical)also site annotations (Compilation)
Step 3: Execution
Interpretation; Query result generation
-
7/30/2019 Kossmann Survey Cs561
6/31
Algebra
relational algebra for SQL very well understood
algebra for XQuery mostly understood
SELECT A.d
FROM A, B
WHERE A.a = B.b
AND A.c = 35
A.d
A.a = B.b,
A.c = 35
X
A B
-
7/30/2019 Kossmann Survey Cs561
7/31
Query Optimization
logical, e.g., push down cheap predicates
enumerate alternative plans, apply cost model
use search heuristics to find cheapest plan
A.d
A.a = B.b,
A.c = 35
X
A B
A.d
hashjoin
B.b
index A.c B
-
7/30/2019 Kossmann Survey Cs561
8/31
Basic Query Optimization
Classical Dynamic Programming algorithm
Performs join order optimization
Input : Join query on n relations
Output : Best join order
-
7/30/2019 Kossmann Survey Cs561
9/31
The Dynamic Prog. Algorithm
for i = 1 to n do {optPlan({Ri}) = accessPlans(Ri)prunePlans(optPlan({Ri}))
}for i = 2 to n dofor all S { R1, R2 Rn } such that |S| = i do {
optPlan(S) = for all O S do {
optPlan(S) = optPlan(S) joinPlans(optPlan(O), optPlan(S O))
prunePlans(optPlan(S))}
}return optPlan({R1, R2, Rn})
-
7/30/2019 Kossmann Survey Cs561
10/31
Query Execution
library of operators (hash join, merge join, ...)
exploit indexes and clustering in database
pipelining (iterator model)
A.d
hashjoin
B.b
index A.c B
(John, 35, CS)(Mary, 35, EE) (Edinburgh, CS,5.0)
(Edinburgh, AS, 6.0)
(CS)
(AS)
(John, 35, CS)
John
-
7/30/2019 Kossmann Survey Cs561
11/31
Summary : Centralized Queries
Basic SQL (SPJG, nesting) well understood Very good extensibility
spatial joins, time series, UDF, xquery, etc.
Current problemsBetter statistics : cost model for optimization
Physical database design expensive & complex
Some Trendsinteractiveness during execution
approximate answers, top-k
self-tuning capabilities (adaptive; robust; etc.)
-
7/30/2019 Kossmann Survey Cs561
12/31
Distributed Query Processing: Basics
Idea:
Extension of centralized query processing. (SystemR* et al. in 80s)
What is different?extend physical algebra: send&receive operators
other metrics : optimize for response time
resource vectors, network interconnect matrixcaching and replication
less predictability in cost model (adaptive algos)
heterogeneity in data formats and data models
-
7/30/2019 Kossmann Survey Cs561
13/31
Issues in Distributed Databases
Plan enumeration
The time and space complexity of traditional dynamic
programming algorithm is very large Iterative Dynamic Programming (heuristic for large
queries)
Cost Models
Classic Cost Model Response Time Model
Economic Models
-
7/30/2019 Kossmann Survey Cs561
14/31
Distributed Query Plan
A.d
hashjoin
B.b
index A.c B
receive receive
send send
Forms
Of
Parallelism?
-
7/30/2019 Kossmann Survey Cs561
15/31
Cost : Resource Utilization
1
8
2
5 10
1 6
1 6
Total Cost =
Sum of Cost of Ops
Cost = 40
-
7/30/2019 Kossmann Survey Cs561
16/31
Another Metric : Response Time
25, 33
24, 32
0, 12
0, 5 0, 10
0, 7 0, 24
0, 6 0, 18
Total Cost = 40first tuple = 25
last tuple = 33
first tuple = 0
last tuple = 10
Pipelined
parallelism
Independent
parallelism
-
7/30/2019 Kossmann Survey Cs561
17/31
Query Execution Techniques for
Distributed Databases Row Blocking
Multi-cast optimization
Multi-threaded execution
Joins with horizontal partitioning
Semi joins Top n queries
-
7/30/2019 Kossmann Survey Cs561
18/31
Query Execution Techniques for DD
Row Blocking
SEND and RECEIVE operators in query plan
to model communicationImplemented by TCP/IP, UDP, etc.
Ship tuples in block-wise fashion (batch);
smooth burstiness
-
7/30/2019 Kossmann Survey Cs561
19/31
Query Execution Techniques for DD
Multi-cast Optimization
Location of sending/receiving may affectcommunication costs; forwarding versus multi-casting
Multi-threaded execution
Several threads for operators at the same site (intra-query parallelism)
May be useful to enable concurrent reads for diverse
machines (while continuing query processing) Must consider if resources warrant concurrent operator
execution (say two sorts each needing all memory)
-
7/30/2019 Kossmann Survey Cs561
20/31
Query Execution Techniques for DD
Joins with Data (horizontal) partitioning: Hash-based partitioning to conduct joins on independent partitions
Semi Joins :
Reduce communication costs; Send only join keys instead ofcomplete tuples to the site to extract relevant join partners
Double-pipelined hash joins :
Non-blocking join operators to deliver first results quickly; fullyexploit pipelined parallelism, and reduce overall response time
Top n queries : Isloate top n tuples quickly and only perform other expensive
operations (like sort, join, etc) on those few (use stop operators)
-
7/30/2019 Kossmann Survey Cs561
21/31
Adaptive Algorithms
Deal with unpredictable events at run time
delays in arrival of data, burstiness of network
autonomity of nodes, changes in policies
Example: double pipelined hash joinsbuild hash table for both input streams
read inputs in separate threads
good for bursty arrival of data
Re-optimization at run time (LEO, etc.) monitor execution of query
adjust estimates of cost model
re-optimize if delta is too large
-
7/30/2019 Kossmann Survey Cs561
22/31
Special Techniques for Client-Server
Architectures Shipping techniques
Query shipping
Data shippingHybrid shipping
Query Optimization
Site SelectionWhere to optimize
Two Phase Optimization
-
7/30/2019 Kossmann Survey Cs561
23/31
Special Techniques for Federated
Database Systems Wrapper architecture
Query optimization
Query capabilities
Cost estimation
Calibration Approach
Wrapper Cost Model
Parameter Binding
-
7/30/2019 Kossmann Survey Cs561
24/31
Heterogeneity
Use Wrappers to hide heterogeneity Wrappers take care of data format, packaging
Wrappers map from local to global schema
Wrappers carry out cachingconnections, cursors, data, ...
Wrappers map queries into local dialect
Wrappers participate in query planning!!!
define the subset of queries that can be handled
give cost information, statistics
capability-based rewriting
-
7/30/2019 Kossmann Survey Cs561
25/31
Middleware
Two kinds of middleware
data warehouses
virtual integration
Data Warehouses
good: query response times
good: materializes results of data cleaning
bad: high resource requirements in middleware
bad: staleness of data Virtual Integration
the opposite
caching possible to improve response times
-
7/30/2019 Kossmann Survey Cs561
26/31
Virtual Integration
Query
Middleware(query decomposition, result composition)
DB1 DB2
wrapper
subquery
wrapper
subquery
-
7/30/2019 Kossmann Survey Cs561
27/31
IBM Data Joiner
SQL Query
Data Joiner
SQL DB1 SQL DB2
wrapper
subquery
wrapper
subquery
-
7/30/2019 Kossmann Survey Cs561
28/31
Adding XML
Query
Middleware (SQL)
DB1 DB2
wrapper
sub
query
wrapper
sub
query
XML Publishing
-
7/30/2019 Kossmann Survey Cs561
29/31
XML Data Integration
XML Query
Middleware (XML)
DB1 DB2
wrapper
XML
query
wrapper
XML
query
-
7/30/2019 Kossmann Survey Cs561
30/31
XML Data Integration
Example: BEA Liquid Data
Advantage
Availability of XML wrappers for all major databases
Problems
XML - SQL mapping is very difficult
XML is not always the right language
(e.g., decision support style queries)
-
7/30/2019 Kossmann Survey Cs561
31/31
Summary
Middleware looks like a homogenous centralizeddatabase
location transparency
data model transparency
Middleware provides global schema
data sources map local schemas to global schema
Various kinds of middleware (SQL, XML)
Stacks of middleware possible
Data cleaning requires special attention