Kossmann Survey Cs561

download Kossmann Survey Cs561

of 31

Transcript of Kossmann Survey Cs561

  • 7/30/2019 Kossmann Survey Cs561

    1/31

    Distributed Query Processing

    Based on The state of the art in distributed query processing

    Donald Kossman (ACM Computing Surveys, 2000)

  • 7/30/2019 Kossmann Survey Cs561

    2/31

    Motivation

    Cost and scalability: network of off-shelfmachines

    Integration of different software vendors

    (with own DBMS) Integration of legacy systems

    Applications inherently distributed, such asworkflow or collaborative-design

    State-of-the-art distributed informationtechnologies (e-businesses)

  • 7/30/2019 Kossmann Survey Cs561

    3/31

    Part 1 : Basics

    Query Processing Basics

    centralized query processing

    distributed query processing

  • 7/30/2019 Kossmann Survey Cs561

    4/31

    Problem Statement

    Input: Query such as Biological objects in

    study A referenced in a literature in journal Y.

    Output: Answer

    Objectives:

    response time, throughput, first answers, little IO, ...

    Centralized vs. Distributed Query Processing

    same basic problem

    but, more and different parameters, such(data sites

    or available machine power) and objectives

  • 7/30/2019 Kossmann Survey Cs561

    5/31

    Steps in Query Processing

    Input: Declarative Query

    SQL, XQuery, ...

    Step 1: Translate Query into Algebra

    Tree of operators (query plan generation)

    Step 2: Optimize Query

    Tree of operators (logical) - also select partitions of table

    Tree of operators (physical)also site annotations (Compilation)

    Step 3: Execution

    Interpretation; Query result generation

  • 7/30/2019 Kossmann Survey Cs561

    6/31

    Algebra

    relational algebra for SQL very well understood

    algebra for XQuery mostly understood

    SELECT A.d

    FROM A, B

    WHERE A.a = B.b

    AND A.c = 35

    A.d

    A.a = B.b,

    A.c = 35

    X

    A B

  • 7/30/2019 Kossmann Survey Cs561

    7/31

    Query Optimization

    logical, e.g., push down cheap predicates

    enumerate alternative plans, apply cost model

    use search heuristics to find cheapest plan

    A.d

    A.a = B.b,

    A.c = 35

    X

    A B

    A.d

    hashjoin

    B.b

    index A.c B

  • 7/30/2019 Kossmann Survey Cs561

    8/31

    Basic Query Optimization

    Classical Dynamic Programming algorithm

    Performs join order optimization

    Input : Join query on n relations

    Output : Best join order

  • 7/30/2019 Kossmann Survey Cs561

    9/31

    The Dynamic Prog. Algorithm

    for i = 1 to n do {optPlan({Ri}) = accessPlans(Ri)prunePlans(optPlan({Ri}))

    }for i = 2 to n dofor all S { R1, R2 Rn } such that |S| = i do {

    optPlan(S) = for all O S do {

    optPlan(S) = optPlan(S) joinPlans(optPlan(O), optPlan(S O))

    prunePlans(optPlan(S))}

    }return optPlan({R1, R2, Rn})

  • 7/30/2019 Kossmann Survey Cs561

    10/31

    Query Execution

    library of operators (hash join, merge join, ...)

    exploit indexes and clustering in database

    pipelining (iterator model)

    A.d

    hashjoin

    B.b

    index A.c B

    (John, 35, CS)(Mary, 35, EE) (Edinburgh, CS,5.0)

    (Edinburgh, AS, 6.0)

    (CS)

    (AS)

    (John, 35, CS)

    John

  • 7/30/2019 Kossmann Survey Cs561

    11/31

    Summary : Centralized Queries

    Basic SQL (SPJG, nesting) well understood Very good extensibility

    spatial joins, time series, UDF, xquery, etc.

    Current problemsBetter statistics : cost model for optimization

    Physical database design expensive & complex

    Some Trendsinteractiveness during execution

    approximate answers, top-k

    self-tuning capabilities (adaptive; robust; etc.)

  • 7/30/2019 Kossmann Survey Cs561

    12/31

    Distributed Query Processing: Basics

    Idea:

    Extension of centralized query processing. (SystemR* et al. in 80s)

    What is different?extend physical algebra: send&receive operators

    other metrics : optimize for response time

    resource vectors, network interconnect matrixcaching and replication

    less predictability in cost model (adaptive algos)

    heterogeneity in data formats and data models

  • 7/30/2019 Kossmann Survey Cs561

    13/31

    Issues in Distributed Databases

    Plan enumeration

    The time and space complexity of traditional dynamic

    programming algorithm is very large Iterative Dynamic Programming (heuristic for large

    queries)

    Cost Models

    Classic Cost Model Response Time Model

    Economic Models

  • 7/30/2019 Kossmann Survey Cs561

    14/31

    Distributed Query Plan

    A.d

    hashjoin

    B.b

    index A.c B

    receive receive

    send send

    Forms

    Of

    Parallelism?

  • 7/30/2019 Kossmann Survey Cs561

    15/31

    Cost : Resource Utilization

    1

    8

    2

    5 10

    1 6

    1 6

    Total Cost =

    Sum of Cost of Ops

    Cost = 40

  • 7/30/2019 Kossmann Survey Cs561

    16/31

    Another Metric : Response Time

    25, 33

    24, 32

    0, 12

    0, 5 0, 10

    0, 7 0, 24

    0, 6 0, 18

    Total Cost = 40first tuple = 25

    last tuple = 33

    first tuple = 0

    last tuple = 10

    Pipelined

    parallelism

    Independent

    parallelism

  • 7/30/2019 Kossmann Survey Cs561

    17/31

    Query Execution Techniques for

    Distributed Databases Row Blocking

    Multi-cast optimization

    Multi-threaded execution

    Joins with horizontal partitioning

    Semi joins Top n queries

  • 7/30/2019 Kossmann Survey Cs561

    18/31

    Query Execution Techniques for DD

    Row Blocking

    SEND and RECEIVE operators in query plan

    to model communicationImplemented by TCP/IP, UDP, etc.

    Ship tuples in block-wise fashion (batch);

    smooth burstiness

  • 7/30/2019 Kossmann Survey Cs561

    19/31

    Query Execution Techniques for DD

    Multi-cast Optimization

    Location of sending/receiving may affectcommunication costs; forwarding versus multi-casting

    Multi-threaded execution

    Several threads for operators at the same site (intra-query parallelism)

    May be useful to enable concurrent reads for diverse

    machines (while continuing query processing) Must consider if resources warrant concurrent operator

    execution (say two sorts each needing all memory)

  • 7/30/2019 Kossmann Survey Cs561

    20/31

    Query Execution Techniques for DD

    Joins with Data (horizontal) partitioning: Hash-based partitioning to conduct joins on independent partitions

    Semi Joins :

    Reduce communication costs; Send only join keys instead ofcomplete tuples to the site to extract relevant join partners

    Double-pipelined hash joins :

    Non-blocking join operators to deliver first results quickly; fullyexploit pipelined parallelism, and reduce overall response time

    Top n queries : Isloate top n tuples quickly and only perform other expensive

    operations (like sort, join, etc) on those few (use stop operators)

  • 7/30/2019 Kossmann Survey Cs561

    21/31

    Adaptive Algorithms

    Deal with unpredictable events at run time

    delays in arrival of data, burstiness of network

    autonomity of nodes, changes in policies

    Example: double pipelined hash joinsbuild hash table for both input streams

    read inputs in separate threads

    good for bursty arrival of data

    Re-optimization at run time (LEO, etc.) monitor execution of query

    adjust estimates of cost model

    re-optimize if delta is too large

  • 7/30/2019 Kossmann Survey Cs561

    22/31

    Special Techniques for Client-Server

    Architectures Shipping techniques

    Query shipping

    Data shippingHybrid shipping

    Query Optimization

    Site SelectionWhere to optimize

    Two Phase Optimization

  • 7/30/2019 Kossmann Survey Cs561

    23/31

    Special Techniques for Federated

    Database Systems Wrapper architecture

    Query optimization

    Query capabilities

    Cost estimation

    Calibration Approach

    Wrapper Cost Model

    Parameter Binding

  • 7/30/2019 Kossmann Survey Cs561

    24/31

    Heterogeneity

    Use Wrappers to hide heterogeneity Wrappers take care of data format, packaging

    Wrappers map from local to global schema

    Wrappers carry out cachingconnections, cursors, data, ...

    Wrappers map queries into local dialect

    Wrappers participate in query planning!!!

    define the subset of queries that can be handled

    give cost information, statistics

    capability-based rewriting

  • 7/30/2019 Kossmann Survey Cs561

    25/31

    Middleware

    Two kinds of middleware

    data warehouses

    virtual integration

    Data Warehouses

    good: query response times

    good: materializes results of data cleaning

    bad: high resource requirements in middleware

    bad: staleness of data Virtual Integration

    the opposite

    caching possible to improve response times

  • 7/30/2019 Kossmann Survey Cs561

    26/31

    Virtual Integration

    Query

    Middleware(query decomposition, result composition)

    DB1 DB2

    wrapper

    subquery

    wrapper

    subquery

  • 7/30/2019 Kossmann Survey Cs561

    27/31

    IBM Data Joiner

    SQL Query

    Data Joiner

    SQL DB1 SQL DB2

    wrapper

    subquery

    wrapper

    subquery

  • 7/30/2019 Kossmann Survey Cs561

    28/31

    Adding XML

    Query

    Middleware (SQL)

    DB1 DB2

    wrapper

    sub

    query

    wrapper

    sub

    query

    XML Publishing

  • 7/30/2019 Kossmann Survey Cs561

    29/31

    XML Data Integration

    XML Query

    Middleware (XML)

    DB1 DB2

    wrapper

    XML

    query

    wrapper

    XML

    query

  • 7/30/2019 Kossmann Survey Cs561

    30/31

    XML Data Integration

    Example: BEA Liquid Data

    Advantage

    Availability of XML wrappers for all major databases

    Problems

    XML - SQL mapping is very difficult

    XML is not always the right language

    (e.g., decision support style queries)

  • 7/30/2019 Kossmann Survey Cs561

    31/31

    Summary

    Middleware looks like a homogenous centralizeddatabase

    location transparency

    data model transparency

    Middleware provides global schema

    data sources map local schemas to global schema

    Various kinds of middleware (SQL, XML)

    Stacks of middleware possible

    Data cleaning requires special attention