Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie...

35
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch

Transcript of Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie...

Page 1: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Generating the Data Cube(Shared Disk)

Andrew Rau-ChaplinFaculty of Computer ScienceDalhousie University

Joint Work withF. DehneT. EavisS. Hambrusch

Page 2: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Data Cube Generation

Proposed by Gray et al in 1995 Can be generated from a

relational DB but…

A

B

C The cuboid ABC (or CAB)

ABC

AB AC BC

A C B

ALL

12

18

83

21

34

3850

21

Page 3: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

As a table

Model Year Colour Sales

Chevy 1990 Blue 87

Chevy 1990 Red 5

Chevy 1990 ALL 92

Chevy ALL Blue 87

Chevy ALL Red 5

Chevy ALL ALL 92

Ford 1990 Blue 99

Ford 1990 Green 64

Ford 1990 ALL 163

Ford 1991 Blue 7

Ford 1991 Red 8

Ford 1991 ALL 15

Ford ALL Blue 106

Ford ALL Green 64

Ford ALL Red 8

ALL 1990 Blue 186

ALL 1990 Green 64

ALL 1991 Blue 7

ALL 1991 Red 8

Ford ALL ALL 178

ALL 1990 ALL 255

ALL 1991 ALL 15

ALL ALL Blue 193

ALL ALL Green 64

ALL ALL Red 13

ALL ALL ALL 270

Model Year Colour Sales

Chevy 1990 Red 5

Chevy 1990 Blue 87

Ford 1990 Green 64

Ford 1990 Blue 99

Ford 1991 Red 8

Ford 1991 Blue 7

Page 4: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

The Challenge Input data set, R |R| typically in the millions, usually will

not fit into memory. Number of dimensions, d, 10-30 2d cuboids in Data Cube

How to solve this highly data and computational intensive problem in parallel?

Page 5: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Existing Parallel Results Goil &

Choudhary MOLAP Approach

Parallelize the generation of each cuboid

Challenge > 2d comm.

rounds

Page 6: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Overview

1) Data cubes2) Review sequential cubing

algorithms3) Our Top-down parallel algorithm4) Conclusions and open problems

Page 7: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Optimizations based on computing multiple cuboids

Smallest-parent - computing a cuboid from the smallest previously computed cuboid.

Cache-results - cache in memory the results of a cuboid from which other cuboid are computed to reduce disk I/O.

Amortize-scans - amortizing disk read by computing as many cuboid as possible.

Share-sorts - sharing sorting cost.

ABCD

ABC ABD ACD BCD

AB AC AD BC BD CD

AA BB CC DD

All

Page 8: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Many Algorithms

Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] ArrayCube – [ZDN’97] Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97]

Page 9: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Approaches

Top Down Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96]

Bottom up Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97]

Array Based ArrayCube – [ZDN’97]

Page 10: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Our results A framework for parallelization of existing

sequential data cube algorithms Top-down Bottom-up Array based

Architecture independent Communication efficient

Avoids irregular communication patterns Few large messages Overlap computation and communication

Today’s Focus Top down approach

Page 11: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

ABCD

ABC ABD ACD BCD

AB AC AD BC BD CD

AA BB CC DD

All

Top Down Algorithms Find a “least cost”

spanning tree Use estimators of

cuboid size Exploit

Data shrinking Pipelining Cuts vs. Sorts

Page 12: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Cut vs. Sort Ordering ABCD Cutting

ABCD -> ABC Linear time

Sorting ABCD ->ABD Sort time

Size ABC may be much

smaller than ABCDA3

A2

A1

B1

B3

B4

C1

C2

B2

D1

D2

Page 13: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

PipesortABCD

ABC ABD ACD BCD

AB AC AD BC BD CD

AA BB CC DD

All

CBAD

CBA BAD ACD BCD

BA AC AD CB DB CD

AA BB CC DD

All

[AADGNRS’96]

Minimize sorting while seeking to compute cuboid from smallest parent

Pipeline sorts with common prefixes

Page 14: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Level-by-level Optimization

Minimum cost matching in a bipartite graph

Scan edges solid, Sort edges dashed

Establish dimension ordering working up the lattice

AB AC BC

AA BB CC

(a) Possible Pathways

AB AC BC

AA BB CC

AB BCAC22 10 55 12 13 20

AB AC BC

AA BB CC

AB BCAC22 10 55 12 13 20

(b) Transformed Search Lattice

(c) Minimum Cost Matching

Page 15: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Overview

1) Data cubes2) Review sequential cubing

algorithms3) Our Top-down parallel

algorithm4) Conclusions and open problems

Page 16: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Top-down parallel: The Idea

Cut the process tree into p “equal weight” subtrees

Each Proc. generates cuboids from a subtree independently

Load balance/stripe the output

CBAD

CBA BAD ACD BCD

BA AC AD CB DB CD

AA BB CC DD

All

Page 17: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

The Basic Algorithm(1) Construct a lattice housing all 2d views.(2) Estimate the size of each of the views in the lattice.(3) To determine the cost of using a given view to directly

compute its children, use its estimated size to calculate (a) the cost of scanning the view and (b) the cost of sorting it.

(4) Using the bipartite matching technique presented in the original IBM paper, reduce the lattice to a spanning tree that identifies the appropriate set of prefix-ordered sort paths.

(5) Add the I/O estimates to the spanning tree.(6) Partition the tree into p sub-trees.(7) Distribute the sub-tree lists to each of the p compute

nodes.(8) On each node, use the sequential Pipesort algorithm to

build the set of local views.

Page 18: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Tree Partitioning What does “Equal Weight”

mean? Want to minimize the max

weight partition!

O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82

O(n) time, Frederickson 1990

time

Page 19: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Tree Partitioning Min-max tree k-partitioning. Given a tree T with n vertices

and a positive weight assigned to each vertex, delete k edges in the tree to obtain k connected components T1, T2, … Tk+1 such that the largest total weight of a resulting sub-tree is minimized.

O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82

O(n) time, Frederickson 1990

Page 20: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Dynamic min-max

125

15

15

8

125

47

125

Raw data

ABC

AB BC

A

Page 21: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Over-sampling

p subtrees

s * p subtrees

p subsets

Page 22: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Implementation Issues 1) Sort Optimization 2) Minimizing Data Movement 3) Efficient Aggregation Operations 4) Disk Optimizations

Page 23: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

1) Sort Optimization qSort is SLOW

May be O(n2) when there are duplicates

When cardinality is small range of keys is small Radix sort

Dynamically select between well optimized Radix and Quick Sorts

Page 24: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

2) Minimizing Data Movement

Sort pointers to the records!

Never reorder the columns

Page 25: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

3) Efficient Aggregation Operations

One pass for each pipeline

Do lazy aggregation

A3

A2

A1

B1

B3

B4

C1

C2

B2

D1

D2

ABCD

ABC

A

AB

all

Page 26: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

4) Disk Optimizations Avoid OS buffering Implemented I/O

Manager Manages buffers

to avoid thrashing Does I/O in

separate process to overlap with computation

Page 27: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Speedup - Cluster

Page 28: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Efficiency - Cluster

Page 29: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Speedup - SunFire

Page 30: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Efficiency - SunFire

Page 31: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Increasing Data Size

Page 32: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Varying Over Sampling Factor

Page 33: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Varying Skew

Page 34: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Conclusions New communication efficient

parallel cubing framework for Top-down Bottom up Array based

Easy to implement (sort of), architecture independent

Page 35: Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Thank you!

Questions?