Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie...

24
Parallel Multi- Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.

Transcript of Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie...

Page 1: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Parallel Multi-Dimensional ROLAP Indexing

Andrew Rau-ChaplinFaculty of Computer Science

Dalhousie University

Joint work with

Frank Dehne, Carleton Univ.

Todd Eavis, Dalhousie Univ.

Page 2: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Data Warehousing for Decision Support

Operational data collected into DW

DW used to support multi-dimensional views

Views form the basis of OLAP processing

Our focus: the OLAP server

Data MiningAnalysisQuery Reports

Olap ServerOlap Server

Meta Data Repository

MonitoringAdministration

Operational Databases

Data Warehouse

Data Marts

External Sources

ExtractClean

TransformLoad

Refresh

Output

Front-End Tools

Olap Engines

Data Storage

Data Cleaningand

Integration

Page 3: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Multi-dimensional views

Collection of feature attributes

Aggregate along one or more measure attributes

Reduce the granularity by “collapsing” dimensions

Points generated by: distributive functions(e.g.,

sum) algebraic functions (e.g.,

average) holistic functions(e.g.,

median)

Red

White

Blue

By Make & Colour

By Colour

By Make

1993

19901991

1992

ChevyFord

By Year

By Colour & Year

By Make & Year

Page 4: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Data Cube Generation

Proposed by Gray et al in 1995

Can be generated “manually” from a relational DB but this is very inefficient

Exploit the relationship between cuboids to compute all 2d cuboids

In OLAP environments, we typically pre-compute these views to improve query response time

ABC

AB AC BC

A C B

ALL

Page 5: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Existing Parallel Results

Goil & ChoudharyMOLAP solution

in-memory structures global partition + d

communication rounds

distributed viewsLimitations

Memory for multi-dimensional arrays

expensive communication for larger d

J. Of Data Mining & Knowledge Discovery 1(4), 1997

Page 6: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Our Approach

ROLAP solution Construct and cost the

data cube lattice Find a “least cost”

spanning tree Partition the spanning tree

over the processors equally, construct views and distribute

Can handle partial cubes

Limitations What about indexing?????

ABCD

ABC ABD ACD BCD

AB AC AD BC BD CD

AA BB CC DD

All

CCGrid’01 + J. Dist. & Parallel Databases 11(2), 2001

Page 7: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Parallel Multi-dimensional Indexing

Query specifies a range on multiple dimensions

Forms a hypercube in the point space

Page 8: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

General Approach

No multidimensional index is universally successful

Exploit domain specific information and the features of a particular index

OLAP Data is provided up front Updates are batch oriented

Page 9: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Design Goals

A framework for distributed high-performance indexing of ROLAP cubes Practical to implement Low communication volume Fully adapted to external memory (disks) No shared disk required Incrementally maintainable Efficient for high D spatial searches Scalable in terms of data size,

dimensions, processors

Page 10: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Challenge

How to order and partition data such that Number of records retrieved per node is

as balanced as possible Minimize the number of disk seeks

required in answering a queryABC

P1 P2 P3 P4

Page 11: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Indexing the Data Cube

Combine the strengths of a space filling and an r-tree index

Use Hilbert curve to load buckets

Index buckets with r-tree

Update indexes with merge/sort

Page 12: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Space Filling Curves & Striping

Page 13: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Query Retrieval

P1 P2 P3 P4

ABC ABC ABC ABC

Page 14: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Example

Original Space Processor 1 Processor 2

8 points to be reported

Reports:2 consecutive blocks & 4 points

Reports:2 consecutive blocks & 4 points

Page 15: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

The Parallel Framework

A single view is partitioned across p processors

Partial Hilbert/r-tree indexes are computed locally

Queries are answered concurrently

Queries answered individually or “piggy-backed”

Page 16: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

The Virtual Data Cube

Problem: Full cube often to large to materialize

Solution: Use surrogate views

Page 17: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Surrogate Processing

Page 18: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Other issues…

Dimension orderingQuery piggybacking Batch updatingManaging Hierarchies of views

Page 19: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Experimental Results

Machine 17 node cluster Node = 1.8 GHz Xeon, 1 GB RAM, 2 * 40

GB IDE drives, running Linux Interconnect = Intel Fast Ethernet

switchTest Data

10 dimensions and 1,000,000 records

Page 20: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

RCUBE index Construction

Output: ~640 million rows, 16 Gigabytes

Page 21: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Distributed Query Resolution

Test: Random queries returning ~15% of points (10 experiments per point)

Page 22: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Disk blocks retrieved vs. Disk Seeks

Test: Random queries returning 5-15% of points (15 experiments per point)

Page 23: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Distributed Query Resolution in Surrogate Group-bys

Page 24: Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Thank You

Questions?