An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J....

45
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    2

Transcript of An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J....

Page 1: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

An Array-Based Algorithm for Simultaneous Multidimensional

Aggregates

Y. Zhao, P. Deshpande, J. Naughton

Page 2: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Motivation

• Previous papers showed the usefulness of the CUBE operator. There are several algorithms for computing the CUBE in Relational OLAP systems.

• This paper proposes an algorithm for computing the CUBE in Multidimensional OLAP systems.

Page 3: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

CUBE in ROLAP

In ROLAP systems, 3 main ideas for efficiently computing the CUBE

1. Group related tuples together (using sorting or hashing)

2. Use grouping performed on sub-aggregates to speed computation

3. Compute an aggregate from another aggregate rather than the base table

Page 4: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

CUBE in MOLAP

• Cannot transfer algorithms from ROLAP to MOLAP, because of the nature of the data

• In ROLAP, data is stored in tuples that can be sorted and reordered by value

• In MOLAP, data cannot be rearranged, because the position of the data determines the attribute values

Page 5: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Multidimensional Array Storage

Data is stored in large, sparse arrays, which leads to certain problems:

1. The array may be too big for memory

2. Many of the cells may be empty and the array will be too sparse

Page 6: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Chunking Arrays

Why chunk?• A simple row major layout (partitioning by

dimension) will favor certain dimensions over others.

What is chunking?• A method for dividing a n-dimensional array

into smaller n-dimensional chunks and storing each chunk as one object on disk

Page 7: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Chunks

Dim

ension B

Dimension A

CA

CB

CB

CACA

Page 8: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Array Compression

• Chunk-offset compression: for each valid entry, we store (offsetInChunk, data) where offsetInChunk is the offset from the start of the chunk.

• Compression is done on dense arrays (defined as arrays more than 40% filled with data)

Page 9: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Naïve Array Cubing Algorithm

Similar to ROLAP, each aggregation is computed from its parent in the lattice.

Each chunk is aggregated completely and then written to disk before moving on the next chunk.

ABC

AB AC BC

A B C

{}

Page 10: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Illustrative example

Example for BC:

Start with BC face on 1 and “sweep” through dimension A to aggregate.

Page 11: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Problems with Naïve approach

• Each sub aggregate is calculated independently

• E.g. this algorithm will compute AB from ABC, then rescan ABC to calculate AC, then rescan ABC to calculate BC

• We need a method to simultaneously compute all children of a parent in a single pass over the parent

Page 12: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Single-Pass Multi-Way Array Cubing Algorithm

• The order of scanning is vitally important in determining how much memory is needed to compute the aggregates.

• A dimension order O = (Dj1, Dj2, … Djn) defines the order in which dimensions are scanned.

• |Di| = size of dimension i• |Ci| = size of the chunk for dimension i• |Ci| << |Di| in general

Page 13: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Example of Multi-way Array

Page 14: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Concrete Example

• |Ci| = 4, |Di| = 16• For BC group-bys, we

need 1 chunk (4x4)• For AC, we need 4

chunks (16x4)• For AB, we need to

keep track of whole slice of the AB plane, so (16x16)

Page 15: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

How much memory?

A formula for the minimum amount of memory can be generalized.

Define p = size of the largest common prefix between the current group-by and its parent

P n-1

|Di| x |Ci|i=1 I=p+1

Page 16: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Example calculation

O = {A B C D}, |Ci| =10,

|Di| ={100, 200, 300, 400}

For the ABD group-by, the max common prefix is AB. Therefore the minimum amount of memory necessary is:

|DA| x |DB| x |CD| = 100 x 200 x 10

Page 17: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

More Memory Notes

• In simple terms, every element q in the common prefix contributes |Dq| while every other element r not in the prefix contributes |Cr|

• Since |Ci| << |Di|, to minimize the memory usage, we should minimize the max common prefix and reorder the dimension order so that the largest dimensions appear in the fewest prefixes

Page 18: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Minimum Memory Spanning Trees

O = { A B C }

Why is the cost of B=4?

Page 19: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Minimum MemorySpanning Trees (cont.)

• Using the formula for calculating the minimum amount of memory, we can build a MMST, s.t. the total memory requirement is minimum for a given dimension order.

• For different dimension orders, the MMSTs may be very different with very different memory requirements

Page 20: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Effects of Dimension Order

Page 21: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

More Effects of Dimension Order

• The early elements in O (particularly the first one) appear in the most prefixes and therefore, contribute their dimension sizes to the memory requirements.

• The last element in O can never appear in any prefix. Therefore, the total memory requirement for computing the CUBE is independent of the size of the last dimension.

Page 22: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Optimal Dimension Order

• Based on the previous two ideas, the optimal ordering for dimension is to sort them on increasing dimension size.

• The total memory requirement will be minimized and will be independent of the size of the largest dimension.

Page 23: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Graphs And Results

Page 24: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

ROLAP vs. MOLAP

Page 25: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

MOLAP wins

Page 26: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

MOLAP for ROLAP system

• The last chart demonstrates one of the unexpected results from this paper.

• We can use the MOLAP algorithm with ROLAP systems by:

1. Scan the table and load into an array.

2. Compute the CUBE on the array.

3. Convert results into tables.

Page 27: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

MOLAP for ROLAP (cont.)

• The results show that even with the additional cost of conversion between data structures, the MOLAP algorithm runs faster than directly computing the CUBE on the ROLAP tables and it scales much better.

• In this scheme, the multi-array is used as a query evaluation data structure rather than a persistent storage structure.

Page 28: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Summary

• The multidimensional array of MOLAP should be chunked and compressed.

• The Single-Pass Multi-Way Array method is a simultaneous algorithm that allows the CUBE to be calculated with a single pass over the data.

• By minimizing the overlap in prefixes and sorting dimensions in order of increasing size, we can build a MMST that gives a plan for computing the CUBE.

Page 29: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

More Summary

• On MOLAP systems, the CUBE is calculated much faster than on ROLAP systems.

• The most surprising (and useful) result is that the MOLAP algorithm is so much faster that it can be used on ROLAP systems as an intermediate step in computing the CUBE.

Page 30: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Caching Multidimensional Queries Using Chunks

P. Deshpande, K. Ramasamy,

A. Shukla, J. Naughton

Page 31: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Caching

• Caching is very important in OLAP systems, since the queries are complex and they are required to respond quickly.

• Previous work in caching dealt with table-level caching and query-level caching.

• This paper will propose another level of granularity using chunks.

Page 32: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Chunk-based caching

• Benefits:

1. Frequently accessed chunks will stay in the cache.

2. A new query need not be “contained” within a cached query to benefit from the cache

Page 33: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

More on Chunking

• More benefits:

3. Closure property of chunks: we can aggregate chunks on one level to obtain chunks at different levels of aggregation.

4. Less redundant storage leads to a better hit ratio of the cache.

Page 34: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Chunking the Multi-dimensional Space

• To divide the space into chunks, distinct values along each dimension have to be divided into ranges.

• Hierarchical locality in OLAP queries suggests that ordering by the hierarchy level is the best option.

Page 35: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Ordering on Dimensions

Page 36: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Chunk Ranges

• Uniform chunk ranges do not work so well with hierarchical data.

Page 37: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Hierarchical Chunk Ranges

Page 38: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Caching Query Results

• When a new query is issued, chunks needed to answer the query are determined.

• The list of chunks in broken into 2 parts:

1) Relevant chunks from the cache

2) Missing chunks that have to be computed from the backend

Page 39: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Chunked File Organization

• The cost of a chunk miss can be reduced by organizing data in chunks at the backend.

• One possible method is to use multi-dimensional arrays, but these require changing the data structures a great deal and may result in the loss of relational access to data.

Page 40: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Chunk Indexing

• A chunk index is created so that given a chunk number, it is possible to access all tuples in that chunk.

• The chunked file will have two interfaces: the relational interface for SQL statement, and chunk-based interface for direct access to chunks.

Page 41: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Query Processing

• How to determine whether a cached chunk can be used to answer a query

1) Level of aggregation – cached chunks at the same level are used.

2) Condition Clause – selection on non group-by predicates must match exactly.

Page 42: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Implementation of Chunked Files

1) Add a new chunked file type to the backend database.

2) Add a level of abstraction1) Add a new attribute of chunk number

2) Sort based on chunk number

3) Create chunk index with a B-tree on the chunk number

Page 43: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Replacement Schemes

• LRU is not viable, because chunks at different levels have different costs.

• Benefit of a chunk is measured by fraction of base table it represents

• Use benefits of chunks as weights when determining which chunk to replace in the cache.

Page 44: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Cost Saving Ratio

• Defined as the percentage of the total cost of the queries saved due to hits in the cache.

• Better than a normal hit ratio, since chunks at different levels have different benefits.

Page 45: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton.

Summary

• Chunk-based caching allows fine granularity and queries to be partially answered from the cache.

• A chunked file organization can reduce the cost of a chunk miss with minimal cost in implementation.

• Performance depends heavily on choosing the right chunk range and a good replacement policy