PrefixCube: Prefix-sharing Condensed Data Cube Jianlin FengQiong Fang Hulin Ding Huazhong Univ. of...
-
Upload
jeffry-mckenzie -
Category
Documents
-
view
213 -
download
0
Transcript of PrefixCube: Prefix-sharing Condensed Data Cube Jianlin FengQiong Fang Hulin Ding Huazhong Univ. of...
PrefixCube: Prefix-sharing Condensed Data
Cube
Jianlin Feng Qiong Fang Hulin Ding
Huazhong Univ. of Sci. & Tech.
Nov 12, 2004
DOLAP 2004 2 Jianlin Feng
Outline
Introduction Related Work ODM: Ordered Datacube Model BST-Condensed Cube Prefix-sharing Condensed Cube Comparisons Conclusions
DOLAP 2004 3 Jianlin Feng
Introduction
Data Cube (ICDE’96)– N-dimensional cube(A1, A2, …, AN)
– 2N cuboids, i.e. GROUP-BYs The Huge Size Problem
– When R is sparse, the size of a cuboid is possibly close to the size of R.
– The I/O cost even for storing the cube result tuples becomes dominative.
DOLAP 2004 4 Jianlin Feng
Related Work
Condensed Cube (ICDE’02) Dwarf (SIGMOD’02) Quotient Cube (VLDB’02) QC-Tree (SIGMOD’03) Basic idea: remove redundancies
existing among cube tuples. – prefix redundancy – suffix redundancy
DOLAP 2004 5 Jianlin Feng
Prefix redundancy
Given an example cube(A, B, C) – Each value of dimension A occurs in 4
cuboids: cuboid(A), (AB), (AC) and (ABC)
– Possibly many times in each cuboid except cuboid(A)
Inter-cuboid and Intra-cuboid prefix redundancy
DOLAP 2004 6 Jianlin Feng
Suffix Redundancy
Occurs when cube tuples belonging to different cuboids are actually aggregated from the same group of base relation tuples.
An extreme case – Let the source relation R have only one single
tuple r(a1, a2, …, an, m);
– 2n cube tuples can be condensed into one physical tuple: (a1, a2, …, an, V), where V = aggr(r);
– together with some information indicating that it is a representative tuple.
DOLAP 2004 7 Jianlin Feng
Thinking… Condensed cube
– It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size.
Dwarf– Besides suffix coalescing, i.e. multi-base-
tuple condensing, it also realized full prefix-sharing so as to achieve high cube size reducing effectiveness.
DOLAP 2004 8 Jianlin Feng
Motivation
HOW to further reduce condensed cube’s size while taking into account query characteristics we intend to answer - range query?
Augmenting BST-condensing with removing of intra-cuboid prefix redundancy!
DOLAP 2004 9 Jianlin Feng
Ordered Datacube Model
Value ALL(or *) is encoded as 0. A dimension D and its cardinality C
– each dimension value is one-to-one mapped to an integer value between 1 and C inclusively.
N dimensions form a N-dimensional space.
The origin O(0, 0, …, 0) represents the grand total.
DOLAP 2004 10 Jianlin Feng
Ordered Datacube Model
Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries.
DOLAP 2004 11 Jianlin Feng
BST-Condensed Cube
Base Single Tuple (BST)
– t1 is a BST on SD {A} and {B}– t2 is a BST on SD {B}
A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - MinCube.
A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60
DOLAP 2004 12 Jianlin Feng
BU-BST Condensed Cube BottomUpBST algorithms (ICDE’02) Each BST corresponds to only one SD. It’s easier to compute and to restore normal cube tuple
from condensed cube compared with MinCube.
Note: BST Condensing is a special kind of Prefix-sharing !
A B C M8 * * 108 1 * 108 * 1 108 1 1 10
A B C M SD
ct7 8 1 1 10 {A}
A group of cube tuples with sharing
prefix are represented by a
BST!
DOLAP 2004 13 Jianlin Feng
A BU-BST Condensed Cube Example
A B C Mt1 8 1 1 100t2 1 8 1 50t3 1 2 3 60
A B C M SID CIDct1 * * * 210 ALLct2 1 * * 110 Act3 1 2 3 60 ABct4 1 8 1 50 ABct5 1 * 1 50 ACct6 1 * 3 60 ACct7 8 1 1 100 Act8 * 1 1 100 Bct9 * 2 3 60 B
ct10 * 8 1 50 Bct11 * * 1 150 Cct12 * * 3 60 C
Note:
Intra-cuboid prefix redundancy: ct3 and ct4
Inter-cuboid prefix redundancy: ct2, ct3 and ct5
DOLAP 2004 14 Jianlin Feng
Prefix-sharing Condensed Cube - PrefixCube
BST Condensing BST Condensing ++
Intra-cuboid prefix-sharingIntra-cuboid prefix-sharing
Prefix-sharingPrefix-sharing
PrefixCubePrefixCube
DOLAP 2004 15 Jianlin Feng
A PrefixCube Example
8
SID = A SID = AB SID = B
1 2 8
1 2 8
1 50
3 60
1 50
3 60
1 100
1 110210 1 1 150 3 60
1 50 3 60
V-RootsN-Roots
1 100
CID = ALL CID = ACCID = A CID = A
1
DOLAP 2004 16 Jianlin Feng
Corresponding Dwarf
100
1 8 2
1 50 50
3 60 60
1 50 1103 60 1 150 2103 60
8
8 21
A Dimension
B Dimension
C Dimension
(node1)
(node2)
(node4)
(node3)
1
1 100
DOLAP 2004 17 Jianlin Feng
PrefixCube vs. Dwarf
PrefixCube
Dwarf
Prefix-sharing Intra-cuboid Inter- and Intra-cuboid
PrefixCube does not aim at blindly achieving effective compression ratio, but it is intended to make a good compromise among cube size reducing ratio, restoring and updating costs, and query characteristics!
Suffix Coalescing
BST Condensing
Multi-tuple Condensing
Compression Ratio
Lower Higher
Saving extra value ALL?
No Yes
Tuple clustered by
cuboid?
Yes No
DOLAP 2004 18 Jianlin Feng
Effectiveness of Size Reduction
Datasets– synthetic datasets with uniform distribution– # of tuples: 1,000,000
0%
20%
40%
60%
80%
100%
2 3 4 5 6 7 8 9
Number of Dimensions
Size
Rat
io
BU-BSTPrefixCube
0%
20%
40%
60%
80%
100%
2 3 4 5 6 7 8 9
Number of Dimensions
Size
Rat
io
BU-BSTPrefixCube
(a) Cardinality = 100 (b) Cardinality = 1000
DOLAP 2004 19 Jianlin Feng
Effectiveness of Size Reduction
PrefixBUC– Full Cube (computed by BUC) – Prefix-sharing
0%
20%
40%
60%
80%
100%
2 3 4 5 6 7 8 9
Number of Dimensions
Size
Rat
io
C=100C=1000
DOLAP 2004 20 Jianlin Feng
Impact of Data Density Datasets
– Uniform distribution– # of dimensions: 6– Cardinality of dimensions: 100– # of tuples: range from 1,000 to 1,000,000
0%
20%
40%
60%
80%
100%
1.E+03 1.E+04 1.E+05 1.E+06
Number of Tuples
Siz
e R
atio
BU-BSTPrefixCubePrefixBUC
DOLAP 2004 21 Jianlin Feng
Impact of Data Skewness Datasets
– Zipf distribution– # of tuples: 1,000,000– Cardinality of dimensions: range from 1,000 to 500 with
100 interval– Zipf factor: range from 0 to 0.8 with 0.2 interval
0%
20%
40%
60%
80%
100%
0 0.2 0.4 0.6 0.8
Zipf Factors
Size
Rat
io
BU-BSTPrefixCubePrefixBUC
DOLAP 2004 22 Jianlin Feng
Real-world Dataset Datasets
– Weather Datasets– # of tuples: 1,015,367
0
100
200
300
400
500
600
700
2 3 4 5 6 7 8 9
Number of Dimensions
Tim
e(se
c.)
BUCBU-BSTPrefixCube
0%
20%
40%
60%
80%
100%
2 3 4 5 6 7 8 9
Number of Dimensions
Siz
e R
atio
BU-BSTPrefixCubePrefixBUC
DOLAP 2004 23 Jianlin Feng
Conclusion
A new cube structure PrefixCube was proposed by augmenting BU-BST condensing with intra-cuboid prefix-sharing.– It can greatly reduce data cube’s size
compared with BU-BST condensed cube.– It can also reduce the impact of data skew
on BU-BST condensing.– It can make a quite stable size reduction
on both dense and sparse datasets.
DOLAP 2004 24 Jianlin Feng
The End
Thank u!
Any question?