1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong...
-
Upload
anissa-francis -
Category
Documents
-
view
218 -
download
0
Transcript of 1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong...
1
Fast Computationof Sparse Datacubes
Vicky :: Cao Hui PingSherman :: Chow Sze Ming
CTH :: Chong Tsz HoRonald :: Woo Lok Yan
Ken :: Yiu Man Lung
2
Content
Introduction Existing Methods Proposed Method: Partitioned-Cube Memory-Cube Experiment Conclusion
3
Introduction Datacubes queries compute
aggregates over database relations at a variety of granularities.
Cube by: Product, Country, Date Aggregation Function: Sum(Sales)
Date
Produ
ct
Cou
ntr
ysum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntr
ysum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Total annual salesof TV in U.S.A.
4
Sparseness
Cardinality is a small fraction of the size of the cross product of the attribute domains.
Interest in sparse relations, as effective datacube computation is important.
5
Problem Large Domain with CUBE BY attributes
Large number of CUBE BY attributes
Existing methods are not efficient
We Need Something New Partitioned - Cube
6
Existing Methods PIPESORT
Optimize overall cost by evaluating each path
Poor performance when the relation is sparse
2/k
k Lower bound of no. of sorting is
Large I / O cost for huge cuboids
7
OVERLAP
Minimize Disk Access by overlapping cuboids
But I / O cost is at least quadratic in k, even given memory-sized partition
Classify the cuboids into “Partition” and “SortRun” state
I / O depends on the partition size and number of sorted runs
8
Array – Based Algorithms
Partitioned the data, and store fragments to memory. Data Compression may be applied
Allow direct access to the memory cells
For sparse data, array fragments may not be fit into memory. Then, a more costly data structure would be required
9
Partitioned-Cube
Partition the large relations into fragments that can be fitted into the memory It follows the recursive structure of
datacubesA sub-datacube is obtained by fixing
each possible value of a CUBE BY attribute
10
Partitioned-Cube(cont.)Algorithm Partition-Cube(R, {B1, …, Bm}, A, G) R: a set of tuples {B1, …, Bn}: CUBE BY attribute A: attribute to be aggregated G: aggregate function F: finest granularity datacube tuples D: remaining tuples
Step 1: if (R fits in memory) then return Memory-Cube(R, {B1, …, Bn}, A, G)
Step 2: scan R, partition on Bj in {B1, …, Bn}Step 3: for (i = 1 to n)
(Fi, Di) = Partition-Cube(Ri, {B1, …, Bn}, A, G)Step 4: let F = union of Fi’sStep 5: let (F’, D’) = Partition-Cube(F, {B1, … Bm}, A, G)Step 6: let D = union of F’, D’ and Di’sStep 7: return (F, D)
Country Year Sales
US 2000 10
US 2001 5
US 2000 8
US 2002 6
HK 2000 6
HK 2001 8
HK 2001 7
HK 2002 7
11
Partitioned-Cube(cont.) STEP 1: Partition the large relations into
fragments that can be fitted into the memory
Country Year Sales
US 2000 10
US 2001 5
US 2000 8
US 2002 6
HK 2000 6
HK 2001 8
HK 2001 7
HK 2002 7
Country Year Sales
US 2000 10
US 2001 5
US 2000 8
US 2002 6Country Year Sale
s
HK 2000 6
HK 2001 8
HK 2001 7
HK 2002 7
RR1
R2
12
Partitioned-Cube(cont.) STEP2: Compute the tuples in the
corresponding sub-datacube
Country Year Sales
US 2000 10
US 2001 5
US 2000 8
US 2002 6
R1 F1
D1
Country Year Sales
US 2000 18
US 2001 5
US 2002 6
Country Year Sales
US ALL 29
13
Partitioned-Cube(cont.) STEP3: In the same way, Compute F2 and D2
Country Year Sales
HK 2000 6
HK 2001 8
HK 2001 7
HK 2002 7
R2 F2
D2
Country Year Sales
HK 2000 6
HK 2001 15
HK 2002 7
Country Year Sales
HK ALL 28
14
Partitioned-Cube(cont.)
Step 4:F= Step 5: by recursively call this function, get F’
and D’
Country Year Sales
US 2000 18
US 2001 5
US 2002 6
HK 2000 6
HK 2001 15
HK 2001 7
F
Country Year Sales
All 2000 24
All 2001 20
All 2002 13
F’
D’Country Year Sales
All All 57
21 FF
15
Partitioned-Cube(cont.)
Step 6:
Step 7: return F, D
i
n
iDFDD1
)''(
Country Year Sales
US 2000 18
US 2001 5
US 2002 6
HK 2000 6
HK 2001 15
HK 2002 7
F
Country Year Sales
All 2000 24
All 2001 21
All 2002 13
Country Year Sales
All All 57
F’
D’
Country Year Sales
US ALL 29
Country Year Sales
HK ALL 28
D1
D2
D
16
Partitioned-Cube(cont.) Recursively execute STEP2 if there are
more than 2 attributes
Country Year Sales
US 2000 10
US 2001 5
US 2000 8
US 2002 6
R1 F1
D1
Country Year Sales
US 2000 18
US 2001 5
US 2002 6
Country Year Sales
US ALL 29
17
Memory-Cube Perform complex operation over each
fragment independently
Minimize the total no. of paths in searching lattice
Share the sort work
Compute the tuples in the corresponding sub-datacube
Compute the datacube tuples with the value ALL for the attributes
18
Memory-Cube Minimize the total no. of paths in
searching lattice
G(1) = D ЄG(2) = CD C Є
DG(3) = BCD BC B Є
BD DCD C
G(4) = ABCD ABC AB A Є
ABD AD D
ACD AC C
BCD BC B
BDCD
2/k
k
6 = 4C2
19
Memory-Cube Share Sort Work
Re-Order the sorting sequence can improve the performance
Sorting result on shorter relation can be reused in longer relation
E.g. S6 = CD, S3 = CADAfter sorting S6, for S3, the entire relation does not have to be resorted, only each block of tuples that shares a C value needs to be independently sorted in the AD order.
20
Memory-Cube Sort in-memory relation according to
the attribute
Like PIPESORT, make a single scan through the data
Aggregates all small fragments on the path
Output datacube result by combining these small fragments
21
Solution Analysis I / O cost is linear of k
CPU Cost (In-memory sorts) is exponential in k
CPU Cost should be dominated by the I / O time
22
Experiment CPU time v.s. No. of Tuples
Exponential in no. of CUBE BY attributes
23
Experiment CPU, I / O, CPU Usage % v.s. no. of
CUBE BY attributes
CPU Usage % drops for large no. of CUBE BY attributes
24
Experiment Share sorting work
CPU Time is dominated by I / O Time
25
Conclusion Partitioned-Cube is a fast computation of
datacubes over large sparse relation
Minimize the number of sort orders
Show the advantages of sharing sort orders in the datacube computation
First solution with LINEAR I / O Cost
26
ReferenceKenneth A. Ross , Divesh Srivastava : Fast Computation of Sparse Datacubes. VLDB 1997 : 116-125
27
Q & A Section