qCube: Efficient integration of range query operators over a high dimension data cube
-
Upload
rodrigo-rocha-silva -
Category
Technology
-
view
272 -
download
1
description
Transcript of qCube: Efficient integration of range query operators over a high dimension data cube
qCube: Efficient integration of range query operators over a high
dimension data cube
Rodrigo Rocha SilvaDoctorate Student
Prof. Dr. Celso Massaki HirataAdvisor
Prof. Dr. Joubert de Castro LimaCo-Advisor
ITA – INSTITUTO TECNOLÓGICO DE AERONÁUTICAElectronic Engineering and Computer Science Division - EEC/I
Department of Computer ScienceBrazil
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Goal
Present a new cube approach, designed for high dimension range queries. Our cube approach, named Query Cube (qCube), implements Equal, Not Equal, Greater or Less than, Some, Between and Similar
range query operators and Distinct, Sub-cube and Top-k Similar inquire query
operators
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Topics
– Motivation
– Data Cube
– Related Work
– Query Cube (qCube)
– Experiments
– Results
– Conclusions
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Motivation
Users need to view data in a tangible way, such as reports,
cross tables and histograms
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
• Suppose that at some decision-makingprocess it is necessary the followinginformation :
Motivation
“What is the women journal research papers variance impact, using months {1, 3, 5, 7, 11}, year 2012 and ages varying from 25-40
years? Return results for all countries”
“The average temperatures above 30 degrees
Celsius on the weekends of leap years in the last
200 years.”
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
628º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
A data cube, introduced by Gray et al., 1996, is
a generalization of the group-by operator over all
possible combinations of dimensions with
various granularity aggregates.
Data Cube
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
728º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
A data cube has exponential complexity with respect to the
number of dimensions
For an input with size d the output has size 2d
Data Cube
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
828º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Data Cube
• Hierarchies
DisciplineDiscipline
DepartmentDepartment
YearYear
YearYear
DayDay
HourHour
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
928º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Base Relation R – 11 tuples
A B C COUNT
a1 b1 c1 1
a3 b3 c2 1
a2 b3 c2 1
a3 b1 c1 1
a2 b1 c1 1
a2 b2 c2 1
a1 b1 c2 1
a2 b2 c1 1
a3 b1 c2 1
a1 b3 c2 1
a2 b1 c2 1
A B C COUNT
* * * 11
a1 * * 3
a2 * * 5
a3 * * 3
* b1 * 6
* b2 * 2
* b3 * 3
* * c1 4
* * c2 7
a1 b1 * 2
a1 b3 * 1
a2 b1 * 2
a2 b2 * 2
a2 b3 * 1
a3 b1 * 2
a3 b3 * 1
a1 * c1 1
a1 * c2 2
a2 * c1 2
a2 * c2 3
a3 * c1 1
a3 * c2 2
* b1 c1 3
* b1 c2 3
FULL 3D CUBE
A B C COUNT
* b2 c1 1
* b2 c2 1
* b3 c2 3
a1 b1 c1 1
a3 b3 c2 1
a2 b3 c2 1
a3 b1 c1 1
a2 b1 c1 1
a2 b2 c2 1
a1 b1 c2 1
a2 b2 c1 1
a3 b1 c2 1
a1 b3 c2 1
a2 b1 c2 1
+
38 tuples
Data Cube
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1028º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Related Work – Frag-Cubing Approach
From book Han and Kamber: Data Mining Concepts and Techniques
• Partitions the data vertically
• Reduces high-dimensional cube into a set of lower
dimensional cubes
• Lossless reduction
• Offers tradeoffs between the amount of pre-processing
and the speed of online computation
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1128º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
• Let the cube aggregation function be count
• Divide the 5 dimensions into 2 shell fragments: – (A, B, C) and (D, E)
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
Related Work – Frag-Cubing Example
From book Han and Kamber: Data Mining Concepts and Techniques
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
• Build traditional invert index or RID list
Attribute Value TID List List Size
a1 1 2 3 3
a2 4 5 2
b1 1 4 5 3
b2 2 3 2
c1 1 2 3 4 5 5
d1 1 3 4 5 4
d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1
Related Work – Frag-Cubing 1-D Inverted Indices
From book Han and Kamber: Data Mining Concepts and Techniques
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
• Generalize the 1-D inverted indices to multi-dimensional
ones in the data cube sense
• Compute all cuboids for data cubes ABC and DE while
retaining the inverted indices
• For example, shell
fragment cube ABC
contains 7 cuboids:
– A, B, C
– AB, AC, BC
– ABC
• This completes the offline
computation stage
111 2 3 1 4 5a1 b1
04 5 2 3a2 b2
24 54 5 1 4 5a2 b1
22 31 2 3 2 3a1 b2
List SizeTID ListIntersectionCell
∩
∩
∩
∩ ⊗
Related Work – Frag-Cubing Approach
From book Han and Kamber: Data Mining Concepts and Techniques
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
• If measures other than count are present, store in ID_measure table separate from the shell fragments
tid count sum
1 5 70
2 3 10
3 8 20
4 5 40
5 2 30
Related Work – Frag-Cubing Measure Table
From book Han and Kamber: Data Mining Concepts and Techniques
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
• Given the fragment cubes, process a query as follows
1. Divide the query into fragment, same as the shell
2. Fetch the corresponding TID list for each fragment
from the fragment cube
3. Intersect the TID lists from each fragment to construct
instantiated base table
4. Compute the data cube using the base table with any
cubing algorithm
Related Work – Frag-Cubing Query
From book Han and Kamber: Data Mining Concepts and Techniques
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1628º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
…NMLKJIHGFEDCBA
Online
ComputationBase Table
Related Work – Frag-Cubing Approach
From book Han and Kamber: Data Mining Concepts and Techniques
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1728º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
qCube ApproachImplements a set of tuple identifiers per dimension attribute, similar to Frag-Cubing;
Therefore, qCube can answer point queries using tuple identifiers intersections and range queries using unions plus intersections algorithms, regardless measure function types.
Frag-Cubing just implements point and some inquire queries.
There is no Frag-Cubing solution for queries like
“What is the women journal research papers variance impact, using months {1, 3, 5, 7, 11}, year 2012 and ages varying
from 25-40 years? Return results for all countries”
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1828º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
qCube Approach
Implements the range query operators:• Equal; • Not Equal; • Greater or Less than; • Some; • Between and Similar.
Also implements inquire query operators: • Distinct; • Sub-cube; • Top-k Similar.
Over a high dimension data cube.
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
1928º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
qCube Architecture
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2028º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
TID A B C D E
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a1 b1 c1 d1 e1
4 a3 b3 c3 d2 e2
5 a1 b1 c4 d1 e2
6 a5 b5 c5 d2 e2
Attribute Value TID List Attribute Value TID List
a1 1, 3, 5 c2 2
a2 2 c3 4
a3 4 c4 5
a5 6 c5 6
b1 1, 3, 5 d1 1, 3, 5
b2 2 d2 2, 4, 6
b3 4 e1 1, 3
b5 6 e2 2, 4, 5, 6
c1 1, 3
Function Variance Count Average Skewness Standard deviation
tid M1 M2 M3 M4 M5
1 2.56 1 10 1 877686769698
2 3.14 1 20 0 7986676867.99
3 2.45 1 10 1 -7878789.8777
4 6.7 1 11 1 -99974333.23
5 9 1 3 1 100045.655
6 1 1 1 1 1
qCube Computation
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2128º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
The same qCube Computation algorithm
qCube Update
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
qCube UpdateTID A B C D E
1 a1 b1 c1 d1 e1
2 a2 b2 c2 d2 e2
3 a1 b1 c1 d1 e1
4 a3 b3 c3 d2 e2
5 a1 b1 c4 d1 e2
6 a5 b5 c5 d2 e2
tid A B C D E F
5 a3 b1 c4 d1 e2
7 a3 b2 c3 d3 e3
8 a2 b3 c4 d3 e2 f1
9 a5 b5 c5 d1 e1 f2
Attribute Value TID List Attribute Value TID List
a1 1, 3 c4 5, 8
a2 2, 8 c5 6, 9
a3 4, 5, 7 d1 1, 3, 5, 9
a5 6, 9 d2 2, 4, 6
b1 1, 3, 5 d3 7, 8
b2 2, 7 e1 1, 3, 9
b3 4, 8 e2 2, 4, 5, 6, 8
b5 6, 9 e3 7
c1 1, 3 f1 8
c2 2 f2 9
c3 4, 7
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
pQ= a1:*:*:*:e1
qCube Query
Attribute Value TID List Attribute Value TID List
a1 1, 3 c4 5, 8
a2 2, 8 c5 6, 9
a3 4, 5 d1 1, 3, 5, 9
a5 6, 9 d2 2, 4, 6
b1 1, 3, 5 d3 7, 8
b2 2, 7 e1 1, 3, 9
b3 4, 8 e2 2, 4, 5, 6, 8
b5 6, 9 e3 7
c1 1, 3 f1 8
c2 2 f2 9
c3 4, 7
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
rOp= (greater than + less than + between + some + different +
similar x (fv1 … fvn))
qCube Range and Inquire Query
iOp =(sub-cube + distinct + top-k similar x (fv1 … fvn))
qCube rearranges Q sub-queries in order to improve query
response times
a result of Q we have qR=(TID1, TID2 … TIDk), where TIDi is the ith tuple identifier of relation R.
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
qCube Query - example
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2628º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
qCube Query - example
“What is the women journal research papers variance impact,
using months {1, 3, 5, 7, 11}, year 2012 and ages varying from 25-
40 years? Return results for all countries”
In Q, they are (sex = women, paperType=journal, year=2012).
The range queries (month = (1,3,5,7,11), age <>25-40) are also sorted according to their cardinalities.
In Q, there is inquire query (country=distinct).
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2728º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
qCube Query - example
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2828º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Experiments• We tested qCube Computation and Query algorithms against Frag-Cubing
algorithm used in [Li et al. 2004];
• The qCube algorithms were coded in Java 64 bits;
• Frag-Cubing is a free and open source C++ application(http://illimine.cs.uiuc.edu/);
• The synthetic base relations were created using data generator provided by the IlliMine project;
• The IlliMine project is an open-source project to provide various approaches for data mining and machine learning.
• Frag-Cubing approach is part of IlliMine project.
• We ran the algorithms in two Intel Xeon six-core processors with 2.4GHz each core, 12MB cache and 128GB of RAM DDR3 1333MHz.
• The system runs Windows Server 2008 64 bits, High Performance version.
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
2928º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Results - Performance Evaluation of Point Queries and Skewed Relations
Response time per query over 100 trials: TTTT=107; CCCC=5000;
DDDD=30, SSSS=0
Response time per query over 100
trials: TTTT=107; CCCC=5000; DDDD=30,
SSSS=2.5
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
3028º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Results - Performance Evaluation of Range Query Operators and Skewed Relations
Response time queries with one infrequent point operator: T=107; C=5000; D=30, S=2.5
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
3128º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Results - Performance Evaluation of of Inquire Operators and Skewed Relations
Response time queries with inquire operators: T = 107; C = 5000; D = 30, S = 2.5.
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
3228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Results - Runtime and Memory Consumption
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
3328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Conclusions
• qCube has linear runtime and memory consumption, similar to Frag-Cubing;
• It implements Not Equal, Greater or Less than, Some, Between
and Similar range query operators and Distinct, Sub-cube and
Top-k Similar inquire query operators;
• When compared with Frag-Cubing, qCube is faster to answer
point and inquire queries with sub-cube operators.
• It introduces a different cube representation with less empty cells
than Frag-Cubing;
• Frag-Cubing cannot answer two sub-cube operators in a data
cube with 107 tuples, C=5000, D=30 and S=2.5.
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
3428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
ConclusionsInteresting research directions to further extend qCube:
First, we must experiment it with holistic measures. Update and computation experiments with many holistic measures are a hard problem;
TIDs can become huge, thus memory consumption and intersection costs can become impracticable, and therefore we must address an efficient solution to partition TIDs with fast data retrieval.
Multicore and multicomputer versions of qCube must be implemented.
qCube must be improved to answer top-k queries combined with range, point and inquire queries.
Experiments with high dimensional text cubes must be made to evaluate qCube , specially its text measures computing.
Wednesday, October 02, 2012
qCube: Efficient integration of range query operators over a high dimension data cube
3528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva
Acknowlegements