An In-Depth Analysis of Data Aggregation Cost Factors in...
Transcript of An In-Depth Analysis of Data Aggregation Cost Factors in...
An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Stephan Müller, Hasso Plattner Enterprise Platform and Integration Concepts Hasso Plattner Institute, Potsdam (Germany) DOLAP 2012 – November 2nd
Agenda
■ Data aggregation
■ Aggregations in Hyrise
■ Cost factors and benchmarks
■ Summary and future work
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Motivation
■ Aggregation
□ Abstractions for conceptualizing the real world
□ Resource-intensive operation in OLAP workloads
■ Columnar in-memory databases
□ Optimized for set processing
□ Enables reunification of OLTP and OLAP [1]
[1] H. Plattner. A common database approach for OLTP and OLAP using an in-memory column database. In SIGMOD, 2009.
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Data Aggregation
■ SELECT product, SUM (amount) AS revenue FROM sales WHERE year=2011 GROUP BY product
■ Two phases
□ Grouping (hash-based, sorting, nested loops)
□ Calculation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Data Aggregation in Hyrise
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
1
1
2
1
1
3
4
1
amount
1
2
3
1
3
5
2
4
fruit
"fruit" dictionary
1 apples
2 bananas
3 cherries
4 grapes
5 melons
1
2
3
4
5
6
7
8
id
apples
bananas
cherries
apples
cherries
melons
bananas
grapes
fruit
10
10
20
10
30
50
20
40
amount
1
2
3
4
5
6
7
8
id
"amount" dictionary"id" dictionary
apples 10 bananas 10 cherries 20
"fruit" dictionary
2 31 5 64 7 8
...
1 1 2 1 3 1
2 31 5 64 7 8
...
Partition Iid
Partition IIfruit, amount
Partition Iid
Partition IIfruit, amount
■ Columnar Storage
■ Dictionary compression
Data Aggregation – Grouping
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
1
1
2
1
1
3
4
1
amount
1
2
3
4
5
6
7
8
id
1
2
3
1
3
5
2
4
fruit
dictionary
1 apples
2 bananas
3 cherries
4 grapes
5 melons
position on "fruit"
1
2
3
4
5
0, 3
table
1, 6
2, 4
7
5
Data Aggregation - Calculation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
1
1
2
1
1
3
4
1
amount
1
2
3
4
5
6
7
8
id
1
2
3
1
3
5
2
4
fruit
position on "fruit"
1
2
3
4
5
0, 3
table
1, 6
2, 4
7
5
"amount" dictionary
1 10
2 20
3 30
4 40
5 50
20
50
30
10
40
fruit sum(amount)
apples
bananas
cherries
grapes
melons
Cost Modeling in IMDBs
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
■ Cost model based on cache misses [2]
□ Basic patterns for each cache level
□ Combine patterns to model complex operations
[2] Manegold, S. et al. 2002. Generic database cost models for hierarchical memory systems. VLDB. (2002).
Cost Modeling of the Aggregation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Cost Factors
■ Data-dependent
□ Dataset size
□ Distinct grouping values
□ Distinct grouping values distribution
□ Sorting of grouping values
■ Data-independent
□ Aggregation function
□ Hash implementation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Hash Implementations
■ Ordering
□ Sorted tree (std::map)
□ Unsorted hash table (std::unordered_map) ■ Pre-allocation strategies
□ Naïve implementation (automatic growth)
□ Distinct value setup (scaffold of hash map)
□ Histogram-based setup (scaffold and position list)
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Data Set Size
■ Number of bytes processed during query
□ Input table data
□ Intermediate result data
□ Final result data
■ Influencing factors
□ Relation size
□ Relational expression
□ Grouping attributes
□ Aggregation attributes
□ Aggregation functions
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Distinct Values – L1 Cache Misses
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Distinct Values – CPU Cycles
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Distinct Values Distribution
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
●●
●
●
●
●
1 10 100 1000 10000 100000
0e+0
02e
+06
4e+0
66e
+06
8e+0
6
Influence of distribution type − 1mio records, random sorting
Number of distinct values
Num
ber o
f cac
he m
isse
s
● Uniform − L1 cache missesExponential − L1 cache missesUniform − L2 cache missesExponential − L2 cache misses
Impact of Sorted Attributes – L1 Misses
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Hash Implementations – L1 Misses
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Hash Implementations – CPU Cycles
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Summary
■ High impact of aggregation operation in a mixed OLTP and OLAP workload
■ Data characteristics influence aggregation performance
■ Hash implementation choice matters
■ Pre-allocation can reduce cache misses
■ Future work
□ Multithreaded aggregations
□ Extension of existing cost models
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Thank you!
http://epic.hpi.uni-potsdam.de
20
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database