EmbarrassinglyScalable, DatabaseSystems,€¦ · 12 0 4 8 12 16 1990 1995 2000 2005 2010 Year...
Transcript of EmbarrassinglyScalable, DatabaseSystems,€¦ · 12 0 4 8 12 16 1990 1995 2000 2005 2010 Year...
-
Embarrassingly Scalable Database Systems
Anastasia Ailamaki Data-‐Intensive Applica2ons and Systems (DIAS)
Computer and Communica2on Sciences EPFL
-
2
From Wikipedia—An embarrassingly parallel workload is one for which li3le or no effort is required to separate the problem into a number of parallel tasks. This is o4en the case where there exists no dependency (or communica=on) between those parallel tasks.
-
Parallelism = the way forward • Implicit parallelism
– Simple cores offer mulFprogramming, pipelining – SophisFcated cores are superscalar, mulFthreaded
• Explicit parallelism – Many-‐chip machines – Many-‐core chips in many-‐chip machines
3 *core = processor
-
1970 1980 1990 2000 2010 2020 Where is parallelism?
Core Core Core
Core Core Core Core
Core Core pipelining
Core
ILP+
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
datacenter
Objec=ve: all processing smoothly exploits available parallelism
4
cluster
-
5
Scalability takes a LOT of effort Conten2on-‐free
workload!
Our systems should be future-‐proof
0
200
400
600
0 8 16 24 32 Concurrent Threads
TPS
0.1
1
10
0 8 16 24 32 Concurrent Threads
shore
BerkeleyDB
mysql
postgres
commercialDB
TPS/thread
BerkeleyDB
commercialDB
postgres
mysql
shore
-
One-‐slide summary • New hardware: implicit AND explicit parallelism
– 1990: “parallelize as you go” – 2011: “parallelize as you go” × #ctx on chip × #chips
• CommunicaFon, is no longer simple – 1990: “local” and “remote” – Today: “local”, “not so local”, “somewhat remote”, …
• 1D philosophy (shared-‐nothing only, or shared-‐everything only) will no longer work
• must adapt to available parallelism
6 Answer: embarrassingly scalable DBMS
-
7
An embarrassingly scalable system is one for which li3le or no effort is required to
perform proporFonally on very small to very large numbers of hardware contexts.
-
Outline • Hardware evolu=on
– New hardware = new form of parallelism
• Efficient use of memory hierarchy • Keeping hardware contexts busy • Lessons for the future
8
-
MulFprocessor plahorms
9
disk
core
memory
1970
disk
core
memory
disk
core
memory
disk
core
memory
disk
core
memory
1980 Shared-‐nothing parallelism natural to
database processing!
-
• Moore’s law single-‐core performance – 2x faster cores every 18 months
• InstrucFon-‐Level Parallelism (ILP) – Pipelines, superscalar, OOO, branch predicFon, overlapping cache misses
• Simultaneous mulFthreading – Implements threads in a superscalar processor
DB code: >60% read/write instruc=ons =ght instruc=on dependencies [Ail99] 10
disk
core
memory
L2 cache
90’s: fine-‐grain, implicit parallelism
-
Implicit = parallelize-‐as-‐you-‐go
Gloomy news for database workloads: • Not much ILP opportunity • Hurt by growing processor/memory speed gap
11
-
12
0
4
8
12
16
1990 1995 2000 2005 2010Year
PentiumItaniumIntel Core2UltraSparcIBM PowerAMD
Contexts/chip
Chip mulFprocessors (mulFcore) • Single-‐processor performance has stalled…
– Power, heat, design/verificaFon complexity – Diminishing returns (esp. for DBMS!)
• … Moore’s law has not – 2xTransistors per 18-‐24 mo.
• Now: mulFcore – Slower, power-‐saving – Lots of cores, big caches – Throughput-‐oriented
disk
core
memory
cache
-
CPU1
L1I L1D
L3 CACHE
MAIN MEMORY
CPU0
L1I L1D
CPU1
L1I L1D
L2 CACHE
CHIP 2
CPU0
CHIP 1
L1I L1D
L2 CACHE
Today’s picture
Non-‐uniform cache access Exponen=ally many available hardware contexts 13
-
MulF-‐core technology trends • Fat Camp (FC)
wide-‐issue, OOO e.g., IBM Power5
• Lean Camp (LC) in-‐order, mulF-‐threaded e.g., Sun UltraSparc T1
one core
FC: parallelism within thread (ILP) LC: parallelism across threads
14
[Har07]
-
So how much do we use those cores?
15
Some contexts busy All contexts busy
25% useful work 65% wait for cache
25% useful work 70% wait for cache
25% useful work 70% wait for cache
75% useful work 10% wait for cache
Efficient use of cache = maximize sharing All contexts busy = parallelize
-
Outline • Hardware evoluFon • Efficient use of memory hierarchy
– Maximizing sharing poten=al
• Keeping hardware contexts busy • Lessons for the future
16
-
Cache-‐conscious algorithms • Minimize unnecessary trips to slow memory
– Data layout opFmizaFons – Bunch-‐of-‐tuples-‐at-‐a-‐Fme query execuFon
• Hide impact of cache misses – New algorithms that trade accuracy for prefetching – Make common case (sorFng, hashing, etc) efficient
• Reduce dependencies/help predicFon – Compiler-‐based techniques
17 Very important but not enough
[e.g. Ail01, Sto05, Bon05]
[e.g. Che04, Gho05]
-
• Queries handled by independent threads • Threads have large instrucFon/data footprint • Lots of interference at the memory/cache level
Database System
thread pool
x no
coordinaFon
S
J S
J
Eliminate interference and expose locality
Running data analysis queries
-
Service-‐Oriented Architecture
One server Request-‐level parallelism
Very large footprint
Monolithic server
quer
ies
Stage 3 Stage 2
Stage 1
SOA-style (staged) server
queries
Orthogonal to algorithmic optimizations
Conventional
Many services Operator-level parallelism
Much smaller footprint vs.
-
Longest anyone ever took to earn a PhD?
Average Fme to finish a PhD in CS?
=processing thread
20
“Classic” DB query engine
scan
join
average
scan
output
Student Dept
max
output
scan Student
4 + 2 70% of execu=on =me is data cache stalls
-
dispatcher
scan
Q Q
join
average Q Q
read write
read
Service-‐oriented approach
21 Maximum opportunity for sharing!
[Har05]
-
Work sharing example
scan
join
average
scan
output
Student Dept
max
output
scan Student
I/O bound on uniprocessor: >2x speedup 22
-
23
To share, or not to share?
0.0
1.0
2.0
0 15 30 45 Shared queries
Speedup from work sharing (read-‐only queries) 1 CPU
8 CPU
Great.
Now let’s run on 8 processors.
Ouch.
How can sharing destroy parallelism?
[Jon07]
-
24
Work sharing in the cri2cal path
Query 2
Query 1
Query 2 response Fme
Query 1 response Fme
Scan
Join
Aggregate
P = 4.33
CriFcal paths
-
25
Work sharing lengthens criFcal path
Query 2 response Fme
Query 1 response Fme
Penalty
Scan
Join
Aggregate
P = 2.75
P = 4.33
CriFcal path now longer
Query 2
Query 1
Work sharing eliminated 60% of work but reduced available parallelism by 1.6x
-
26
PredicFng criFcal paths • Work sharing trade-‐off
⎟⎟⎠
⎞⎜⎜⎝
⎛=
||1,
||1
CPathWorkfPerf
• Model-‐guided sharing – Predict impact of sharing – IdenFfy bad combinaFons – Inform work sharing policy 0
50
100
150
200
250
Share-‐haters 1:1 Share-‐lovers Query mix raFo
Queries/min
always share
never share
balanced
Balance between sharing and parallelism
-
Summary: Implicit parallelism
• WE NEED cache-‐conscious query processing – To exploit instrucFon-‐level parallelism
• Create sharing opportuniFes – Share data, instrucFons, and work
• But, NEVER lengthen cri2cal path – Trade sharing for parallelism
• Program with scalability in mind – Think global / act local
27
-
Outline
• Hardware evoluFon • Efficient use of memory hierarchy • Keeping hardware contexts busy
– Turn concurrency into parallelism
• Lessons for the future
28
-
29
On-‐line transacFon processing
Concurrency != parallelism
0.1
1
10
0 8 16 24 32 Concurrent Threads
Throughp
ut (tps/thread)
shore
BerkeleyDB
mysql postgres
commercialDB
Conten2on-‐free workload!
[Jon09a]
-
30
Amdahl’s Law
Bixen by Amdahl’s law
where p = parallel fracFon of work N = hardware parallelism
The maximum benefit from a parallel system is given by
0
2
4
6
8
10
0% 20% 40% 60% 80% 100% Degree of serializaFon (1-‐p)
PostgreSQL
MySQL BerkeleyDB 59%
80%
8%
Scaleup for N=32
N p p
time old time new + - ≥ ) 1 ( _ _
scaleup = 1
Even a li3le serial code hurts a lot!
1/(1-‐0.08+0.92/32)=9.19
p=92% Scaleup =
-
Shared-‐everything Lots of cri2cal sec2ons protect shared data
Core Core Core Core
Core Core Core Core
CPU
Core Core Core Core
Core Core Core Core
CPU
0
10
20
30
40
50
60
70
80
Shared-‐Everything DORA PLP
CSs p
er Transac=o
n
Other
Message passing
Xct mgr
Log mgr
Page Latches
Lock mgr
[Jon08, Jon09]
-
Transac2on processing engine
Locking logical enFFes (e.g., records)
Data
John Anne Chris Niki
Locking = serial code 32
-
Typical Lock Manager
L1 EX
EX L2 EX
T1
Lock Head Lock Hash Table
Queue Lock Requests
Xct’s Lock Requests
33
-
Time Inside the lock manager Sun Niagara T2
TPC-‐B
34
0%
20%
40%
60%
80%
100%
1 5 9 18 26 33 37 40 43 46 49 52 53 57 64
Time Breakd
own (%
)
# HW Contexts
LM Release Cont
LM Release
LM Acquire Cont
LM Acquire
Higher HW parallelism Longer Request Queues Longer CSs Higher Conten=on
-
Unpredictable access paxerns
35 Data par==oning?
Transac=on Processing
Databa
se re
cords
-
Shared-‐Nothing: Physical parFFoning
Explicit contenFon control No logging, locking, latching
Physically separated data Distributed transacFons High reparFFoning cost Very sensiFve to skew Redundancy: memory pressure
• ParFFoning 1024-‐way??
• Concurrency control? MulFcore mulFsocket machine
Core Core Core Core
Core Core Core Core
CPU
Core Core Core Core
Core Core Core Core
CPU
36 Can we make shared-‐everything scale?
[Sto07, Dew90]
[Cur10]
[Jones10]
-
Shared-‐everything -‐ Logical parFFoning
Data Data Data Data
Move conten=on away from cri=cal path
John Anne Chris Niki
37
Data
-
Data-‐Oriented Architecture (DORA) • Shared-‐everything -‐ Logically ParFFoned
Got rid of centralized lock manager Very fast reparFFoning against load-‐imbalances SFll contenFon at the physical layers
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
0
10
20
30
40
50
60
70
80
Shared-‐Everything DORA PLP
CSs p
er Transac=o
n
Other
Message passing
Xct mgr
Log mgr
Page Latches
Lock mgr
[Pan10]
-
Predictable access paxerns
39
Databa
se re
cords
0 20 40 60 80
100 120
0 50 100
Throughp
ut (k
TpS)
Real CPU Load (%)
Looming problem: physical page latches
-
Page Latch contenFon • Shared-‐everything -‐ Physiological ParFFoning [PLP]
Eliminates most of the contenFon at the physical layers Fast reparFFoning
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core 0
10
20
30
40
50
60
70
80
Shared-‐Everything DORA PLP
CSs p
er Transac=o
n
Other
Message passing
Xct mgr
Log mgr
Page Latches
Lock mgr
[Pan11]
-
41
-
Logging is crucial for OLTP
• TransacFons must – Write a log record describing every update – when ready to commit, write log to disk!
• Great for single-‐thread performance – But not scalable! – Compromise performance or recoverability
42 * hxp://www.datacenterknowledge.com/archives/2010/05/13/car-‐crash-‐triggers-‐amazon-‐power-‐outage/
(e.g., Amazon outage*)
$$$
Need efficient and scalable logging solu=on
-
Why logging hurts scalability
• Working around the boxlenecks: – Asynchronous commit – Replace logging with replicaFon and fail-‐over
43
(1) At commit, must yield for log flush synchronous I/O at criFcal path locks held for long Fme two context switches per commit
(2) Must insert records to the log buffer centralized main-‐memory structure source of contenFon
CPU-‐1
L1 L2
CPU-‐2
L1
CPU-‐N
L1
Data Log
CPU
RAM
HDD
Workarounds compromise durability
-
44
Attempts to scale Shore’s logmgr
0 2 4 6 8 10 12
0 8 16 24 32 Concurrent Threads
Throughp
ut (k
tps)
MCS mutex T&T&S mutex Baseline
Cannot scale by improving 1-‐thr performance
[Jon10a]
-
Does “correct” logging have to be so slow?
• Locks held for long Fme – Not actually used during the flush – Indirect way to enforce isolaFon (Early Lock Release)
• Two context switches per commit – TransacFons nearly stateless at commit Fme – Easy to migrate transacFons between threads
• Log buffer is source of contenFon – Log orders incoming requests, not threads – Log records can be combined
45
Compose scalability by solving each problem
[Jon10]
-
Mutex held Start/finish Copy into buffer WaiFng
AlleviaFng ContenFon
46
ConsolidaFon array (C)
(D) Decoupled buffer insert All together (CD)
(B) Baseline
(D) Decoupled buffer insert All together (CD)
(B) Baseline
contention(work) = O(1)
contention(# threads) = O(1)
-
Performance as contenFon increases
47
0.01
0.1
1
10
1 4 16 64
Log insert ra
te (G
B/s)
# of threads
Baseline Decoupled (D) ConsolidaFon (C) Hybrid (CD)
Hybrid solu=on combines benefits of both
-
48
Log redesign = scalability
0 2 4 6 8 10 12
0 8 16 24 32 Concurrent Threads
Throughp
ut (k
tps)
Aether MCS mutex T&T&S mutex Baseline
Scalability >> performance
-
How far can we go?
49
Scalability implies performance!
Sun Niagara T1 Insert-‐only workload
0.1
1
10
0 8 16 24 32 Concurrent Threads
Throughp
ut (tps/thread)
shore-‐mt*
shore
commercialDB Core Core Core Core
CHIP
Core Core Core Core
CHIP
*Shore-‐MT available at dias.epfl.ch
[Jon09]
-
50
Summary: Explicit parallelism • Keeping hardware contexts busy
– There’s no escaping from Amdahl – make it scale if you want it to run fast
• ParFFoning eliminates contenFon – Shared-‐nothing carries overhead – Shared-‐everything made fast with logical parFFoning – Employ shared-‐everything on shared-‐nothing “islands”
• Concurrency ≠ parallelism – Find right dimension/decouple logically unrelated operaFons
-
51
Future: The rise of the power wall • ILP era (ca. 1990)
• MulFcore era (ca. 2000+) • Heterogenous era (e.g., AMD fusion)
¢û¢
ûû
¢û¢
ûû
Think global, act local
CPU NPU GPU FPGA
CPU CPU NPU NPU
GPU
GPU
CPU NPU
GPU
cache cache cache cache
[Mul09] [He08] [Gol05]
-
Thank you!
Mike Carey Alkis Polyzo2s
Divesh Srivastava
Special thanks to…
for their comments!
-
References -‐ I [Ail99] A. Ailamaki, D. J. DeWix, M. D. Hill, D.A. Wood: DBMSs on a modern processor: Where Does Time Go?, VLDB 1999 [Ail01] A. Ailamaki, D. J. DeWix, M. D. Hill, M. Skounakis: Weaving RelaFons for Cache Performance. VLDB 2001 [Bon05] P. Boncz, M. Zukowski, N. Nes: MonetDB/X100: Hyper-‐Pipelining Query ExecuFon. CIDR 2005 [Che04] S. Chen, A. Ailamaki, P. B. Gibbons, T. C. Mowry: Improving Hash Join Performance through Prefetching. ICDE 2004 [Dew90] D. J. DeWix, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. Hsiao, R. Rasmussen: The Gamma Database Machine Project. IEEE TKDE 1990 [Gho05] A. GhoFng, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, P. Dubey: Cache-‐conscious Frequent Paxern Mining on a Modern Processor. VLDB 2005 [Gol05] B. T. Gold, A. Ailamaki, L. Huston, B. Falsafi: AcceleraFng Database OperaFons Using a Network Processor. DaMoN 2005 [Har07] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, B. Falsafi: Database Servers on Chip MulFprocessors: LimitaFons and OpportuniFes. CIDR 2007
53
-
References -‐ II [Har08] S. Harizopoulos, D. J. Abadi, S. Madden, M. Stonebraker: OLTP through the looking glass, and what we found there. SIGMOD 2008 [Har05] S. Harizopoulos, V. Shkapenyuk, A. Ailamaki: QPipe: A Simultaneously Pipelined RelaFonal Query Engine. SIGMOD 2005 [He08] B. He, K. Yang, R. Fang, M. Lu, N. K. Govindaraju, Q. Luo, P. V. Sander: RelaFonal joins on graphics processors. SIGMOD 2008 [Jon07] R. Johnson, N. Hardavellas, I. Pandis, N. Mancheril, S. Harizopoulos, K. Sabirli, A. Ailamaki, B. Falsafi: To Share or Not To Share? VLDB 2007 [Jon08] R. Johnson, I. Pandis, A. Ailamaki: CriFcal secFons: re-‐emerging scalability concerns for database storage engines. DaMoN 2008 [Jon09] R. Johnson, I. Pandis, A. Ailamaki: Improving OLTP Scalability using SpeculaFve Lock Inheritance. VLDB 2009 [Jon09a] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, B. Falsafi: Shore-‐MT: a scalable storage manager for the mulFcore era. EDBT 2009 [Jon10] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, A. Ailamaki: Aether: A Scalable Approach to Logging. PVLDB 2010
54
-
References -‐ III [Jon10a] R. Johnson, R. Stoica, A. Ailamaki, T. C. Mowry: Decoupling contenFon management from scheduling. ASPLOS 2010 [Mul09] R. Müller, J. Teubner, G. Alonso: Data Processing on FPGAs. PVLDB 2009 [Pan10] I. Pandis, R. Johnson, N. Hardavellas, A. Ailamaki: Data-‐Oriented TransacFon ExecuFon. PVLDB 2010 [Pan11] I. Pandis, P. Tözün, R. Johnson, A. Ailamaki: PLP: Page Latch-‐free Shared-‐everything OLTP. Technical Report, EPFL DIAS, 2011 (available upon request) [Sto05] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, S. B. Zdonik: C-‐Store: A Column-‐oriented DBMS. VLDB 2005 [Sto07] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, P. Helland: The End of an Architectural Era (It's Time for a Complete Rewrite). VLDB 2007 [Cur10] C. Curino, Y. Zhang, E. P. C. Jones, S. Madden: Schism: a Workload-‐Driven Approach to Database ReplicaFon and ParFFoning. PVLDB 2010 [Jones10] E. P. C. Jones, D. J. Abadi, S. Madden: Low overhead concurrency control for parFFoned main memory databases. SIGMOD 2010
55