MICA : A Holistic Approach to Fast In-Memory Key-Value Storage
description
Transcript of MICA : A Holistic Approach to Fast In-Memory Key-Value Storage
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim1
Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3
1Carnegie Mellon University2KAIST, 3Intel Labs
2
Goal: Fast In-Memory Key-Value Store• Improve per-node performance (op/sec/node)
• Less expensive• Easier hotspot mitigation• Lower latency for multi-key queries
• Target: small key-value items (fit in single packet)
• Non-goals: cluster architecture, durability
3
Q: How Good (or Bad) are Current Systems?• Workload: YCSB [SoCC 2010]
• Single-key operations
• In-memory storage• Logging turned off in our experiments
• End-to-end performance over the network
• Single server node
4
Mem-cached
RAMCloud MemC3 Masstree MICA0
10
20
30
40
50
60
70
8095% GET ORIG
95% GET OPT
50% GET OPT
End-to-End Performance Comparison
Throughput (M operations/sec)
- Published results; Logging on RAMCloud/Masstree- Using Intel DPDK (kernel bypass I/O); No logging
- (Write-intensive workload)
5
Mem-cached
RAMCloud MemC3 Masstree MICA0
10
20
30
40
50
60
70
80
1.4 0.14.4
8.9
1.35.8 5.7
16.5
65.6
0.7 1 0.96.5
70.495% GET ORIG
95% GET OPT
50% GET OPT
End-to-End Performance Comparison
Throughput (M operations/sec)
Performance collapses under heavy writes
- Published results; Logging on RAMCloud/Masstree- Using Intel DPDK (kernel bypass I/O); No logging
- (Write-intensive workload)
6
Mem-cached
RAMCloud MemC3 Masstree MICA0
10
20
30
40
50
60
70
80
1.4 0.14.4
8.9
1.35.8 5.7
16.5
65.6
0.7 1 0.96.5
70.495% GET ORIG
95% GET OPT
50% GET OPT
End-to-End Performance Comparison
Throughput (M operations/sec)
13.5x
Maximum packets/secattainableusing UDP4x
7
MICA Approach• MICA: Redesigning in-memory key-value storage
• Applies new SW architecture and data structuresto general-purpose HW in a holistic way
ClientCPU
NICCPU
Memory
Server node
1. Parallel data access
2. Requestdirection
3. Key-valuedata structures(cache & store)
8
Parallel Data Access
• Modern CPUs have many cores (8, 15, …)• How to exploit CPU parallelism efficiently?
ClientCPU
NICCPU
Memory
Server node
1. Parallel data access
2. Requestdirection
3. Key-valuedata structures
9
Parallel Data Access Schemes
CPU core
CPU coreMemory
CPU core
CPU core
Partition
Partition
Concurrent Read Concurrent Write
Exclusive ReadExclusive Write
+ Good load distribution
- Limited CPU scalability(e.g., synchronization)
- Cross-NUMA latency
+ Good CPU scalability
- Potentially low performanceunder skewed workloads
10
In MICA, Exclusive Outperforms Concurrent
Throughput (Mops)
Concurrent Access Exclusive Access0
10
20
30
40
50
60
70
80
35.1
76.969.3
76.3
21.6
70.469.465.6
Uniform, 50% GETUniform, 95% GETSkewed, 50% GETSkewed, 95% GET
End-to-end performance with kernel bypass I/O
11
Request Direction
• Sending requests to appropriate CPU cores forbetter data access locality
• Exclusive access benefits from correct delivery• Each request must be sent to corresp. partition’s core
ClientCPU
NICCPU
Memory
Server node
1. Parallel data access
2. Requestdirection
3. Key-valuedata structures
12
Request Direction Schemes
Client CPU
CPU
Server node
Client
Classification using 5-tuple
CPU
CPU
Server node
Classification depends on request content
Key 1
Key 1
Key 2
Key 2Client
ClientNICNIC
+ Good locality for flows(e.g., HTTP over TCP)
- Suboptimal for smallkey-value processing
+ Good locality for key access
- Client assist or special HW support needed for efficiency
Flow-based Affinity Object-based Affinity
13
Crucial to Use NIC HW for Request Direction
Throughput (Mops)
Using exclusive access for parallel data access
Request direction done solely by software
Client-assisted hardware-based request direction
0
10
20
30
40
50
60
70
80
33.9
76.9
28.1
70.4Uniform
Skewed
14
Key-Value Data Structures
• Significant impact on key-value processing speed• New design required for very high op/sec
for both read and write• “Cache” and “store” modes
ClientCPU
NICCPU
Memory
Server node
1. Parallel data access
2. Requestdirection
3. Key-valuedata structures
15
MICA’s “Cache” Data Structures• Each partition has:
• Circular log (for memory allocation)• Lossy concurrent hash index (for fast item access)
• Exploit Memcached-like cache semantics• Lost data is easily recoverable (not free, though)• Favor fast processing
• Provide good memory efficiency & item eviction
16
Circular Log• Allocates space for key-value items of any length• Conventional logs + Circular queues
• Simple garbage collection/free space defragmentationNew item is
appended at tail
Head Tail
HeadTail
Evict oldest itemat head (FIFO)
Insufficient spacefor new item?
(fixed log size)
Support LRU by reinserting recently accessed items
17
Lossy Concurrent Hash Index• Indexes key-value items stored in the circular log• Set-associative table
• Full bucket? Evict oldest entry from it• Fast indexing of new key-value items
Key,Val
bucket 0bucket 1
bucket N-1…
Circularlog
Hashindexhash(Key)
18
MICA’s “Store” Data Structures• Required to preserve stored items
• Achieve similar performance by trading memory• Circular log -> Segregated fits• Lossy index -> Lossless index (with bulk chaining)
• See our paper for details
19
Evaluation
• Going back to end-to-end evaluation…
• Throughput & latency characteristics
20
Throughput ComparisonThroughput (Mops)
End-to-end performance with kernel bypass I/O
Memcached MemC3 RAMCloud Masstree MICA0
10
20
30
40
50
60
70
80
0.7 0.9 0.9
5.7
76.9
1.3
5.4 6.1
12.2
76.3
0.7 0.9 1
6.5
70.4
1.3
5.7 5.8
16.5
65.6
Uniform, 50% GETUniform, 95% GETSkewed, 50% GETSkewed, 95% GET
Bad at high write ratios
Similar performance regardless of skew/write
Largeperformance
gap
21
Throughput-Latency on EthernetAverage latency (μs)
Throughput (Mops)0 10 20 30 40 50 60 70 80
0102030405060708090
100Original Memcached MICA
0 0.1 0.2 0.30
102030405060708090
100
Original Memcached using standard socket I/O; both use UDP
200x+ throughput
22
MICA• Redesigning in-memory key-value storage• 65.6+ Mops/node even for heavy skew/write• Source code: github.com/efficient/mica
ClientCPU
NICCPU
Memory
Server node
1. Parallel data access
2. Requestdirection
3. Key-valuedata structures(cache & store)
23
Reference• [DPDK]
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/packet-processing-is-enhanced-with-software-from-intel-dpdk.html
• [FacebookMeasurement] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In Proc. SIGMETRICS 2012.
• [Masstree] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache Craftiness for Fast Multicore Key-Value Storage. In Proc. EuroSys 2012.
• [MemC3] Bin Fan, David G. Andersen, and Michael Kaminsky. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing. In Proc. NSDI 2013.
• [Memcached] http://memcached.org/• [RAMCloud] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout,
and Mendel Rosenblum. Fast Crash Recovery in RAMCloud. In Proc. SOSP 2011.• [YCSB] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and
Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In Proc. SoCC 2010.