Near Memory Key/Value Lookup Acceleration · Evaluate performance on the next generation of LiME...
Transcript of Near Memory Key/Value Lookup Acceleration · Evaluate performance on the next generation of LiME...
LLNL-PRES-738931
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Near Memory Key/Value Lookup AccelerationMemSys 2017
Scott Lloyd, Maya GokhaleCenter for Applied Scientific Computing
October 3, 2017
2LLNL-PRES-738931
▪ Used in— Data analytics— Scientific workloads
▪ Implementations— Redis— Memcached— Riak
▪ Applications— URL or web object caching— Log data analysis— De-duplication— Genomic sequence analysis— Access SHA-1 or other cryptographic hashed objects
Key-Value Store Introduction
3LLNL-PRES-738931
Example of Open Address Hashing
Open Addressing
Hash Table
Fred
0
N
buckets
John
Sue
Mike
Hash Function
Mary
John
Bob
Kelly
Sue
1
Mike
PSL = 3
Probe
Sequence
Length (PSL)
value
value
value
value
value
value
value
PSL = 2
PSL = 1
2
3
4
5
6
7
8
9
10
…
Keys
4LLNL-PRES-738931
▪ Attractive for near memory hash tables— Allows probe sequence bytes to be streamed sequentially— Enables a constant rate, deterministic pipeline of lookups
▪ Beneficial for small keys and values— No extra space needed for linked lists or pointers
▪ Avoids indirection— Chaining approaches need extra memory accesses
▪ Table space must be reserved at the outset— Avoid overhead of allocating memory for each record— No need for a memory allocator
▪ Susceptible to clustering— Requires a high quality hash algorithm— Our hash algorithm is adapted from SpookyHash
Open Addressing FeaturesSupportive of Hardware Implementation
5LLNL-PRES-738931
▪ Lookup pipeline is compatible with multiple open addressing insertion algorithms
▪ Linear insertion— Collisions are resolved by searching subsequent table locations for an
empty slot— Can have large clusters (long probe sequence) at higher load factors
▪ Robin Hood hashing— Scans forward from initial probe point— Entries are swapped with one that has a shorter probe sequence length— Reduces the maximum probe sequence length— Our experiments use Robin Hood hashing
Insertion Algorithms
6LLNL-PRES-738931
Near Memory Key/Value Building Blocks
Memory Subsystem
CPU
Memory
Channel
Controller
Core 0
Memory Interconnect
Memory
Channel
Load
Store
Unit N
Memory
Channel
Memory
Memory
Channel
Host
Control
Interface
SRAM
Scratchpad
Hardware
building
blocks
Slave
Controller
Core N
Load
Store
Unit 0
Slave
Slave Slave Slave Slave
Stream Interconnect
Hash Unit 0 Hash Unit N
MasterMaster Master Master
Traversal Reorganization
Lookup
Compare
Unit
7LLNL-PRES-738931
Lookup Pipeline Configuration
LSU0-R Hash
Keys
Hash
Index
LSU1-R Comp
Select
Split LSU1-W
Buckets Values
ValuesBucketsKeys
Keys
Keys
FIFO
Stream Interconnect
Memory Interconnect
Memory
Channel
Memory
Channel
Memory
Channel
Memory
Channel
Control from CPU
8LLNL-PRES-738931
Emulated Memory Subsystem Architecture
▪ Multiple Memory Channels
▪ Up to 16 concurrent memory requests
▪ Lookup accelerators are located in the
Memory Subsystem
▪ Scratchpad is used to communicate
parameters and results between CPU and
accelerator
Links
Scratchpad
Lookup Accelerator (LA)
Memory Subsystem
Processor
Cache
CPU
Core
Cache
CPU
Core
To SwitchLA
Memory
Channel
Memory
Channel
Memory
Channel
Memory
Channel
LA LA LA
Shared Cache
Switch
9LLNL-PRES-738931
LiME (Logic in Memory Emulator)
Implementation
Actual
▪ Memory 1.6 GB/s, 180 ns
▪ CPU 1-800 MB/s, 1-800 MHz
▪ Accel. 2-1600 MB/s, 1-200 MHz
Target
▪ Memory 40 GB/s, 100ns
▪ CPU 5 GB/s, 2.5 GHz
▪ Accel. 10 GB/s, 1.25 GHz
Programmable Logic (PL)
Processing System (PS)
Zynq SoC
Tra
ce
Su
bsyste
mM
em
ory
Su
bsyste
mH
ost S
ub
syste
m
Trace DRAM
SRAM
Pro
gra
m D
RA
M
AXI Performance
Monitor (APM)
ARM
Core
L2 Cache
ARM
Core
Accelerat
orAccelerat
orLookup
Accelerator
AXI Interconnect
Trace Capture
Device
Monitor
AXI Peripheral
Interconnect
BRAM
L1 L1
Delay Delay
Open Source:http://bitbucket.org/perma/emulator_st/
10LLNL-PRES-738931
Component Actual Emulated
Memory Bandwidth 1.6 GB/s 32 GB/s
Memory Latency 180 ns 9 ns (too low)
Memory Latency w/delay 180 ns 9+91 = 100 ns
CPU Frequency 128.6 MHz 2.57 GHz
CPU Bandwidth 257 MB/s 5.1 GB/s
Accelerator Frequency 62.5 MHz 1.25 GHz (base logic)
Accelerator Bandwidth 500 MB/s 10 GB/s
ExampleEmulator Scaling by 20
Delay is programmable over a wide range: 0 - 262 us
in 0.25 ns increments
11LLNL-PRES-738931
▪ Key/value table is filled with a scientific data set
▪ Table entries consist of— 64-bit keys: k-length genomic sequences, i.e. k-mers— 32-bit values: sequence numbers (32-bits due to 32-bit ARM on emulator)
▪ 32 million entry table is allocated at first and filled to varying degrees— Fits within 1 GB of memory on emulator
▪ Explore wide range of workload characteristics
Experiment Design
Parameters Values
Load factor 10%–90%
Hit ratio 10%, 50%, 90%
Key repeat frequency Uniform, Zipf
Memory Latency (ns) 85R/106W, 200R/400W
Query block size 1024 keys
12LLNL-PRES-738931
▪ Accel— Near memory hardware lookup accelerator— Collision resolution: open addressing and Robin Hood hashing— Hash function: adapted from SpookyHash (hardware friendly)— Lookup uses linear probing
▪ Soft— Software version of the hardware lookup algorithm— Collision resolution: same as Accel— Hash function: same as Accel— Unlike the hardware, the software algorithm terminates probe sequence
search as soon as a key has been found
▪ STL— Hash table uses the Standard Template Library (STL) unordered map— Collision resolution: separate chaining with linked lists— Hash function: simple
Lookup Algorithms Evaluated
13LLNL-PRES-738931
Lookup Performance90% hit rate
64.32
9.135.02
2.600
10
20
30
40
50
60
70
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Loo
kup
s/s
Mill
ion
s
Load Factor
ARM_32 - R85,W106 - Uniform - Hit 90%
Accel
Soft
STL
64.46
9.13
30.42
8.240
10
20
30
40
50
60
70
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Lo
oku
ps/
s
Mill
ion
sLoad Factor
ARM_32 - Accel - Zipf=.99 - Hit 90%
R85,W106
R200,W400
Accelerator vs. Software Low vs. Moderate Latency
▪ Accel. performance does not vary with hit rate or key repeat frequency (scans entire PSL)
▪ Accel. performance decreases with increasing load (PSL) and memory latency
▪ Accel. performance comes from parallelism and more outstanding near memory requests
▪ Software is slower because of serialization and fewer outstanding far memory requests
14LLNL-PRES-738931
Speedup of Uniform and Zipfian Key Distributions90% hit rate
12.80
3.51
10.19
2.90
0
2
4
6
8
10
12
14
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Spee
du
p
Load Factor
ARM_32 - Accel/Soft - R85,W106 - Hit 90%
Uniform Zipf
▪ Zipfian has less speedup because software has more query hits in CPU cache (lower)
▪ At higher load factors, the software is disadvantaged with more cache misses (convergence)
9.47
5.546.85
4.33
0
1
2
3
4
5
6
7
8
9
10
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Sp
eed
up
Load Factor
ARM_32 - Accel/Soft - R200,W400 - Hit 90%
Uniform Zipf
Low Latency (DRAM) Moderate Latency (SCM)
15LLNL-PRES-738931
Speedup of 10% and 90% Hit RateZipf skew factor 0.99
6.62
10.19
2.90
0
2
4
6
8
10
12
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Spee
du
p
Load Factor
ARM_32 - Accel/Soft - R85,W106 - Zipf=.99
hit 10% hit 90%
▪ Hit rate does not affect speedup at low load factors since probe sequence is short
▪ Software is challenged on longer searches (low hit, high load) with more sequential memory accesses
▪ Higher latency pushes the trend even more
10.91
6.85
4.33
0
2
4
6
8
10
12
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Sp
eed
up
Load Factor
ARM_32 - Accel/Soft - R200,W400 - Zipf=.99
hit 10% hit 90%
Low Latency (DRAM) Moderate Latency (SCM)
16LLNL-PRES-738931
▪ Cache flush and invalidate operations on shared buffers— 2.72 us per 1K block of keys— 2.65 ns avg. per key
▪ Command messages between CPU and Accelerator— Indicate the start and end of accelerator activity— Contain parameters and results— 1 us per 1K block of keys, ARM on emulator, AXI adaptor function calls— 1.5 us per message, x86 Linux platform, PCIe user-space driver
▪ Together, the communication overhead for a 1K block of keys ranges between 3% (high load) and 25% (low load)
CPU–Accelerator Communication Overhead
17LLNL-PRES-738931
▪ Given the demand for and prevalence of K/V stores, there may be sufficient commercial interest to overcome the expense of near memory hardware
▪ Our simple, re-usable, interconnecting hardware building blocks are suitable for constructing a near memory lookup accelerator
▪ The building blocks have been composed in a synchronous high performance pipeline capable of delivering a result every few clock cycles
▪ Over a wide range of parameters, the accelerator has shown excellent speedup on both HMC/HBM memory and storage class memory
▪ Our lookup accelerator is tolerant to some memory latency, demonstrating up to an order of magnitude speedup with a pipelined architecture optimized for streaming, overlapped memory requests
Conclusions
18LLNL-PRES-738931
▪ Evaluate a general K/V accelerator capable of inserts, deletes, and lookups
▪ Simulate multiple K/V accelerators in a near memory subsystem
▪ Analyze the energy profile
▪ Explore region-based cache-memory synchronization— Current cache management requires iterating over a buffer with flush
and invalidate instructions
▪ Evaluate performance on the next generation of LiME— Based on Xilinx ZCU102 Evaluation Board with Zynq UltraScale+— 64-bit architecture with ARM Cortex-A53 cores— More memory up to 4GB of DDR4— Potential for deeper and configurable cache hierarchy
Future Work
19LLNL-PRES-738931
▪ LiME Open Source Release,— “Logic in Memory Emulator” with benchmark applications available at
http://bitbucket.org/perma/emulator_st
▪ M. Gokhale, S. Lloyd, and C. Hajas, “Near memory data structure rearrangement,” International Symposium on Memory Systems, pp. 283–290, Washington DC, Oct 2015.
▪ M. Gokhale, S. Lloyd, and C. Macaraeg, “Hybrid memory cube performance characterization on data-centric workloads,” Workshop on Irregular Applications: Architectures and Algorithms, 7:1–7:8, Austin, TX, Nov 2015.
▪ S. Lloyd and M. Gokhale, “In-memory data rearrangement for irregular, data intensive computing,” IEEE Computer, 48(8):18–25, Aug 2015.
References