Near Memory Key/Value Lookup Acceleration · Evaluate performance on the next generation of LiME...

LLNL-PRES-738931

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Near Memory Key/Value Lookup AccelerationMemSys 2017

Scott Lloyd, Maya GokhaleCenter for Applied Scientific Computing

October 3, 2017

2LLNL-PRES-738931

▪ Used in— Data analytics— Scientific workloads

▪ Implementations— Redis— Memcached— Riak

▪ Applications— URL or web object caching— Log data analysis— De-duplication— Genomic sequence analysis— Access SHA-1 or other cryptographic hashed objects

Key-Value Store Introduction

3LLNL-PRES-738931

Example of Open Address Hashing

Open Addressing

Hash Table

Fred

0

N

buckets

John

Sue

Mike

Hash Function

Mary

John

Bob

Kelly

Sue

1

Mike

PSL = 3

Probe

Sequence

Length (PSL)

value

value

value

value

value

value

value

PSL = 2

PSL = 1

2

3

4

5

6

7

8

9

10

…

Keys

4LLNL-PRES-738931

▪ Attractive for near memory hash tables— Allows probe sequence bytes to be streamed sequentially— Enables a constant rate, deterministic pipeline of lookups

▪ Beneficial for small keys and values— No extra space needed for linked lists or pointers

▪ Avoids indirection— Chaining approaches need extra memory accesses

▪ Table space must be reserved at the outset— Avoid overhead of allocating memory for each record— No need for a memory allocator

▪ Susceptible to clustering— Requires a high quality hash algorithm— Our hash algorithm is adapted from SpookyHash

Open Addressing FeaturesSupportive of Hardware Implementation

5LLNL-PRES-738931

▪ Lookup pipeline is compatible with multiple open addressing insertion algorithms

▪ Linear insertion— Collisions are resolved by searching subsequent table locations for an

empty slot— Can have large clusters (long probe sequence) at higher load factors

▪ Robin Hood hashing— Scans forward from initial probe point— Entries are swapped with one that has a shorter probe sequence length— Reduces the maximum probe sequence length— Our experiments use Robin Hood hashing

Insertion Algorithms

6LLNL-PRES-738931

Near Memory Key/Value Building Blocks

Memory Subsystem

CPU

Memory

Channel

Controller

Core 0

Memory Interconnect

Memory

Channel

Load

Store

Unit N

Memory

Channel

Memory

Memory

Channel

Host

Control

Interface

SRAM

Scratchpad

Hardware

building

blocks

Slave

Controller

Core N

Load

Store

Unit 0

Slave

Slave Slave Slave Slave

Stream Interconnect

Hash Unit 0 Hash Unit N

MasterMaster Master Master

Traversal Reorganization

Lookup

Compare

Unit

7LLNL-PRES-738931

Lookup Pipeline Configuration

LSU0-R Hash

Keys

Hash

Index

LSU1-R Comp

Select

Split LSU1-W

Buckets Values

ValuesBucketsKeys

Keys

Keys

FIFO

Stream Interconnect

Memory Interconnect

Memory

Channel

Memory

Channel

Memory

Channel

Memory

Channel

Control from CPU

8LLNL-PRES-738931

Emulated Memory Subsystem Architecture

▪ Multiple Memory Channels

▪ Up to 16 concurrent memory requests

▪ Lookup accelerators are located in the

Memory Subsystem

▪ Scratchpad is used to communicate

parameters and results between CPU and

accelerator

Links

Scratchpad

Lookup Accelerator (LA)

Memory Subsystem

Processor

Cache

CPU

Core

Cache

CPU

Core

To SwitchLA

Memory

Channel

Memory

Channel

Memory

Channel

Memory

Channel

LA LA LA

Shared Cache

Switch

9LLNL-PRES-738931

LiME (Logic in Memory Emulator)

Implementation

Actual

▪ Memory 1.6 GB/s, 180 ns

▪ CPU 1-800 MB/s, 1-800 MHz

▪ Accel. 2-1600 MB/s, 1-200 MHz

Target

▪ Memory 40 GB/s, 100ns

▪ CPU 5 GB/s, 2.5 GHz

▪ Accel. 10 GB/s, 1.25 GHz

Programmable Logic (PL)

Processing System (PS)

Zynq SoC

Tra

ce

Su

bsyste

mM

em

ory

Su

bsyste

mH

ost S

ub

syste

m

Trace DRAM

SRAM

Pro

gra

m D

RA

M

AXI Performance

Monitor (APM)

ARM

Core

L2 Cache

ARM

Core

Accelerat

orAccelerat

orLookup

Accelerator

AXI Interconnect

Trace Capture

Device

Monitor

AXI Peripheral

Interconnect

BRAM

L1 L1

Delay Delay

Open Source:http://bitbucket.org/perma/emulator_st/

10LLNL-PRES-738931

Component Actual Emulated

Memory Bandwidth 1.6 GB/s 32 GB/s

Memory Latency 180 ns 9 ns (too low)

Memory Latency w/delay 180 ns 9+91 = 100 ns

CPU Frequency 128.6 MHz 2.57 GHz

CPU Bandwidth 257 MB/s 5.1 GB/s

Accelerator Frequency 62.5 MHz 1.25 GHz (base logic)

Accelerator Bandwidth 500 MB/s 10 GB/s

ExampleEmulator Scaling by 20

Delay is programmable over a wide range: 0 - 262 us

in 0.25 ns increments

11LLNL-PRES-738931

▪ Key/value table is filled with a scientific data set

▪ Table entries consist of— 64-bit keys: k-length genomic sequences, i.e. k-mers— 32-bit values: sequence numbers (32-bits due to 32-bit ARM on emulator)

▪ 32 million entry table is allocated at first and filled to varying degrees— Fits within 1 GB of memory on emulator

▪ Explore wide range of workload characteristics

Experiment Design

Parameters Values

Load factor 10%–90%

Hit ratio 10%, 50%, 90%

Key repeat frequency Uniform, Zipf

Memory Latency (ns) 85R/106W, 200R/400W

Query block size 1024 keys

12LLNL-PRES-738931

▪ Accel— Near memory hardware lookup accelerator— Collision resolution: open addressing and Robin Hood hashing— Hash function: adapted from SpookyHash (hardware friendly)— Lookup uses linear probing

▪ Soft— Software version of the hardware lookup algorithm— Collision resolution: same as Accel— Hash function: same as Accel— Unlike the hardware, the software algorithm terminates probe sequence

search as soon as a key has been found

▪ STL— Hash table uses the Standard Template Library (STL) unordered map— Collision resolution: separate chaining with linked lists— Hash function: simple

Lookup Algorithms Evaluated

13LLNL-PRES-738931

Lookup Performance90% hit rate

64.32

9.135.02

2.600

10

20

30

40

50

60

70

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Loo

kup

s/s

Mill

ion

s

Load Factor

ARM_32 - R85,W106 - Uniform - Hit 90%

Accel

Soft

STL

64.46

9.13

30.42

8.240

10

20

30

40

50

60

70

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Lo

oku

ps/

s

Mill

ion

sLoad Factor

ARM_32 - Accel - Zipf=.99 - Hit 90%

R85,W106

R200,W400

Accelerator vs. Software Low vs. Moderate Latency

▪ Accel. performance does not vary with hit rate or key repeat frequency (scans entire PSL)

▪ Accel. performance decreases with increasing load (PSL) and memory latency

▪ Accel. performance comes from parallelism and more outstanding near memory requests

▪ Software is slower because of serialization and fewer outstanding far memory requests

14LLNL-PRES-738931

Speedup of Uniform and Zipfian Key Distributions90% hit rate

12.80

3.51

10.19

2.90

0

2

4

6

8

10

12

14

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Spee

du

p

Load Factor

ARM_32 - Accel/Soft - R85,W106 - Hit 90%

Uniform Zipf

▪ Zipfian has less speedup because software has more query hits in CPU cache (lower)

▪ At higher load factors, the software is disadvantaged with more cache misses (convergence)

9.47

5.546.85

4.33

0

1

2

3

4

5

6

7

8

9

10

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Sp

eed

up

Load Factor

ARM_32 - Accel/Soft - R200,W400 - Hit 90%

Uniform Zipf

Low Latency (DRAM) Moderate Latency (SCM)

15LLNL-PRES-738931

Speedup of 10% and 90% Hit RateZipf skew factor 0.99

6.62

10.19

2.90

0

2

4

6

8

10

12

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Spee

du

p

Load Factor

ARM_32 - Accel/Soft - R85,W106 - Zipf=.99

hit 10% hit 90%

▪ Hit rate does not affect speedup at low load factors since probe sequence is short

▪ Software is challenged on longer searches (low hit, high load) with more sequential memory accesses

▪ Higher latency pushes the trend even more

10.91

6.85

4.33

0

2

4

6

8

10

12

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Sp

eed

up

Load Factor

ARM_32 - Accel/Soft - R200,W400 - Zipf=.99

hit 10% hit 90%

Low Latency (DRAM) Moderate Latency (SCM)

16LLNL-PRES-738931

▪ Cache flush and invalidate operations on shared buffers— 2.72 us per 1K block of keys— 2.65 ns avg. per key

▪ Command messages between CPU and Accelerator— Indicate the start and end of accelerator activity— Contain parameters and results— 1 us per 1K block of keys, ARM on emulator, AXI adaptor function calls— 1.5 us per message, x86 Linux platform, PCIe user-space driver

▪ Together, the communication overhead for a 1K block of keys ranges between 3% (high load) and 25% (low load)

CPU–Accelerator Communication Overhead

17LLNL-PRES-738931

▪ Given the demand for and prevalence of K/V stores, there may be sufficient commercial interest to overcome the expense of near memory hardware

▪ Our simple, re-usable, interconnecting hardware building blocks are suitable for constructing a near memory lookup accelerator

▪ The building blocks have been composed in a synchronous high performance pipeline capable of delivering a result every few clock cycles

▪ Over a wide range of parameters, the accelerator has shown excellent speedup on both HMC/HBM memory and storage class memory

▪ Our lookup accelerator is tolerant to some memory latency, demonstrating up to an order of magnitude speedup with a pipelined architecture optimized for streaming, overlapped memory requests

Conclusions

18LLNL-PRES-738931

▪ Evaluate a general K/V accelerator capable of inserts, deletes, and lookups

▪ Simulate multiple K/V accelerators in a near memory subsystem

▪ Analyze the energy profile

▪ Explore region-based cache-memory synchronization— Current cache management requires iterating over a buffer with flush

and invalidate instructions

▪ Evaluate performance on the next generation of LiME— Based on Xilinx ZCU102 Evaluation Board with Zynq UltraScale+— 64-bit architecture with ARM Cortex-A53 cores— More memory up to 4GB of DDR4— Potential for deeper and configurable cache hierarchy

Future Work

19LLNL-PRES-738931

▪ LiME Open Source Release,— “Logic in Memory Emulator” with benchmark applications available at

http://bitbucket.org/perma/emulator_st

▪ M. Gokhale, S. Lloyd, and C. Hajas, “Near memory data structure rearrangement,” International Symposium on Memory Systems, pp. 283–290, Washington DC, Oct 2015.

▪ M. Gokhale, S. Lloyd, and C. Macaraeg, “Hybrid memory cube performance characterization on data-centric workloads,” Workshop on Irregular Applications: Architectures and Algorithms, 7:1–7:8, Austin, TX, Nov 2015.

▪ S. Lloyd and M. Gokhale, “In-memory data rearrangement for irregular, data intensive computing,” IEEE Computer, 48(8):18–25, Aug 2015.

References

Near Memory Key/Value Lookup Acceleration · Evaluate performance on the next generation of LiME...

Documents

Transcript of Near Memory Key/Value Lookup Acceleration · Evaluate performance on the next generation of LiME...