Data Processing on Modern Hardware · SCM-aware data structures are an active research topic. 6 /...

Post on 17-Aug-2020

1 views 0 download

Transcript of Data Processing on Modern Hardware · SCM-aware data structures are an active research topic. 6 /...

1 / 28

Storage

Storage

2 / 28

Storage

Motivation

• so far, we only talked about CPU caches and DRAM• both are volatile (data is lost on power failure)• traditional persistent storage:

I diskI tape

• modern persistent storage:I solid state drive (flash)I storage class memory

3 / 28

Storage

Moving Mountains of Data

Core

Register

Core

L1 Cache

Core

L2 Cache

Shared

L3 CacheDRAM HDD

©2016 Western Digital Corporation or its affiliates. All rights reserved. 10

Source: Western Digital estimates

4 / 28

Storage

Storage Class Memory (SCM)

• also called non-volatile memory (NVM), NVRAM, persistent memory• semiconductor-based technologies: Phase Change Memory (PCM), ReRAM, etc.• promises:

I similar performance as DRAMI byte-addressabilityI persistence

• not yet commercially available, but Intel has announced NVM-DIMMs

5 / 28

Storage

SCM Data Access

• connected like DRAM DIMMs (PCIe is too slow)• mapped into address space of application• automatically benefits from CPU caches and can be used as a (slower but cheaper) form of

RAM (without re-writing software)• How to utilize its persistence?

I since writes are cached, they are not immediately persistedI write order is also a problemI solution: explicit write back instruction: void _mm_clwb(void const* p)I hardware guarantees that once a cache line is written back from cache, it is persisted

• SCM-aware data structures are an active research topic

6 / 28

Storage

31

Game Changer? Non-Volatile Memory (NVRAM)

ADVANTAGES

… does not consume energy if not used

… is persistent, byte-addressable

… x-times denser than DRAM

DRAWBACKS

… has higher latency than DRAM- Read latency ~2x slower than DRAM

- Write latency ~10x slower than DRAM

Number of writes is limited

m

Estimates provided by David Wang (Samsung)

7 / 28

Storage

Solid State Drive

• based on semiconductors (no moving parts)• NAND gates• available as SATA- or PCIe/M.2- attached devices (NVMe interface)• e.g., Intel P3700:

I 2.8/1.9 GB/s sequential read/write bandwidthI 450K/150K random 4KB read/write IOs/s (requires many concurrent accesses)

8 / 28

Storage

SSD Data Access

• data is stored on pages (e.g., 4KB, 8KB, 16KB), access is through the normal OS blockdevice abstraction (i.e., read, write)

• physical properties:I pages are combined to blocks (e.g., 2MB)I it is not possible to overwrite a page, it is necessary to erase a block firstI erasing is fairly expensive

• these properties are usually not exposed• SSDs implement a flash translation layer (FTL) that emulates a traditional read/write

interface (including in-page updates)

9 / 28

Storage

10 / 28

Storage

11 / 28

Storage

Cos

t ($

/GB)

[Log

Sca

le]

2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

~2x

~20x

DRAM

SCM

BiCS

©2016 Western Digital Corporation or its affiliates. All rights reserved. 24

Cost Scaling for SCM Market Adoption

Note: Technology transition cadence assumed 18 months for all technologiesReRAM & 3DXP greenfield fabs, NAND & DRAM existing fabs

* DRAM data source: IDC ASP forecast with 45% GM assumed** 13 nm: assumes EUV @ 1.4x i-ArF capex cost, 2160 w/day

12 / 28

Storage

Managing Larger-than-RAM Data Sets

• traditional databases use a buffer pool to support very large data sets:I store everything on fixed-size pages (e.g., 8KB)I pages are fixed/unfixedI transparently handles arbitrary data (relations, indexes)

• traditional systems are very good at minimizing the number of disk IOs• however, in-memory systems are much faster than disk-based systems (as long as all data

fits into RAM)

13 / 28

Storage

Canonical Buffer Management

P1

P2 ...

P2 P4 P3

hash table pageframe root: P1

...

P4 ...

• Effelsberg and Harder, TODS 1984

14 / 28

Storage

Managing Larger-Than-Memory Data Sets

• paging/mmapI no control over eviction (needed for OLTP/HTAP systems)

• Anti CachingI each tuple has a per-relation LRU listI move cold tuples to disk/SSDI indexes refer to hot and cold tuples

• SiberiaI record tuple accesses in a logI identify cold tuples from log (offline)I move cold tuples to disk or SSDI in-memory index only stores hot (in-memory) tuplesI bloom (or adaptive range) filter for cold tuples

• buffer managersI have full full control over evictionI handle arbitrary data structures (tables, indexes, etc.) transparently

15 / 28

Storage

Hardware-Driven Desiderata

• fast and cheap PCIe-attached NVMe SSDs(1/10th of the price and bandwidth of DRAM)

⇒ store everything (relations, indexes) on pages (e.g., 16 KB)• huge main memory capacities

⇒ make in-memory accesses very fast• dozens (soon hundreds?) of cores

⇒ use smart synchronization techniques

16 / 28

Storage

Pointer Swizzling

P2

P1

P4

pageframeroot

...

...

...

P3

17 / 28

Storage

Replacement Strategy (1)

cold(SSD)

hot(RAM)

cooling(RAM)

swizzle

unswizzle

evictload,

swizzle

1. select random page as a potential candidate for eviction2. cooling pages are organized in a FIFO queue (e.g., 10% of all pages are in cooling state)

18 / 28

Storage

Replacement Strategy (2)

P3

P1

...

P8P2

hot

P9

...

hotP6

...

hot

P4

...

hotP5

3. it finds the swizzled child page P6 and unswizzles it instead

iterate

1. P4 is randomly selected for speculative unswizzling

2. the buffer manager iterates over all swips on the page ?

19 / 28

Storage

Scalable Optimistic Synchronization Primitives

• no hash table synchronization for accessing hot pages• per-page optimistic latches• epoch-based memory reclamation and eviction• global latch for cooling stage and I/O manager

20 / 28

Storage

Epochs

FIFOqueue

cooling stage

hash table

global epoch

e9

e4 P8

e7 P2

epochmanagement

thread-local epochse9∞e5

thread 1thread 2thread 3

P2

...

cooling

P8

...

cooling

21 / 28

Storage

Putting It All Together

FIFOqueue

root

OS

in-flight I/O operationscooling stage

P3

P1

...

P8P2

hotP7

P8

...

cooling

P4

...

hot buffer pool

hash table hash table

P2

...

cooling P3

...

cold(being loaded)

...(more hot pages)

22 / 28

Storage

Evaluation

• 10 core Haswell EP• Linux 4.8• Intel NVMe P3700 SSD• storage managers compared (no logging, no transactions, 16 KB pages):

I LeanStore: B-tree on top of buffer managerI in-memory B-tree: same B-tree without buffer managementI BerkeleyDB: B-tree on top of traditional buffer manager

23 / 28

Storage

How Large Is the Overhead? (Single-Threaded)

0

20K

40K

60K

BerkeleyDB LeanStore B-tree Silo

TPC

-C th

roug

hput

[txn

s/se

c]

in-memory onlybuffer managers

24 / 28

Storage

How Well Does It Scale?

0

200K

400K

600K

1 5 10 15 20threads

TPC

-C th

roug

hput

[txn

s/se

c]in-memory B-tree

LeanStore

BerkeleyDB

25 / 28

Storage

What Happens If The Data Does Not Fit Into RAM?

BerkeleyDB

in-memory B-tree (+swapping)

LeanStore

0

200K

400K

600K

0 20 40 60time [sec]

TPC

-C th

roug

hput

[txn

s/se

c]

26 / 28

Storage

TPC-C Ramp Up With Cold Cache

PCIe SSD

SATA SSD

disk10

100

1000

10K

100K

1M

0 10 20 30 40time [sec]

TPC

-C th

roug

hput

[txn

s/se

c]

27 / 28

Storage

Replacement Strategy Parameter

uniform z=1 z=1.25 z=1.5 z=1.55 z=1.6 z=1.65 z=1.7 z=1.75 z=2

2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50 2 10 50

0.7

0.8

0.9

1.0

1.1

% of pages in cooling stage [log scale]

norm

aliz

ed th

roug

hput

28 / 28

Storage

Conclusions

• in-memory performance is competitive with the fastest main-memory systems• transparent management of arbitrary data structures (e.g., B-trees, hash tables,

column-wise storage)• scales well on multi-core CPUs