Using High Capacity Flash Storage In Extremely Large ...

23
Using High Capacity Flash Storage In Extremely Large Database Systems Keith Muller Halıcıoğlu Data Science Institute, UCSD Technology, Research and Innovation, Teradata April 3, 2019

Transcript of Using High Capacity Flash Storage In Extremely Large ...

Page 1: Using High Capacity Flash Storage In Extremely Large ...

1

Using High Capacity Flash Storage In Extremely Large Database Systems

Keith MullerHalıcıoğlu Data Science Institute, UCSDTechnology, Research and Innovation, Teradata

April 3, 2019

Page 2: Using High Capacity Flash Storage In Extremely Large ...

2

• Targeted Market Review

• Some Basic Background

• Relevant System Trends

• Where we are today – approach and measurements

• NVME 1.4 Sets and Endurance groups overview

• Does it make sense to implement storage tiering using NVMe sets and endurance groups on large capacity flash devices?

Agenda

Page 3: Using High Capacity Flash Storage In Extremely Large ...

3

Large Systems: Effective Size Scaling Tradeoffs

Performance & Capacity Density

Various Costs (+OPEX)

AvailabilityStorage Capacity

• Minimize Stranded Resources• Inefficiencies are very significant $ at scale

• Focus is on the role of Large Capacity SSD’s

• Single database system starting at around• 500 SSD Storage devices and up• 60 2-Socket Servers (2160 CPU’s) and up

• Touch on various optimizations, an emphasis on:• Rack space Density

• Performance/U• Capacity/U• …

• Stranded Performance• Impact of Technology Implementations

• Minimizing Degraded Performance

Some Obvious Examples

Page 4: Using High Capacity Flash Storage In Extremely Large ...

4

Why Focus On High Capacity Storage? Costs!

Storage StorageServer

Server

Page 5: Using High Capacity Flash Storage In Extremely Large ...

5

Write Performance:

paced by available flash block/page + other overhead operations (wear leveling etc.)

Quick Refresh: Architecture & OP

Host InterfaceCore

Host InterfaceCore

NVME

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

FlashInherent Overhead

Base Reserved Capacity

Advertised Capacity

Capacity Distribution

OP %Unmapped Reserved Capacity

Representative Flash Architecture

Low

er W

rite

Per

form

ance

Low

er W

rite

End

uran

ce

Hig

her W

rite

Am

plifi

catio

n

Larg

er A

dver

tised

C

apac

ity

Low

er $

/Adv

ertis

ed

Cap

acity

Page 6: Using High Capacity Flash Storage In Extremely Large ...

6

Large Cap Enterprise SSD’s – All Looks Good… Right?

Averaged: 4K IOPS, 256KB sequential

Page 7: Using High Capacity Flash Storage In Extremely Large ...

7

Large Cap Enterprise SSD’s – All Looks Good… Right?

Averaged: 4K IOPS, 256KB sequential

Page 8: Using High Capacity Flash Storage In Extremely Large ...

8

Looking Forward: Enterprise Tiering Estimates

Processor

L1/2 Cache

L3 Cache

Main Memory

Capacity HDD

~1 ns

~10 ns

~100 ns

On Core CPU

On Die

Memory

SAS

LatencyP

erfo

rman

ce

Capacity

PCIe Gen 4

NVDimm ~500 ns

Performance NVME

Capacity NVME ~1xxxK/~1xxK IOPS

~6:1 to ~12:1

~0.12K/~0.11K IOPS~1:1

~1xxxK/~5xxK IOPS

~1.5:1 to ~3:1

MetricRead/Write Ratio

~1/1 ns1:1

~10/10 ns1:1

~100/~100 ns1:1

~500/~500 ns1:1

~200,000 ns ~200 us

~15,000,000 ns ~15 ms

Capacity

FixedKB’s

FixedMB’s

FixedTB’s

Each~30 TB’s

Each>12 TB’s

Each~1-3+ TB’s5 DWPD and greater

~1 DWPD and less

Cos

t/Cap

acity

Page 9: Using High Capacity Flash Storage In Extremely Large ...

9

The Past: HDD Performance: R/W Ratio By Disk Extent

10

40

70

100

0.000.100.200.300.400.500.600.700.800.901.001.101.20

100

90 80 70 60 50 40 30 20 10 0

Read %

Nor

mal

ized

IO R

ate

% of Capacity Accessed

100% Read Full Capacity

0.68

100% Read 10% Capacity

1.0

100% Write Full Capacity

0.65

100% Write 10% Capacity

0.91

Many Filesystems were designed assuming this modelSingle Disk, 96 KB, Random, QD=4

Page 10: Using High Capacity Flash Storage In Extremely Large ...

10

High Capacity HDD’s in this market segment? ...PAIN

~50 C~35 C

Page 11: Using High Capacity Flash Storage In Extremely Large ...

11

Current NVME SSD Performance: R/W Ratio By OP%

Single Disk, 32 KB, Random, QD=16

Page 12: Using High Capacity Flash Storage In Extremely Large ...

12

OP %

10 DWPD

3 DWPD

1 DWPD

Note: OP% to DWPD varies by supplier

Page 13: Using High Capacity Flash Storage In Extremely Large ...

13 140% OP 7% OP10% OP 140% OP 7% OP10% OP

Page 14: Using High Capacity Flash Storage In Extremely Large ...

14

Did Storage Tiering Mitigate Performance & Capacity Tradeoffs?

Slide from 2002 – looking at HDD’s

Page 15: Using High Capacity Flash Storage In Extremely Large ...

15

Capacity Tier: Distributed RAID 8+2 with 7% OP Performance Tier: RAID -1 with 10 DWPD drives

Drive No: 3 111098765421 12

… … … … … … … … … … …

15 23222120191817161413 24

… … … … … … … … … … … …

Protected: 1.6TB -> 9.6 TB; 3.2 TB -> 19.2 TB Protected: 15.36 TB -> 145.2 TB; 7.68 TB -> 74.2 TB

Basic Tiered Storage Example – Shared 24 Drive Tray

• Capacity Tier - Distributed RAID 8+2: Optimized for Lower $/GB (+ Capacity Density)• Challenged write performance when not full stripe & not flash block aligned• Significant degradation in performance with drive loss

• Performance Tier - RAID 1 Optimized for Write traffic • Low degradation in write performance with drive loss

Page 16: Using High Capacity Flash Storage In Extremely Large ...

16

R/W IOPS Ratios Under RAID

96 KB, Random, 32KB Aligned QD = ~8/device

Page 17: Using High Capacity Flash Storage In Extremely Large ...

17

Page 18: Using High Capacity Flash Storage In Extremely Large ...

18

Aside: Mitigating Degraded Write Performance Impact

• Stripe and flash block aligned I/O helps performance

Page 19: Using High Capacity Flash Storage In Extremely Large ...

19

Motivation: Multi-Tenant Noisy Neighbor Example

Host InterfaceCore

Host InterfaceCore

NVME

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Representative Flash Architecture

User 1

User 2

User 3

User 4

• Collisions on Flash Die, uneven distributions of work, Read/Write mixes, etc.

Reference: Solving Latency Challenges with NVM Express SSD’s at Scale; Petersen & Huffman; FMS 2017

Page 20: Using High Capacity Flash Storage In Extremely Large ...

20

NVMe 1.4 Sets and Endurance Groups

• Defined in the NVMe 1.4 spec (2H 2019)

• NVMe Set• NVM that is physically and logically isolated from

NVM in other NVM Sets • Dedicated NAND resources, channels, FTL, etc.

(device architecture dependent) • Workload isolation: one set has no impact on

other sets (hopefully)• Carries out its own writes and background

operations independently• Drive appears like several smaller drives

• Endurance group: wear level management• Set independent levels of OP and usable

capacity (hopefully)• May contain one or more NVMe sets

Host InterfaceCore

Host InterfaceCore

NVME

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Representative Flash Architecture

NVME Set 1Endurance group A

NVME Set 2Endurance group B

NVME Set 3Endurance group C

NVME Set 4Endurance group D

Example: Four uniform NVMe sets & groups

Page 21: Using High Capacity Flash Storage In Extremely Large ...

21

• Write Optimized Tier• Example: OP in group is: 5 - 10 DWPD

• Capacity Optimized Tier• Example: OP in group is: 1 – 3 DWPD

• How many sets/groups?• Measurements and future advisories suggest

the NVMe interface may be over-subscribed• Trade offs (architecture dependent):

• Endurance for workloads• Write IOPS/capacity• Tier capacity ratios

• Will the implementation allow partial failures to be isolated within an endurance group?• Example: loss of a flash die or FTL core

Host InterfaceCore

Host InterfaceCore

NVME

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash FTLCore

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Flash

Representative Flash Architecture

NVME Set 1Endurance group A

NVME Set 2Endurance group B

NVME Set 3Endurance group C

NVME Set 4Endurance group D

Can We Do Storage Tiering Using NVMe Sets and Groups?

Page 22: Using High Capacity Flash Storage In Extremely Large ...

22

… … … … … … … … … … … …

3 111098765421 12

… … … … … … … … … … …

… … … … … … … … … … … …

15 23222120191817161413 24

… … … … … … … … … … …

1-A

2-B

3-C

4-D

Drive No:

• Example: 24 Drives X 4 Sets/drive = 48 Performance + 48 Capacity; Single drive to stock• Opportunity for lower set/group loss impact

• whole drive loss impacts two storage tiers but with likely lower impact each• Less capacity to rebuild

• Opportunity for more efficient utilization of NVMe interface on many workloads if the NVMe interface QOS allows stranded bandwidth to float among NVMe sets

Page 23: Using High Capacity Flash Storage In Extremely Large ...

23

Questions? Suggestions? Things I got wrong or missed?

Thank You!