Using High Capacity Flash Storage In Extremely Large ...
Transcript of Using High Capacity Flash Storage In Extremely Large ...
1
Using High Capacity Flash Storage In Extremely Large Database Systems
Keith MullerHalıcıoğlu Data Science Institute, UCSDTechnology, Research and Innovation, Teradata
April 3, 2019
2
• Targeted Market Review
• Some Basic Background
• Relevant System Trends
• Where we are today – approach and measurements
• NVME 1.4 Sets and Endurance groups overview
• Does it make sense to implement storage tiering using NVMe sets and endurance groups on large capacity flash devices?
Agenda
3
Large Systems: Effective Size Scaling Tradeoffs
Performance & Capacity Density
Various Costs (+OPEX)
AvailabilityStorage Capacity
• Minimize Stranded Resources• Inefficiencies are very significant $ at scale
• Focus is on the role of Large Capacity SSD’s
• Single database system starting at around• 500 SSD Storage devices and up• 60 2-Socket Servers (2160 CPU’s) and up
• Touch on various optimizations, an emphasis on:• Rack space Density
• Performance/U• Capacity/U• …
• Stranded Performance• Impact of Technology Implementations
• Minimizing Degraded Performance
Some Obvious Examples
4
Why Focus On High Capacity Storage? Costs!
Storage StorageServer
Server
5
Write Performance:
paced by available flash block/page + other overhead operations (wear leveling etc.)
Quick Refresh: Architecture & OP
Host InterfaceCore
Host InterfaceCore
NVME
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
FlashInherent Overhead
Base Reserved Capacity
Advertised Capacity
Capacity Distribution
OP %Unmapped Reserved Capacity
Representative Flash Architecture
Low
er W
rite
Per
form
ance
Low
er W
rite
End
uran
ce
Hig
her W
rite
Am
plifi
catio
n
Larg
er A
dver
tised
C
apac
ity
Low
er $
/Adv
ertis
ed
Cap
acity
6
Large Cap Enterprise SSD’s – All Looks Good… Right?
Averaged: 4K IOPS, 256KB sequential
7
Large Cap Enterprise SSD’s – All Looks Good… Right?
Averaged: 4K IOPS, 256KB sequential
8
Looking Forward: Enterprise Tiering Estimates
Processor
L1/2 Cache
L3 Cache
Main Memory
Capacity HDD
~1 ns
~10 ns
~100 ns
On Core CPU
On Die
Memory
SAS
LatencyP
erfo
rman
ce
Capacity
PCIe Gen 4
NVDimm ~500 ns
Performance NVME
Capacity NVME ~1xxxK/~1xxK IOPS
~6:1 to ~12:1
~0.12K/~0.11K IOPS~1:1
~1xxxK/~5xxK IOPS
~1.5:1 to ~3:1
MetricRead/Write Ratio
~1/1 ns1:1
~10/10 ns1:1
~100/~100 ns1:1
~500/~500 ns1:1
~200,000 ns ~200 us
~15,000,000 ns ~15 ms
Capacity
FixedKB’s
FixedMB’s
FixedTB’s
Each~30 TB’s
Each>12 TB’s
Each~1-3+ TB’s5 DWPD and greater
~1 DWPD and less
Cos
t/Cap
acity
9
The Past: HDD Performance: R/W Ratio By Disk Extent
10
40
70
100
0.000.100.200.300.400.500.600.700.800.901.001.101.20
100
90 80 70 60 50 40 30 20 10 0
Read %
Nor
mal
ized
IO R
ate
% of Capacity Accessed
100% Read Full Capacity
0.68
100% Read 10% Capacity
1.0
100% Write Full Capacity
0.65
100% Write 10% Capacity
0.91
Many Filesystems were designed assuming this modelSingle Disk, 96 KB, Random, QD=4
10
High Capacity HDD’s in this market segment? ...PAIN
~50 C~35 C
11
Current NVME SSD Performance: R/W Ratio By OP%
Single Disk, 32 KB, Random, QD=16
12
OP %
10 DWPD
3 DWPD
1 DWPD
Note: OP% to DWPD varies by supplier
13 140% OP 7% OP10% OP 140% OP 7% OP10% OP
14
Did Storage Tiering Mitigate Performance & Capacity Tradeoffs?
Slide from 2002 – looking at HDD’s
15
Capacity Tier: Distributed RAID 8+2 with 7% OP Performance Tier: RAID -1 with 10 DWPD drives
Drive No: 3 111098765421 12
… … … … … … … … … … …
15 23222120191817161413 24
… … … … … … … … … … … …
Protected: 1.6TB -> 9.6 TB; 3.2 TB -> 19.2 TB Protected: 15.36 TB -> 145.2 TB; 7.68 TB -> 74.2 TB
Basic Tiered Storage Example – Shared 24 Drive Tray
• Capacity Tier - Distributed RAID 8+2: Optimized for Lower $/GB (+ Capacity Density)• Challenged write performance when not full stripe & not flash block aligned• Significant degradation in performance with drive loss
• Performance Tier - RAID 1 Optimized for Write traffic • Low degradation in write performance with drive loss
16
R/W IOPS Ratios Under RAID
96 KB, Random, 32KB Aligned QD = ~8/device
17
18
Aside: Mitigating Degraded Write Performance Impact
• Stripe and flash block aligned I/O helps performance
19
Motivation: Multi-Tenant Noisy Neighbor Example
Host InterfaceCore
Host InterfaceCore
NVME
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Representative Flash Architecture
User 1
User 2
User 3
User 4
• Collisions on Flash Die, uneven distributions of work, Read/Write mixes, etc.
Reference: Solving Latency Challenges with NVM Express SSD’s at Scale; Petersen & Huffman; FMS 2017
20
NVMe 1.4 Sets and Endurance Groups
• Defined in the NVMe 1.4 spec (2H 2019)
• NVMe Set• NVM that is physically and logically isolated from
NVM in other NVM Sets • Dedicated NAND resources, channels, FTL, etc.
(device architecture dependent) • Workload isolation: one set has no impact on
other sets (hopefully)• Carries out its own writes and background
operations independently• Drive appears like several smaller drives
• Endurance group: wear level management• Set independent levels of OP and usable
capacity (hopefully)• May contain one or more NVMe sets
Host InterfaceCore
Host InterfaceCore
NVME
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Representative Flash Architecture
NVME Set 1Endurance group A
NVME Set 2Endurance group B
NVME Set 3Endurance group C
NVME Set 4Endurance group D
Example: Four uniform NVMe sets & groups
21
• Write Optimized Tier• Example: OP in group is: 5 - 10 DWPD
• Capacity Optimized Tier• Example: OP in group is: 1 – 3 DWPD
• How many sets/groups?• Measurements and future advisories suggest
the NVMe interface may be over-subscribed• Trade offs (architecture dependent):
• Endurance for workloads• Write IOPS/capacity• Tier capacity ratios
• Will the implementation allow partial failures to be isolated within an endurance group?• Example: loss of a flash die or FTL core
Host InterfaceCore
Host InterfaceCore
NVME
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash FTLCore
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Representative Flash Architecture
NVME Set 1Endurance group A
NVME Set 2Endurance group B
NVME Set 3Endurance group C
NVME Set 4Endurance group D
Can We Do Storage Tiering Using NVMe Sets and Groups?
22
… … … … … … … … … … … …
3 111098765421 12
… … … … … … … … … … …
… … … … … … … … … … … …
15 23222120191817161413 24
… … … … … … … … … … …
1-A
2-B
3-C
4-D
Drive No:
• Example: 24 Drives X 4 Sets/drive = 48 Performance + 48 Capacity; Single drive to stock• Opportunity for lower set/group loss impact
• whole drive loss impacts two storage tiers but with likely lower impact each• Less capacity to rebuild
• Opportunity for more efficient utilization of NVMe interface on many workloads if the NVMe interface QOS allows stranded bandwidth to float among NVMe sets
23
Questions? Suggestions? Things I got wrong or missed?
Thank You!