DDN and Flash presentation · 2020-01-15 · © 2016 DataDirect Networks, Inc. * Other names and...

ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.

1!

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

T. Cecchi - September 21st 2016 HPC Advisory Council


2!

DDN END-TO-END DATA LIFECYCLE MANAGEMENT

• 1000X APP & FILE SYSTEM SPEED UP

• PREDICTIVE BURST BUFFER

• FASTEST, LOW LATENCY & BEST COST SSD • 600GB/S & 60M IOPS, 5+PB/RACK

• FASTEST, SCALABLE OBJECT STORAGE • GLOBAL MULTI-SITE SECURE TIERING

• OPEN SOURCE & HYBRID CLOUD • BRIDGE S3, SWIFT, GPFS™ & LUSTRE®

BURST & COMPUTE

SSD, DISK & FILE SYSTEM

LIVE ARCHIVE, SHARING & CLOUD

Flas

hSca

le

Flas

hSca

le

• MOST VERSATILE EMBEDDED FILE & BLOCK STORAGE • EDR IB, 100 GbE, Omnipath, PCI-E


3! Using Flash...

1) Conventional (ish) cacheing Lustre examples...

2) All Flash Filesystems FlashScaler

3) Burst Buffers Infinite Memory Engine


4! Why SSD Cache? Don't blow the power/space/management with spindles

Write MB/s per PB 4k Read IOPS per PB

SSDs still pricey... So ▶  Optimise Data for SSDs ▶  Optimise SSDs for Data

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Seagate Makara (4 TB)

HGST Ultrastar (1.8 TB/10k rpm)

Seagate Tardis (4 TB/10k

rpm)

SanDisk Cloud Speed (SSD SATA)

Toshiba Px04 (SSD SAS)

Intel DC P3700 (SSD

NVMe)


5!

Lustre and Flash


6! DDN | ES14K Designed for Flash and NVMe

Industry Leading Performance in 4U §  Up to 40 GB/sec throughput §  Up to 6 million IOPS to cache §  Up to 3.5 million IOPS to storage §  1PB+ capacity (with 16TB SSD) §  100 millisecond latency

Configuration Options §  72 SAS SSD or 48 NVMe §  SSDs or HDDs only §  HDDs with SSD caching §  SSDs with HDD tier

Connectivity §  FDR/EDR §  OmniPath §  40/100GbE


7!

OSS OSS OSS OSS OSS OSS OSS OSS OSS

SFX

Block-Level Read Cache

Instant Commit, DSS,

fadvise()

Millions of Read IOPS

L2RC

OSS-level Read cache

Heuristics with FileHeat

Millions of Read IOPS

All SSD Lustre

Lustre HSM for Data

Tiering to HDD

namespace

Generic Lustre I/O

Millions of Read and

Write IOPS

SSD Options 1 2 3


8! 1. Rack Performance: Lustre

0 50

100 150 200 250 300 350 400

Write Read

IOR File-per-Process (GB/s)

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

Read

4k Random Read IOPS

350GB/s Read and Write (IOR)

4 Million IOPs

up to 4PB Flash Capacity


9!

2. SFX & ReACT – Accelerating Reads Integrated with Lustre DSS

HDD Tier

DRAM Cache

OSS

SFX Tier

Large Reads

Sm

all Reads

Sm

all Rereads

Cache Warm

DSS S

FX A

PI


10! 2. 4 KiB Random I/O

15,587 14,184 17,070

174,344

14,486 14,984 13,008 13,001

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

No SFX/SSD Metadata No SFX Metadta Mix With SFX / Metadata Mix

SFX Read Hit

IOPS

Read Write

First Time I/O

Second Time I/O SFA Read Hit


11! 3. Lustre L2RC and File Heat

OSS OSS OSS

Heap of Object Heat

Values

PREFETCH

Higher heat == Higher access

frequency

Lower heat == Lower access frequency

OSS-based Read Caching •  Uses SSDs (or SFA SSD pools) on

the OSS as read cache •  Automatic prefetch management

based on file heat •  File-heat is a relative (tunable)

attribute that reflects file access frequency

•  Indexes are kept in memory (worst case is 1 TB SSD for 10 GB memory)

•  Efficient space management for the SSD cache space (4KB-1 MB extends)

•  Full support for ladvice in Lustre


12! 3. File Heat Utility

▶  tune the arguments of file heat with proc interfaces /proc/fs/lustre/heat_period_second/proc/fs/lustre/heat_replacement_percentage

▶  Utils to get file heat values: lfs heat_get <file> ▶  Utils to switch on/off heat accounting

lfs heat_set [--clear|-c] [--off|-o] [--on|-O] <file>

▶  Heaps on OSTs which can be used to dump lists of FIDs sorted by heat:

[root@server9-Centos6-vm01 cache]# cat /proc/fs/lustre/obdfilter/lustre-OST0000/heat_top [0x200000400:0x1:0x0] [0x100000000:0x2:0x0]: 0 740 0 775946240[0x200000400:0x9:0x0] [0x100000000:0x6:0x0]: 0 300 0 314572800

[0x200000400:0x8:0x0] [0x100000000:0x5:0x0]: 0 199 0 208666624[0x200000400:0x7:0x0] [0x100000000:0x4:0x0]: 0 100 0 104857600[0x200000400:0x6:0x0] [0x100000000:0x3:0x0]: 0 100 0 104857600


13! DDN | Flashscale

Some Flashscale Use Cases -  Large Unstructured Data Set Acceleration

•  Real time analytics – dropped calls (Telco), fraud detection (Web, Security)

•  IOT Analytics – sensor generated data -  Complex Modeling for Scientific

Computing •  Chip Design, Car or Airplane Simulation •  Weather

-  Webscale Data Lakes

-  Latency Sensitive Large Databases

What is Flashscale? The Fastest Scale Out & Scale Up All Flash Array

For Both IOPS & Bandwidth Acceleration

• FASTEST, LOW LATENCY & BEST COST SSD • 600GB/S & 6M IOPS, 3+PB/RACK

SSD FOR MIXED WORKLOAD BLOCK &

FILE SYSTEM


14!DDN Flashscale™ Game-Changing Performance, Cost Optimized All-Flash Array

Industry Leading Performance in 4U §  48 NVMe Dual Port or 72 SAS SSDs §  6 million IOPS, 60 GB/sec throughput §  As low as 100µs latency

Broad Connectivity Options

o  FC16/32 o  IB FDR/EDR o  Gb/E10/40/100* o  OmniPath

Hyperconverged Architecture

§  Intel Broadwell Processors §  PCI-e 3 Embedded Fabric §  Embedded File Systems &

Applications

Flexible Scalability §  Add 6 million IOPS,

60GB/s and 576TB capacity with each 4U performance node

§  Add 672 TB with each 4U capacity node


15!

Burst Buffering with IME


16! Anatomy of an IO

Read or Write IO?

How Large?

Random or Sequential? highly threaded?

to many files, or one?

to many directories, or one?

aligned or not?

Small or large file?

Storage

buffered or direct?


17! IME vs Parallel File System: Shared File Performance


18! IME vs Parallel File System: Shared File Performance


19!

DDN | IME Application I/O Workflow

NVM TIER LUSTRE COMPUTE

Lightweight IME client intercepts application I/O. Places fragments into buffers + parity

IME client sends fragments to IME servers

IME servers write buffers to NVM and manage internal metadata

IME servers write aligned sequential I/O to SFA backend

Parallel File system operates at maximum efficiency


20!

IME server IME server IME server IME server

4. IME Write Dataflow

IME server

Application

IME Client

Parallel Filesystem

1. Application issues fragmented, misaligned IO

2. IME clients send fragments to IME servers

3. Fragments are sent to IME servers and are

accessible via DHT to all clients

4. Fragments to be flushed from IME are assembled into

PFS stripes

5. PFS receives complete aligned PFS

stripe

Application Application Application

IME Client IME Client IME Client

COMPUTE NODES


21! DDN | IME Stage-In, Stage-out

IME

PFS

COMPUTE

1

Scheduler executes file Stage-In and sets Job status to "ready-to-run"

3 User Job is scheduled and executes

2

•  Scheduler executes Stage-In for job data

ime-ctl -i $INPUT_FILE •  and then sets the job as "ready to

run" •  Job Executes: All reads/writes to IME •  Temporary data that is not required

can be purged from IME ime-ctl -p $TMP_FILE

•  Completion of the User Job initiates sync of output data to PFS

ime-ctl -r $OUTPUT_FILE

4 purge unwanted data

5 sync output file to PFS


22! DDN | IME Data Residency Control

•  maximum percentage of dirty data resident in IME before the data is automatically synchronized to the PFS:

flush_threshold_ratio [0% .. 100%]

•  Once Synchronised, the data is marked clean

•  The clean data is kept in IME until the min_free_space_ratio is reached.

min_free_space_ratio [0% .. 100%]

COMPUTE

CLEAN

DIRTY

IME

PFS

purge clean data until min_free_space_ratio

sync dirty data until flush_threshold_ratio


23!4. IME Erasure Coding

▶  Data protection against IME server or SSD Failure is optional •  (the lost data is "just cache”)

▶  Erasure Coding calculated at the Client •  Great scaling with extremely high client count •  Servers don't get clogged up

▶  Erasure coding does reduce useable Client bandwidth and useable IME capacity: •  3+1: 56Gb à 42Gb •  5+1: 56Gb à 47Gb •  7+1: 56Gb à 49Gb •  8+1: 56Gb à 50Gb

COMPUTE NODES

IME

PFS

LUSTRE

PFS PFS

Data buffer

Parity buffer

Data buffer

Data buffer


24!Aggregate IME Adaptive vs. Non-Adaptive WRITE Performance

Amdahl’s Law in action!

Ideal, healthy system

One degraded IME server, Adaptive

One degraded IME server, Non-adaptive


25! IME vs. Lustre with different IO size



Small IO size penalty



Small IO size penalty

Flat behaviour


28!IOR (MPIIO, Lustre and IME) 32 clients, 512 process, 3.3TB File Size, 7000 segments

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Write Read Write Read

FPP(File Per Process) SSF(Single Shared File)

Thro

ughp

ut (M

B/s

ec)

IOR(IME, MPIIO)

10K

100K

1000K

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Write Read Write Read

FPP(File Per Process) SSF(Single Shared File)

Thro

ughp

ut (M

B/s

ec)

IOR(Lustre, MPIIO)

10K

100K

1000K

FPP I/O Efficiency ~84%(Write) ~90%(Read) SSF I/O Efficiency ~5%(Write) ~22%(Read)

FPP I/O Efficiency ~97%(Write) ~78%(Read) SSF I/O Efficiency ~97%(Write) ~81%(Read) Still under optimizations

Hardware Limit

Hardware Limit


29!Results 3x Full Application Speed-Up

▶  Test results reflect elapsed times for computing a single “shot” which includes all the computation, MPI communications and the I/O time. One-time initialization time excluded

▶  Speed-ups are normalized relative to Lustre-only performance

▶  IME approaches the performance of smaller scale in-memory runs, and benefit of IME increases with scale

▶  IME yields 3x full-app speed-up over Lustre alone

ELAPSED TIME FOR A SINGLE SHOT FOR

DIFFERENT TESTS CASES

In the Large case, the data could no longer fit within memory; IME allows for a

3x full-app speed-up over Lustre alone


30! 4. Rack Performance: IME

500GB/s Read and Write

50 Million IOPs

768TB Flash Capacity 0

100

200

300

400

500

600

Write Read

IOR File-per-Process (GB/s)

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Write Read

4k Random IOPS


31! JCAHPC System DDN | Omni-Path | NVMe

•  University of Tokyo & University of Tsukuba

•  25 PF System with 8208 KNL Nodes provided by Fujitsu

•  I/O System by DDN ►  Intel Omni-Path

Architecture ►  26 PB ExaScaler @ 400

GB/sec ►  1 PB of IME Burst Buffer

with NVMe @ 1400 GB/sec


32! Summary

▶  SSDs can today be seamlessly introduced into a Lustre Filesystem •  Modest investment in SSDs •  Intelligent policy-driven data moves the most appropriate blocks/

files to SSD cache •  Block level and Lustre Object Level data placement schemes

▶  IME is a ground-up NVM distributed cache which adds •  Write Performance optimisation (not just read) •  Small, random I/O optimisations •  Shared (many-to-one) file optimisations •  Improved SSD lifetime •  Back-end Lustre IO optimisation


33!

9351 Deering Avenue Chatsworth, CA 91311

1.800.837.2298 1.818.700.4000

company/datadirect-networks

@ddn_limitless

[email protected]

Thank You! Keep in touch with us

DDN and Flash presentation · 2020-01-15 · © 2016 DataDirect Networks, Inc. * Other names and...

Documents

Transcript of DDN and Flash presentation · 2020-01-15 · © 2016 DataDirect Networks, Inc. * Other names and...