DDN and Flash presentation · 2020-01-15 · © 2016 DataDirect Networks, Inc. * Other names and...
Transcript of DDN and Flash presentation · 2020-01-15 · © 2016 DataDirect Networks, Inc. * Other names and...
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
1!
DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine
T. Cecchi - September 21st 2016 HPC Advisory Council
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
2!
DDN END-TO-END DATA LIFECYCLE MANAGEMENT
• 1000X APP & FILE SYSTEM SPEED UP
• PREDICTIVE BURST BUFFER
• FASTEST, LOW LATENCY & BEST COST SSD • 600GB/S & 60M IOPS, 5+PB/RACK
• FASTEST, SCALABLE OBJECT STORAGE • GLOBAL MULTI-SITE SECURE TIERING
• OPEN SOURCE & HYBRID CLOUD • BRIDGE S3, SWIFT, GPFS™ & LUSTRE®
BURST & COMPUTE
SSD, DISK & FILE SYSTEM
LIVE ARCHIVE, SHARING & CLOUD
Flas
hSca
le
Flas
hSca
le
• MOST VERSATILE EMBEDDED FILE & BLOCK STORAGE • EDR IB, 100 GbE, Omnipath, PCI-E
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
3! Using Flash...
1) Conventional (ish) cacheing Lustre examples...
2) All Flash Filesystems FlashScaler
3) Burst Buffers Infinite Memory Engine
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
4! Why SSD Cache? Don't blow the power/space/management with spindles
Write MB/s per PB 4k Read IOPS per PB
SSDs still pricey... So ▶ Optimise Data for SSDs ▶ Optimise SSDs for Data
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Seagate Makara (4 TB)
HGST Ultrastar (1.8 TB/10k rpm)
Seagate Tardis (4 TB/10k
rpm)
SanDisk Cloud Speed (SSD SATA)
Toshiba Px04 (SSD SAS)
Intel DC P3700 (SSD
NVMe)
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
5!
Lustre and Flash
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
6! DDN | ES14K Designed for Flash and NVMe
Industry Leading Performance in 4U § Up to 40 GB/sec throughput § Up to 6 million IOPS to cache § Up to 3.5 million IOPS to storage § 1PB+ capacity (with 16TB SSD) § 100 millisecond latency
Configuration Options § 72 SAS SSD or 48 NVMe § SSDs or HDDs only § HDDs with SSD caching § SSDs with HDD tier
Connectivity § FDR/EDR § OmniPath § 40/100GbE
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
7!
OSS OSS OSS OSS OSS OSS OSS OSS OSS
SFX
Block-Level Read Cache
Instant Commit, DSS,
fadvise()
Millions of Read IOPS
L2RC
OSS-level Read cache
Heuristics with FileHeat
Millions of Read IOPS
All SSD Lustre
Lustre HSM for Data
Tiering to HDD
namespace
Generic Lustre I/O
Millions of Read and
Write IOPS
SSD Options 1 2 3
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
8! 1. Rack Performance: Lustre
0 50
100 150 200 250 300 350 400
Write Read
IOR File-per-Process (GB/s)
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
Read
4k Random Read IOPS
350GB/s Read and Write (IOR)
4 Million IOPs
up to 4PB Flash Capacity
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
9!
2. SFX & ReACT – Accelerating Reads Integrated with Lustre DSS
HDD Tier
DRAM Cache
OSS
SFX Tier
Large Reads
Sm
all Reads
Sm
all Rereads
Cache Warm
DSS S
FX A
PI
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
10! 2. 4 KiB Random I/O
15,587 14,184 17,070
174,344
14,486 14,984 13,008 13,001
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
No SFX/SSD Metadata No SFX Metadta Mix With SFX / Metadata Mix
SFX Read Hit
IOPS
Read Write
First Time I/O
Second Time I/O SFA Read Hit
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
11! 3. Lustre L2RC and File Heat
OSS OSS OSS
Heap of Object Heat
Values
PREFETCH
Higher heat == Higher access
frequency
Lower heat == Lower access frequency
OSS-based Read Caching • Uses SSDs (or SFA SSD pools) on
the OSS as read cache • Automatic prefetch management
based on file heat • File-heat is a relative (tunable)
attribute that reflects file access frequency
• Indexes are kept in memory (worst case is 1 TB SSD for 10 GB memory)
• Efficient space management for the SSD cache space (4KB-1 MB extends)
• Full support for ladvice in Lustre
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
12! 3. File Heat Utility
▶ tune the arguments of file heat with proc interfaces /proc/fs/lustre/heat_period_second/proc/fs/lustre/heat_replacement_percentage
▶ Utils to get file heat values: lfs heat_get <file> ▶ Utils to switch on/off heat accounting
lfs heat_set [--clear|-c] [--off|-o] [--on|-O] <file>
▶ Heaps on OSTs which can be used to dump lists of FIDs sorted by heat:
[root@server9-Centos6-vm01 cache]# cat /proc/fs/lustre/obdfilter/lustre-OST0000/heat_top [0x200000400:0x1:0x0] [0x100000000:0x2:0x0]: 0 740 0 775946240[0x200000400:0x9:0x0] [0x100000000:0x6:0x0]: 0 300 0 314572800
[0x200000400:0x8:0x0] [0x100000000:0x5:0x0]: 0 199 0 208666624[0x200000400:0x7:0x0] [0x100000000:0x4:0x0]: 0 100 0 104857600[0x200000400:0x6:0x0] [0x100000000:0x3:0x0]: 0 100 0 104857600
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
13! DDN | Flashscale
Some Flashscale Use Cases - Large Unstructured Data Set Acceleration
• Real time analytics – dropped calls (Telco), fraud detection (Web, Security)
• IOT Analytics – sensor generated data - Complex Modeling for Scientific
Computing • Chip Design, Car or Airplane Simulation • Weather
- Webscale Data Lakes
- Latency Sensitive Large Databases
What is Flashscale? The Fastest Scale Out & Scale Up All Flash Array
For Both IOPS & Bandwidth Acceleration
• FASTEST, LOW LATENCY & BEST COST SSD • 600GB/S & 6M IOPS, 3+PB/RACK
SSD FOR MIXED WORKLOAD BLOCK &
FILE SYSTEM
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
14!DDN Flashscale™ Game-Changing Performance, Cost Optimized All-Flash Array
Industry Leading Performance in 4U § 48 NVMe Dual Port or 72 SAS SSDs § 6 million IOPS, 60 GB/sec throughput § As low as 100µs latency
Broad Connectivity Options
o FC16/32 o IB FDR/EDR o Gb/E10/40/100* o OmniPath
Hyperconverged Architecture
§ Intel Broadwell Processors § PCI-e 3 Embedded Fabric § Embedded File Systems &
Applications
Flexible Scalability § Add 6 million IOPS,
60GB/s and 576TB capacity with each 4U performance node
§ Add 672 TB with each 4U capacity node
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
15!
Burst Buffering with IME
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
16! Anatomy of an IO
Read or Write IO?
How Large?
Random or Sequential? highly threaded?
to many files, or one?
to many directories, or one?
aligned or not?
Small or large file?
Storage
buffered or direct?
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
17! IME vs Parallel File System: Shared File Performance
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
18! IME vs Parallel File System: Shared File Performance
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
19!
DDN | IME Application I/O Workflow
NVM TIER LUSTRE COMPUTE
Lightweight IME client intercepts application I/O. Places fragments into buffers + parity
IME client sends fragments to IME servers
IME servers write buffers to NVM and manage internal metadata
IME servers write aligned sequential I/O to SFA backend
Parallel File system operates at maximum efficiency
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
20!
IME server IME server IME server IME server
4. IME Write Dataflow
IME server
Application
IME Client
Parallel Filesystem
1. Application issues fragmented, misaligned IO
2. IME clients send fragments to IME servers
3. Fragments are sent to IME servers and are
accessible via DHT to all clients
4. Fragments to be flushed from IME are assembled into
PFS stripes
5. PFS receives complete aligned PFS
stripe
Application Application Application
IME Client IME Client IME Client
COMPUTE NODES
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
21! DDN | IME Stage-In, Stage-out
IME
PFS
COMPUTE
1
Scheduler executes file Stage-In and sets Job status to "ready-to-run"
3 User Job is scheduled and executes
2
• Scheduler executes Stage-In for job data
ime-ctl -i $INPUT_FILE • and then sets the job as "ready to
run" • Job Executes: All reads/writes to IME • Temporary data that is not required
can be purged from IME ime-ctl -p $TMP_FILE
• Completion of the User Job initiates sync of output data to PFS
ime-ctl -r $OUTPUT_FILE
4 purge unwanted data
5 sync output file to PFS
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
22! DDN | IME Data Residency Control
• maximum percentage of dirty data resident in IME before the data is automatically synchronized to the PFS:
flush_threshold_ratio [0% .. 100%]
• Once Synchronised, the data is marked clean
• The clean data is kept in IME until the min_free_space_ratio is reached.
min_free_space_ratio [0% .. 100%]
COMPUTE
CLEAN
DIRTY
IME
PFS
purge clean data until min_free_space_ratio
sync dirty data until flush_threshold_ratio
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
23!4. IME Erasure Coding
▶ Data protection against IME server or SSD Failure is optional • (the lost data is "just cache”)
▶ Erasure Coding calculated at the Client • Great scaling with extremely high client count • Servers don't get clogged up
▶ Erasure coding does reduce useable Client bandwidth and useable IME capacity: • 3+1: 56Gb à 42Gb • 5+1: 56Gb à 47Gb • 7+1: 56Gb à 49Gb • 8+1: 56Gb à 50Gb
COMPUTE NODES
IME
PFS
LUSTRE
PFS PFS
Data buffer
Parity buffer
Data buffer
Data buffer
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
24!Aggregate IME Adaptive vs. Non-Adaptive WRITE Performance
Amdahl’s Law in action!
Ideal, healthy system
One degraded IME server, Adaptive
One degraded IME server, Non-adaptive
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
25! IME vs. Lustre with different IO size
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
26! IME vs. Lustre with different IO size
Small IO size penalty
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
27! IME vs. Lustre with different IO size
Small IO size penalty
Flat behaviour
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
28!IOR (MPIIO, Lustre and IME) 32 clients, 512 process, 3.3TB File Size, 7000 segments
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Write Read Write Read
FPP(File Per Process) SSF(Single Shared File)
Thro
ughp
ut (M
B/s
ec)
IOR(IME, MPIIO)
10K
100K
1000K
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Write Read Write Read
FPP(File Per Process) SSF(Single Shared File)
Thro
ughp
ut (M
B/s
ec)
IOR(Lustre, MPIIO)
10K
100K
1000K
FPP I/O Efficiency ~84%(Write) ~90%(Read) SSF I/O Efficiency ~5%(Write) ~22%(Read)
FPP I/O Efficiency ~97%(Write) ~78%(Read) SSF I/O Efficiency ~97%(Write) ~81%(Read) Still under optimizations
Hardware Limit
Hardware Limit
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
29!Results 3x Full Application Speed-Up
▶ Test results reflect elapsed times for computing a single “shot” which includes all the computation, MPI communications and the I/O time. One-time initialization time excluded
▶ Speed-ups are normalized relative to Lustre-only performance
▶ IME approaches the performance of smaller scale in-memory runs, and benefit of IME increases with scale
▶ IME yields 3x full-app speed-up over Lustre alone
ELAPSED TIME FOR A SINGLE SHOT FOR
DIFFERENT TESTS CASES
In the Large case, the data could no longer fit within memory; IME allows for a
3x full-app speed-up over Lustre alone
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
30! 4. Rack Performance: IME
500GB/s Read and Write
50 Million IOPs
768TB Flash Capacity 0
100
200
300
400
500
600
Write Read
IOR File-per-Process (GB/s)
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Write Read
4k Random IOPS
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
31! JCAHPC System DDN | Omni-Path | NVMe
• University of Tokyo & University of Tsukuba
• 25 PF System with 8208 KNL Nodes provided by Fujitsu
• I/O System by DDN ► Intel Omni-Path
Architecture ► 26 PB ExaScaler @ 400
GB/sec ► 1 PB of IME Burst Buffer
with NVMe @ 1400 GB/sec
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
32! Summary
▶ SSDs can today be seamlessly introduced into a Lustre Filesystem • Modest investment in SSDs • Intelligent policy-driven data moves the most appropriate blocks/
files to SSD cache • Block level and Lustre Object Level data placement schemes
▶ IME is a ground-up NVM distributed cache which adds • Write Performance optimisation (not just read) • Small, random I/O optimisations • Shared (many-to-one) file optimisations • Improved SSD lifetime • Back-end Lustre IO optimisation
ddn.com © 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
33!
9351 Deering Avenue Chatsworth, CA 91311
1.800.837.2298 1.818.700.4000
company/datadirect-networks
@ddn_limitless
Thank You! Keep in touch with us