SFA12KX and Lustre Update - HPC Advisory Council and Lustre Update Sep 2014 ... DDN Confidential –...
Transcript of SFA12KX and Lustre Update - HPC Advisory Council and Lustre Update Sep 2014 ... DDN Confidential –...
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
SFA12KX and Lustre Update
Sep 2014
Maria Perez Gutierrez HPC Advisory Council HPC Specialist
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Agenda
▶ SFA12KX – Features update • Partial Rebuilds • QoS on reads
▶ Lustre metadata performance update
2
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Big Data & Cloud Infrastructure DDN FY14 Focus Areas
Analytics Reference Architectures
EXAScaler™ 10Ks of Clients 1TB/s+, HSM
Linux HPC Clients NFS
Petascale Lustre® Storage
Enterprise Scale-Out File Storage
GRIDScaler™ ~10K Clients 1TB/s+, HSM
Linux/Windows HPC Clients NFS & CIFS
SFA12KX/E 40GB/s/1.7M IOPS 1,680 Drives supported Embedded Computing
SFA7700 13GB/s, 600K IOPS • 7700x • 7700 E*
Storage Fusion Architecture™ Core Storage Platforms
SATA SSD
Flexible Drive Configuration SAS
SFX Automated Flash Caching • Read • Context Commit
• Instant Commit
WOS® 3.0 32 Trillion Unique Objects
Geo-Replicated Cloud Storage 256 Million Objects/Second
Self-Healing Cloud Parallel Boolean Search
Cloud Foundation
Big Data Platform Management
DirectMon
Cloud Tiering
* GA = Future Release
Infinite Memory Engine™ Distributed File System Buffer Cache* [Demo]
WOS7000 60 Drives in 4U
Self-Contained Servers
S3/Swift
WolfCreek
Highlights • Platform
• SFA Hardening
• Higher speed Embedded
• WolfCreek • 7700E* • 7700x*
• Full speed on IME • PFS
acceleration • More use
cases under review
• WOS • S3/Swift • WOS
Access • Cost
reduction • Performance
improvements
FY14
FY14
FY14
DDN Confidential – NDA Required, Roadmap Subject to Change
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
SFA12KX | GA: Q2 2014
SFA 12K Family addition: SFA12KX High-Scale Active/Active Block Storage Appliance, Available May 2014
Specifications
Appliances Active/Active Block Controllers
CPU Dual Socket Intel 8Core Ivy-Bridge
Memory & Battery-Backed Cache
128GB DDR3 1866 64GB Mirrored
SFX Flash cache Up to 12.8 TB Write Intensive SSDs
RAID Levels RAID 1/5/6
IB Host Connectivity 16 x 56Gb (FDR)
FC Host Connectivity 32 x 16Gb/s
Drive Support 2.5” and 3.5” SAS, SATA, SSD
Max JBOD Expansion Up to 20
Supported JBODs SS7000, SS8460
SFA12KX with 20x SS8460 Enclosures
41
48
30
40
50
Infiniband SRP
Ban
dwid
th (G
B/s
) Write* Read*
* Estimated Performance
* Large Block, Full Stripe ** Read I/O Includes Parity Verification
DDN Confidential – NDA Required, Roadmap Subject to Change
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Introducing 12KX (Q2 2014)
0 5,000
10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000
1 8 16 24 32 40 48 56
Rat
e (M
B/s
)
Pool Count
Sequential Write Throughput
64K I/O Size
256K I/O Size
512K I/O Size
1M I/O Size
2M I/O Size
4M I/O Size
8M I/O Size
ü Over 40 GB/s Reads AND Writes
ü Over 20 GB/s MIRRORED Writes
ü Linear Scalability ü SFX Ready ü Latest Intel
Processor ü 32 Ports FC-16
0
5,000
10,000
15,000
20,000
25,000
1 8 16 24 32 40 48 56
Rat
e (M
B/s
)
Pool Count
Random Write Throughput
512K I/O Size
1M I/O Size
2M I/O Size
4M I/O Size
8M I/O Size
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
SFA Feature Baseline
High Density Array Up to 840 HDDs in a
single rack. 84 HDDs / SSDs in 4U
Storage Fusion Fabric Highly-Over-Provisioned Backend - RAIDed Fabric
Withstands Failures of Drives, Enclosures, Cabling, etc
Reliabiltiy, Availability & Monitoring - Mgmt Raid 10 (8+2) Fast Pool Rebuilds
partial rebuilds Rebuild priority adjustable SSD Life-counter
Priority , Queuing, Real-Time QoS for Read Operations; Critical for Striped File Performance Consistency ReACT IO Fairness Lowest Latency for Small IO Highest Bandwidth for Large & Streaming I/O Dialed-in Host I/O priority during rebuilds
Performance , Flash Real-Time, Multi-CPU RAID Engine Interrupt-Free; Massively Parallel I/O Delivery System RAID Rebuild Acceleration SFX Cache, Automated, data-driven caching system enabling hybrid SSD & HDD systems
Data Integrity & Security DirectProtect™ RAID1/5/6 Real-Time Data Integrity Verification for every I/O SED management Data at-rest encryption of all user data. Instant Crypto Erase
DDN Confidential – NDA Required, Roadmap Subject to Change
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
SFA Partial Rebuild Example persistent partial rebuild
1
Complete enclosure
removed for F/W upgrade.
2 Controllers send I/O destined
for failed enclosure to available drives. Holds
corresponding metadata in synchronous mirrored cache.
I/O I/O
3
Controller 1 fails. I/O continues.
I/O
4
Complete system outage due to power failure.
controller 2 writes cache to onboard
drives.
5 Power restored.
Controller cache restored. Upgraded Enclosure
undergoes partial rebuild of cached data in minutes.
I/O
84 * 4TB Disks (336TB) removed for hours, rebuilt in minutes
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
SFA Quality of Service Read retry timeouts coupled with DirectProtect DIF
Raid6 (8+2) of NL-SAS Disks, one of the member has ~100% higher avg latencies
than others. Production not impacted thanks to
DDN’s QoS
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Benchmark and Performance / How to scale on Lustre metadata performance
10
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance
▶ Lustre metadata is a crucial performance metric for many Lustre user • LU-56 SMP Scaling (Lustre-2.3) • DNE (Lustre-2.4)
▶ Metadata performance is related to small file performance on Lustre
▶ But, metadata performance is still a little mysterious J • Performance differentiation by metadata type and access patterns? • What is the impact of hardware resources for metadata
performance? ▶ This presentation: use standard metadata benchmark tools
to analyze metadata performance on Lustre today
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Benchmark Tools
▶ mds-survey • Build into Lustre code • Similar to obdfilter-suvey • Generates loads on MDS to simulate Lustre metadata performance
▶ mdtest • Major metadata benchmark tool used by many large HPC sites • Runs on clients using MPI • Several metadata operation and access patterns are supported
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Single Client Metadata Performance Limitation ▶ Single client Metadata performance does not scale with threads.
/** * Serializes in-flight MDT-modifying RPC requests to preserve idempotency. * * This mutex is used to implement execute-once semantics on the MDT. * The MDT stores the last transaction ID and result for every client in * its last_rcvd file. If the client doesn't get a reply, it can safely * resend the request and the MDT will reconstruct the reply being aware * that the request has already been executed. Without this lock, * execution status of concurrent in-flight requests would be * overwritten. * * This design limits the extent to which we can keep a full pipeline of * in-flight requests from a single client. This limitation could be * overcome by allowing multiple slots per client in the last_rcvd file. */ struct mdc_rpc_lock { /** Lock protecting in-flight RPC concurrency. */ struct mutex rpcl_mutex; /** Intent associated with currently executing request. */ struct lookup_intent *rpcl_it; /** Used for MDS/RPC load testing purposes. */ int rpcl_fakes; };
▶ LU-5319 supports multiple slots per client in last_rcvd file (Under development by Intel and Bull).
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
ops/
sec
Number of Threads
Single Client Metadata Performance (Unique)
File creation File stat File removal
lustre/include/lustre_mdc.h
Client can send many metadata requests to MDS simultaneously, but MDS needs to store each client's last transaction ID and it's serialized.
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Modified mdtest for Lustre
▶ Basic Function • Supports multiple mount points on a single client • Helps generating heavy metadata load from single client
▶ Background • Originally developed by Liang Zhen for LU-56 work • We rebased and cleaned up codes and made few enhancements
▶ Enables metadata benchmarks on a small number of clients • Regression testing • MDS server sizing • Performance optimization
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Performance Comparison
Single Lustre client mounts /lustre_0, /lustre_1, .... /lustre_31 for single filesystem
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
ops/
sec
Number of Threads
Single Client Metadata Performance (Unique, multi-
mountpoints) File creation File stat File removal
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
ops/
sec
Number of Threads
Single Client Metadata Performance (Unique, single
mountpoint) File creation File stat File removal
# mdtest –n 10000 –u –d /lustre_{0-15}
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
RP1 RP0 RP1 RP0
Benchmark Configuration
MDS OSS
Client
2 x MDS(2 x E5-2676v2, 128GB memory) 4 x OSS(2 x E5-2680v2, 128GB memory) 32 x Client(2 x E5-2680, 128GB memory) SFA12K-40 400 x NL-SAS for OST 8 x SSD for MDT Lustre-2.5 for Servers Lustre-2.6.52, Lustre-1.8.9 for Client
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Metadata Benchmark Method
▶ Tested Metadata Operations • Directory/File Creation • Directory/File Stat • Directory/File Removal
▶ Access patterns • To Unique Directory and shared directory o P0 -> /lustre/Dir0/file.0.0, P1 -> /lustre/Dir1/file.0.1 (Unique) o P0 -> /lustre/Dir/file.0.0, P1 -> /lustre/Dir/file.1.0 (Shared)
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance Impact MDS's CPU speed ▶ Metadata Performance comparison (Unique Directory)
• 32 clients(1024 mount points), 1024 processes, 1.28M Files • Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS)
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz
Directory Operation(Unique) Dir Creation Dir Stats Dir Removal
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz
File Operation(Unique) File Creation Fie Stats File Removal
38% 20%
60%
38%
70%
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance Impact MDS's CPU speed ▶ Metadata Performance comparison (Shared Directory)
• 32 clients(1024 mount points), 1024 processes, 1.28M Files • Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS)
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz
Directory Operation(Shared) Dir Creation Dir Stats Dir Removal
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz
File Operation(Shared) File Creation Fie Stats File Removal
30%
58%
22% 50%
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance Impact MDS's CPU Cores ▶ Metadata Performance comparison (Unique Directory)
• 32 clients(1024 mount points), 1024 processes, 1.28M Files • Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors
0%
50%
100%
150%
200%
250%
Directory Operation(Unique) Dir Creation Dir Stats Dir Removal
0%
50%
100%
150%
200%
250%
File Operation(Unique) File Creation Fie Stats File Removal
100%
25%
80-120%
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance Impact MDS's CPU Cores
0%
20%
40%
60%
80%
100%
120%
140%
160%
Directory Operation(Shared) Dir Creation Dir Stats Dir Removal
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
File Operation(Shared) File Creation Fie Stats File Removal
▶ Metadata Performance comparison (Shared Directory) • 32 clients(1024 mount points), 1024 processes, 1.28M Files • Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors
Creation(Sshare) and Stat do not scale
60%
No scale on 12->16CPU
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance MDSs Scalability (Unique Directory)
0
50000
100000
150000
200000
250000
16 32 64 128 256 512 1024
ops/
sec
Number of Threads
Lustre Metadata Scalability (Unique) (32 clients, 1024 mount points)
File Creation
File Removal
Dir Creation
Dir Removal
File Creation(DNE)
File Removal(DNE)
100% by DNE
50% of File creation
multiple mount points
Directory Creation
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance MDSs Scalability (Shared Directory)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
16 32 64 128 256 512 1024
ops/
sec
Number of Threads
Lustre Metadata Scalability (Shared) (32 clients, 1024 mount points)
File Creation
File Removal
Dir Creation
Dir Removal
80% of Performance compared to UniqueDir patterns
Directory Creation
multiple mount points
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Lustre Metadata Performance File creation and removal for small files
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 4096 8192 16384 32768 65536 131072
ops/
sec
File Size(byte)
Small File Performance (Unique Directory)
(32 clients, 1024 mount points) File Creation File Read File Removal
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 4096 8192 16384 32768 65536 131072
ops/
sec
File Size(byte)
Small File Performance (Shared Directory)
(32 clients, 1024 mount points) File Creation File Read File Removal
Creating files with actual file size (4K, 8K, 16K, 32K, 64K and 128K) (Stripe Count=1)
Small file performance bounds on metadata performance, but no performance impacts with file size.
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Summary Observations
▶ MDS Server resources significantly affect Lustre Metadata performance • Performance scales well by number of CPU core and CPU Speed in
unique directory access, but not CPU bound for shared directory access pattern
• Collected baseline results with 16 CPU cores, but need more tests on CPU cores
▶ Performance is highly dependent on metadata access pattern • Example: Directory Creation vs. File Creation • With actual file size (instead of zero byte), less impact in the case of a
small number of OST(e.g. up to 40 OST), but testing on large number of OSTs is needed
ddn.com ©2012 DataDirect Networks. All Rights Reserved.
Metadata Performance: Future Work
▶ Known Issues and Optimizations • Client-side metadata optimization and especially single-client
metadata performance • Various performance regressions in Lustre 2.5/2.6 that need to be
addressed (e.g. LU5608)
▶ Areas of Future Investigation • Real-world metadata use scenarios and metadata problems • Real-world small-file performance (e.g. life sciences) • Impact of OST data structures on real world metadata performance • DNE scalability on very large systems with many MDSs/MDTs and
many OSSs/OSTs