Dilger Lustre HPCS May Workshop
-
Upload
sebastian-gutierrez -
Category
Documents
-
view
225 -
download
0
Transcript of Dilger Lustre HPCS May Workshop
-
8/6/2019 Dilger Lustre HPCS May Workshop
1/45
1
Andreas Dilger Senior Staff Engineer, Lustre GroupSun Microsystems
Lustre HPCS Design Overview
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
2/45
Topics
HPCS GoalsHPCS Architectural ImprovementsPerformance EnhancementsConclusion
2Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
3/45
HPC Center of the Future
S ha red S tora g e Network
Use r Da ta10 00 's OS Ts
M e t a d a t a10 0 's M DTs
Ca pa ci ty 1250 ,0 00 Nodes
Capabil i ty500 ,000 Nodes
Ca pa ci ty 2150 ,0 00 Nodes Cap aci ty 3
50 ,000 Nodes Te s t25 ,000 Nodes
Vi z2
WANAccess
10 TB/sec
HPS S Archiv e
Viz1
-
8/6/2019 Dilger Lustre HPCS May Workshop
4/45
HPCS Goals - Capacity
Filesystem Limits 100 PB+ maximum file system size (10PB) 1 trillion files (1012) per file system (4 billion files) > 30k client nodes(> 30k clients)
Single File/Directory Limits 10 billion files per directory (15M) 0 to 1 PB file size range (360TB) 1B to 1 GB I/O request size range(1B - 1GB IO req) 100,000 open shared files per process (10,000 files) Long file names and 0-length files as data records
(long file names and 0-length files)4
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
5/45
HPCS Goals - Performance Aggregate
10,000 metadata operations per second(10,000/s) 1500 GB/s file per process/shared file(180GB/s) No impact RAID rebuilds on performance(variable)
Single Client 40,000 creates/s , up to 64kB data(5k/s, 0kB) 30 GB/s full-duplex streaming I/O(2GB/s)
Miscellaneous POSIX I/O API extensions proposed at OpenGroup* (partial)
* High End Computing Extensions Working Group http://www.opengroup.org/platform/hecewg/
5Scalability Workshop ORNL May 2009
http://www.opengroup.org/platform/hecewg/http://www.opengroup.org/platform/hecewg/ -
8/6/2019 Dilger Lustre HPCS May Workshop
6/45
HPCS Goals - Reliability
End-to-end resiliency T10 DIF equivalent(net only) No impact RAID rebuild on performance(variable) Uptime of 99.99% (99% ?)
Downtime
-
8/6/2019 Dilger Lustre HPCS May Workshop
7/45
HPCS Architectural Improvements
7
Use Available ZFS FunctionalityEnd-to-End Data IntegrityRAID Rebuild Performance ImpactFilesystem Integrity CheckingClustered MetadataRecovery ImprovementsPerformance Enhancements
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
8/45
Use Available ZFS Functionality
8
Capacity Single filesystem 100TB+(264 LUNs * 264 bytes) Trillions of files in a single file system(248 files) Dynamic addition of capacity
Reliability and resilience Transaction based, copy-on-write Internal data redundancy (double parity, 3 copies) End-to-end checksum of all data/metadata
Online integrity verification and reconstructionFunctionality Snapshots, filesets, compression, encryption Online incremental backup/restore
Hybrid storage pools (HDD + SSD)Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
9/45
End-to-End Data Integrity
9
Current Lustre checksumming Detects data corruption over network Ext3/4 does not checksum data on disk
ZFS stores data/metadata checksums Fast (Fletcher-4 default, or none) Strong (SHA-256)
HPCS Integration Integrate Lustre and ZFS checksums Avoid recompute full checksum on data Always overlap checksum coverage Use scalable tree hash method
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
10/45
Hash Tree and Multiple Block Sizes
10Scalability Workshop ORNL May 2009
16kB
128kB
256kB
512kB
32kB
1024kB
64kB
DATASEGMENT
LEAF HASHINTERIOR HASH
ROOT HASH
4kB8kB
MAX PAGE
ZFS BLOCK
LUSTRE RPC SIZE
M I N
P A G E
-
8/6/2019 Dilger Lustre HPCS May Workshop
11/45
Hash Tree For Non-contiguous Data
11Scalability Workshop ORNL May 2009
C45
K
C7
C457
C1
= LH(data segment 1)C
0= LH(data segment 0)
C5
= LH(data segment 5)C4 = LH(data segment 4)
LH(x) = hash of data x (leaf)x+y = concatenation x and y
C7
= LH(data segment 7)
C01
= IH(C0
+ C1)
C45 = IH(C4 + C5)
C457
= IH(C45
+ C7)
C5
C7
C4
K = IH(C01
+ C457
) = ROOT hash
IH(x) = hash of data x (interior)
C01
C0
C01
C1
-
8/6/2019 Dilger Lustre HPCS May Workshop
12/45
End-to-End Integrity Client Write
12Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
13/45
End-to-End Integrity Server Write
13Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
14/45
HPC Center of the Future
S ha red S tora g e Network
M e t a d a t a10 0 's M DTs
Ca pa ci ty 1250 ,000 Nodes
Capabil i ty500 ,000 Nodes
Ca pa ci ty 2150 ,000 Nodes Cap aci ty 3
50 ,000 Nodes Te s t25 ,000 Nodes
Vi z2
WANAccess
10 TB/sec
HPSS Archiv e
Viz1
1224 96TB OSTs3 * 32TB LUNs
36720 4TB disksRAID-6 8+2
115 PB Data
-
8/6/2019 Dilger Lustre HPCS May Workshop
15/45
RAID Failure Rates115PB filesystem, 1.5TB/s
36720 4TB disks in RAID 6 8+2 LUNs 4 year Mean Time To Failure
1 disk failsevery hour on average 4TB disk @ 30MB/s 38hr rebuild 30MB/s is 50%+ disk bandwidth (seeks) May reduce aggregate throughput by 50%+
1 disk failure may cost 750GB/s aggregate 38hr * 1 disk/hr = 38 disks/OSTs degraded
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
16/45
RAID Failure Rates
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
0
10
20
30
40
50
60
70
80
90
100
Failure Rates vs. MTTF
36720 4TB disks, RAID-6 8+2, 1224 OSTs, 30MB/s rebuild
Disks failed during rebuild
Percent OST degraded
Days to OST double failure
Years to OST triple failure
MTTF (years)
F a
i l u r e
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
17/45
0 5 10 15 20 25 30 35
0
10
20
30
40
50
60
70
80
90
100
Failure Rates vs. RAID disks
36720 4TB disks, RAID-6 N+2, 1224 OSTs, MTTF 4 years
Disks failed during rebuild
Percent OST degraded
Days to OST double failure
Years to OST triple failure
Disks per RAID-6 stripe
F a
i l u r e
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
18/45
0 10 20 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
Failure Rates vs. Rebuild Speed
36720 4TB disks, RAID-6 8+2, 1224 OSTs, MTTF 4 years
Disks failed during rebuild
Percent OST degraded
Days to OST double failure
Years to OST triple failure
Rebuild Speed (MB/s)
F a
i l u r e
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
19/45
Lustre-level Rebuild Mitigation
19Scalability Workshop ORNL May 2009
Several related problems Avoid global impactfromdegraded RAID Avoid loadon rebuilding RAID set
Avoids degraded OSTs for new files Little or no load on degraded RAID set Maximize rebuild performance Minimal global performance impact 30 disks (3 LUNs) per OST, 1224 OSTs 38 of 1224 OSTS = 3% aggregate cost OSTs available for existing files
-
8/6/2019 Dilger Lustre HPCS May Workshop
20/45
ZFS-level RAID-Z RebuildRAID-Z/Z2 is not the same as RAID-5/6
NEVER does read-modify-write Supports arbitrary block size/alignment
RAID layout is stored in block pointer
ZFS metadata traversal for RAID rebuild Good: only rebuild used storage (
-
8/6/2019 Dilger Lustre HPCS May Workshop
21/45
RAID-Z Rebuild ImprovementsRAID-Z optimized rebuild
~3% of storage is metadata Scan metadata first, build ordered list
Data reads mostly linear Bookmark to restart rebuild ZFS itself is not tied to RAID-Z
Distributed hot space Spread hot-spare rebuild space over all disks All disks' bandwidth/IOPS for normal IO All disks' bandwidth/IOPS for rebuilding
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
22/45
HPC Center of the Future
S ha red S tora g e Network
Ca pa ci ty 1250 ,000 Nodes
Capabil i ty500 ,000 Nodes
Ca pa ci ty 2150 ,000 Nodes Cap aci ty 3
50 ,000 Nodes Te s t25 ,000 Nodes
Vi z2
WANAccess
10 TB/sec
HPSS Archiv e
Viz1
Use r Da ta10 00 's OSTs
2 PB5120 800GBSAS/SSD
RAID-1
128 MDTs16TB LUNs
-
8/6/2019 Dilger Lustre HPCS May Workshop
23/45
Clustered Metadata
23
100s of metadata serversDistributed inodes
Files normally local to parent directory Subdirectories often non-local
Split directories Split dir::name hash Striped file::offset
Distributed Operation Recovery Cross directory mkdir, rename, link, unlink Strictly ordered distributed updates Ensures namespace coherency, recovery At worst inode refcount too high, leaked
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
24/45
Filesystem Integrity Checking
24Scalability Workshop ORNL May 2009
Problem is among hardest to solve 1 trillion files in 100h 2-4PB of MDT filesystem metadata
~3 million files/sec, 3GB/s+ for one pass 3M*stripes checks/sec from MDSes to OSSes 860*stripes random metadata IOPS on OSTs
Need to handle CMD coherency as well Link count on files, directories Directory parent/child relationship Filename to FID to inode mapping
-
8/6/2019 Dilger Lustre HPCS May Workshop
25/45
-
8/6/2019 Dilger Lustre HPCS May Workshop
26/45
Recovery ImprovementsVersion Based Recovery
Independent recovery stream per file Isolate recovery domain to dependent ops
Commit on Share Avoid client getting any dependent state Avoid sync for single client operations Avoid sync for independent operations
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
27/45
Imperative RecoveryServer driven notification of failover
Server notifies client of failover completed Client replies immediately to server
Avoid client waiting on RPC timeouts Avoid server waiting for dead clients
Can tell between slow/dead server No waiting for RPC timeout start recovery Can use external or internal notification
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
28/45
Performance Enhancements
SMP ScalabilityNetwork Request Scheduler Channel Bonding
28Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
29/45
SMP Scalability
29Scalability Workshop ORNL May 2009
Future nodes will have 100s of cores Need excellent SMP scaling on client/server Need to handle NUMA imbalances
Remove contention on servers Per-CPU resources (queues, locks) Fine-grained locking Avoid cross-node memory access
Bind requests to a specific CPU deterministically Client NID, object ID, parent directory
Remove contention on clients Parallel copy_{to,from}_user, checksums
-
8/6/2019 Dilger Lustre HPCS May Workshop
30/45
Network Request Scheduler
30Scalability Workshop ORNL May 2009
Much larger working set than disk elevator Higher level information Client NID, File/Offset
Read & write queue for each object on server
Requests sorted in object queues by offset Queues serviced round-robin, operation count (variable) Deadline for request service time
Scheduling input: opcount, offset, fairness, delay
Future enhancements Job ID, process rank Gang scheduling across servers Quality of service
Per UID/GID, cluster: min/max bandwidth
-
8/6/2019 Dilger Lustre HPCS May Workshop
31/45
31
q2q3qn
Network Request Scheduler
q1 WQ
req req req
req req req req
req
req
req
Q (Global FIFO Queue)
Data Object Queues
Work Queue
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
32/45
HPC Center of the Future
S ha red S tora g e Network
Use r Da ta10 00 's OS Ts
M e t a d a t a10 0 's M DTs
Ca pa ci ty 1250 ,0 00 Nodes
Capabil i ty500 ,000 Nodes
Ca pa ci ty 2150 ,0 00 Nodes Cap aci ty 3
50 ,000 Nodes Te s t25 ,000 Nodes
Vi z2
10 TB/sec
HPS S Archiv e
Viz1
WANAccess
-
8/6/2019 Dilger Lustre HPCS May Workshop
33/45
Channel Bonding
33Scalability Workshop ORNL May 2009
Combine multiple Network Interfaces Increased performance
Shared: balance load across all interfaces Improved reliability
Failover: use backup links if primary down Flexible configuration
Network interfaces of different types/speedsPeers do not need to share all networksConfiguration server per network
-
8/6/2019 Dilger Lustre HPCS May Workshop
34/45
Conclusion
34Scalability Workshop ORNL May 2009
Lustre can meet HPCS filesystem goals Scalability roadmap is reasonable Incremental orthogonal improvements
HPCS provides impetus to grow Lustre Accelerate development roadmap Ensures that Lustre will meet future needs
-
8/6/2019 Dilger Lustre HPCS May Workshop
35/45
Questions?
35Scalability Workshop ORNL May 2009
HPCS Filesystem Overview online at:http://wiki.lustre.org/index.php/Learn:Lustre_Publications
-
8/6/2019 Dilger Lustre HPCS May Workshop
36/45
36
THANK YOU
Scalability Workshop ORNL May 2009
Andreas Dilger Senior Staff Engineer, Lustre GroupSun Microsystems
-
8/6/2019 Dilger Lustre HPCS May Workshop
37/45
T10-DIF vs. Hash Tree
CRC-16 Guard Word All 1-bit errors All adjacent 2-bit errors Single 16-bit burst error
10-5 bit error rate32-bit Reference Tag
Misplaced write != 2nTB Misplaced read != 2nTB
Fletcher-4 Checksum All 1- 2- 3- 4-bit errors All errors affecting 4 or fewer 32-b
words
Single 128-bit burst error 10-13 bit error rate
Hash Tree Misplaced read
Misplaced write Phantom write Bad RAID reconstruction
-
8/6/2019 Dilger Lustre HPCS May Workshop
38/45
Metadata Improvements
Metadata Writeback Cache Avoids unnecessary server communication
Operations logged/cached locally Performance of local file system when uncontended
Aggregated distributed operations Server updates batched and tranferred using bulk protocols
(RDMA) Reduced network and service overhead
Sub-Tree Locking Lock aggregation a single lock protects a whole subtree Reduce lock traffic and server load
38Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
39/45
Metadata Improvements
Metadata Protocol Size on MDT (SOM)
> Avoid multiple RPCs for attributes derived from OSTs> OSTs remain definitive while file open> Compute on close and cache on MDT
Readdir+> Aggregation
- Directory I/O- Getattrs- Locking
39Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
40/45
LNET SMP Server Scaling
40
Total client processes Total client processes
R P C
T r h o u g h p u
t
R P C
T h r o u g p u
t
ClientNodes
Single Lock Finer Grain Locking
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
41/45
Communication ImprovementsFlat Communications model
Stateful client/server connection required for coherence and performance Every client connects to every server O(n) lock conflict resolution
Hierarchical Communications Model Aggregate connections, locking, I/O, metadata ops Caching clients
> Lustre System Calls> Aggregate local processes (cores)
> I/O Forwarders scale another 32x or more Caching Proxies> Lustre Lustre> Aggregate whole clusters> Implicit Broadcast - scalable conflict resolution
41Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
42/45
-
8/6/2019 Dilger Lustre HPCS May Workshop
43/45
Architectural Improvements
Scalable Health Network Burden of monitoring clients distributed not replicated Fault-tolerant status reduction/broadcast network
> Servers and LNET routers LNET high-priority small message support
> Health network stays responsive Prompt, reliable detection
> Time constants in seconds> Failed servers, clients and routers> Recovering servers and routers
Interface with existing RAS infrastructure Receive and deliver status notification
43Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
44/45
44
Primary Health Monitor
Failover Health Monitor
Client
Health Monitoring Network
Scalability Workshop ORNL May 2009
-
8/6/2019 Dilger Lustre HPCS May Workshop
45/45
Operations support
Lustre HSM Interface from Lustre to hierarchical storage
> Initially HPSS> SAM/QFS soon afterward
Tiered storage Combine HSM support with ZFSs SSD support and a
policy manager to provide tiered storage management