There and back again” The Battle of Lustre at...
Transcript of There and back again” The Battle of Lustre at...
![Page 1: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/1.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
"There and back again”The Battle of Lustre at LANLSusan Coulter, Kyle Lamb, Mike Mason
April13-15, 2015
Slide 1
![Page 2: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/2.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Background Current Lustre status at LANL Future developments at LANL Fine Grain Routing FGR at LANL
Overview
Slide 2
![Page 3: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/3.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
LANL was primarily Panasas Lustre came about with Cielo Lustre taking presence as FS of choice for future
systems A few changes in order to adjust to Lustre vs.
Panasas– Purging– User load balancing– Monitoring
Background of LANL and Lustre
Slide 3
![Page 4: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/4.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Lustre 1.8 *if it ain’t broke don’t fix it 3 File systems (2PB, 2PB, and 4PB) Aggregate of 160GB/s across all 3 Fat tree topology for IB Only connected to Cielo
Cielo Lustre Deployment
Slide 4
![Page 5: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/5.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
L1– DDN SF12K system– Lustre Version 2.5.19– DDN OS and Stack (will likely change to
TOSS early next year)– 3PB in aggregate– 35GB/s with direct IB connectivity
Turquoise Lustre File Systems
Slide 5
![Page 6: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/6.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
L2– DDN SF12K – TOSS OS – Lustre Version 2.5.3– ZFS on OSTs LDISKFS on MDT– 1PB aggregate– Small deployment that we plan to increase
over time– Current compression ratio of 1.5
Turquoise Lustre File Systems
Slide 6
![Page 7: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/7.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
The Future of Lustre at LANL
Open– Current
• Lustre 4.3 PB• Other 2 PB
– End of this year• Lustre 15.1 PB• Other 2 PB
Secure– Current
• Lustre 8.3 PB• Other 3.7 PB
– End of this year• Lustre 106.3 PB• Other 3.7 PB
Almost a 10x
increaseSlide 7
![Page 8: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/8.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
LANL Open Infrastructure
Current Open Infrastructure– L1 [3.5PB]– L2 [836TB]
• One rack• At least double by FY16
– /scratch (Panasas) [740TB]
– /scratch3 (Panasas) [1.1PB]• Removal by early 2016
L3 RFP– Minimum 5 PB (likely ~10PB)– 80-100GB/s @ 70% capacity
• Sized for our FY16 Open System
– Few Mandatory Requirement more Target Requirements• Allows us to see what vendors
have to offer• Give us flexibility with our
purchasing department– Target date Dec 2015
Slide 8
![Page 9: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/9.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Open Lustre Infrastructure
L1-lnet00 … L1-lnet31 wf-io01 … wf-io24 L3-lnet00 ???
Ethernet BB Cluster
I/O node
New IB Cluster
Wolf Cluster
TurqBE
CM-Box
L1-master
L1 servers
L1 DDN Disks
Lustre IB
L2 servers
L2 DDN Disks
L3-master
L3 servers
L3 JBODS
New Ethernet BB Cluster
L2-master
Slide 9
![Page 10: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/10.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
LANL Secure Infrastructure
Current Red Infrastructure– Cielo
• /lscratch2,3,4 [2.1, 4.1, 2.1 PB]
– /scratch6 (Panasas) [410TB]
– /scratch8 (Panasas) [1.7PB]
– /scratch9 (Panasas) [1.6PB]• Removal by early 2016
• L3 RFP – Piggyback– 2 new Red systems– Consolidate FS types– Target date Dec 2015
• Build (test/debug) 3 FSes in 5 months
• 3 admins + 4 others
Slide 10
![Page 11: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/11.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Partial HPC system and compete file systems– Summer 2015– 2 PB memory– Burst buffer
• 3.7 PB @ 3.3TB/s– LNET router 222
Cray Hardware administration LANL/SNL Software administration
ACES Supercomputer: Trinity
Slide 11
![Page 12: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/12.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Two Cray Sonexion 2000 File Systems– 39PB each (78PB total usable)– 1.33 TB/s (80% memory in 20 minutes)– 19 Racks per file system (38 total)– 108 SSU per file system (216 total)
• 216 OSSs with 1 OST each– 6 TB drives; GridRAID 41 drives– Lustre 2.7– DNE phase 2
• 5 MDS
Trinity File System: Sonexion
Slide 12
![Page 13: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/13.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Application Regression Test: Trinitite– 2 racks, 200 nodes– 38TB Burst Buffer @ 34GB/s– 3 LNET Routers– 720TB Sonexion 2000 @ 15GB/s
System Development Test: Gadget– 1 rack, 40 nodes– 13TB Burst Buffer @ 11GB/s– 2 LNET Routers– 360TB Sonexion 2000 @ 7.5GB/s
Posters– Early Performance and Scaling of Sonexion 2000– Bottom-up Performance Estimation for a Cray Sonexion 2000
Trinity Test Environment: Trinitite and Gadget
Slide 13
![Page 14: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/14.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
FGR – Decision
Long term configuration– 3 Lustre file systems– 3 IB-connected clusters
Close impedance match between hosts and uplinks for non-FGR option
Hardware requirements Ease of expansion Support model
References: I/O Congestion Avoidance via Routing and Object Placement / D. A. Dillow, G. M. Shipman, S. Oral, and Z. Zhang (CUG 2011)
Acknowledgements: Bob Pearson, Dave McMillen (Cray) Steve Valimaki, Oz Rentas (DDN) David Sherrill (LANL)
Slide 14
![Page 15: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/15.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
FGR – Decision / Configuration without FGR
4x Spine-A Spine-B
New Spine New Spine
New LeafsNew Leafs
Lustre1 + Wolf
Lustre2 + 2nd IB-connected cluster
Lustre3 + 3rd IB-connected cluster
LEGEND
Slide 15
![Page 16: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/16.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
FGR – Decision / Configuration with FGR
8 uplinks
2x
2x
2x 2
x 2x
2x
2x
2x
8 uplinks
PROs– Simpler IB fabric– Much easier to expand– Significantly lower hardware costs– Translatable to secure implementation
CONs– FGR relatively new– LANL Lustre experience minimal– DDN support of FGR tentative– More complex LNET configuration
Slide 16
![Page 17: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/17.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
FGR – Implementation
o2ib1
o2ib2
o2ib3
o2ib4
o2ib5
OSSsLNETs
(same 2)
o2ib0
Leaf Switch
Leaf Switch
Leaf Switch
Leaf Switch
(contains entire FS) (contains LNETs & MDSs)
OSSs
OSSs
OSSs
LNETs
LNETs
LNETs
MDSs
L1 IB Core
MDSs
Slide 17
![Page 18: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/18.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
FGR – Expansion
tcp0Ethernet Core
o2ib10
Native IB
wolf
2 MDSs
LNETs
LNETs
LNETs
LNETs
OSSs
OSSs
OSSs
OSSs
(same 2) MDSs
o2ib5
o2ib25
TCP connected
2 MDSs
o2ib20
2 MDSs
o2ib0
Slide 18
![Page 19: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/19.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
FGR - Future
L2 disks
L1-master
EthernetLNETs
L2-master
L1 disks L1 servers
L2 servers
Wolf IO nodes
Wolf Cluster
Future L3
Future Cluster
Backend
Turquoise
EthernetIB Fabric
Ethernet clustersconnected
IO nodes
Slide 19
![Page 20: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/20.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
FGR – Lessons Modprobe configuration file
– Single common file vs custom files– Failover complexity
• Limitation on number of characters
Knowledge translation from turquoise to red– Use of FGR informed IB backbone Damselfly design
Isolation Groups (IG) 1 through 8Spanning Leaf 2Spanning Leaf 1
Slide 20
![Page 21: There and back again” The Battle of Lustre at LANLcdn.opensfs.org/wp-content/uploads/2015/04/There... · /scratch9 (Panasas) [1.6PB] • Removal by early 2016 • L3 RFP – Piggyback](https://reader034.fdocuments.us/reader034/viewer/2022050518/5fa1a82674b98840c36c9c2c/html5/thumbnails/21.jpg)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Thank You
Slide 21