Lustre User Group 2010 - Aptos, CA
Lustre deploymentand early experiences
Florent ParentCoordinator, Québec site
Lustre User Group 2010 - Aptos, CA
Outline
• What is CLUMEQ?
• CLUMEQ's new HPC cluster: Colosse
• Lustre experience
2
Lustre User Group 2010 - Aptos, CA
What is CLUMEQ?
• Consortium of 11 universities in the province of Québec, Canada
• Part of the Compute Canada national platform
• Two HPC sites:✓ Montréal✓ Québec City
3
Lustre User Group 2010 - Aptos, CA
Who is CLUMEQ?
4
Lustre User Group 2010 - Aptos, CA
CLUMEQ's mission
• To serve the HPC needs of its member institutions in all fields of research
• To outreach non traditional and emerging HPC fields
• To train "highly qualified personel" (HQP)
5
Lustre User Group 2010 - Aptos, CA6
Compute Canada — Calcul CanadaA proposal to the
Canada Foundation for Innovation – National Platforms Fund
Hugh Couchman (McMaster University, SHARCNET)Robert Deupree (Saint Mary’s University, ACEnet)Ken Edgecombe (Queen’s University, HPCVL)Wagdi Habashi (McGill University, CLUMEQ)Richard Peltier (University of Toronto, SciNet)Jonathan Schaeffer (University of Alberta, WestGrid)David Senechal (Universite de Sherbrooke, RQCHP)
Executive Summary
The Compute/Calcul Canada (CC) initiative unites the academic high-performance comput-ing (HPC) organizations in Canada. The seven regional HPC consortia in Canada —ACEnet,CLUMEQ, RQCHP, HPCVL, SciNet, SHARCNET and WestGrid— represent over 50 institutionsand over one thousand university faculty members doing computationally-based research. TheCompute Canada initiative is a coherent and comprehensive proposal to build a shared distributedHPC infrastructure across Canada to best meet the needs of the research community and en-able leading-edge world-competitive research. This proposal is requesting an investment of 60 M$from CFI (150 M$ with matching money) to put the necessary infrastructure in place for fourof the consortia for the 2007-2010 period. It is also requesting operating funds from Canada’sresearch councils, for all seven consortia. Compute Canada has developed a consensus on nationalgovernance, resource planning, and resource sharing models, allowing for effective usage and man-agement of the proposed facilities. Compute Canada represents a major step forward in movingfrom a regional to a national HPC collaboration. Our vision is the result of extensive consultationswith the Canadian research community.
Lustre User Group 2010 - Aptos, CA
Outline
•What is CLUMEQ?
• CLUMEQ's new HPC cluster: Colosse
• Lustre experience
7
Lustre User Group 2010 - Aptos, CA
CLUMEQ Colosse
• Sun constellation system✓ 10 fully loaded Sun Blade 6048, with X6275 modules
(double Nehalem EP blade, 2.8GHz, 24GB of RAM)
✓ full-bisection IB-QDR interconnect (2xM9 switches)
✓ 1 PB of Lustre storage in a high availability configuration, using 2 MDS and 9x2 OSS
✓ Sun J4400 storage arrays
• 86 Tflops peak ✓ 77 Tflops max (preliminary run)
✓ ---> 80 Tflops ?
8
Lustre User Group 2010 - Aptos, CA
Infiniband Architecture
9
24 nodes per shelf
QNEMIB leafswitch
M9648-port core
switch
M9648-port core
switch
4 LustreOSS
HA pairs
M2leaf sw.36 ports
LustreHA MDS
11infrastructure
nodes
M2leaf sw.36 ports
24 nodes per shelf
QNEMIB leafswitch
24 nodes per shelf
QNEMIB leafswitch
40 shelves in 10 racks
960 2-sockets nodes
.
.
.
.
.
.
.
.
5 LustreOSS
HA pairs
11infrastructure
nodes
12 links21 links
24 links
Lustre User Group 2010 - Aptos, CA10
Second floor contains all compute racks +
core networking switches
First floor contains file system &
infrastructure nodes
Racks aligned in a circle around a
central hot core; outside ring is a
cold aisle
Lustre User Group 2010 - Aptos, CA11
Street view...
Lustre User Group 2010 - Aptos, CA12
Satellite view...
Free air cooling system
Maincooling system
Air blowers
cooling coils
cold air plenum(32 m2)
hot air core(25 m2)
Lustre User Group 2010 - Aptos, CA15
View inside hot air core
Lustre User Group 2010 - Aptos, CA
Main specifications
• Rack capacity: 56
• Cooling capacity: ~1.5 MW
• Electrical capacity: 1.1 MW (1.6 MW)
• Blowing capacity: 132,500 CFM
• Maximum air velocity: 2.4 m/s
• floor loading capacity: 940 lb/ft2
16
Lustre User Group 2010 - Aptos, CA
Outline
•What is CLUMEQ?
• CLUMEQ's new HPC cluster: Colossus
• Lustre experience
17
Lustre User Group 2010 - Aptos, CA
Timeline
• New staff hired in Apr 2009 and July 2009
• Nov 12: System acceptance signed
• Dec 16: First “Beta users” on machine
• Nov - now: Learning, debugging, patching, tuning, helping users
18
Lustre User Group 2010 - Aptos, CA
Experience so far...
• Many technologies to get up to speed on installing, monitoring, debugging, tuning ...✓ Lustre
✓ Infiniband
✓ Grid Engine
• So far, pretty much everything is learned as we go✓ asking questions to SMEs
✓ reading documentation, mailing list discussions
19
Lustre User Group 2010 - Aptos, CA
CLUMEQ Lustre deployment
• At acceptance:✓ CentOS 5.3 (2.6.18-128.2.1.el5)✓ OFED 1.4.1, Lustre 1.8.1.1
• Now✓ CentOS 5.3 (2.6.18-164.11.1.el5_lustre.1.8.2)✓ OFED 1.4.1(?) stock from RH✓ Lustre 1.8.2 + patch
20
Lustre User Group 2010 - Aptos, CA
Lustre fixes
• OSS crash during heartbeat/failover✓ “CPU hog/soft lockups” bug (21612, 19557)✓ Patched in 1.8.2
• 1.8.2 installed when GA✓ Started to see high load on MDT, Lustre hanging on
clients✓ inode link count fix (22177)✓ Installed patched version
21
Lustre User Group 2010 - Aptos, CA
Filesystem structure
• Lustre is the only FS on Colosse✓ /home for user accounts✓ /rap for group-shared space✓ /scratch for temporary data
• Lustre striping✓ /home and /rap use striping of 1 (typically small
files)✓ /scratch uses striping of 72 (parallel IO performance)
22
Lustre User Group 2010 - Aptos, CA23
!"#$%&'()!'(*+,-*+.'!/01234+*52/0'
!
! ! ! !
!
!
!
!
"#$!%$&'!())*!
! +,-.!(/! !
!
012345-&$! ! 6!789!:5#41;3;$.<;!1=!0,9,>,'! ! ())*!
! ?8$&14!@!A1B.4$!C,=1;;.!
!
!
789!019=5>.9$5,D!
'
!"!#$%&'()%*#+',-#.'/01&2#
!
!"#$%&'(')**'+,-./'+01+('
10 4
589 USB
3
2 67
45 12
14
10 13
11
15
TOPFAN
REARPS
SunFire X 4250
OK
10 4
589 USB
3
2 67
45 12
14
10 13
11
15
TOPFAN
REARPS
SunFire X 4250
OK
10 4
589 USB
3
2 67
45 12
14
10 13
11
15
TOPFAN
REARPS
SunFire X 4250
OK
10 4
589 USB
3
2 67
45 12
14
10 13
11
15
TOPFAN
REARPS
SunFire X 4250
OK
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
OSS1OSS2
OSS4OSS3
a b c de f g hi j k l
m n o pq r s tu v w x
ax ay az babb bc bd bebf bg bh bi
y z aa abac ad ae afag ah ai aj
bv bw bx bybz ca cb cccd ce cf cg
bj bk bl bmbn bo bp bqbr bs bt bu
ak al am anao ap aq aras at au av
ch ci cj ckcl cm cn cocp cq cr cs
md12,14 md10
md11md13,15
md30md32,34
md31md33,35
md20md22,24
md21md23,25
md40md42,44
md41md43,45
OSS HA and RAID setup
• OSS HA pairs✓ 4 OST per OSS
✓ Linux heartbeat used to signal failure
✓ Have 8 OST on OSS in failure mode
• RAID 6✓ 10 1TB disk per OST
✓ Software raid (md)
Lustre User Group 2010 - Aptos, CA
HA and heartbeat
• Linux heartbeat/HA✓ Does not always work. Seen issues w.r.t. node not
able to “kill” its neighbor (IPMI issues, investigating)✓ Currently working in manual HA mode
• Observing that it takes a long time for clients to use the new OSS taking over OSTs. ✓ Not sure yet why. Needs investigating. todo++
24
Lustre User Group 2010 - Aptos, CA
Finding a failed disk
• “Interesting” experience
• Device name to physical location is not reliable✓ need to double check with md*
commands
• Then came “blinkenlights”✓ http://wikis.sun.com/display/
HPCSoftware/JBOD+Troubleshooting+Utilities
✓ REALLY useful!
25
10 4
589 USB
3
2 67
45 12
14
10 13
11
15
TOPFAN
REARPS
SunFire X 4250
OK
10 4
589 USB
3
2 67
45 12
14
10 13
11
15
TOPFAN
REARPS
SunFire X 4250
OK
OSS1OSS2
a b c de f g hi j k l
m n o pq r s tu v w x
ax ay az babb bc bd bebf bg bh bi
y z aa abac ad ae afag ah ai aj
bv bw bx bybz ca cb cccd ce cf cg
bj bk bl bmbn bo bp bqbr bs bt bu
ak al am anao ap aq aras at au av
ch ci cj ckcl cm cn cocp cq cr cs
md12,14 md10
md11md13,15
md30md32,34
md31md33,35
md20md22,24
md21md23,25
md40md42,44
md41md43,45
Lustre User Group 2010 - Aptos, CA
Performance
• IOR used for parallel I/O measurements✓ IOR read performance = 33.6 GB/s✓ IOR write performance = 17.3 GB/s✓ all over IB
• Performance monitoring✓ Is there a BCP out there?
26
Lustre User Group 2010 - Aptos, CA
Conclusion
• Learned quite a lot in the past few months
• Lustre support team is key
• Looking to share experience
27
Top Related