John Morrison CCN Division Leader Nicholas C. Metropolis Center for Modeling and Simulation 7th...
-
Upload
ada-jennifer-oneal -
Category
Documents
-
view
213 -
download
0
Transcript of John Morrison CCN Division Leader Nicholas C. Metropolis Center for Modeling and Simulation 7th...
John MorrisonJohn MorrisonCCN Division Leader CCN Division Leader
Nicholas C. Metropolis Center for Modeling and Simulation
7th Workshop on Distributed SupercomputingMarch 4, 2003
ASCI Q
LA-UR-03-0541
The ASCI Q System at Los Alamos The ASCI Q System at Los Alamos
LA-UR-
Q is operational for stewardship applications (1st 10T)
Many ASCI applications are experiencing significant performance increases over Blue Mountain.
Linpack performance run of 7.727 TeraOps (more than 75% efficiency)
Initial user response is very positive (with some issues!)
(Users want more cycles…) Users from the tri-lab community are also using the system
Available to users for Classified ASCI codes since August 2002Available to users for Classified ASCI codes since August 2002 Smaller initial system available since April 2002Smaller initial system available since April 2002
Los Alamos has run its December 2002 ASCI Milestone calculation on QLos Alamos has run its December 2002 ASCI Milestone calculation on Q
LA-UR-
Question 1:
Is your machine living up to the performance expectations? If yes, how? If not, what is the root cause?
LA-UR-
Performance ComparisonQ vs. White vs. Blue Mountain
SAGE (timing.input)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1 10 100 1000 10000# PEs
Cyc
le t
ime
(s)
Blue Mountain
ASCI White
ASCI Q
Cycle-time : lower is betterWeak-scaling of SAGE (problem per processor is constant )-> ideal cycle-time is a constant for all PEs (but have parallel overheads)
LA-UR-
Modeled and Measured PerformanceUnique capability for performance prediction developed in the
Performance and Architecture Lab (PAL) at Los Alamos
Latest two sets of measurements are consistent
(~70% longer than model)
SAGE on QB 1-rail (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000# PEs
Cycle
Tim
e (
s)
Model
21-Sep
25-Nov
Lower is better!
There is a difference why ?
LA-UR-
Using fewer PEs per Node
Test performance using 1,2,3 and 4 PEs per node
Reduces the number of compute processors available
Performance degradation appears when using all 4 procs in a node!
Sage on QB (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000
#PEs
Cycle
Tim
e (
s)
1PEsPerNode
2PEsPerNode
3PEsPerNode
4PEsPerNode
LA-UR-
Performance Variability
Lots of noise on the nodes: daemons and kernel activityThis noise was analyzed, quantified, modeled, and included
back in the application modelThis system activity has structure: it was identified and
modeledCycle-time varies from cycle to cycle
SAGE QB 3584 PEs (timing.input)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
100 200 300 400 500 600 700 800 900 1000
Cycle Number
Cycle
Tim
e (
s)
Cyc_sec
Model
LA-UR-
Performance Variability (2)
Histogram of cycle-time over 1000 cycles
Minimum cycle-time is very close to model! (0.75 vs 0.70)
SAGE QB 3584 PEs (timing.input)
0
20
40
60
80
100
1200.7
0.8
0.9 1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9 2
Histogram Bins (s)
Fre
qu
en
cy
Performance is variable (some cycles are not affected!)
LA-UR-
Modeled and Experimental Data
The model is a close approximation of the experimental dataThe primary bottleneck is the noise generated by the compute
nodes (Tru64)
1
2
3
4
5
6
7
8
0 200 400 600 800 1000
Lat
ency
ms
Nodes
Barrier, 1 ms Granularity, Modelled and Experimental Data
experimentmodelwithout 0without 1without 31without 0, 1 and 31without background noise Lower
Is better
LA-UR-
Performance after System Optimization
Sage on QB (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1000 2000 3000 4000 5000
#PEs
Cyc
le T
ime (
s)
Sept -21st
Nov25th
Jan27th (Average)
Jan27th (Min)
Model
After system mods (both kernel and daemons and Quadrics RMS: right on target! After these optimizations, Q will deliver the performance that it’s supposed to. Modeling works!
LA-UR-
Resources
Performance and Architecture Lab (PAL in CCS-3
Work by Petrini, Kerbyson, Hoisie
Publications on this work and other architecture and performance topics at
www.c3.lanl.gov/par_arch
LA-UR-
Plan of Attack
Find low hanging fruit (common problems with high payback) to attack first
1. Kill unnecessary daemons
2. Look at member 1 and 2 for CFS related activities
3. Member 31 noise!!
LA-UR-
Kill Daemons
HP SC engineering is checking that there are no operational problems with permanently switching them off.
Daemons status
envmond /sbin/init.d/envmon stop
insightd sbin/init.d/insightd stop
snmpd /sbin/init.d/snmpd stop
advfsd Not running at LANL
smsd Not running at LANL
lat Already off
lpd /sbin/init.d/lpd stop
xlogin Not running at LANL
niff /sbin/init.d/niffd stop
LA-UR-
Summary on Performance
Performance of Q machine is meeting and exceeding performance expectations
Performance Modeling Integral part of Q machine system deployment
Performance testing done at each major contractual milestone
FS-QB used in the unclassified environment for performance variability testing.
Approach is to systematically evaluate and implement recommendations of performance variability testing
LA-UR-
Question 2: What is the MTBI? What are the topmost reasons for interrupts? What is the average utilization rate?
LA-UR-
Machine Q Interrupts and Overall MTBIASCI QA Categorized Failures (Unscheduled Interrupts) per Month Data Thru: 02/22/2003
21
60
10292
75
108
70
9984
67
414
50 52
2534
23 17 2333
25
74
152144
100
142
93
116107
100
0
20
40
60
80
100
120
140
160
May June July August September October November December January February
Month to Date
MONTH
FAIL
UR
ES
Hardware Other Total
ASCI QA System MTBI per Month Data thru: 02/22/2003
29.8
9.7
4.9 5.27.2
5.37.7
6.4 75.3
0
5
10
15
20
25
30
35
May June July August September October November December January February
Month to Date
Month
Hou
rs
LA-UR-
Topmost Reasons for HW interrupts
Detailed Scheduled and Unscheduled Categorized Hardware Interrupts
July August September October November December January1 GBit Ethernet Card1 1 2CPU 70 23 33 65 48 62 61Memory Dimm 15 47 32 34 19 28 20PCI Fibre Channel Adaptor 2 1 4 1 1PCI GBIT Ethernet Board3System Board 2 2 1 1…….……..……….Total 91 75 69 103 67 91 83
LA-UR-
Interrupts for CPUs
QA Number of CPU Failures (Unscheduled Interrupts) Per Week Data Thru: 02/22/2003
13
17
15
10
13
15
11
1312
10
0
2
4
6
8
10
12
14
16
18
Wk3 Wk4 Wk1 Wk2 Wk3 Wk4 Wk5 Wk1 Wk2 Wk3
December December January January January January January February February February
Week
Fail
ure
s
LA-UR-
Scientific Investigation of Cosmic Rays Impact on CPU Failures
• L2 Btag memory parity checked but not correctedAt altitude at Los Alamos the number of neutrons is about 6-10 times higher than at sea levelWith large number of ES45 systems and altitude we could be finding neutron induced CPU failures due to single bit soft errors
• Neutron Monitors installed with Q to measure neutron flux• LANSCE Beam line testing of different memories
Two classes of programs usedSome discrepancies between results, trying to figure outOnly testing for neutron impact, other particles being evaulated
• Statistical analysis for predicted error rates• Attempting to map beam line test output to predicted # of CPU
failures on Q based on neutron flux at SCC
LA-UR-
Scientific Investigation of Cosmic Rays Impact on CPU Failures - continued
• Initial results seem to indicate that the system is being impacted by neutrons hitting the L2 btag memory
• Mapping of beam line results to predict # of CPU failures is not yet fully understood
• We are managing around this problem from an applications perspective as demonstrated by the recent success of the milestone runs.
LA-UR-
Memory Interrupts
QA and QB Unscheduled Memory Dimm Failures per Weekfor Previous Ten Weeks thru 02/22/2003
78 8
5 5
2
4
1
9
16
43
1 1
5
2
0 0 01
0
2
4
6
8
10
12
14
16
18
Wk3 Wk4 Wk1 Wk2 Wk3 Wk4 Wk5 Wk1 Wk2 Wk3
December December January January January January January February February February
Week
Num
ber
of F
ailu
res
QA QB
LA-UR-
QA YTD 1/1/03 - 2/8/03
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
# jobs 16426 107 1377 194 212 462 251 287 379 184 0
Proc Hrs 48266 486 10782 7538 11738 38953 123013 344321 856987 292796 0
4 8 16 32 64 128 256 512 1024 2048 4096
2.8%0% 0.6% 0.4% 0.7%
2.2%
7.1%
19.8%
49.4%
16.9%
0%
Overall utilization rate for initialFew months is between 50-60%
LA-UR-
QB Final 11/1/02 - 1/22/03
0
500000
1000000
1500000
2000000
2500000
3000000
# jobs 4694 4382 142 82 549 925 254 1062 1031 277 559
Proc Hrs 26052 75050 1266 684 63379 987403 150619 649458 3E+06 239710 272862
4 8 16 32 64 128 256 512 1024 2048 4096
0.5% 1.5%0% 0%
1.3%
19.7%
3.0%
13.0%
50.8%
4.8% 5.4%
Over 4.3 Million processor hours for Science Runs System Utilization over 85% some weeks
LA-UR-
Question 3:
What is the primary complaint, if any, from the users?
LA-UR-
Historical Top Issues
Reliability and AvailabilityMessage Passing Interface (MPI)LSF integrationFile systemsCode development tools
LA-UR-
Current Top User IssuesOctober 2002 Q Technical Quarterly Review
Highest PriorityFile system problemsSystem Availability & ReliabilityHPSS “store” performance
Note the absence of MPI issues
LA-UR-
Current Top User IssuesOctober, 2002
Medium PrioritySerial jobs (& login sessions) on QAQ file system performance poor for many small serial files
Formal change control and preventative maintenance on all Q systems
QA viz users need non-purged file system
LA-UR-
Current Top User IssuesOctober, 2002
Lower Priority LSF configurations on all Q systemsEarly-Beta nature of QA versus User count
White-to-LANL(Q & HPSS) connectivityDFS on Q (for Sandia users)MPI CRC on QQ “devq” 2-login limit
LA-UR-
Highest Priority
File system problems Loss of all /scratch files (multiple
times) Local component failures impact
entire file system Files not always visible (PFS & NFS) Slow performance (e.g. simple “ls”
command)System Availability & Reliability
Whole machine impact Long (4-8 hr) reboot time! Many hung “services” require reboots
LA-UR-
Highest Priority
HPSS “store” performanceHPSS rates too low for QA capability
< 50MB/s100’s GB (not unusual) require hours to store
SW & HW upgrades (relief is coming)150MB/s Nov. target; 600MB/s Jan. targetParallel clients; new HW & 4- & 16-way stripes
Totalview & F90 data in modules on QCan’t see F90 data located in modulesWorkaround cumbersome & sometimes even crashesIssue is over 1yr old!
LA-UR-
Medium Priority
Serial jobs (& login sessions) on QA 4 PE minimum due to RMS/LSF config
Q file system performance poor for many small serial files
Many codes write serial files from 1 PE Some codes write 1 serial file per PE per dump time Some codes write multiple sets of files at each
dump timeFormal change control and preventative maintenance
on all Q systems Machine needs to move to more production-like
statusQA viz users need non-purged file system
Interactive viz requires all files be resident simultaneously
No special “viz” file systems as on BlueMtn