NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction...
-
Upload
emma-bates -
Category
Documents
-
view
214 -
download
1
Transcript of NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction...
NCCS User Forum
11 December 2008
GSFCNCCS12/11/08 2NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO Chief
Current System StatusFred Reitz, Operations Manager
NCCS Compute CapabilitiesDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
GSFCNCCS12/11/08 3NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO Chief
Current System StatusFred Reitz, Operations Manager
NCCS Compute CapabilitiesDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
GSFCNCCS12/11/08 4NCCS User Forum
Key Accomplishments
∙SCU4 added to Discover and currently running in “pioneer” mode
∙Explore decommissioned and removed
∙Discover filesystems converted to GPFS 3.2 native mode
GSFCNCCS12/11/08 5NCCS User Forum
Discover Utilization Past Year
67.1%
64.4%
73.3%
2,446,365CPU hours
1,320,683CPU hours
SCU3 cores added
GSFCNCCS12/11/08 6NCCS User Forum
Discover Utilization
GSFCNCCS12/11/08 7NCCS User Forum
Discover Queue Expansion FactorEligible Time + Run Time
Run TimeWeighted over all queues for all jobs(Background and Test queues excluded)
GSFCNCCS12/11/08 8NCCS User Forum
Discover Availability
0
5
10
15
20
25
Outage Duration
September through November availability
∙ 13 outages▶ 9 unscheduled
◆ 0 hardware failures◆ 7 software failures◆ 2 extended maintenance windows
▶ 4 scheduled
∙ 104.3 hours total downtime▶ 68.3 unscheduled▶ 36.0 scheduled
Longest outages∙ 11/28-29 – GPFS hang, 21 hrs
∙ 11/12 – Electrical maintenance, Discover reprovisioning,18 hrs
▶ Scheduled outage
∙ 10/1 – SCU4 integration, 11.5 hrs▶ Scheduled outage plus extension
∙ 9/2-3 – Subnet Manager hang, 11.3 hrs∙ 11/6 – GPFS hang, 10.9 hrs
GPFS han
g
Elect
rical
mai
ntenan
ce,
Disco
ver r
epro
visi
oning
SCU4 in
tegra
tion
Subnet M
anag
er h
ang
GPFS han
g
GPFS han
g
GPFS han
g
SCU4 in
tegra
tion, S
witch
reco
nfigura
tion
Subnet M
anag
er h
ang
Subnet M
anag
er m
aint.
GPFS han
g
90.0%
91.0%
92.0%
93.0%
94.0%
95.0%
96.0%
97.0%
98.0%
99.0%
100.0%
September October November
GSFCNCCS12/11/08 9NCCS User Forum
Current Issues on Discover:GPFS Hangs
∙Symptom: GPFS hangs resulting from users running nodes out of memory.
∙Impact: Users cannot login or use filesystem. System Admins reboot affected nodes.
∙Status: Implemented additional monitoring and reporting tools.
GSFCNCCS12/11/08 10NCCS User Forum
Current Issues on Discover:Problems with PBS –V Option
∙Symptom: Jobs with large environments not starting
∙Impact: Jobs placed on hold by PBS
∙Status: Consulting with Altair. In the interim, don’t use –V to pass full environment, instead use –v or define necessary variables within job scripts.
GSFCNCCS12/11/08 11NCCS User Forum
Resolved Issues on Discover:Infiniband Subnet Manager
∙Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes.
∙Impact: Job failures due to node removal
∙Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred.
GSFCNCCS12/11/08 12NCCS User Forum
Resolved Issues on Discover:PBS Hangs
∙ Symptom: PBS server experiencing 3-minute hangs several times per day
∙ Impact: PBS-related commands (qsub, qstat, etc.) hang
∙ Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work-around on 9/17 to prevent conflicting use of these ports. No further occurrences.
GSFCNCCS12/11/08 13NCCS User Forum
Resolved Issues on Discover:Intermittent NFS Problems
∙Symptom: Inability to access archive filesystems
∙Impact: hung commands and sessions when attempting to access $ARCHIVE
∙Status: Identified hardware problem with Force10 E600 network switch. Implemented workaround and replaced line card. No further occurrences.
GSFCNCCS12/11/08 14NCCS User Forum
Future Enhancements
∙Discover Cluster▶ Hardware platform▶ Additional storage
∙Data Portal▶ Hardware platform
∙Analysis environment▶ Hardware platform
∙DMF▶ Hardware platform
GSFCNCCS12/11/08 15NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO Chief
Current System StatusFred Reitz, Operations Manager
NCCS Compute CapabilitiesDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
GSFCNCCS12/11/08 16NCCS User Forum
Very High Level of What to Expect in FY09
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep
Discover SW Stack Upgrade
Cluster Upgrade (Nehalem)
Analysis System
DMF from IRIX to Linux
Data Management Initiative
New Tape Drives
Maj
or
Init
iati
ves
Oth
er A
ctiv
itie
s
Discover FC and Disk Addition
Additional Discover Disk
Continued Scalability Testing
Delivery of IBM Cell
GSFCNCCS12/11/08 17NCCS User Forum
Adapting the Overall Architecture
∙Services will have▶ More independent SW stacks▶ Consistent user environment▶ Fast access to the GPFS file systems▶ Large additional disk capacity for longer
storage of files within GPFS
∙This will result in▶ Fewer downtimes▶ Rolling outages (not everything at once)
GSFCNCCS12/11/08 18NCCS User Forum
Conceptual Architecture Diagram
GPFS I/O Servers
IB
Discover (batch)BaseSCU1SCU2SCU3SCU4
Viz
GPFS I/O Servers
IB
Analysis Nodes(interactive)
SAN
GPFS I/O Servers
IB
FY09 ComputeUpgrade
(Nehalem)Data Portal
GPFS I/O Servers
IB
SAN
ArchiveDMF
SAN
10 GbELAN
GSFCNCCS12/11/08 19NCCS User Forum
What is the Analysis Environment?
∙ Initial technical implementation plan▶ Large shared memory (256 GB at least) nodes
◆ 16 core nodes with 16 GB/core▶ Interactive (not batch); direct logins▶ Fast access to GPFS▶ 10 GbE network connectivity▶ Consistent software stack to Discover▶ Independent of compute stack (coupled only by
GPFS)∙ Additional storage for staging from the
archive specific for analysis∙ Visibility and easy access to the archive and
data portal (NFS)
GSFCNCCS12/11/08 20NCCS User Forum
Excited about Intel Nehalem
∙ Quick Specs▶ Core 7i – 45 nm▶ 731 million transistors per quad-core▶ 2.66 GHz to 2.93 GHz▶ Private L1 cache (32 KB) and L2 (256 KB) per core▶ Shared L3 cache (up to 8 MB) across all the cores▶ 1,066 MHz DDR3 Memory (3 channels per core)
∙ Important Features▶ Intel QuickPath Interconnect▶ Turbo Boost▶ Hyper-Threading
∙ Learn more at:▶ http://www.intel.com/technology/architecture-silicon/next-ge
n/index.htm▶ http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
GSFCNCCS12/11/08 21NCCS User Forum
Nehalem versus Harpertown
∙ Single thread improvement (will vary based on application)∙ Larger cache with the 8 MB shared cache across all processors∙ Memory to processor bandwidth dramatically increased over
the Harpertown▶ Initial measurements have shown 3 to 4x memory to processor
bandwidth increase
GSFCNCCS12/11/08 22NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO Chief
Current System StatusFred Reitz, Operations Manager
NCCS Compute CapabilitiesDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
GSFCNCCS12/11/08 23NCCS User Forum
Issues from Last User Forum:Shared Project Space
∙ Implementation of shared project space on Discover
∙Status: resolved▶ Available for projects by request▶ Accessible via /share; usage deprecated▶ Accessible via $SHARE; correct usage
GSFCNCCS12/11/08 24NCCS User Forum
Issues from Last User Forum:Increase Queue Limits
∙ Increase CPU & time limits in queues
∙Status: resolvedQueue Priority Max CPUs Max Hours
test 101 2064 12
general_hi 80 512 24
debug 70 32 1
general_long
55 256 24
general 50 256 12
general_small
50 16 12
background 1 256 4
GSFCNCCS12/11/08 25NCCS User Forum
Issues from Last User Forum:Commands to Access DMF
∙ Implementation dmget and dmput
∙Status: test version ready to be enabled on Discover login nodes▶ Reason for delay was that dmgets on non-
dm files would hang▶ There may still be stability issues▶ E-mail will be sent soon notifying users of
availability
GSFCNCCS12/11/08 26NCCS User Forum
Issues from Last User Forum:
Enabling Sentinel Jobs∙Running a “sentinel” subjob to watch a
main parallel compute “subjob” in a single PBS job
∙Status: under investigation▶ Requires an NFS mount of data portal file
system on Discover gateway nodes▶ Requires some special PBS usage to
specify how subjobs will land on nodes
GSFCNCCS12/11/08 27NCCS User Forum
Other Issues:Poor Interactive Response
∙Slow interactive response on Discover
∙Status: resolved▶ Router line card replaced▶ Automatic monitoring instituted to
promptly detect future problems
GSFCNCCS12/11/08 28NCCS User Forum
Other Issues:Parallel Jobs > ~300-400 CPUs∙Some users experiencing problems
running > ~300-400 CPUs on Discover
∙Status: resolved▶ “stacksize unlimited” in .cshrc file needed▶ Intel mpi passes environment, including
settings in startup files
GSFCNCCS12/11/08 29NCCS User Forum
Other Issues:Parallel Jobs > 1500 CPUs
∙Many jobs won’t run at > 1500 CPUs
∙Status: under investigation▶ Some simple jobs will run▶ NCCS consulting with IBM and Intel to
resolve the issue▶ Software upgrades probably required▶ Solution may fix slow Intel MPI startup
GSFCNCCS12/11/08 30NCCS User Forum
Other Issues:Visibility of the Archive
∙Visibility of the archive from discover∙Current Status
▶ Compute/viz nodes don’t have external network connections
▶ “Hard” NFS mounts guarantee data integrity, but if there is an NFS hang, the node hangs
▶ Login/gateway nodes may use a “soft” NFS mount, but risk of data corruption
▶ bbftp or scp (to Dirac) preferred over cp when copying data
GSFCNCCS12/11/08 31NCCS User Forum
DMF Transition
∙Dirac due to be replaced in Q2 CY09▶ Interactive host for Grads, IDL, Matlab, etc.▶ Much larger memory▶ GPFS shared with Discover▶ Significant increase in GPFS storage
∙ Impacts to Dirac users:▶ Source code must be recompiled▶ COTS must be relicensed/rehosted
∙Old Dirac up until migration complete
GSFCNCCS12/11/08 32NCCS User Forum
Help Us Help You
∙Don’t use “PBS –V” (job hangs with error “too many failed attempts to start”)
∙Direct stdout, stderr to specific files, or you will fill up the PBS spool directory
∙Use an interactive batch session instead of an interactive session on a login node
∙ If you suspect your job is crashing nodes, call us before running again
GSFCNCCS12/11/08 33NCCS User Forum
Help Us Help You (continued)
∙Try to be specific when reporting problems, for example:▶ If the archive is broken, specify symptoms▶ If files are inaccessible or can’t be recalled,
please send us the file names
GSFCNCCS12/11/08 34NCCS User Forum
Plans
∙ Implement a better scheduling policy
∙ Implement integrated job performance monitoring
∙ Implement better job metrics reporting
∙Or…
GSFCNCCS12/11/08 35NCCS User Forum
Feedback
∙ Now – Voice your …▶ Praises?▶ Complaints?▶ Suggestions?
∙ Later – NCCS Support▶ [email protected]▶ (301) 286-9120
∙ Later – USG Lead (me!)▶ [email protected]▶ (301) 286-2954
GSFCNCCS12/11/08 36NCCS User Forum
Agenda
Welcome & IntroductionPhil Webster, CISTO Chief
Current System StatusFred Reitz, Operations Manager
NCCS Compute CapabilitiesDan Duffy, Lead Architect
Questions and CommentsPhil Webster, CISTO Chief
User Services UpdatesBill Ward, User Services Lead
Open DiscussionQuestionsComments