NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction...

37
NCCS User Forum 11 December 2008

Transcript of NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction...

Page 1: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

NCCS User Forum

11 December 2008

Page 2: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 2NCCS User Forum

Agenda

Welcome & IntroductionPhil Webster, CISTO Chief

Current System StatusFred Reitz, Operations Manager

NCCS Compute CapabilitiesDan Duffy, Lead Architect

Questions and CommentsPhil Webster, CISTO Chief

User Services UpdatesBill Ward, User Services Lead

Page 3: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 3NCCS User Forum

Agenda

Welcome & IntroductionPhil Webster, CISTO Chief

Current System StatusFred Reitz, Operations Manager

NCCS Compute CapabilitiesDan Duffy, Lead Architect

Questions and CommentsPhil Webster, CISTO Chief

User Services UpdatesBill Ward, User Services Lead

Page 4: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 4NCCS User Forum

Key Accomplishments

∙SCU4 added to Discover and currently running in “pioneer” mode

∙Explore decommissioned and removed

∙Discover filesystems converted to GPFS 3.2 native mode

Page 5: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 5NCCS User Forum

Discover Utilization Past Year

67.1%

64.4%

73.3%

2,446,365CPU hours

1,320,683CPU hours

SCU3 cores added

Page 6: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 6NCCS User Forum

Discover Utilization

Page 7: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 7NCCS User Forum

Discover Queue Expansion FactorEligible Time + Run Time

Run TimeWeighted over all queues for all jobs(Background and Test queues excluded)

Page 8: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 8NCCS User Forum

Discover Availability

0

5

10

15

20

25

Outage Duration

September through November availability

∙ 13 outages▶ 9 unscheduled

◆ 0 hardware failures◆ 7 software failures◆ 2 extended maintenance windows

▶ 4 scheduled

∙ 104.3 hours total downtime▶ 68.3 unscheduled▶ 36.0 scheduled

Longest outages∙ 11/28-29 – GPFS hang, 21 hrs

∙ 11/12 – Electrical maintenance, Discover reprovisioning,18 hrs

▶ Scheduled outage

∙ 10/1 – SCU4 integration, 11.5 hrs▶ Scheduled outage plus extension

∙ 9/2-3 – Subnet Manager hang, 11.3 hrs∙ 11/6 – GPFS hang, 10.9 hrs

GPFS han

g

Elect

rical

mai

ntenan

ce,

Disco

ver r

epro

visi

oning

SCU4 in

tegra

tion

Subnet M

anag

er h

ang

GPFS han

g

GPFS han

g

GPFS han

g

SCU4 in

tegra

tion, S

witch

reco

nfigura

tion

Subnet M

anag

er h

ang

Subnet M

anag

er m

aint.

GPFS han

g

90.0%

91.0%

92.0%

93.0%

94.0%

95.0%

96.0%

97.0%

98.0%

99.0%

100.0%

September October November

Page 9: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 9NCCS User Forum

Current Issues on Discover:GPFS Hangs

∙Symptom: GPFS hangs resulting from users running nodes out of memory.

∙Impact: Users cannot login or use filesystem. System Admins reboot affected nodes.

∙Status: Implemented additional monitoring and reporting tools.

Page 10: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 10NCCS User Forum

Current Issues on Discover:Problems with PBS –V Option

∙Symptom: Jobs with large environments not starting

∙Impact: Jobs placed on hold by PBS

∙Status: Consulting with Altair. In the interim, don’t use –V to pass full environment, instead use –v or define necessary variables within job scripts.

Page 11: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 11NCCS User Forum

Resolved Issues on Discover:Infiniband Subnet Manager

∙Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes.

∙Impact: Job failures due to node removal

∙Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred.

Page 12: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 12NCCS User Forum

Resolved Issues on Discover:PBS Hangs

∙ Symptom: PBS server experiencing 3-minute hangs several times per day

∙ Impact: PBS-related commands (qsub, qstat, etc.) hang

∙ Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work-around on 9/17 to prevent conflicting use of these ports. No further occurrences.

Page 13: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 13NCCS User Forum

Resolved Issues on Discover:Intermittent NFS Problems

∙Symptom: Inability to access archive filesystems

∙Impact: hung commands and sessions when attempting to access $ARCHIVE

∙Status: Identified hardware problem with Force10 E600 network switch. Implemented workaround and replaced line card. No further occurrences.

Page 14: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 14NCCS User Forum

Future Enhancements

∙Discover Cluster▶ Hardware platform▶ Additional storage

∙Data Portal▶ Hardware platform

∙Analysis environment▶ Hardware platform

∙DMF▶ Hardware platform

Page 15: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 15NCCS User Forum

Agenda

Welcome & IntroductionPhil Webster, CISTO Chief

Current System StatusFred Reitz, Operations Manager

NCCS Compute CapabilitiesDan Duffy, Lead Architect

Questions and CommentsPhil Webster, CISTO Chief

User Services UpdatesBill Ward, User Services Lead

Page 16: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 16NCCS User Forum

Very High Level of What to Expect in FY09

Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep

Discover SW Stack Upgrade

Cluster Upgrade (Nehalem)

Analysis System

DMF from IRIX to Linux

Data Management Initiative

New Tape Drives

Maj

or

Init

iati

ves

Oth

er A

ctiv

itie

s

Discover FC and Disk Addition

Additional Discover Disk

Continued Scalability Testing

Delivery of IBM Cell

Page 17: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 17NCCS User Forum

Adapting the Overall Architecture

∙Services will have▶ More independent SW stacks▶ Consistent user environment▶ Fast access to the GPFS file systems▶ Large additional disk capacity for longer

storage of files within GPFS

∙This will result in▶ Fewer downtimes▶ Rolling outages (not everything at once)

Page 18: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 18NCCS User Forum

Conceptual Architecture Diagram

GPFS I/O Servers

IB

Discover (batch)BaseSCU1SCU2SCU3SCU4

Viz

GPFS I/O Servers

IB

Analysis Nodes(interactive)

SAN

GPFS I/O Servers

IB

FY09 ComputeUpgrade

(Nehalem)Data Portal

GPFS I/O Servers

IB

SAN

ArchiveDMF

SAN

10 GbELAN

Page 19: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 19NCCS User Forum

What is the Analysis Environment?

∙ Initial technical implementation plan▶ Large shared memory (256 GB at least) nodes

◆ 16 core nodes with 16 GB/core▶ Interactive (not batch); direct logins▶ Fast access to GPFS▶ 10 GbE network connectivity▶ Consistent software stack to Discover▶ Independent of compute stack (coupled only by

GPFS)∙ Additional storage for staging from the

archive specific for analysis∙ Visibility and easy access to the archive and

data portal (NFS)

Page 20: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 20NCCS User Forum

Excited about Intel Nehalem

∙ Quick Specs▶ Core 7i – 45 nm▶ 731 million transistors per quad-core▶ 2.66 GHz to 2.93 GHz▶ Private L1 cache (32 KB) and L2 (256 KB) per core▶ Shared L3 cache (up to 8 MB) across all the cores▶ 1,066 MHz DDR3 Memory (3 channels per core)

∙ Important Features▶ Intel QuickPath Interconnect▶ Turbo Boost▶ Hyper-Threading

∙ Learn more at:▶ http://www.intel.com/technology/architecture-silicon/next-ge

n/index.htm▶ http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

Page 21: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 21NCCS User Forum

Nehalem versus Harpertown

∙ Single thread improvement (will vary based on application)∙ Larger cache with the 8 MB shared cache across all processors∙ Memory to processor bandwidth dramatically increased over

the Harpertown▶ Initial measurements have shown 3 to 4x memory to processor

bandwidth increase

Page 22: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 22NCCS User Forum

Agenda

Welcome & IntroductionPhil Webster, CISTO Chief

Current System StatusFred Reitz, Operations Manager

NCCS Compute CapabilitiesDan Duffy, Lead Architect

Questions and CommentsPhil Webster, CISTO Chief

User Services UpdatesBill Ward, User Services Lead

Page 23: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 23NCCS User Forum

Issues from Last User Forum:Shared Project Space

∙ Implementation of shared project space on Discover

∙Status: resolved▶ Available for projects by request▶ Accessible via /share; usage deprecated▶ Accessible via $SHARE; correct usage

Page 24: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 24NCCS User Forum

Issues from Last User Forum:Increase Queue Limits

∙ Increase CPU & time limits in queues

∙Status: resolvedQueue Priority Max CPUs Max Hours

test 101 2064 12

general_hi 80 512 24

debug 70 32 1

general_long

55 256 24

general 50 256 12

general_small

50 16 12

background 1 256 4

Page 25: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 25NCCS User Forum

Issues from Last User Forum:Commands to Access DMF

∙ Implementation dmget and dmput

∙Status: test version ready to be enabled on Discover login nodes▶ Reason for delay was that dmgets on non-

dm files would hang▶ There may still be stability issues▶ E-mail will be sent soon notifying users of

availability

Page 26: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 26NCCS User Forum

Issues from Last User Forum:

Enabling Sentinel Jobs∙Running a “sentinel” subjob to watch a

main parallel compute “subjob” in a single PBS job

∙Status: under investigation▶ Requires an NFS mount of data portal file

system on Discover gateway nodes▶ Requires some special PBS usage to

specify how subjobs will land on nodes

Page 27: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 27NCCS User Forum

Other Issues:Poor Interactive Response

∙Slow interactive response on Discover

∙Status: resolved▶ Router line card replaced▶ Automatic monitoring instituted to

promptly detect future problems

Page 28: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 28NCCS User Forum

Other Issues:Parallel Jobs > ~300-400 CPUs∙Some users experiencing problems

running > ~300-400 CPUs on Discover

∙Status: resolved▶ “stacksize unlimited” in .cshrc file needed▶ Intel mpi passes environment, including

settings in startup files

Page 29: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 29NCCS User Forum

Other Issues:Parallel Jobs > 1500 CPUs

∙Many jobs won’t run at > 1500 CPUs

∙Status: under investigation▶ Some simple jobs will run▶ NCCS consulting with IBM and Intel to

resolve the issue▶ Software upgrades probably required▶ Solution may fix slow Intel MPI startup

Page 30: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 30NCCS User Forum

Other Issues:Visibility of the Archive

∙Visibility of the archive from discover∙Current Status

▶ Compute/viz nodes don’t have external network connections

▶ “Hard” NFS mounts guarantee data integrity, but if there is an NFS hang, the node hangs

▶ Login/gateway nodes may use a “soft” NFS mount, but risk of data corruption

▶ bbftp or scp (to Dirac) preferred over cp when copying data

Page 31: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 31NCCS User Forum

DMF Transition

∙Dirac due to be replaced in Q2 CY09▶ Interactive host for Grads, IDL, Matlab, etc.▶ Much larger memory▶ GPFS shared with Discover▶ Significant increase in GPFS storage

∙ Impacts to Dirac users:▶ Source code must be recompiled▶ COTS must be relicensed/rehosted

∙Old Dirac up until migration complete

Page 32: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 32NCCS User Forum

Help Us Help You

∙Don’t use “PBS –V” (job hangs with error “too many failed attempts to start”)

∙Direct stdout, stderr to specific files, or you will fill up the PBS spool directory

∙Use an interactive batch session instead of an interactive session on a login node

∙ If you suspect your job is crashing nodes, call us before running again

Page 33: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 33NCCS User Forum

Help Us Help You (continued)

∙Try to be specific when reporting problems, for example:▶ If the archive is broken, specify symptoms▶ If files are inaccessible or can’t be recalled,

please send us the file names

Page 34: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 34NCCS User Forum

Plans

∙ Implement a better scheduling policy

∙ Implement integrated job performance monitoring

∙ Implement better job metrics reporting

∙Or…

Page 35: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 35NCCS User Forum

Feedback

∙ Now – Voice your …▶ Praises?▶ Complaints?▶ Suggestions?

∙ Later – NCCS Support▶ [email protected]▶ (301) 286-9120

∙ Later – USG Lead (me!)▶ [email protected]▶ (301) 286-2954

Page 36: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

GSFCNCCS12/11/08 36NCCS User Forum

Agenda

Welcome & IntroductionPhil Webster, CISTO Chief

Current System StatusFred Reitz, Operations Manager

NCCS Compute CapabilitiesDan Duffy, Lead Architect

Questions and CommentsPhil Webster, CISTO Chief

User Services UpdatesBill Ward, User Services Lead

Page 37: NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

Open DiscussionQuestionsComments