NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction...

NCCS User Forum

11 December 2008

GSFCNCCS12/11/08 2NCCS User Forum

Agenda

Welcome & IntroductionPhil Webster, CISTO Chief

Current System StatusFred Reitz, Operations Manager

NCCS Compute CapabilitiesDan Duffy, Lead Architect

Questions and CommentsPhil Webster, CISTO Chief

User Services UpdatesBill Ward, User Services Lead


Agenda







Key Accomplishments

∙SCU4 added to Discover and currently running in “pioneer” mode

∙Explore decommissioned and removed

∙Discover filesystems converted to GPFS 3.2 native mode


Discover Utilization Past Year

67.1%

64.4%

73.3%

2,446,365CPU hours

1,320,683CPU hours

SCU3 cores added


Discover Utilization


Discover Queue Expansion FactorEligible Time + Run Time

Run TimeWeighted over all queues for all jobs(Background and Test queues excluded)


Discover Availability

0

5

10

15

20

25

Outage Duration

September through November availability

∙ 13 outages▶ 9 unscheduled

◆ 0 hardware failures◆ 7 software failures◆ 2 extended maintenance windows

▶ 4 scheduled

∙ 104.3 hours total downtime▶ 68.3 unscheduled▶ 36.0 scheduled

Longest outages∙ 11/28-29 – GPFS hang, 21 hrs

∙ 11/12 – Electrical maintenance, Discover reprovisioning,18 hrs

▶ Scheduled outage

∙ 10/1 – SCU4 integration, 11.5 hrs▶ Scheduled outage plus extension

∙ 9/2-3 – Subnet Manager hang, 11.3 hrs∙ 11/6 – GPFS hang, 10.9 hrs

GPFS han

g

Elect

rical

mai

ntenan

ce,

Disco

ver r

epro

visi

oning

SCU4 in

tegra

tion

Subnet M

anag

er h

ang

GPFS han

g

GPFS han

g

GPFS han

g

SCU4 in

tegra

tion, S

witch

reco

nfigura

tion

Subnet M

anag

er h

ang

Subnet M

anag

er m

aint.

GPFS han

g

90.0%

91.0%

92.0%

93.0%

94.0%

95.0%

96.0%

97.0%

98.0%

99.0%

100.0%

September October November


Current Issues on Discover:GPFS Hangs

∙Symptom: GPFS hangs resulting from users running nodes out of memory.

∙Impact: Users cannot login or use filesystem. System Admins reboot affected nodes.

∙Status: Implemented additional monitoring and reporting tools.


Current Issues on Discover:Problems with PBS –V Option

∙Symptom: Jobs with large environments not starting

∙Impact: Jobs placed on hold by PBS

∙Status: Consulting with Altair. In the interim, don’t use –V to pass full environment, instead use –v or define necessary variables within job scripts.


Resolved Issues on Discover:Infiniband Subnet Manager

∙Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes.

∙Impact: Job failures due to node removal

∙Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred.


Resolved Issues on Discover:PBS Hangs

∙ Symptom: PBS server experiencing 3-minute hangs several times per day

∙ Impact: PBS-related commands (qsub, qstat, etc.) hang

∙ Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work-around on 9/17 to prevent conflicting use of these ports. No further occurrences.


Resolved Issues on Discover:Intermittent NFS Problems

∙Symptom: Inability to access archive filesystems

∙Impact: hung commands and sessions when attempting to access $ARCHIVE

∙Status: Identified hardware problem with Force10 E600 network switch. Implemented workaround and replaced line card. No further occurrences.


Future Enhancements

∙Discover Cluster▶ Hardware platform▶ Additional storage

∙Data Portal▶ Hardware platform

∙Analysis environment▶ Hardware platform

∙DMF▶ Hardware platform


Agenda







Very High Level of What to Expect in FY09

Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep

Discover SW Stack Upgrade

Cluster Upgrade (Nehalem)

Analysis System

DMF from IRIX to Linux

Data Management Initiative

New Tape Drives

Maj

or

Init

iati

ves

Oth

er A

ctiv

itie

s

Discover FC and Disk Addition

Additional Discover Disk

Continued Scalability Testing

Delivery of IBM Cell


Adapting the Overall Architecture

∙Services will have▶ More independent SW stacks▶ Consistent user environment▶ Fast access to the GPFS file systems▶ Large additional disk capacity for longer

storage of files within GPFS

∙This will result in▶ Fewer downtimes▶ Rolling outages (not everything at once)


Conceptual Architecture Diagram

GPFS I/O Servers

IB

Discover (batch)BaseSCU1SCU2SCU3SCU4

Viz

GPFS I/O Servers

IB

Analysis Nodes(interactive)

SAN

GPFS I/O Servers

IB

FY09 ComputeUpgrade

(Nehalem)Data Portal

GPFS I/O Servers

IB

SAN

ArchiveDMF

SAN

10 GbELAN


What is the Analysis Environment?

∙ Initial technical implementation plan▶ Large shared memory (256 GB at least) nodes

◆ 16 core nodes with 16 GB/core▶ Interactive (not batch); direct logins▶ Fast access to GPFS▶ 10 GbE network connectivity▶ Consistent software stack to Discover▶ Independent of compute stack (coupled only by

GPFS)∙ Additional storage for staging from the

archive specific for analysis∙ Visibility and easy access to the archive and

data portal (NFS)


Excited about Intel Nehalem

∙ Quick Specs▶ Core 7i – 45 nm▶ 731 million transistors per quad-core▶ 2.66 GHz to 2.93 GHz▶ Private L1 cache (32 KB) and L2 (256 KB) per core▶ Shared L3 cache (up to 8 MB) across all the cores▶ 1,066 MHz DDR3 Memory (3 channels per core)

∙ Important Features▶ Intel QuickPath Interconnect▶ Turbo Boost▶ Hyper-Threading

∙ Learn more at:▶ http://www.intel.com/technology/architecture-silicon/next-ge

n/index.htm▶ http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

http://www.intel.com/technology/architecture-silicon/next-gen/index.htm

http://www.intel.com/technology/architecture-silicon/next-gen/index.htm

http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)


Nehalem versus Harpertown

∙ Single thread improvement (will vary based on application)∙ Larger cache with the 8 MB shared cache across all processors∙ Memory to processor bandwidth dramatically increased over

the Harpertown▶ Initial measurements have shown 3 to 4x memory to processor

bandwidth increase


Agenda







Issues from Last User Forum:Shared Project Space

∙ Implementation of shared project space on Discover

∙Status: resolved▶ Available for projects by request▶ Accessible via /share; usage deprecated▶ Accessible via $SHARE; correct usage


Issues from Last User Forum:Increase Queue Limits

∙ Increase CPU & time limits in queues

∙Status: resolvedQueue Priority Max CPUs Max Hours

test 101 2064 12

general_hi 80 512 24

debug 70 32 1

general_long

55 256 24

general 50 256 12

general_small

50 16 12

background 1 256 4


Issues from Last User Forum:Commands to Access DMF

∙ Implementation dmget and dmput

∙Status: test version ready to be enabled on Discover login nodes▶ Reason for delay was that dmgets on non-

dm files would hang▶ There may still be stability issues▶ E-mail will be sent soon notifying users of

availability


Issues from Last User Forum:

Enabling Sentinel Jobs∙Running a “sentinel” subjob to watch a

main parallel compute “subjob” in a single PBS job

∙Status: under investigation▶ Requires an NFS mount of data portal file

system on Discover gateway nodes▶ Requires some special PBS usage to

specify how subjobs will land on nodes


Other Issues:Poor Interactive Response

∙Slow interactive response on Discover

∙Status: resolved▶ Router line card replaced▶ Automatic monitoring instituted to

promptly detect future problems


Other Issues:Parallel Jobs > ~300-400 CPUs∙Some users experiencing problems

running > ~300-400 CPUs on Discover

∙Status: resolved▶ “stacksize unlimited” in .cshrc file needed▶ Intel mpi passes environment, including

settings in startup files


Other Issues:Parallel Jobs > 1500 CPUs

∙Many jobs won’t run at > 1500 CPUs

∙Status: under investigation▶ Some simple jobs will run▶ NCCS consulting with IBM and Intel to

resolve the issue▶ Software upgrades probably required▶ Solution may fix slow Intel MPI startup


Other Issues:Visibility of the Archive

∙Visibility of the archive from discover∙Current Status

▶ Compute/viz nodes don’t have external network connections

▶ “Hard” NFS mounts guarantee data integrity, but if there is an NFS hang, the node hangs

▶ Login/gateway nodes may use a “soft” NFS mount, but risk of data corruption

▶ bbftp or scp (to Dirac) preferred over cp when copying data


DMF Transition

∙Dirac due to be replaced in Q2 CY09▶ Interactive host for Grads, IDL, Matlab, etc.▶ Much larger memory▶ GPFS shared with Discover▶ Significant increase in GPFS storage

∙ Impacts to Dirac users:▶ Source code must be recompiled▶ COTS must be relicensed/rehosted

∙Old Dirac up until migration complete


Help Us Help You

∙Don’t use “PBS –V” (job hangs with error “too many failed attempts to start”)

∙Direct stdout, stderr to specific files, or you will fill up the PBS spool directory

∙Use an interactive batch session instead of an interactive session on a login node

∙ If you suspect your job is crashing nodes, call us before running again


Help Us Help You (continued)

∙Try to be specific when reporting problems, for example:▶ If the archive is broken, specify symptoms▶ If files are inaccessible or can’t be recalled,

please send us the file names


Plans

∙ Implement a better scheduling policy

∙ Implement integrated job performance monitoring

∙ Implement better job metrics reporting

∙Or…


Feedback

∙ Now – Voice your …▶ Praises?▶ Complaints?▶ Suggestions?

∙ Later – NCCS Support▶ [email protected]▶ (301) 286-9120

∙ Later – USG Lead (me!)▶ [email protected]▶ (301) 286-2954

mailto:[email protected]

mailto:[email protected]


Agenda






Open DiscussionQuestionsComments

NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction...

Documents

Transcript of NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction...