Post on 06-Jul-2020
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
San Diego Supercomputer Center: San Diego Supercomputer Center: Best practices, policiesBest practices, policies
Giri Chukkapallisupercomputer best practices
symposiumMay 11, 05
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Center’sCenter’s MissionMission
• Computational science vs computer science research
• Computational science• Supporting Single code• Supporting single field• Supporting broad spectrum of fields• Target existing users or grow new users• Capacity vs capability computing
• Cant be everything to everybody• Mission statement and policy document
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
User awarenessUser awareness
• Well publicizing to the target user community existing as well as upcoming compute, data capabilities of the center
• This will enable the user community to plan the type of problems they want to solve and develop appropriate codes to take advantage of the resources
• Otherwise, people who happened to know will make use of it
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
More than just a large supercomputerMore than just a large supercomputer
• To support a broad computational science research community• Peripheral hardware, software and personnel with wide
range of expertise are necessary• A sizable shared memory machine to do pre and post
processing • Large compute farm to run embarrassingly parallel jobs • Viz. engines• SAN
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Computing: One Size Doesn’t Fit AllComputing: One Size Doesn’t Fit AllD
ata
capa
bilit
y(In
crea
sing
I/O
and
sto
rage
)
Compute capability(increasing FLOPS)
SDSC Data Science Env
Campus, Departmental and
Desktop Computing
Traditional HEC Env
QCD
Protein Folding
CPMD
NVOEOL
CIPRes
SCECVisualization
Data Storage/Preservation Extreme I/O
1. 3D + time simulation
2. Out-of-CoreENZOVisualization
CFD
ClimateSCEC
Simulation ENZOsimulation
Can’t be done on Grid(I/O exceeds WAN)
Distributed I/OCapable
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Data MovementData Movement
• Into and out of the center• SAN File system
• SAN to/from compute platform’s parallel file system
• Movement of data between compute, viz. and pre/post processing engines
• Automatic migration of data to/from archive• Bottleneck free data flow
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Pushing the DataPushing the Data--Intensive EnvelopeIntensive Envelope
Memory ParallelFile System Data Parking Archival Tape
System
C
O
M
P
U
T
E
R
Today’s leading-edge
1 GB/s 100 MB/s1 GB/s
4 TB 60 TB 100 TB 10 PB2 TB/s
15 TF
Tomorrow’s demands
100 GB/s 100 GB/s 10 GB/s
10 TB 3 PB 10 PB 100 PB
10 TB/s
100 TF
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Various file systems Various file systems
• Small backed up /home file system• Periodically purged fast parallel file system • Parking file system
• SAN file system with auto-migration to archive• Possibly non-backed non-purged intermediate
size file system
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTERVector/
SMPMPPs
Loosely coupled
clusters
Work stations
Data
Engines
servers
Web
server
Sensors
instruments
N E T W O R K / D A T A T R A N S P O R T L A Y E R
GLOBUS LAYER
Grid middleware bridge software, schedulers etc.
Problem Solving Environments portals, UIs, web services
Operating Systems, Compilers, Oracle TOMCAT A/D
Life Sciences Engineering Environmental Astrophysics Etc.
Bioinformatics Automotive/ Climate/
Aircraft Weather
Hardware
Complex
Systems
Domain Specific
Resource Specific
Cyber
Infrastructure
Cyber InfrastructureCyber Infrastructure
Tools
libraries
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SDSC DataStar
187 Total Nodes11 p690
176 p655
1.7
1.5
(5)(171)
(7)
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SANergySANergy Data MovementData Movement
Orion
TeragridN
etwork
SAM-QFS DISK
2Gb
1Gb x 41Gb x 4
p690
Federation Switch
SAN Switch Infrastructure
2Gb x 4
SANergy MDC
Metadata operations, NFS
Data operations
SANergy client
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER~400 Sun FC Disk Arrays (~4100 disks, 540 TB total)
32 FC Tape Drives
Sun Fire 15K
DataStar176 P655s
SAM-QFS ETF DBSAN-GPFS
5 x Brocade 12000 (1408 2Gb ports)
DataStar 11 P690s
SAN
ergy
Clie
nt
SAN
ergy
Serv
er
Force 10 -12000
HPSS
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Compute platform: setupCompute platform: setup
• Small identical Test system• Perform all the upgrades on test system first
• Shared interactive pool• Batch pool• Setting up common environment
• Copydefaults• Softenv
• Setting up of third party tools, libraries, helper apps, community codes
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Compute platform: setupCompute platform: setup
• Providing example code, scripts, configures• /usr/local/apps/examples
• Providing user interface to allocation management
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Compute platform: AllocationsCompute platform: Allocations
• Compute and data allocations• Understanding space-time resolution
relationships• Peer (rotating body) review process• Online system• I am currently part of NSF review committee
• Can provide more info if needed
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Criteria for machine accessCriteria for machine access
• Preliminary access for porting, benchmarking and optimizing user’s code
• Single CPU performance criteria (15%?)• Scaling criteria (half the machine with 90%)• If not met provide help, consulting
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Compute platform: schedulingCompute platform: scheduling
• Higher priority to large PE jobs• Allowing longer times to larger PE jobs• Weighting based on allocation size• Good API for users to probe and interact with
the scheduler• Prologue and epilogue scripts to bring the
system to clean state• Express, high, low and back fill queues• Optimizing for maximum throughput vs quick
turn around
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Regression testsRegression tests
• Well designed set of benchmarks and regression tests to monitor system correctness and performance
• Preventive maintenance• Compiler/OS upgrades• Provide access to login/interactive nodes during PM
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Compute platform: life cycleCompute platform: life cycle
• Friendly user phase• Few expert users who can cope with instabilities
• Production phase• Criteria for a machine to be production
• Uptime• Documentation• Accounting• stable
• Terminal phase• When the next system goes to production• 2 or 3 users who can use the whole machine
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Communicating to UsersCommunicating to Users
• User guide, FAQ• Periodic articles on tools usage, example apps • Yearly week long training• Email, motd alerts
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
consultingconsulting
• Ticketing system, phone consulting• Quick analysis and optimization help
• TOPs (targeted optimization and porting) program• Extended collaboration
• Strategic Applications Collaboration (SAC)• Modern tools like IM
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Listening to usersListening to users
• Periodic well designed surveys• User advisory committee• Local internal users• Listening while consulting• Application space is moving from monolithic
single component analysis codes to multi-scale multi-physics systems simulation codes
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Usage AnalysisUsage Analysis
• To see how we are fallowing the policies set
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
DS p655 Usage by node count (4/1/04DS p655 Usage by node count (4/1/04--5/1/05)5/1/05)1, 6%
2-3., 4%
4, 6%
5-7., 2%
8, 15%
9-15., 5%
16, 9%
17-31., 15%
32, 9%
33-63., 8%
64, 10%
65-123., 4%
128, 6% 129-176., 1%
12-3.45-7.89-15.1617-31.3233-63.6465-123.128129-176.
There have been recent increases in the # of 128-node jobs.
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SDSC User Snapshot: 2004SDSC User Snapshot: 2004
• 286 active projects• 90 institutions• 7 million SUs
consumed on DataStar
• PIs funded by NSF, NIH, DOE, NASA, DOD, DARPA, AFOSR, ONR
Time Awarded, by Discipline
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
PIs by DisciplinePIs by Discipline
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Time Awarded, by DisciplineTime Awarded, by Discipline
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
States with SDSC-Allocated PIs
Users Span the Nation Users Span the Nation
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SDSC Compute ResourcesSDSC Compute Resources
• DataStar• 1,628 Power4+ processors• IBM p655 and p690 nodes• 4 TB total memory• Up to 2 GBps I/O to disk
• TeraGrid Cluster• 512 Itanium2 IA-64
processors• 1 TB total memory
• Intimidata• 2,048 PowerPC processors• 128 I/O nodes• Half a petabyte of GPFS Intimidata Installation
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SDSC Data ResourcesSDSC Data Resources
• 1 PB Storage-area Network (SAN)
• 6 PB StorageTek tape library
• DB2, Oracle, MySQL• Storage Resource Broker• HPSS• 72-CPU Sun Fire 15K• 96-CPU IBM p690s
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SDSC Top 10 UsersSDSC Top 10 Users((SUsSUs consumed in 2004)consumed in 2004)
• Marvin Cohen, UC Berkeley• DataStar: 846,397 SUs
• Michael Norman, UC San Diego• DataStar: 551,969
• Juri Toomre, U Colorado• DataStar: 361,633
• Richard Klein, UC Berkeley• DataStar: 315,240
• J Andrew Mccammon, UCSD• DataStar: 310,909
• Klaus Schulten, UIUC• TeraGrid Cluster: 287,188
• George Karniadakis, Brown U• DataStar: 284,430
• Richard Klein, UC Berkeley• DataStar: 279,766
• Pui-Kuen Yeung, Ga Tech• DataStar: 220,172
• Parviz Moin, Stanford U• DataStar: 188,391
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SAC: ENZO SAC: ENZO (Robert (Robert HarknessHarkness))
• “Reconstructing the first billion years’’ • 3D cosmological
hydrodynamics code• Generates TBs of data
now• Stresses network and
data movement limits• Run anywhere, write data
to SDSC with SRB
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
SAC: SAC: TeraShakeTeraShake ((YifengYifeng Cui)Cui)
• Estimating the potential damage of a magnitude 7.7 Southern California earthquake
• Large-scale simulation of seismic wave propagation on the San Andreas Fault• 1.8 billion gridpoints• 240 DataStar processors• 1 TB memory• 5 days• 2 GB/s continuous I/O• 47 TB output
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
NVO Montage NVO Montage ((LeesaLeesa BriegerBrieger))
•• ComputeCompute--intensive service to intensive service to deliver sciencedeliver science--grade custom grade custom mosaics on demand, with mosaics on demand, with requests made through requests made through existing portalsexisting portals
• 2MASS: 10-TB, three-band infrared frequency archive of the entire sky
• Compute-intensive generation of custom mosaics
• Possible to mosaic the whole sky into five-degree squares with ~1 week of TeraGrid time
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
BluegeneBluegene specificspecific
• better development environment• eliminate cross compilation need(pretty ancient)
• Run BGL kernel as a VM on the front end?• BGL’s special need for packing jobs on contiguous chunk
of nodes • Special map files, mapping codes
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
BluegeneBluegene: experience: experience
• Extremely reproducible times• Extremely stable hardware• Very poor single processor (compiler?)
performance (double hummer, simd)• Still not tested computation/communication
overlap• Would like to operate in single-boot, multi-user
mode
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
BluegeneBluegene: experience: experience
• Several SDSC codes ported:• Mpcugles: LES turbulence code• PK’s DNS turbulence code• POP ocean model• SPECFEM3D: seismic wave propagation• Amber: MD chemistry code• ENZO: Astrophysics code• NAMD, CPMD came from IBM
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
BluegeneBluegene: latest: latest
• Half a petabyte of SATA file system attached to BGL
• 64 IA64 server nodes• 3.2GB/s reads and 2.8GB/s writing• 700MB/s from a production code using 512
nodes