Oak Ridge Leadership Computing Facility

26
Oak Ridge Leadership Computing Facility Don Maxwell HPC Technical Coordinator October 8, 2010 Presented To: HPC User Forum, Stuttgart www.olcf.ornl.gov

description

Oak Ridge Leadership Computing Facility. www.olcf.ornl.gov. Don Maxwell HPC Technical Coordinator October 8, 2010 Presented To: HPC User Forum, Stuttgart. Oak Ridge Leadership Computing Facility. Mission: Deploy and operate the computational resources required to tackle global challenges - PowerPoint PPT Presentation

Transcript of Oak Ridge Leadership Computing Facility

Page 1: Oak Ridge Leadership Computing Facility

Oak Ridge Leadership Computing Facility

Don MaxwellHPC Technical Coordinator

October 8, 2010

Presented To:HPC User Forum, Stuttgart

www.olcf.ornl.gov

Page 2: Oak Ridge Leadership Computing Facility

2

Oak Ridge LeadershipComputing Facility• Mission: Deploy and operate the

computational resources required to tackle global challenges

– Providing world-class computational resources and specialized services for the most computationally intensive problems

– Providing stable hardware/software path of increasing scale to maximize productive applications development

– Deliver transforming discoveries in materials, biology, climate, energy technologies, etc.

– Provide the ability to investigate otherwise inaccessible systems, from supernovae to nuclear reactors to energy grid dynamics

2 Managed by UT-Battellefor the Department of Energy

Page 3: Oak Ridge Leadership Computing Facility

3

Our vision for sustained leadershipand scientific impact• Provide the world’s most powerful open resource

for capability computing

• Follow a well-defined path for maintaining world leadershipin this critical area

• Attract the brightest talent and partnerships from all over the world

• Deliver cutting-edge science relevant to the missionsof DOE and key federal and state agencies

• Unique opportunity for multi-agencycollaboration for science basedon synergy of requirements and technology

Page 4: Oak Ridge Leadership Computing Facility

4 4

With UT, we are NSF’s National Institute for Computational Sciences for academia

4 Managed by UT-Battellefor the Department of Energy

· 1 PF system to the UT-ORNL Joint Institute for Computational Sciences– Largest grant in UT history– Other partners: Texas Advanced Computing Center, National Center

for Atmospheric Research, ORAU, and core universities– 1 of up to 4 leading-edge computing systems planned

to increase the availability of computing resourcesto U.S. researchers

· A new phase in our relationship with UT– Computational Science Initiative– Governor’s Chair and joint faculty– Engagement with the scientific community– Research, education, and training mission

Page 5: Oak Ridge Leadership Computing Facility

5

Peak performance 2.33 PF/s

Memory 300 TB

Disk bandwidth > 240 GB/s

Square feet 5,000

Power 7 MW

Oak Ridge National LaboratoryLeadership Computing Systems

Peak performance 1.03 PF/s

Memory 132 TBDisk bandwidth > 50 GB/s

Square feet 2,300Power 3 MW

Jaguar

Kraken

Peak Performance 1.1 PF/s

Memory 248 TB

Disk Bandwidth 104 GB/s

Square feet 1,600

Power 2.2 MWNOAA CMRS

World’s most powerful computer

NOAA’s most powerful computer

NSF’s most powerful computer

Page 6: Oak Ridge Leadership Computing Facility

6

Jaguar History

Jan 2005XT3 Dev Cabinet

Mar 200510 Cabinet

Single Core

April 2005+30 XT3 Cabinets

Jun 2005+16 cabinets for total of 56 XT3

25TF

Nov 2006XT4 Dual Core

2.6GHz32 then 36

cabinets

July 2006XT3 Dual

Core 2.6 GHz50TF

March 2007XT3 and XT4

Combined for total of 124

cabinets100TF

May 2008XT4 68 cabinets

Quad Core 250TF

Dec 2008200 cabinet

Quad Core XT51PF

Nov 2009200 cabinet Six Core XT5

2PF

Page 7: Oak Ridge Leadership Computing Facility

7

What is Jaguar Today?

Jaguar combines a 263 TF Cray XT4 system at ORNL’s OLCF with a 2,332 TF Cray XT5 to create a 2.5 PF systemSystem attribute XT5 XT4

AMD Opteron processors 37,376 Hex-core 7,832 Quad-core

Memory DIMMS 75,772 31,776

Node architecture Dual socket SMP Single Socket

Memory per core/node (GB) 1.3/16 2/8

Total system memory (TB) 300 62

Disk capacity (TB) 10,000 750

Disk bandwidth (GB/s) 240 44

Interconnect SeaStar2+ 3D torus

SeaStar2+3D torus

Page 8: Oak Ridge Leadership Computing Facility

8

“Spider”: Center-wide High Speed Parallel File System

• “Spider” provides a shared, parallel file system for all systems

– Based on Lustre file system

• Demonstrated bandwidth of over 240 GB/s

• Over 10 PB of RAID-6 Capacity– DDN 9900 storage controllers with 8+2 disks per

RAID group– 13,440 1-TB SATA Drives

• 192 Dell PowerEdge Storage servers – 3 TB of memory

• Available from all systems via our high-performance scalable I/O network

– Over 3,000 InfiniBand ports– Over 3 miles of cables– Scales as storage grows

• Spider is the parallel file system for Jaguar

• Spider uses approximately 400 KW of power

Page 9: Oak Ridge Leadership Computing Facility

9

Jaguar combines a 2.33 PF Cray XT5 with a 263 TF Cray XT4

System components are linked by 4×-DDR InfiniBand (IB) using three Cisco 7024D switches

• XT5 has 192 IB links

• XT4 has 48 IB links

• Spider has 192 IB links

Spider

Cray XT4

Cray XT5 External Logins

Page 10: Oak Ridge Leadership Computing Facility

10

Building an Exabyte Archive

• Supercomputers addressing Grand Challenges need to quickly store massive amounts of data

• The High-Performance Storage System meets the big-storage demands of big science

• 25PB of Tape Storage

• Planning for 750PB by 2012

Stanley White, National Center for Computational Sciences

High-Performance Storage System adds capacity and speed

“Fifteen years ago, [national] labs realized they needed something of this size. They recognized Grand Challenge problems were coming up that would require petaflops of computing power. And they realized those jobs had to have a place to put the data.”

Page 11: Oak Ridge Leadership Computing Facility

11

Scheduling to Maximize Capability Computing

Factor Unit of Weight

Actual Weight (Minutes) Value

Quality of Service # of days 1440Highest (90)

High (12)Medium (2)

Account Priority # of days 1440Allocated Project (1)

No Allocation (Staff) (0)No Hours (-365)

Job Size # of days 1440

0 (90)>120000 (15)

>80000 & <120000 (10)>40000 & <80000 (5)

<40000 (0)

Fairshare # of minutes 1440 (<>)5% user (+/-) 30 minutes (<>)10% acct (+/-) 1 hour

Queue Time 1 minute 1 Provided by Moab

Capability jobs get maximum priority and walltime

Jobs are prioritized using several factors to meet DOE goals and to provide flexibility

Page 12: Oak Ridge Leadership Computing Facility

12

Job Failure Trends

2000 5000 40000 80000 120000 2250000.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

Failures Due to Hardware By Job Size

Cores

MPI Forum

OpenMPI

HWPOISON

Page 13: Oak Ridge Leadership Computing Facility

13

ORNL’s Current and Planned Data CentersComputational Sciences Building (40,000 ft2) Maximum building power to 25 MW 6,600 ton chiller plant 1.5 MW UPS and 2.25 MW generator LEED Certified

Multiprogram Research Facility (30,000 ft2) Capability computing for national defense 25 MW of power and 8,000 ton chillers LEED Gold Certification

Multiprogram Computing & Data Center (140,000 ft2)

Up to 100 MW of power Lights out facility Planned for LEED Gold certification

Page 14: Oak Ridge Leadership Computing Facility

14

T. BarronD. DillowD. FullerR. GunasekaranS. Hicks5

Y. KimK. MatneyR. MillerS. Oral

National Center for Computational SciencesJ. Hack, Director

A. Bland, OLCF Project DirectorL. Gregg, Division Secretary

Operations CouncilW. McCrosky, Finance Officer

H. George, HR Rep.K. Carter, Recruiting

M. Richardson*, Facility Mgmt.M. Disney, ES&H Officer

R. Adamson, M. Disney, Cyber Security

D. LevermanD. Londo4

J. LothianD. Maxwell@M. McNamara4

J. Miller6

D. PelfreyG. Phipps, Jr.6

R. RayS. ShpanskiyC. St. PierreB. Tennessen4

K. ThachT. Watts4

S. WhiteC. Willis4

T. Wilson6

R. AdamsonM. BastJ. Becklehimer4

J. Breazeale6

J. Brown6

M. DisneyA. Enger4

C. EnglandJ. Evanko4

A. Funk4

D. Garman4

D. GilesM. Hermanson2

J. HillS. KochH. KuehnC. Leach6

High-PerformanceComputing Operations

A. BakerS. Allen

B. Mintz7

M. MathesonR. Mills5

B. Mintz7

H. NamG.Ostrouchov5

N. PodhorszkiD. PugmireR. Sisneros7

R. SankaranR. TchouaA. Tharrington#

R. Toedte

S. Ahern#

E. Apra5

R. H. BakerD. Banks3

M. BrownJ. DanielM. EisenbachM. FaheyJ. Gergel5S. Hampton7

W. Joubert#

S. Klasky#

A. Lopez-Bezanilla7

Q. Liu7

Scientific ComputingR. KendallA. Fields

Deputy Project DirectorK. Boudwin

B. Hammontree, Site PreparationJ. Rogers, Hardware Acquisition

R. Kendall, Test & Acceptance DevelopmentA. Baker, Commissioning

D. Hudson, Project ManagementK. Stelljes, Cray Project Director

Advisory CommitteeJ. DongarraT. DunningK. Droegemeier

S. KarinD. ReedJ. Tomkins

J. Levesque N. Wichmann J. LarkinD. Kiefer L. DeRose

Cray Supercomputing Center of Excellence

Application Performance

Tools5

R. GrahamT. Darland

R. BarrettW. BlandL. Broto7

O. HernandezS. HodsonT. JonesR. KellerG. KoenigJ. Kuehn

Chief Technology OfficerA. Geist

Director of OperationsJ. Rogers

OLCF System ArchitectS. Poole

Director of ScienceB. Messer, Acting

INCITE ProgramJ. White

Industrial PartnershipsS. Tichenor

User AssistanceAnd Outreach

A. BarkerA. Fields

J. BuchananJ. Eady5

D. FrederickC. FusonE. Gedenk1

B. Gajus5

M. GriffithS. HempflingJ. Hines#

S. JonesC. Kerns1

D. Levy5

M. MillerL. RaelB. RenaudC. Rockett1

D. Rose5

J. SmithW. Wade1

B. WhittenL. Williams5

B. Settlemyer5

D. SteinertJ. SimmonsV. Tipparaju5

S. Vazhkudai5F. WangV. WhiteZ. Zhang

Technology IntegrationG. ShipmanS. Mowery

1Student2Post Graduate3JICS4Cray, Inc.5Matrixed6Subcontract7 Post Doc*Acting# Task Lead@ Technical Coordinator

ORNL is managed and operated by UT-Battelle, LLC under contract

with the DOE.78 FTEs

Page 15: Oak Ridge Leadership Computing Facility

15

Scientific Computing

15

Scientific Computing facilitates the delivery of leadership science by partnering with users to effectively utilize computational science, visualization and workflow technologies on OLCF resources through:

• Science team liaisons

• Developing, tuning, and scaling current and future applications

• Providing visualizations to present scientific results and augment discovery processes

Page 16: Oak Ridge Leadership Computing Facility

16

We allocate time on the DOE systems through the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) Program

Provides awards to academic, government, and industry organizations worldwide needing large allocations of computer time, supporting resources, and data storage to pursue transformational advances in science and industrial competitiveness.

Page 17: Oak Ridge Leadership Computing Facility

17

User Demographics

Active Usersby Sponsor

System time is allocated to each project. We do not charge for time except for proprietary work by commercial companies.

Page 18: Oak Ridge Leadership Computing Facility

18

• Glimpse into dark matter • Supernovae ignition • Protein structure • Creation of biofuels • Replicating enzyme functions • Protein folding • Chemical catalyst design • Efficient coal gasifiers • Combustion • Algorithm development

• Global cloudiness • Regional earthquakes • Carbon sequestration • Airfoil optimization • Turbulent flow • Propulsor systems • Nano-devices • Batteries • Solar cells • Reactor design

Contact informationJulia C. White, INCITE Manager

[email protected]

Some INCITE research topics

Next INCITE Call for Proposals: April 2011

Awards for 1-, 2-, or 3- years

Average award > 20 million processor hours per year

Contact us about discretionary time for INCITE preparation

Page 19: Oak Ridge Leadership Computing Facility

19

Three of six GB finalists ran on Jaguar

Gordon Bell Prize Awarded to ORNL Team

• A team led by ORNL’s Thomas Schulthess received the prestigious 2008 Association for Computing Machinery (ACM) Gordon Bell Prize at SC08

• For attaining fastest performance ever in a scientific supercomputing application

• Simulation of superconductors achieved 1.352 petaflops on ORNL’s Cray XT Jaguar supercomputer

• By modifying the algorithms and software design of the DCA++ code, the team was able to boost its performance tenfold

Gordon Bell Finalists DCA++ ORNL LS3DF LBNL SPECFEM3D SDSC• RHEA TACC• SPaSM LANL• VPIC LANL

UPDATE: with upgraded Jaguar, DCA++ has exceeded 1.9 PF

Page 20: Oak Ridge Leadership Computing Facility

20

OLCF is working with users to produce scalable, high-performance apps for the petascale

Science Area Code Contact Cores Total

Performance Notes

Materials DCA++ Schulthess 213,120 1.9 PF* 2008 Gordon Bell Winner

Materials WL-LSMS Eisenbach 223,232 1.8 PF 2009 Gordon Bell Winner

Chemistry NWChem Apra 224,196 1.4 PF 2009 Gordon Bell Finalist

Nano Materials OMEN Klimeck 222,720 >1 PF

2010 Gordon Bell

Submission

Seismology SPECFEM3D Carrington 149,784 165 TF 2008 Gordon Bell Finalist

Weather WRF Michalakes 150,000 50 TF

Combustion S3D Chen 144,000 83 TF

Fusion GTC PPPL 102,000 20 billion Particles / sec

Materials LS3DF Lin-Wang Wang 147,456 442 TF 2008 Gordon

Bell Winner

Chemistry MADNESS Harrison 140,000 550+ TF

20 Managed by UT-Battellefor the U.S. Department of Energy

Page 21: Oak Ridge Leadership Computing Facility

21

Scientific Progress at the PetascaleNuclear EnergyHigh-fidelity predictive simulation tools for the design of next-generation nuclear reactors to safely increase operating margins.

Fusion EnergySubstantial progress in the understanding of anomalous electron energy loss in the National Spherical Torus Experiment (NSTX).

Nano ScienceUnderstanding the atomic and electronic properties of nanostructures in next-generation photovoltaic solar cell materials.

TurbulenceUnderstanding the statistical geometry of turbulent dispersion of pollutants in the environment.

Energy StorageUnderstanding the storage and flow of energy in next-generation nanostructured carbon tube supercapacitors

BiofuelsA comprehensive simulation model of lignocellulosic biomass to understand the bottleneck to sustainable and economical ethanol production.

21 Managed by UT-Battellefor the U.S. Department of Energy

Page 22: Oak Ridge Leadership Computing Facility

22

Science Results• Coherent transport simulations in band-to-band

tunneling devices with simulation times of less than an hour => rapidly explore design space

• Incoherent transport simulations coupling all energies through phonon-interactions. Production runs on 70,000 cores in 12 hours=> first atomistic incoherent transport simulations

Science Objectives and Impact• Identify next generation nano-transistor

architectures, and reduce power consumption and increase manufacturability.

• Model, understand, and design carrier flow in nano-scale semiconductor transistors.

Nanoscience / nanotechnologyPetascale simulations of nano-electronic devices

Research TeamM. Luisier and G. Klimeck, Purdue University

3-year INCITE award, with 20 million hours in 2010

OMEN: 3D, 2D, and 1D atomistic devices

Page 23: Oak Ridge Leadership Computing Facility

23

Science Results

Science Objectives and Impact

Computational Fluid DynamicsSmart-Truck Optimization

Research TeamMike Henderson, BMI Corp.

Participant in the Industrial Partnerships Program

Unprecedented detail and accuracy of a Class 8 Tractor-Trailer aerodynamic simulation.• Minimizes drag associated with trailer underside• Compresses and accelerates incoming air flow and

injecting high energy air into trailer wake

=> UT-6 Trailer Under Tray System reduces Tractor/Trailer drag by 12%

• Apply advanced computational techniques from aerospace industry to substantially improve fuel efficiency and reduce emissions of trucks by reducing drag / increasing aerodynamic efficiency

• If all 1.3 million long haul trucks operated with the drag of a passenger car, the US would annually: Save 6.8 billion gallons of diesel Reduce 75 million tons CO2 Save $19 billion in fuel costs

Aerodynamic Performance Testing Methods - Jaguar CFD analysis of truck

and mirrors

Page 24: Oak Ridge Leadership Computing Facility

24

Examples of OLCF Industrial ProjectsDeveloping new add-on parts to reduce drag and increase fuel efficiency of Class 8 (18-wheeler) long haul trucks. This will reduce fuel consumption by up to 3,700 gallons per truck per year, and reduce CO2 by up to 41 tons (82,000 lb) per truck per year. BMI using NASA FUN3D and NASA team is assisting BMI with code refinement (OLCF Director’s Discretionary Award)

Analyzing unsteady versus steady flows in low pressure turbomachinery and their potential effects on more energy efficient designs. (OLCF Director’s Discretionary Award)

Studying at the nano scale catalysts that can selectively produce hydrogen from biomass (hydrogen to be used as energy for fuel cells) (OLCF Director’s Discretionary Award)

Developing a unique CO2 compression technology for significantly lower cost carbon sequestration (ALCC award)

INCITE awards

Page 25: Oak Ridge Leadership Computing Facility

25

• The U.S. Department of Energy requires exaflops computing by 2018 to meet the needs of the science communities that depend on leadership computing

• Our vision: Provide a series of increasingly powerful computer systems and work with user community to scale applications to each of the new computer systems

– OLCF-3 Project: New 10-20 petaflops computer based on early hybrid multi-core technology

10 Year Strategy:Moving to the Exascale

OLCF Roadmap from 10-year plan

300 PF System

1020 PF

2008 2009 2010 2011 2012 2013 2014 2015 2016

ORNL Extreme Scale Computing Facility(140,000 ft2)

2017

ORNL Computational Sciences Building

2018 2019

ORNL Multipurpose Research Facility

1 EF

OLCF-3Future systems

Today 2 PF, 6-core 1 PF

100 PF

Page 26: Oak Ridge Leadership Computing Facility

26

• Similar number of cabinets, cabinet design, and cooling as Jaguar

• Operating system upgrade of today’s Cray Linux Environment

• New Gemini interconnect• 3-D Torus • Globally addressable

memory• Advanced synchronization

features• New accelerated node design using GPUs • 20 PF peak performance

• 9x performance of today’s XT5• 3x larger memory• 3x larger and 4x faster file system

OLCF-3 “Titan” System Description