Mateo valero p1

1

“Future Exascale Supercomputers”

Mexico DF, November, 2011 Prof. Mateo ValeroDirector

Top10Rank Site Computer Procs Rmax Rpeak

1RIKEN Advanced Institutefor Computational Science(AICS)

Fujitsu, K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 705024 10510000 11280384

2 Tianjin, China XeonX5670+NVIDIA 186368100352 2566000 4701000

3 Oak Ridge Nat. Lab. Crat XT5,6 cores 224162 1759000 2331000

4 Shenzhen, China XeonX5670+NVIDIA 120640 1271000 2984300

5 GSIC Center, Tokyo XeonX5670+NVIDIA 7327856994 1192000 2287630

6 DOE/NNSA/LANL/SNL Cray XE6 8-core 2.4 GHz 142272 1110000 1365811

7 NASA/Ames ResearchCenter/NAS

SGI Altix ICE 8200EX/8400EX, Xeon HT QC 3.0/Xeon5570/5670 2.93 Ghz, Infiniband

111104 1088000 1315328

8 DOE/SC/LBNL/NERSC Cray XE6 12 cores 153408 1054000 1288627

2Mexico DF, November, 2011

9 Commissariat a l'EnergieAtomique (CEA)

Bull bullx super-nodeS6010/S6030 138368 1050000 1254550

10 DOE/NNSA/LANL QS22/LS21 Cluster, PowerXCell8i / Opteron Infiniband 122400 1042000 1375776

2

Parallel Systems

Interconnect (Myrinet, IB, Ge, 3D torus, tree, …)

NodeNode

NodeNode

Node

Node*

Node*Node*

Node**

Node**Node**

SMPMemory


multicoremulticoremulticoremulticore

MemoryIN

homogeneous multicore (BlueGene-Q chip)heterogenous multicore

general-purpose accelerator (e.g. Cell)GPUFPGAASIC (e.g. Anton for MD)

Network-on-chip (bus, ring, direct, …)

Riken’s Fujitsu K with SPARC64 VIIIfx

● Homogeneous architecture:● Compute node:

● One SPARC64 VIIIfx processor2 GHz, 8 cores per chip128 Gigaflops per chip128 Gigaflops per chip

● 16 GB memory per node

● Number of nodes and cores:● 864 cabinets * 102 compute nodes/cabinet * (1 socket * 8 CPU cores) = 705024

cores …. 50 by 60 meters

● Peak performance (DP):


p ( )● 705024 cores * 16 GFLOPS per core = 11280384 PFLOPS

● Linpack: 10510 PF 93% efficiency. Matrix: more than 13725120 rows !!!29 hours and 28 minutes

● Power consumption 12.6 MWatt, 0.8 Gigaflops/W

3

Looking at the Gordon Bell Prize

● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors● Static finite element analysis

● 1 TFlop/s; 1998; Cray T3E; 1024 Processors● Modeling of metallic magnet atoms, using a

variation of the locally self-consistent multiple scattering method.

● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors● Superconductive materials


● 1 EFlop/s; ~2018; ?; 1x108 Processors?? (109 threads)

Jack Dongarra


4

Nvidia GPU instruction execution

instruction1

MP1 MP2 MP3 MP4

7Mexico DF, November, 2011 SBAC-PAD, Vitoria October 28th, 2011

instruction2

Long latency3

Instruction4

Potential System Architecture for Exascale Supercomputers

System attributes

2010 “2015” “2018” Difference2010-18

System peak 2 Pflop/s 200 Pflop/s 1 Eflop/sec O(1000)

Power 6 MW 15 MW ~20 MWPower 6 MW 15 MW 20 MW

System memory 0.3 PB 5 PB 32-64 PB O(100)

Node performance

125 GF 0.5 TF 7 TF 1 TF 10 TF O(10) –O(100)

Node memory BW

25 GB/s 0.1TB/sec

1 TB/sec 0.4 TB/sec 4 TB/sec O(100)

Node concurrency

12 O(100) O(1,000) O(1,000) O(10,000) O(100) –O(1000)


Total Concurrency

225,000 O(108) O(109) O(10,000)

Total Node Interconnect BW

1.5 GB/s 20 GB/sec 200 GB/sec O(100)

MTTI days O(1day) O(1 day) - O(10)EESI Final Conference10-11 Oct. 2011, Barcelona

5

Boeing: Number of wing prototypes prepared for wind-tunnel testing

2. To faster air plane design

Date 1980 1995 2005Date 1980 1995 2005

Airplane B757/B767 B777 B787

# wing prototypes 77 11 11 5


Plateau due to RANS limitations. Further decrease expected from LES with ExaFlop

EESI Final Conference10-11 Oct. 2011, Barcelona

Diseño del Airbus 380


6

Airbus: "More simulation, less testsMore simulation, less tests“

From A380A380 to A350A350

2. To faster air plane design

- 40%40% less wind-tunnel days - 25%25% saving in aerodynamics development time - 20%20% saving on wind-tunnel tests cost

th k t HPC bl d CFD i ll i hi h d i idi


Acknowledgements: E. CHAPUT (AIRBUS)

thanks to HPC-enabled CFD runs, especially in high-speed regime, providing even better representation of aerodynamics phenomenon turned into better design choices.

EESI Final Conference10-11 Oct. 2011, Barcelona

2. Oil industry

12Mexico DF, November, 2011 EESI Final Conference10-11 Oct. 2011, Barcelona

7

Diseño del ITER


TOKAMAK (JET)

Fundamental Sciences

14Mexico DF, November, 2011 EESI Final Conference10-11 Oct. 2011, Barcelona

8

On-demand materials for effective commercial useConductivity: energy loss reductionLifetime: corrosion protection, e.g. chrome

Materials: a new path to competitiveness

Fissures: saftety insurance from molecular designOptimisation of materials / lubricantsless friction, longer lifetime, less energy-losses

Industrial need to speed up simulation from months to days


All atom Multi-scale

Exascale enables simulation of largerand realistic systems and devices

EESI Final Conference, 10-11 Oct. 2011, Barcelona

Life Sciences and Health

Cell

Tissue

OrganPopulation


Atom

Small Molecule

Macromolecule

EESI Final Conference, 10-11 Oct. 2011, Barcelona

9

Supercomputación, teoría y experimentación

17Mexico DF, November, 2011 Cortesia de IBM

Supercomputing, theory and experimentation

18Mexico DF, November, 2011 Cortesia de IBM

10

Holistic approach …

Towards exaflop

Applications

Job Scheduling

Programming Model DependenciesDependencies

Resource awarenessResource awarenessMoldability

Comput. ComplexityComput. Complexity

User satisfactionUser satisfactionAddress spaceAddress space

Work generationWork generation

Async. Algs.Async. Algs.

Load Balancin


Run time

Interconnection

Processor/node architecture

Concurrency extractionConcurrency extraction

NIC design Hw countersHw counters

Core Structure

Locality optimizationLocality optimization

Memory subsystemMemory subsystemRun time supportRun time support

Topology and routingTopology and routingExternal contentionExternal contention

ngng

10+ Pflop/s systems planned

● Fujitsu Kei● 80,000 8-core Sparc64 VIIIfx processors 2 GHz,

(16 Gflops/core, 58 watts 3.2 Gflops/watt),16 GB/node 1 PB memory 6D mesh torus16 GB/node, 1 PB memory, 6D mesh-torus, 10 Pflops


● Cray's Titan at DOE, Oak Ridge National Laboratory● Hybrid system with Nvidia GPUs, 1 Pflop/s in 2011,

20 Pflop/s in 2012, late 2011 prototype● $100 million

11

10+ Pflop/s systems planned

● IBM Blue Waters at Illinois● 40,000 8-core Power7, 1 PB memory,

18 PB disk, 500 PB archival storage,10 Pflop/s 2012 $200 million10 Pflop/s, 2012, $200 million

● IBM Blue Gene/Q systems:● Mira to DOE, Argonne National Lab with 49,000 nodes,

16-core Power A2 processor (1.6-3 GHz), 750 K cores, 750 TB memory, 70 PB disk,5D torus 10 Pflop/s


5D torus, 10 Pflop/s● Sequoia to Lawrence Livermore National Lab with

98304 nodes (96 racks), 16-core A2 processor, 1.6 M cores (1 GB/core), 1.6 Petabytes memory, 6 Mwatt, 3 Gflops/watt, 20 Pflop/s, 2012

Heterogeneous, Distributed MemoryGigaHz KiloCore MegaNode system

Japan Plan for Japan Plan for ExascaleExascale

2012 2015 2018-2020

K Machine 10K Machine 100K Machine


10 PF 100 PF ExaFlops

Feasibility Study (2012-2013) Exascale Project (2014-2020)Post-Petascale Projects

12

23Mexico DF, November, 2011 Thanks to S. Borkar, Intel

24Mexico DF, November, 2011 Thanks to S. Borkar, Intel

13

Nvidia: Chip for the Exaflop Computer

25Mexico DF, November, 2011 Thanks Bill Dally

Nvidia: Node for the Exaflop Computer


14

Exascale Supercomputer


BSC-CNS: International Initiatives (IESP)

B ild i i l l f d l i

Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment


Build an international plan for developing the next generation open source software for scientific high-performance computing

15

Back to Babel?

“Now the whole earth had one language and the same words” …

Book of Genesis The computer age

Fortran & MPI

…”Come, let us make bricks, and burn them thoroughly. ”…

…"Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves”…

++


And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech."

Fortress

StarSsOpenMP

MPI

X10

Sequoia

CUDASisal

CAFSDKUPC

Cilk++

Chapel

HPF

ALF

RapidMind

Thanks to Jesus Labarta

You will see…. in 400 years from now people will get crazy

Parallel

New generation of programmers

New Usage

Parallel ProgrammingMulticore/manycore

Architectures


Source: Picasso -- Don Quixote

gmodels

Dr. Avi Mendelson (Microsoft). Keynote at ISC-2007

16

Different models of computation …….

● The dream for automatic parallelizing compilers not true …● … so programmer needs to express opportunities for parallel execution

in the application

OpenMP 2.5 Nested fork-joinSPMD OpenMP 3.0 DAG – data flow


● And … asynchrony (MPI and OpenMP too synchronous):● Collectives/barriers multiply effects of microscopic load

imbalance, OS noise,…

Huge Lookahead &Reuse…. Latency/EBW/Scheduling

StarSs: … generates task graph at run time …#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS],

float C[BS]);#pragma css task input(sum, A) output(B)void scale_add (float sum, float A[BS],

float B[BS]);#pragma css task input(A) inout(sum)

Task Graph Generation

void accum (float A[BS], float *sum);

for (i=0; i<N; i+=BS) // C=A+Bvadd3 ( &A[i], &B[i], &C[i]);

...for (i=0; i<N; i+=BS) // sum(C[i])

accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*E

scale_add (sum, &E[i], &B[i]);

1 2 3 4

5 6 87

9 10 11 12


...for (i=0; i<N; i+=BS) // A=C+D

vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=C+F

vadd3 (&C[i], &F[i], &E[i]);

13 14 15 16

17 18 19 20

17

StarSs: … and executes as efficient as possible …#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS],

float C[BS]);#pragma css task input(sum, A) output(B)void scale_add (float sum, float A[BS],

float B[BS]);#pragma css task input(A) inout(sum)

Task Graph Execution

void accum (float A[BS], float *sum);

1 1 1 2

2 3 54

6 6 6 7

for (i=0; i<N; i+=BS) // C=A+Bvadd3 ( &A[i], &B[i], &C[i]);

...for (i=0; i<N; i+=BS) // sum(C[i])

accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*E

scale_add (sum, &E[i], &B[i]);


2 2 2 3

7 8 7 8

...for (i=0; i<N; i+=BS) // A=C+D

vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=C+F

vadd3 (&C[i], &F[i], &E[i]);

StarSs: … benefiting from data access information

● Flat global address space seen by programmer

● Flexibility to dynamically traverse dataflow graph “optimizing”● Concurrency. Critical path● Memory access

● Opportunities for● Prefetch● Reuse

Eli i t tid d


● Eliminate antidependences (rename)

● Replication management

18

StarSs: Enabler for exascale

Can exploit very unstructured parallelism

Not just loop/data parallelismEasy to change structure

S t l t f l k h d

Support for heterogeneityAny # and combination of CPUs, GPUsIncluding autotuning

M ll bilit D l fSupports large amounts of lookaheadNot stalling for dependence satisfaction

Allow for locality optimizations to tolerate latency

Overlap data transfers, prefetchReuse

Nicely hybridizes into MPI/StarSs

Malleability: Decouple program from resources

Allowing dynamic resource allocation and load balanceTolerate noise

Data-flow; Asynchrony


Propagates to large scale the node level dataflow characteristicsOverlap communication and computationA chance against Amdahl’s law

35

Data flow; Asynchrony

Potential is there;Can blame runtime

Compatible with proprietary low level technologies

StarSs: history/strategy/versions

must provide directionality ∀argumentContiguous, non partially overlappedRenamingS l h d l ( i it l lit )

Basic SMPSs

C/C++, Fortran under developmentO MP tibilit ( )

OMPSs

C, No Fortranmust provide directionality ∀argumentovelaping &stridedReshaping strided accessesPriority and locality aware scheduling

SMPSs regions

Several schedulers (priority, locality,…)No nestingC/FortranMPI/SMPSs optims.


OpenMP compatibility (~)Dependences based only on args. with directionalityContiguous args. (address used as centinels)Separate dependences/transfersInlined/outlined pragmasNestingSMP/GPU/ClusterNo renaming,Several schedulers: “Simple” locality aware sched,…

19

Multidisciplinary top-down approach

Applicationand algorithms

Programming models

Investigate solutions to these

and other problems

Performance analysis and

predictiontools

Load balancingPower

60

70

80

90

Computer Center Power Projections

CoolingComputers

$23M

$31M


problemsProcessor and nodeInterconnect

0

10

20

30

40

50

60

Pow

er (M

W)

2005 2006 2007 2008 2009 2010 2011Year

$3M

$17M

$9M

$23M


20

Green/Top 500 November 2011

Green500_Rank

Top500_Rank Mflops/Watt Power Site Computer

1 64 2026,48 85,12 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom2 65 2026,48 85,12 IBM Thomas J. Watson Research Center BlueGene/Q, Power BQC 16C 1.60 GHz, Custom3 29 1996,09 170,25 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom4 17 1988,56 340,5 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom5 284 1689,86 38,67 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 16 328 1378,32 47,05 Nagasaki University DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDRg y

7 114 1266,26 81,5 Barcelona Supercomputing CenterBullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA 2090

8 102 1010,11 108,8 TGCC / GENCICurie Hybrid Nodes - Bullx B505, Xeon E5640 2.67 GHz, Infiniband QDR

9 21 963,7 515,2Institute of Process Engineering, Chinese Academy of Sciences

Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR, NVIDIA 2050

10 5 958,35 1243,8GSIC Center, Tokyo Institute of Technology

HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows

11 96 928,96 126,27 Virginia TechSuperServer 2026GT-TRF, Xeon E5645 6C 2.40GHz, Infiniband QDR, NVIDIA 2050

12 111 901,54 117,91 Georgia Institute of TechnologyHP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia Fermi, Infiniband QDR

13 82 891 88 160CINECA / SCS - SuperComputing S l ti

iDataPlex DX360M3, Xeon E5645 6C 2.40 GHz, Infiniband QDR, NVIDIA 2070

39Mexico DF, November, 2011 SBAC-PAD, Vitoria October 28th, 2011

13 82 891,88 160 Solution NVIDIA 2070

14 256 891,87 76,25 Forschungszentrum Juelich (FZJ)iDataPlex DX360M3, Xeon X5650 6C 2.66 GHz, Infiniband QDR, NVIDIA 2070

15 61 889,19 198,72 Sandia National LaboratoriesXtreme-X GreenBlade GB512X, Xeon E5 (Sandy Bridge - EP) 8C 2.60GHz, Infiniband QDR

32 1 830,18 12659,89RIKEN Advanced Institute for Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

47 2 635,15 4040 National Supercomputing Center in Tianjin NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050149 3 253,09 6950 DOE/SC/Oak Ridge National Laboratory Cray XT5-HE Opteron 6-core 2.6 GHz

56 4 492,64 2580National Supercomputing Centre in Shenzhen (NSCS)

Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050

Green/Top 500 November 2011

Top5

00 ra

nk

BSC, Xeon 6C, NVIDIA 2090 GPU

IBM and NNSA, Blue Gene/Q

Nagasaki U., Intel i5, ATI Radeon GPU


100-500MF/watt

500-1000 MF/watt>1 GF/watt

Mflops/watt Mwatts/Exaflop

2026,48 493

1689,86 592

1378,32 726

1266,26 726

Mflops/watt

Mateo valero p1

Technology

Transcript of Mateo valero p1