Mateo valero p1
-
Upload
guadalupemoreno -
Category
Technology
-
view
718 -
download
7
Transcript of Mateo valero p1
1
“Future Exascale Supercomputers”
Mexico DF, November, 2011 Prof. Mateo ValeroDirector
Top10Rank Site Computer Procs Rmax Rpeak
1RIKEN Advanced Institutefor Computational Science(AICS)
Fujitsu, K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 705024 10510000 11280384
2 Tianjin, China XeonX5670+NVIDIA 186368100352 2566000 4701000
3 Oak Ridge Nat. Lab. Crat XT5,6 cores 224162 1759000 2331000
4 Shenzhen, China XeonX5670+NVIDIA 120640 1271000 2984300
5 GSIC Center, Tokyo XeonX5670+NVIDIA 7327856994 1192000 2287630
6 DOE/NNSA/LANL/SNL Cray XE6 8-core 2.4 GHz 142272 1110000 1365811
7 NASA/Ames ResearchCenter/NAS
SGI Altix ICE 8200EX/8400EX, Xeon HT QC 3.0/Xeon5570/5670 2.93 Ghz, Infiniband
111104 1088000 1315328
8 DOE/SC/LBNL/NERSC Cray XE6 12 cores 153408 1054000 1288627
2Mexico DF, November, 2011
9 Commissariat a l'EnergieAtomique (CEA)
Bull bullx super-nodeS6010/S6030 138368 1050000 1254550
10 DOE/NNSA/LANL QS22/LS21 Cluster, PowerXCell8i / Opteron Infiniband 122400 1042000 1375776
2
Parallel Systems
Interconnect (Myrinet, IB, Ge, 3D torus, tree, …)
NodeNode
NodeNode
Node
Node*
Node*Node*
Node**
Node**Node**
SMPMemory
3Mexico DF, November, 2011
multicoremulticoremulticoremulticore
MemoryIN
homogeneous multicore (BlueGene-Q chip)heterogenous multicore
general-purpose accelerator (e.g. Cell)GPUFPGAASIC (e.g. Anton for MD)
Network-on-chip (bus, ring, direct, …)
Riken’s Fujitsu K with SPARC64 VIIIfx
● Homogeneous architecture:● Compute node:
● One SPARC64 VIIIfx processor2 GHz, 8 cores per chip128 Gigaflops per chip128 Gigaflops per chip
● 16 GB memory per node
● Number of nodes and cores:● 864 cabinets * 102 compute nodes/cabinet * (1 socket * 8 CPU cores) = 705024
cores …. 50 by 60 meters
● Peak performance (DP):
4Mexico DF, November, 2011
p ( )● 705024 cores * 16 GFLOPS per core = 11280384 PFLOPS
● Linpack: 10510 PF 93% efficiency. Matrix: more than 13725120 rows !!!29 hours and 28 minutes
● Power consumption 12.6 MWatt, 0.8 Gigaflops/W
3
Looking at the Gordon Bell Prize
● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors● Static finite element analysis
● 1 TFlop/s; 1998; Cray T3E; 1024 Processors● Modeling of metallic magnet atoms, using a
variation of the locally self-consistent multiple scattering method.
● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors● Superconductive materials
5Mexico DF, November, 2011
● 1 EFlop/s; ~2018; ?; 1x108 Processors?? (109 threads)
Jack Dongarra
6Mexico DF, November, 2011
4
Nvidia GPU instruction execution
instruction1
MP1 MP2 MP3 MP4
7Mexico DF, November, 2011 SBAC-PAD, Vitoria October 28th, 2011
instruction2
Long latency3
Instruction4
Potential System Architecture for Exascale Supercomputers
System attributes
2010 “2015” “2018” Difference2010-18
System peak 2 Pflop/s 200 Pflop/s 1 Eflop/sec O(1000)
Power 6 MW 15 MW ~20 MWPower 6 MW 15 MW 20 MW
System memory 0.3 PB 5 PB 32-64 PB O(100)
Node performance
125 GF 0.5 TF 7 TF 1 TF 10 TF O(10) –O(100)
Node memory BW
25 GB/s 0.1TB/sec
1 TB/sec 0.4 TB/sec 4 TB/sec O(100)
Node concurrency
12 O(100) O(1,000) O(1,000) O(10,000) O(100) –O(1000)
8Mexico DF, November, 2011
Total Concurrency
225,000 O(108) O(109) O(10,000)
Total Node Interconnect BW
1.5 GB/s 20 GB/sec 200 GB/sec O(100)
MTTI days O(1day) O(1 day) - O(10)EESI Final Conference10-11 Oct. 2011, Barcelona
5
Boeing: Number of wing prototypes prepared for wind-tunnel testing
2. To faster air plane design
Date 1980 1995 2005Date 1980 1995 2005
Airplane B757/B767 B777 B787
# wing prototypes 77 11 11 5
9Mexico DF, November, 2011
Plateau due to RANS limitations. Further decrease expected from LES with ExaFlop
EESI Final Conference10-11 Oct. 2011, Barcelona
Diseño del Airbus 380
10Mexico DF, November, 2011
6
Airbus: "More simulation, less testsMore simulation, less tests“
From A380A380 to A350A350
2. To faster air plane design
- 40%40% less wind-tunnel days - 25%25% saving in aerodynamics development time - 20%20% saving on wind-tunnel tests cost
th k t HPC bl d CFD i ll i hi h d i idi
11Mexico DF, November, 2011
Acknowledgements: E. CHAPUT (AIRBUS)
thanks to HPC-enabled CFD runs, especially in high-speed regime, providing even better representation of aerodynamics phenomenon turned into better design choices.
EESI Final Conference10-11 Oct. 2011, Barcelona
2. Oil industry
12Mexico DF, November, 2011 EESI Final Conference10-11 Oct. 2011, Barcelona
7
Diseño del ITER
13Mexico DF, November, 2011
TOKAMAK (JET)
Fundamental Sciences
14Mexico DF, November, 2011 EESI Final Conference10-11 Oct. 2011, Barcelona
8
On-demand materials for effective commercial useConductivity: energy loss reductionLifetime: corrosion protection, e.g. chrome
Materials: a new path to competitiveness
Fissures: saftety insurance from molecular designOptimisation of materials / lubricantsless friction, longer lifetime, less energy-losses
Industrial need to speed up simulation from months to days
15Mexico DF, November, 2011
All atom Multi-scale
Exascale enables simulation of largerand realistic systems and devices
EESI Final Conference, 10-11 Oct. 2011, Barcelona
Life Sciences and Health
Cell
Tissue
OrganPopulation
16Mexico DF, November, 2011
Atom
Small Molecule
Macromolecule
EESI Final Conference, 10-11 Oct. 2011, Barcelona
9
Supercomputación, teoría y experimentación
17Mexico DF, November, 2011 Cortesia de IBM
Supercomputing, theory and experimentation
18Mexico DF, November, 2011 Cortesia de IBM
10
Holistic approach …
Towards exaflop
Applications
Job Scheduling
Programming Model DependenciesDependencies
Resource awarenessResource awarenessMoldability
Comput. ComplexityComput. Complexity
User satisfactionUser satisfactionAddress spaceAddress space
Work generationWork generation
Async. Algs.Async. Algs.
Load Balancin
19Mexico DF, November, 2011
Run time
Interconnection
Processor/node architecture
Concurrency extractionConcurrency extraction
NIC design Hw countersHw counters
Core Structure
Locality optimizationLocality optimization
Memory subsystemMemory subsystemRun time supportRun time support
Topology and routingTopology and routingExternal contentionExternal contention
ngng
10+ Pflop/s systems planned
● Fujitsu Kei● 80,000 8-core Sparc64 VIIIfx processors 2 GHz,
(16 Gflops/core, 58 watts 3.2 Gflops/watt),16 GB/node 1 PB memory 6D mesh torus16 GB/node, 1 PB memory, 6D mesh-torus, 10 Pflops
20Mexico DF, November, 2011
● Cray's Titan at DOE, Oak Ridge National Laboratory● Hybrid system with Nvidia GPUs, 1 Pflop/s in 2011,
20 Pflop/s in 2012, late 2011 prototype● $100 million
11
10+ Pflop/s systems planned
● IBM Blue Waters at Illinois● 40,000 8-core Power7, 1 PB memory,
18 PB disk, 500 PB archival storage,10 Pflop/s 2012 $200 million10 Pflop/s, 2012, $200 million
● IBM Blue Gene/Q systems:● Mira to DOE, Argonne National Lab with 49,000 nodes,
16-core Power A2 processor (1.6-3 GHz), 750 K cores, 750 TB memory, 70 PB disk,5D torus 10 Pflop/s
21Mexico DF, November, 2011
5D torus, 10 Pflop/s● Sequoia to Lawrence Livermore National Lab with
98304 nodes (96 racks), 16-core A2 processor, 1.6 M cores (1 GB/core), 1.6 Petabytes memory, 6 Mwatt, 3 Gflops/watt, 20 Pflop/s, 2012
Heterogeneous, Distributed MemoryGigaHz KiloCore MegaNode system
Japan Plan for Japan Plan for ExascaleExascale
2012 2015 2018-2020
K Machine 10K Machine 100K Machine
22Mexico DF, November, 2011
10 PF 100 PF ExaFlops
Feasibility Study (2012-2013) Exascale Project (2014-2020)Post-Petascale Projects
12
23Mexico DF, November, 2011 Thanks to S. Borkar, Intel
24Mexico DF, November, 2011 Thanks to S. Borkar, Intel
13
Nvidia: Chip for the Exaflop Computer
25Mexico DF, November, 2011 Thanks Bill Dally
Nvidia: Node for the Exaflop Computer
26Mexico DF, November, 2011 Thanks Bill Dally
14
Exascale Supercomputer
27Mexico DF, November, 2011 Thanks Bill Dally
BSC-CNS: International Initiatives (IESP)
B ild i i l l f d l i
Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment
28Mexico DF, November, 2011
Build an international plan for developing the next generation open source software for scientific high-performance computing
15
Back to Babel?
“Now the whole earth had one language and the same words” …
Book of Genesis The computer age
Fortran & MPI
…”Come, let us make bricks, and burn them thoroughly. ”…
…"Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves”…
++
29Mexico DF, November, 2011
And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech."
Fortress
StarSsOpenMP
MPI
X10
Sequoia
CUDASisal
CAFSDKUPC
Cilk++
Chapel
HPF
ALF
RapidMind
Thanks to Jesus Labarta
You will see…. in 400 years from now people will get crazy
Parallel
New generation of programmers
New Usage
Parallel ProgrammingMulticore/manycore
Architectures
30Mexico DF, November, 2011
Source: Picasso -- Don Quixote
gmodels
Dr. Avi Mendelson (Microsoft). Keynote at ISC-2007
16
Different models of computation …….
● The dream for automatic parallelizing compilers not true …● … so programmer needs to express opportunities for parallel execution
in the application
OpenMP 2.5 Nested fork-joinSPMD OpenMP 3.0 DAG – data flow
31Mexico DF, November, 2011
● And … asynchrony (MPI and OpenMP too synchronous):● Collectives/barriers multiply effects of microscopic load
imbalance, OS noise,…
Huge Lookahead &Reuse…. Latency/EBW/Scheduling
StarSs: … generates task graph at run time …#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS],
float C[BS]);#pragma css task input(sum, A) output(B)void scale_add (float sum, float A[BS],
float B[BS]);#pragma css task input(A) inout(sum)
Task Graph Generation
void accum (float A[BS], float *sum);
for (i=0; i<N; i+=BS) // C=A+Bvadd3 ( &A[i], &B[i], &C[i]);
...for (i=0; i<N; i+=BS) // sum(C[i])
accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*E
scale_add (sum, &E[i], &B[i]);
1 2 3 4
5 6 87
9 10 11 12
32Mexico DF, November, 2011
...for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=C+F
vadd3 (&C[i], &F[i], &E[i]);
13 14 15 16
17 18 19 20
17
StarSs: … and executes as efficient as possible …#pragma css task input(A, B) output(C)void vadd3 (float A[BS], float B[BS],
float C[BS]);#pragma css task input(sum, A) output(B)void scale_add (float sum, float A[BS],
float B[BS]);#pragma css task input(A) inout(sum)
Task Graph Execution
void accum (float A[BS], float *sum);
1 1 1 2
2 3 54
6 6 6 7
for (i=0; i<N; i+=BS) // C=A+Bvadd3 ( &A[i], &B[i], &C[i]);
...for (i=0; i<N; i+=BS) // sum(C[i])
accum (&C[i], &sum);...for (i=0; i<N; i+=BS) // B=sum*E
scale_add (sum, &E[i], &B[i]);
33Mexico DF, November, 2011
2 2 2 3
7 8 7 8
...for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]);...for (i=0; i<N; i+=BS) // E=C+F
vadd3 (&C[i], &F[i], &E[i]);
StarSs: … benefiting from data access information
● Flat global address space seen by programmer
● Flexibility to dynamically traverse dataflow graph “optimizing”● Concurrency. Critical path● Memory access
● Opportunities for● Prefetch● Reuse
Eli i t tid d
34Mexico DF, November, 2011
● Eliminate antidependences (rename)
● Replication management
18
StarSs: Enabler for exascale
Can exploit very unstructured parallelism
Not just loop/data parallelismEasy to change structure
S t l t f l k h d
Support for heterogeneityAny # and combination of CPUs, GPUsIncluding autotuning
M ll bilit D l fSupports large amounts of lookaheadNot stalling for dependence satisfaction
Allow for locality optimizations to tolerate latency
Overlap data transfers, prefetchReuse
Nicely hybridizes into MPI/StarSs
Malleability: Decouple program from resources
Allowing dynamic resource allocation and load balanceTolerate noise
Data-flow; Asynchrony
35Mexico DF, November, 2011
Propagates to large scale the node level dataflow characteristicsOverlap communication and computationA chance against Amdahl’s law
35
Data flow; Asynchrony
Potential is there;Can blame runtime
Compatible with proprietary low level technologies
StarSs: history/strategy/versions
must provide directionality ∀argumentContiguous, non partially overlappedRenamingS l h d l ( i it l lit )
Basic SMPSs
C/C++, Fortran under developmentO MP tibilit ( )
OMPSs
C, No Fortranmust provide directionality ∀argumentovelaping &stridedReshaping strided accessesPriority and locality aware scheduling
SMPSs regions
Several schedulers (priority, locality,…)No nestingC/FortranMPI/SMPSs optims.
36Mexico DF, November, 2011
OpenMP compatibility (~)Dependences based only on args. with directionalityContiguous args. (address used as centinels)Separate dependences/transfersInlined/outlined pragmasNestingSMP/GPU/ClusterNo renaming,Several schedulers: “Simple” locality aware sched,…
19
Multidisciplinary top-down approach
Applicationand algorithms
Programming models
Investigate solutions to these
and other problems
Performance analysis and
predictiontools
Load balancingPower
60
70
80
90
Computer Center Power Projections
CoolingComputers
$23M
$31M
37Mexico DF, November, 2011
problemsProcessor and nodeInterconnect
0
10
20
30
40
50
60
Pow
er (M
W)
2005 2006 2007 2008 2009 2010 2011Year
$3M
$17M
$9M
$23M
38Mexico DF, November, 2011
20
Green/Top 500 November 2011
Green500_Rank
Top500_Rank Mflops/Watt Power Site Computer
1 64 2026,48 85,12 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom2 65 2026,48 85,12 IBM Thomas J. Watson Research Center BlueGene/Q, Power BQC 16C 1.60 GHz, Custom3 29 1996,09 170,25 IBM - Rochester BlueGene/Q, Power BQC 16C 1.60 GHz, Custom4 17 1988,56 340,5 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom5 284 1689,86 38,67 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 16 328 1378,32 47,05 Nagasaki University DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDRg y
7 114 1266,26 81,5 Barcelona Supercomputing CenterBullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA 2090
8 102 1010,11 108,8 TGCC / GENCICurie Hybrid Nodes - Bullx B505, Xeon E5640 2.67 GHz, Infiniband QDR
9 21 963,7 515,2Institute of Process Engineering, Chinese Academy of Sciences
Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR, NVIDIA 2050
10 5 958,35 1243,8GSIC Center, Tokyo Institute of Technology
HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows
11 96 928,96 126,27 Virginia TechSuperServer 2026GT-TRF, Xeon E5645 6C 2.40GHz, Infiniband QDR, NVIDIA 2050
12 111 901,54 117,91 Georgia Institute of TechnologyHP ProLiant SL390s G7 Xeon 6C X5660 2.8Ghz, nVidia Fermi, Infiniband QDR
13 82 891 88 160CINECA / SCS - SuperComputing S l ti
iDataPlex DX360M3, Xeon E5645 6C 2.40 GHz, Infiniband QDR, NVIDIA 2070
39Mexico DF, November, 2011 SBAC-PAD, Vitoria October 28th, 2011
13 82 891,88 160 Solution NVIDIA 2070
14 256 891,87 76,25 Forschungszentrum Juelich (FZJ)iDataPlex DX360M3, Xeon X5650 6C 2.66 GHz, Infiniband QDR, NVIDIA 2070
15 61 889,19 198,72 Sandia National LaboratoriesXtreme-X GreenBlade GB512X, Xeon E5 (Sandy Bridge - EP) 8C 2.60GHz, Infiniband QDR
32 1 830,18 12659,89RIKEN Advanced Institute for Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
47 2 635,15 4040 National Supercomputing Center in Tianjin NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050149 3 253,09 6950 DOE/SC/Oak Ridge National Laboratory Cray XT5-HE Opteron 6-core 2.6 GHz
56 4 492,64 2580National Supercomputing Centre in Shenzhen (NSCS)
Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050
Green/Top 500 November 2011
Top5
00 ra
nk
BSC, Xeon 6C, NVIDIA 2090 GPU
IBM and NNSA, Blue Gene/Q
Nagasaki U., Intel i5, ATI Radeon GPU
40Mexico DF, November, 2011
100-500MF/watt
500-1000 MF/watt>1 GF/watt
Mflops/watt Mwatts/Exaflop
2026,48 493
1689,86 592
1378,32 726
1266,26 726
Mflops/watt