Post on 13-Jan-2016
description
Combining the Power of Computer and Computational
Sciences to Fly to Peta-Scale
— a Case Study —
Hiroshi NakashimaAcademic Center for Computing and Media Studies
Kyoto University
special thanks to: Y. Omura & H. Usui (RISH, Kyoto U.)
Contents Introduction: Combining CS2 Power
Why Need to Fly to Peta-Scale? What Kind of Power to Be Combined?
Case Study: Plasma Simulation on DM Systems Why Plasma Simulation? Why for DM Systems ? How for DM Systems ? How Efficient ?
Fly from Case Study Took off Successfully? How Can We Fly Higher?
Conclusions
Contents Introduction: Combining CS2 Power
Why Need to Fly to Peta-Scale? What Kind of Power to Be Combined?
Case Study: Plasma Simulation on DM Systems Why Plasma Simulation? Why for DM Systems ? How for DM Systems ? How Efficient ?
Fly from Case Study Took off Successfully? How Can We Fly Higher?
Conclusions
Why Need to Combine CS2 Power ?
Fly to Peta: How High? (1/2)T2K Open Supercomputer in KyotoRpeak/Rmax=61.2/50.5TFLOPS (#34)
core: (mul+add) x 2 + (L1+L2)
socket: core x 4 + L3
node: (socket + mem.bank) x 4 + IB x 4
node group: node x 6 + 24p-sw x 2
system: node-group x 70 + 288p-sw x 6 + + + ...
already large enough(16 x 416 nodes = 6656 cores)
already layered deeply &complicatedly enough(coresocketnode node-groupsystem)
Why Need to Combine CS2 Power ?
Fly to Peta: How High? (2/2)T2K Open Supercomputer in KyotoRpeak/Rmax=61.2/50.5TFLOPS (#3?)
core: (mul+add) x 2 + (L1+L2)
socket: core x 4 + L3
node: (socket + mem.bank) x 4 + IB x 4
node group: node x 6 + 24p-sw x 2
system: node-group x 70 + 288p-sw x 6 + + + ...
already large enough(16 x 416 nodes = 6656 cores)
already layered deeply &complicatedly enough(coresocketnode node-groupsystem)
Peta-scale system should be; much larger (1,000,000 cores 6656 x 150) much more deeply/complicatedly layered
(corecore-groupsocketsocket-groupnode node-groupnode-supergroupsystem)
Why Need to Combine CS2 Power ?
Fly to Peta: How High? (2/2)
BTW, how large is Peta? 1 Peta meter > 100 light-year 1 Peta second > 30 million year 1 Peta kg > 1/2 x Deimos 1 Peta Hz > violet
Peta-scale system should be; much larger (1,000,000 cores 6656 x 150) much more deeply/complicatedly layered
(corecore-groupsocketsocket-groupnode node-groupnode-supergroupsystem)
Why Need to Combine CS2 Power ?
What Are Combined to Fly?Computational scientists have deep knowledge of; physics, chemistry, biology, ... their own problems, algorithms, programs, ... (sometimes) their own supercomputers
Computer scientists have deep knowledge of; a wide variety of computers, software, tools, ... a wide variety of algorithms, techniques, tricks, ... (sometimes) a few of scientific problems
much more efficient wayto fully exploit peta-scalecomputing power
more Nature/Science papersand chance to win Nobel Prize
chance to co-author a Nature/Science paper and to attendNobel Prize Ceremony
Computational scientists have deep knowledge of; physics, chemistry, biology, ... their own problems, algorithms, programs, ... (sometimes) their own supercomputersand (often?) have Nature/Science papers
Computer scientists have deep knowledge of; a wide variety of computers, software, tools, ... a wide variety of algorithms, techniques, tricks, ... (sometimes) a few of scientific problemsbut never dream to author a Nature/Science paper
Contents Introduction: Combining CS2 Power
Why Need to Fly to Peta-Scale? What Kind of Power to Be Combined?
Case Study: Plasma Simulation on DM Systems Why Plasma Simulation? Why for DM Systems ? How for DM Systems ? How Efficient ?
Fly from Case Study Took off Successfully? How Can We Fly Higher?
Conclusions
Case Study: Plasma Simulation on DM
Why Plasma Simulation ?
power/money hungrylarge scale (128cores,1TB, 1.28TFlops) sharedmemory nodes
A big user group of plasma simulationinsisted that our new system shouldinclude this power/money hungrysubsystem for their memory hungrySM-parallel application.
I failed to persuade them to buildOpen-Supercomputer-only system.So I swore revenge on them by codinga much more efficient DM-parallelprogram to run on Open Supercomputer.
Case Study: Plasma Simulation on DM
Why for DM Systems ? (1/2)a large number of(e.g. > 1 billion)charged particles
a large scale (e.g. 2000x2000x2000 grid)electromagnetic field (e.g. magnetosphere)
simulate particle movement by
Case Study: Plasma Simulation on DM
Why for DM Systems ? (2/2) particle parallelization
(only)
very simple esp. on SM
#particlememory short in SM#grid-pointmemory short even in DM
33
120010
03 13 23
01
32
3010
3303
Case Study: Plasma Simulation on DM
How for DM Systems ? (1/3)
primary subspaces secondary subspaces
uniform block decomposition well-balanced :
#particle-in-subspace #p / #nodes (1 + )
simulate primary particlesneighboring comm. only
each node helps anothernode having dense subspace
balanced #particles balanced subspace size simple boundary comp/comm well-balancedstable ss ass.
13 23
02 12 22 32
01 11 21 31
00 10 20 30
02 22
11 21
00 20
03 1120310123
0230331332
22
31 21
OhHelp:One-handed Help
Case Study: Plasma Simulation on DM
How for DM Systems ? (2/3)
33 00 32 01 30 10 13 03 23 20 31 02 11 21 12 22
Secondary Space Assignment
move p from heaviestto lightest so thatlightest has av. #p
av. #p
give p even if becomingless than averageget from somebody
afterward
Case Study: Plasma Simulation on DM
How for DM Systems ? (3/3)
33
00
32
01
30
10
13
03
23
2031
02
11
21
12
22
must have all primaries cover secondaries up to
well-balancing limit
must have all primariesnot covered by children
cover secondaries up towell-balancing limit
check recursively fromleaves to root
OK if no overflow detected
Well-Balancing Check with Primary/Secondary Tree
0
20
40
60
80
100
120
0 32 64 96 128# of processes
106̂ p
arti
cle/
sec
Case Study: Plasma Simulation on DM
How Efficient ? performance @ 16-128 proc on HPC2500
x3.20
x11.71
x8.76
balanced
unbalanced
original
x10.7
T2K Open Supercomputer4 nodes (64 cores)
x1.66
x4.02
Contents Introduction: Combining CS2 Power
Why Need to Fly to Peta-Scale? What Kind of Power to Be Combined?
Case Study: Plasma Simulation on DM Systems Why Plasma Simulation? Why for DM Systems ? How for DM Systems ? How Efficient ?
Fly from Case Study Took off Successfully? How Can We Fly Higher?
Conclusions
Fly from Case Study
Took off Successfully ?Plasma simulation group now; appreciates OhHelp and Open Supercomputer
(but not published Nature/Science papers yet ) is planning to port codes to Open Supercomputer.
We supercomputer guys now; are happy with accomplishing the revenge. are generously pursuing cooperative research with
them (hoping at least to have a SC paper )
Plasma simulation group now; appreciates OhHelp and Open Supercomputer
(but not published Nature/Science papers yet ) is planning to port codes to Open Supercomputer. hopes our help in recoding a variety of simulators.
We supercomputer guys now; are happy with accomplishing the revenge. are generously pursuing cooperative research with
them (hoping at least to have a SC paper ) but cannot find time to do everything they want.
Fly from Case Study
How Fly Higher ? Plasma guys have a large variety of simulators.
We supercomputer guys have OhHelp which needsto be adapted to each simulator by modifying notonly itself but also the simulator.
Parallelization Method Librarygenerated from method skeleton AP specific stuband linked to simulators.
Plasma guys have a wide variety of simulators. Other guys have other varieties of other simulators.
We supercomputer guys have OhHelp which needsto be adapted to each simulator by modifying notonly itself but also the simulator.
Expectedly we will find other computer-scientifictricks for other types of simulators.
Conclusions Flying to Peta-scale needs CS2 collaboration
offering various (non-numerical) tricks from computer guys.
taking opportunity to play in larger and real-world application field from computational guys.
Took off from OhHelp simple but efficient load balancing for plasma simulations. (non-numerical) computer-scientific tricks can greatly
improve numerical simulations. fly higher by parallelization method libraries.
Other ways to elevate adaptation of linear equation solvers to applications w.r.t.
memory layout. parallel script programming language for large parameter
space exploration.