Updates of AICS and the next step for Post-Petascale Computing in Japan Mitsuhisa Sato University of...
-
Upload
margaret-nelson -
Category
Documents
-
view
215 -
download
1
Transcript of Updates of AICS and the next step for Post-Petascale Computing in Japan Mitsuhisa Sato University of...
Updates of AICS and the next step for Post-Petascale
Computing in Japan
Mitsuhisa Sato
University of TsukubaTeam leader of programming environment research team,
Advanced Institute for Computational Science (AICS), RIKEN
1
2
RIKEN Advanced Institute for Computational Science (AICS)
The institute have been established at the K computer in Kobe (started in October 2010)
Missions To run the K computer efficiently for users of wide research areas Carry out the leading edge of computational science technologies
and contribute for COE of computational science in Japan Propose the future directions of HPC in Japan and conduct it.
Organization Operation division, to run and manage the K computer Research division
Started with 5 computational science research teams and 3 computer science research teams. In 2012, expanded to 10 computational science research teams and 6
computer science research teams. Promoting strong collaborations between computational and computer
scientists, working with core-organizations of each fields together.
http://www.aics.riken.jp/
Divisions
AICS Policy Planning Division
AICS Research Support Division
Research Division Operations and Computer
Technologies Division
3
Research Division (16teams + 3units)
System Software Research Team
Programming Environment Research Team
Processor Research Team Large-scale Parallel Numerical
Computing Technology Research Team
HPC Usability Research Team Field Theory Research Team Discrete Event Simulation
Research Team Computational Molecular
Science Research Team Computational Materials
Science Research Team Computational Biophysics
Research Team
Particle Simulator Research Team
Computational Climate Science Research Team
Complex Phenomena Unified Simulation Research Team
HPC Programming Framework Research Team
Advanced Visualization Research Team
Data Assimilation Research Team
Computational Chemistry Research Unit
Computational Disaster Mitigation and Reduction Research Unit
Computational Structural Biology Research Unit
4
The status of the K computer
The first racks of the K computer were delivered to Kobe on Sept, 2010.
Rack : 864 (+54), Compute nodes (CPUs): 82,944 (88,128), Number of cores: 663,552(705,024)
It has already achieved its primary target “ over 10 Peta-flops” (10.51PF Linpack, 12.66MW), the last November (2011).
Installation and adjustment of K was complete, and the public use was started at the end of the September (2012).
Photo of First delivery, Sep 28, 2010
6
K computer: compute nodes and networkRack : 864 (+ 54 )
Compute nodes (CPUs): 82,944 (88128 )
Number of cores: 663,552 (705024 )
Peak performance: 10.6PFLOPS (11.28)
10.51PF Linpack (12.66MW), Nov 2011
Memory: 1.33 PB (16GB/node)
SPARC64TM VIIIfx
Courtesy of FUJITSU Ltd.
ノード
CPU: 128GFLOPS(8 Core)
CoreSIMD(4FMA)
16GFlops
CoreSIMD(4FMA)
16GFlops
CoreSIMD(4FMA)
16GFlops
CoreSIMD(4FMA)
16GFlops
CoreSIMD(4FMA)
16GFlops
CoreSIMD(4FMA)
16GFlops
CoreSIMD(4FMA)
16GFlops
L2$: 5MB
64GB/s
CoreSIMD(4FMA)16GFLOPS
MEM: 16GB
x
y
z5G
B/s (peak) x 2
5GB/
s (peak
) x 2
5GB/
s(pea
k) x 2
5GB/s(peak) x 2
5GB/s (peak) x 2
Compute node
Logical 3-dimensional torus network for programming
High-Performance/Low Power CPU with 8 cores : 128GFlops@2GHz, 58WHigh Throughput/Low Latency Torus Network (Tofu)Logical 3-dimensional torus networkPeak bandwidth: 5GB/s x 2 for each direction of logical 3-dimensional torus networkbi-section bandwidth: 30TB/s
7
京コンピュータ “ The K computer"
8
Projects to organize users of the K computer
SPIRE (Strategic Programs for Innovative Research)
The committee in MEXT has identified five application areas that are expected to create breakthroughs using the K computer from national viewpoint.
National-wide High Performance Computing Infrastructure Project (HPCI)
To organize computing resources and users, including university supercomputers the K computing in national-wide
Consortium
users
Large Scale Storage
Supercomputers in Universities
SPIRE (Strategic Programs for Innovative Research)
Purpose To produce scientific results as soon as HPCI starts its operation To establish several core institutes for computational science
Outline of this program Identify the five strategic research areas which will contribute to
produce results to scientific and social Issues A nation wide research groups are formed by funding the core
organization designated by MEXT. The groups are to promote R&D using K computer and to
construct research structures for their own area 50% computing resources of the K computer will be dedicated to
this program
Five strategic areas of SPIRELife science/Drug
manufactureLife science/Drug
manufacture
Monodukuri(Manufacturing
technology)
Monodukuri(Manufacturing
technology)
New material/energy
creation
New material/energy
creation
Global change prediction for
disaster prevention/mitigati
on
Global change prediction for
disaster prevention/mitigati
on
ゲノム全身
タンパク質
細胞
多階層の生命現象
組織,臓器
Toshio YANAGIDA(RIKEN)
Shinji TSUNEYUKI(University of
Tokyo)
Shiro IMAWAKI(JAMSTEC)
Chisachi KATO(University of
Tokyo)
The origin of matter and the
universe
The origin of matter and the
universe
Shinya AOKI(University of
Tsukuba)
National-wide High Performance Computing Infrastructure Project (HPCI)
Background: After re-evaluation of the project at “government party change” in 2011,
the NGS project was restarted as “Creation of the Innovative High Performance Computing Infra-structure (HPCI)”.
Building HPCI: High-Performance Computing Infrastructure Provide Seamless access to K computer, supercomputers, and user's
machines Set up a large-scale storage system for the K computer and other
supercomputers Joint selection of proposals for K and other supercomputers
Organizing HPCI Consortium Organize users and computer centers and provide proposals/suggestions to the government and related organizations
Plan and operation of HPCI system Promotion of computational sciences Future supercomputing
consortium
supercomputer
supercomputer
supercomputer
supercomputer
HPCIK computer
Institutional/Universitycomputer centers
Computational Science communities
Advanced Institute forComputational Science (AICS),
RIKEN
The consortium has been organized
and started in June 201211
The Conceptual View of HPCI
Consortium
users
K computer
Large Scale Storage
Supercomputers in Universities
HPCI is a comprehensive advanced computing infrastructure in which the supercomputers and large scale storages are connected together through the high speed network.
Computing resources in HPCI
13
As of Jun 2012
AICS, RIKEN:K computer (11.28 PF, 1.27PiB)
Univ. of Tokyo:FX10 (1.13PF, 150TB)SR16000/M1(54.9TF, 10.94TB)T2K (140 TF, 31.25TB)
Kyushu Univ.:FX10 (181.6TF, 24TB)CX400 (510.1TfF, 184.5TB)SR16000 L2 (25.3Tflops, 5.5TB)
Hokkaido Univ.:SR16000/M1(172TF, 22TB)BS2000 (44TF, 14TB)
Nagoya Univ.:FX1(30.72TF, 24TB)HX600(25.6TF, 10TB)M9000(3.84TF, 3TB)
Tohoku Univ.:SX-9(29.4TF, 18TB)Express5800 (1.74TF, 3TB)
Tokyo Institute of Technology:TSUBAME2.0 (2.4 PF, 100TB)
Osaka Univ.:SX-9 (16TF, 10TB)SX-8R (5.3TF, 3.3TB)PCCluster (6.1TF, 2.0TB)
Kyoto Univ.XE6 (300.8 TF, 59 TB)GreenBlade8000(242.5TF, 38TB)2548X(10.6TF, 24TB)
Univ. of Tsukuba:T2K (95.4Tflops, 20TB)HA-PACS (802Tflops, 34.3TB)FIRST (36.1TFlops, 1.6TB)
11.28 PF + 0.757496 PF + 2.21248 PF
Storage System in first phase for HPCI
Hokkaido University
Tohoku University
University of Tokyo
University of Tsukuba
Tokyo Institute of Technology
Nagoya University
Kyushu University
Osaka UniversityKyoto University
AICS, RIKEN
• 12 PB storage (52 OSS)• Cluster for data analysis (87
node)
• 10 PB storage (30 OSS)• Cluster for data analysis (88
node)
HPCI WEST HUB HPCI EAST HUB
Gfarm2 is used as the global shared file
system
14
HPCI offers to computational science users …
Computing resource Computational science researchers can make use of proper
amount of computing resources more effectively and efficiently K computer and university supercomputers
Authentication Users can get access to these computers and storages with a
single-sign-on account
Storage Users can share a large amount of data in the storage Analyze or visualize results simulated by other researchers on
different supercomputers
15
Users and jobs
about 100 active users and 1000 jobs per day
period ISep.2012-Mar.2013
period IIApr.2013-Sep.2013
period IIIOct.2013-Mar.2014
Job Property(1/2)
• Larger jobs (more than 5000 nodes (= 0.5PF)) consume about 40-50% of the resource.
• Used / Serviced ratio reaches 80%.
period ISep.2012-Mar.2013
period IIApr.2013-Sep.2013
period IIIOct.2013-Mar.2014
Job Property(2/2)
• Average of the sustained performances and job scale are gradually increasing.
Activities for the future HPC development
FY2011 FY2012 FY2013
SDHPC WGs: Technical discussion by two groups
FS “Feasibility study” for future HPC
Discussion on HPC policy Decision of future HPC
R&D
Discussion on basic concept
SDHPC White paper was published
Review
We are here now
The SDHPC white paper and “Feasibility Study" project
WGs were organized for drafting the white paper for Strategic Direction/Development of HPC in JAPAN by young Japanese researchers with advisers (seniors)
Contents Science roadmap until 2020 and List of application for 2020’s Four types of hardware architectures identified and performance projection in
2018 estimated from the present technology trend Necessity of further research and development to realize the science roadmap
For “Feasibility Study" project, 4 research teams were accepted Application study team leaded by RIKEN AICS (Tomita) System study team leaded by U Tokyo (Ishikawa)
Next-generation “General-Purpose” Supercomputer System study team leaded by U Tsukuba (Sato)
Study on exascale heterogeneous systems with accelerators System study team leaded by Tohoku U (Kobayashi)
Projects were started from July 2012 (1.5 year) …
20
System requirement analysis for Target sciences
System performance FLOPS: 800 – 2500PFLOPS Memory capacity: 10TB – 500PB Memory bandwidth: 0.001 – 1.0 B/F Example applications
Small capacity requirement MD, Climate, Space physics, …
Small BW requirement Quantum chemistry, …
High capacity/BW requirement Incompressibility fluid dynamics, …
Interconnection Network Not enough analysis has been carried out Some applications need >1us latency and large bisection
BW Storage
There is not so big demand 21
1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+31.0E-4
1.0E-3
1.0E-2
1.0E-1
1.0E+0
1.0E+1
Requirement of Memory Capacity (PB)
Re
qu
irem
en
t of
B/F
Low BWMiddle capacity
High BWsmall capacity
High BWmiddle capacity
High BWHigh capacity
(From SDHPC white paper)
Alternatives of ExaScale Architecture
Four types of architectures are identified for exascale:
General Purpose (GP) Ordinary CPU-based MPPs e.g.) K-Computer, GPU, Blue Gene,
x86-based PC-clusters Capacity-Bandwidth oriented (CB)
With expensive memory-I/F rather than computing capability
e.g.) Vector machines Reduced Memory (RM)
With embedded (main) memory e.g.) SoC, MD-GRAPE4, Anton
Compute Oriented (CO) Many processing units e.g.) ClearSpeed, GRAPE-DR, GPU?
22
(From SDHPC white paper)
Memorybandwidth
Memorycapacity
FLOPS
CB oriented
Computeoriented
ReducedMemory
General purpose
CB
GP
CO
RM
1.0E-4
1.0E-3
1.0E-2
1.0E-1
1.0E+0
1.0E+1
1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+3
Req
uire
men
t of
B/F
Requirement of Memory Capacity (PB)
Mapping of Architectures
23
Issues for exascale computing
Two important aspects of post-petascale computing
Power limitation < 20-30 MW
Strong-scaling < 10^6 nodes, for FT > 10TFlops/node accelerator, many-cores
Solution: Accelerated Computing
by GPGPU by Application-specific
Accelerator by ... future acceleration
device ...
1 10 102 103 104 105 106
1GFlops109
1TFlops1012
1PFlops1015
1EFlops1018
#node
Peakflops
limitationof #node
Exaflops system
PACS-CS (14TF)
petaflopsby 100-1000nodes
NGS> 10PF
T2K-tsukuba(95TF)
the K computer
simple projection of #nodes and peak flops
Study on exascale heterogeneous systems with accelerators (U Tsukuba
project) Two keys for exascale computing
Power and strong-scaling
We study “exascale” heterogeneous systems with accelerators of many-cores. We are interested in:
Architecture of accelerators, core and memory architecture Special-purpose functions Direct connection between accelerators in a group Power estimation and evaluation Programming model and computational science applications Requirement for general-purpose system etc …
24
ネットワーク
Nod
e
Nod
e
Nod
e
Nod
e ・・・
System Network
Group
AcceleratorNetwork
GPProcessor
Memory
Accelerator
Accelerator ・・・
GPProcessor
Memory
Accelerator
Accelerator ・・・
・・・
Node Node
Group
Core
MemCont
rolle
r Core
Mem
Core
Mem
Core
Mem
Core
MemCont
rolle
r Core
Mem
Core
Mem
Core
Mem
Core
MemCont
rolle
r Core
Mem
コア
Mem
Core
Mem
Core
MemCont
rolle
r Core
Mem
Core
Mem
コア
Mem
X
X
X
X
X
X
X X
X
X
X
X
XX
X
X
Storage
To System network
Accelerator Chip (example)
System
PACS-G: a straw man architecture SIMD architecture, for compute oriented apps (N-body, MD), and
stencil apps. 4096 cores (64x64), 2FMA@1GHz, 4GFlops x 4096 = 16TFlops/chip 2D mesh (+ broardcast/reduction) on-chip network for stencil apps. We expect 14nm technology at the range of year 2018-2020,
Chip dai size: 20mm x 20mm Mainly working on on-chip memory (size 512 MB/chip, 128KB/core), and, with module memory by 3D-stack/wide IO DRAM memory (via 2.5D TSV), bandwidth 1000GB/s, size 16-32GB/chip No external memory (DIM/DDR)
250 W/chips expected 64K chips for 1 EFLOPS (at peak)
コントローラ
ホストプロセッサ データ
メモリ命令メモリ
通信バッファ
通信バッファ
通信バッファ
結果縮約ネットワーク
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
通信バッファ
加速機構間
ネットワーク
放送
メモリ
放送
メモリ
放送
メモリ
放送
メモリ
FSACCチップPACS-Gチップ
3D stack or Wide IO DRAM
A group of 1024~ 2048 chips are con-nected via accelerator network (inter-chip network)
25 – 50Gpbs/link for inter-chip: If we ex-tend 2-D mesh network to the (2D-mesh) external network in a group, we need 200~ 400GB/s (= 32ch. x 25~ 50Gbps x 2(bi-direction))
For 50Gpbs data transfer, we may need direct optical interconnect from chip.
I/O Interface to Host: PCI Express Gen 4 x16 (not enough!!!)
interconnect between chips (2D mesh)
Programming model: XcalableMP + OpenACC
Use OpenACC to specify offloaded fragment of code and data move-ment
To align data and computation to core, we use the concept "template" of XcalableMP (virtual index space). We can generate code for each core.
(And data parallel lang. like C*)
An example of implementation (for 1U rack)
PACS-G: a straw man architecture
Board
acceleratorLSI
PSU
acceleratormodule
memory module・HMC・Wide I/O DRAM
optical interconnectmodule
connectorsilicon interposerto other acc unit.
PSU
Project organization Joint project with Titech (Makino), Aizu U (Nakazato), RIKEN (Taiji), U
Tokyo, KEK, Hiroshima U, and Hitachi as a super computer company Target apps: QCD in particle physics, tree N-body, HMD in
Astrophysics, MD in life sci., FDM of earthquake, FMO in chemistry, NICAM in climate sci.
27
Simulator and Evaluation tools (Kodama) Programming model (Sato)
Study on Implementation and power (Hitachi)
Processor core architecture (Nakazato)
Accelerator Nework(Boku)
Basic Accelerator architecture (Makino)
Particle physics
(Kururamashi, Ishikawa, Matsufuru)
Astrophysics (Umemura, Yoshikawa)
Nano-material Science
(Oshiyama)
Life Science (Taiji, Umeda)
Global Climate Science (Tanaka, Yashiro)
Earth Science
(Okamoto)
API
Programming modelsimulation
Feedback
Application Study (U Tsukuba, RIKEN, U. Tokyo, KEK, Hiroshima U)
Accelerator and Network
System Design(Titech,
U Tsukuba, Aizu)
Programming model
and Simulation Tools
(U. Tsukuba)
Study on Implementation
(Hitachi)
Current status and schedule
We are now working on performance estimation by co-design process
2012 (done): QCD, N-body, MD, HMD 2013: earth quake sim, NICAM (climate), FMO (chemistry)
When all data fits on on-chip memory, ratio B/F is 4 B/F, total mem size 1TB/group When data fits into module memory, ratio B/F is 0.05B/F, total mem size 32TB/group
Also, developing simulators (clock-level/instruction level) for more precious and quantitative performance evaluation
Compiler development (XMP and OpenACC) (Re-)Design and investigation of network topology
2D mesh is sufficient? or, other alternatives? Code development for apps using Host and Acc, including I/O Precise and more detail estimation of power consumptions
AICS Development of International Partnership
NCSA, US under MoU (1st meeting in April, 2013) NCI (National Computational Infracture), Australia, under MoU JSC, Germany, under MoU SISSA (THE SCUOLA INTERNAZIONALE SUPERIORE DI STUDI AVANZATI),
Italia, under agreement. Unversity of Maryland, US, with agreement for collaboration on modeling
and data assimilation.
"maison de la simulation" (INRIA/CNRS), France, under discussion ANL, US, under discussion
Recently, Japan and US agreed the collaboration on system software for supercomputing at "U.S.-Japan Joint High-Level Committee Meeting on Science and Technology Cooperation"
http://www.state.gov/r/pa/prs/ps/2013/04/208651.htm Workshop on international collaboration for exascale computing will be
organized in ISC2013, next week. Towards to JLESC. (esp. for exascale software development)
29