Updates of AICS and the next step for Post-Petascale Computing in Japan Mitsuhisa Sato University of...

Updates of AICS and the next step for Post-Petascale

Computing in Japan

Mitsuhisa Sato

University of TsukubaTeam leader of programming environment research team,

Advanced Institute for Computational Science (AICS), RIKEN

1

2

RIKEN Advanced Institute for Computational Science (AICS)

The institute have been established at the K computer in Kobe (started in October 2010)

Missions To run the K computer efficiently for users of wide research areas Carry out the leading edge of computational science technologies

and contribute for COE of computational science in Japan Propose the future directions of HPC in Japan and conduct it.

Organization Operation division, to run and manage the K computer Research division

Started with 5 computational science research teams and 3 computer science research teams. In 2012, expanded to 10 computational science research teams and 6

computer science research teams. Promoting strong collaborations between computational and computer

scientists, working with core-organizations of each fields together.

http://www.aics.riken.jp/

Divisions

AICS Policy Planning Division

AICS Research Support Division

Research Division Operations and Computer

Technologies Division

3

Research Division (16teams + 3units)

System Software Research Team

Programming Environment Research Team

Processor Research Team Large-scale Parallel Numerical

Computing Technology Research Team

HPC Usability Research Team Field Theory Research Team Discrete Event Simulation

Research Team Computational Molecular

Science Research Team Computational Materials

Science Research Team Computational Biophysics

Research Team

Particle Simulator Research Team

Computational Climate Science Research Team

Complex Phenomena Unified Simulation Research Team

HPC Programming Framework Research Team

Advanced Visualization Research Team

Data Assimilation Research Team

Computational Chemistry Research Unit

Computational Disaster Mitigation and Reduction Research Unit

Computational Structural Biology Research Unit

4

The status of the K computer

The first racks of the K computer were delivered to Kobe on Sept, 2010.

Rack : 864 (+54), Compute nodes (CPUs): 82,944 (88,128), Number of cores: 663,552(705,024)

It has already achieved its primary target “ over 10 Peta-flops” (10.51PF Linpack, 12.66MW), the last November (2011).

Installation and adjustment of K was complete, and the public use was started at the end of the September (2012).

Photo of First delivery, Sep 28, 2010

6

K computer: compute nodes and networkRack : 864 (+ 54 )

Compute nodes (CPUs): 82,944 (88128 ）

Number of cores: 663,552　 (705024 )

Peak performance: 10.6PFLOPS (11.28)

10.51PF Linpack (12.66MW), Nov 2011

Memory: 1.33 PB (16GB/node)

SPARC64TM VIIIfx

Courtesy of FUJITSU Ltd.

ノード

CPU: 128GFLOPS(8 Core)

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

L2$: 5MB

64GB/s

CoreSIMD(4FMA)16GFLOPS

MEM: 16GB

x

y

z5G

B/s (peak) x 2

5GB/

s (peak

) x 2

5GB/

s(pea

k) x 2

5GB/s(peak) x 2

5GB/s (peak) x 2

Compute node

Logical 3-dimensional torus network for programming

High-Performance/Low Power CPU with 8 cores : 128GFlops@2GHz, 58WHigh Throughput/Low Latency Torus Network (Tofu)Logical 3-dimensional torus networkPeak bandwidth: 5GB/s x 2 for each direction of logical 3-dimensional torus networkbi-section bandwidth: 30TB/s

7

京コンピュータ　“ The K computer"

8

Projects to organize users of the K computer

SPIRE (Strategic Programs for Innovative Research)

The committee in MEXT has identified five application areas that are expected to create breakthroughs using the K computer from national viewpoint.

National-wide High Performance Computing Infrastructure Project (HPCI)

To organize computing resources and users, including university supercomputers the K computing in national-wide

Consortium

users

Large Scale Storage

Supercomputers in Universities

SPIRE (Strategic Programs for Innovative Research)

Purpose To produce scientific results as soon as HPCI starts its operation To establish several core institutes for computational science

Outline of this program Identify the five strategic research areas which will contribute to

produce results to scientific and social Issues A nation wide research groups are formed by funding the core

organization designated by MEXT. The groups are to promote R&D using K computer and to

construct research structures for their own area 50% computing resources of the K computer will be dedicated to

this program

Five strategic areas of SPIRELife science/Drug

manufactureLife science/Drug

manufacture

Monodukuri(Manufacturing

technology)

Monodukuri(Manufacturing

technology)

New material/energy

creation

New material/energy

creation

Global change prediction for

disaster prevention/mitigati

on

Global change prediction for

disaster prevention/mitigati

on

ゲノム全身

タンパク質

細胞

多階層の生命現象

組織，臓器

Toshio YANAGIDA(RIKEN)

Shinji TSUNEYUKI(University of

Tokyo)

Shiro IMAWAKI(JAMSTEC)

Chisachi KATO(University of

Tokyo)

The origin of matter and the

universe

The origin of matter and the

universe

Shinya AOKI(University of

Tsukuba)

National-wide High Performance Computing Infrastructure Project (HPCI)

Background: After re-evaluation of the project at “government party change” in 2011,

the NGS project was restarted as “Creation of the Innovative High Performance Computing Infra-structure (HPCI)”.

Building HPCI: High-Performance Computing Infrastructure Provide Seamless access to K computer, supercomputers, and user's

machines Set up a large-scale storage system for the K computer and other

supercomputers Joint selection of proposals for K and other supercomputers

Organizing HPCI Consortium Organize users and computer centers and provide proposals/suggestions to the government and related organizations

Plan and operation of HPCI system Promotion of computational sciences Future supercomputing

consortium

supercomputer

supercomputer

supercomputer

supercomputer

HPCIK computer

Institutional/Universitycomputer centers

Computational Science communities

Advanced Institute forComputational Science (AICS),

RIKEN

The consortium has been organized

and started in June 201211

The Conceptual View of HPCI

Consortium

users

K computer

Large Scale Storage

Supercomputers in Universities

HPCI is a comprehensive advanced computing infrastructure in which the supercomputers and large scale storages are connected together through the high speed network.

Computing resources in HPCI

13

As of Jun 2012

AICS, RIKEN：K computer (11.28 PF, 1.27PiB)

Univ. of Tokyo：FX10 (1.13PF, 150TB)SR16000/M1(54.9TF, 10.94TB)T2K (140 TF, 31.25TB)

Kyushu Univ.：FX10 (181.6TF, 24TB)CX400 (510.1TfF, 184.5TB)SR16000 L2 (25.3Tflops, 5.5TB)

Hokkaido Univ.：SR16000/M1(172TF, 22TB)BS2000 (44TF, 14TB)

Nagoya Univ.：FX1(30.72TF, 24TB)HX600(25.6TF, 10TB)M9000(3.84TF, 3TB)

Tohoku Univ.：SX-9(29.4TF, 18TB)Express5800 (1.74TF, 3TB)

Tokyo Institute of Technology：TSUBAME2.0 (2.4 PF, 100TB)

Osaka Univ.：SX-9 (16TF, 10TB)SX-8R (5.3TF, 3.3TB)PCCluster (6.1TF, 2.0TB)

Kyoto Univ.XE6 (300.8 TF, 59 TB)GreenBlade8000(242.5TF, 38TB)2548X(10.6TF, 24TB)

Univ. of Tsukuba：T2K (95.4Tflops, 20TB)HA-PACS (802Tflops, 34.3TB)FIRST (36.1TFlops, 1.6TB)

11.28 PF + 0.757496 PF + 2.21248 PF

Storage System in first phase for HPCI

Hokkaido University

Tohoku University

University of Tokyo

University of Tsukuba

Tokyo Institute of Technology

Nagoya University

Kyushu University

Osaka UniversityKyoto University

AICS, RIKEN

• 12 PB storage (52 OSS)• Cluster for data analysis (87

node)

• 10 PB storage (30 OSS)• Cluster for data analysis (88

node)

HPCI WEST HUB HPCI EAST HUB

Gfarm2 is used as the global shared file

system

14

HPCI offers to computational science users …

Computing resource Computational science researchers can make use of proper

amount of computing resources more effectively and efficiently K computer and university supercomputers

Authentication Users can get access to these computers and storages with a

single-sign-on account

Storage Users can share a large amount of data in the storage Analyze or visualize results simulated by other researchers on

different supercomputers

15

Users and jobs

about 100 active users and 1000 jobs per day

period ISep.2012-Mar.2013

period IIApr.2013-Sep.2013

period IIIOct.2013-Mar.2014

Job Property(1/2)

• Larger jobs (more than 5000 nodes (= 0.5PF)) consume about 40-50% of the resource.

• Used / Serviced ratio reaches 80%.

period ISep.2012-Mar.2013

period IIApr.2013-Sep.2013

period IIIOct.2013-Mar.2014

Job Property(2/2)

• Average of the sustained performances and job scale are gradually increasing.

Activities for the future HPC development

FY2011 FY2012 FY2013

SDHPC WGs: Technical discussion by two groups

FS “Feasibility study” for future HPC

Discussion on HPC policy Decision of future HPC

R&D

Discussion on basic concept

SDHPC White paper was published

Review

We are here now

The SDHPC white paper and “Feasibility Study" project

WGs were organized for drafting the white paper for Strategic Direction/Development of HPC in JAPAN by young Japanese researchers with advisers (seniors)

Contents Science roadmap until 2020 and List of application for 2020’s Four types of hardware architectures identified and performance projection in

2018 estimated from the present technology trend Necessity of further research and development to realize the science roadmap

For “Feasibility Study" project, 4 research teams were accepted Application study team leaded by RIKEN AICS (Tomita) System study team leaded by U Tokyo (Ishikawa)

Next-generation “General-Purpose” Supercomputer System study team leaded by U Tsukuba (Sato)

Study on exascale heterogeneous systems with accelerators System study team leaded by Tohoku U (Kobayashi)

Projects were started from July 2012 (1.5 year) …

20

System requirement analysis for Target sciences

System performance FLOPS: 800 – 2500PFLOPS Memory capacity: 10TB – 500PB Memory bandwidth: 0.001 – 1.0 B/F Example applications

Small capacity requirement MD, Climate, Space physics, …

Small BW requirement Quantum chemistry, …

High capacity/BW requirement Incompressibility fluid dynamics, …

Interconnection Network Not enough analysis has been carried out Some applications need >1us latency and large bisection

BW Storage

There is not so big demand 21

1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+31.0E-4

1.0E-3

1.0E-2

1.0E-1

1.0E+0

1.0E+1

Requirement of Memory Capacity (PB)

Re

qu

irem

en

t of

B/F

Low BWMiddle capacity

High BWsmall capacity

High BWmiddle capacity

High BWHigh capacity

(From SDHPC white paper)

Alternatives of ExaScale Architecture

Four types of architectures are identified for exascale:

General Purpose (GP) Ordinary CPU-based MPPs e.g.) K-Computer, GPU, Blue Gene,

x86-based PC-clusters Capacity-Bandwidth oriented (CB)

With expensive memory-I/F rather than computing capability

e.g.) Vector machines Reduced Memory (RM)

With embedded (main) memory e.g.) SoC, MD-GRAPE4, Anton

Compute Oriented (CO) Many processing units e.g.) ClearSpeed, GRAPE-DR, GPU?

22

(From SDHPC white paper)

Memorybandwidth

Memorycapacity

FLOPS

CB oriented

Computeoriented

ReducedMemory

General purpose

CB

GP

CO

RM

1.0E-4

1.0E-3

1.0E-2

1.0E-1

1.0E+0

1.0E+1

1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+3

Req

uire

men

t of

B/F

Requirement of Memory Capacity (PB)

Mapping of Architectures

23

Issues for exascale computing

Two important aspects of post-petascale computing

Power limitation < 20-30 MW

Strong-scaling < 10^6 nodes, for FT > 10TFlops/node accelerator, many-cores

Solution: Accelerated Computing

by GPGPU by Application-specific

Accelerator by ... future acceleration

device ...

1 10 102 103 104 105 106

1GFlops109

1TFlops1012

1PFlops1015

1EFlops1018

#node

Peakflops

limitationof #node

Exaflops system

PACS-CS (14TF)

petaflopsby 100-1000nodes

NGS> 10PF

T2K-tsukuba(95TF)

the K computer

simple projection of #nodes and peak flops

Study on exascale heterogeneous systems with accelerators (U Tsukuba

project) Two keys for exascale computing

Power and strong-scaling

We study “exascale” heterogeneous systems with accelerators of many-cores. We are interested in:

Architecture of accelerators, core and memory architecture Special-purpose functions Direct connection between accelerators in a group Power estimation and evaluation Programming model and computational science applications Requirement for general-purpose system etc …

24

ネットワーク

Nod

e

Nod

e

Nod

e

Nod

e ・・・

System Network

Group

AcceleratorNetwork

GPProcessor

Memory

Accelerator

Accelerator ・・・

GPProcessor

Memory

Accelerator

Accelerator ・・・

・・・

Node Node

Group

Core

MemCont

rolle

r Core

Mem

Core

Mem

Core

Mem

Core

MemCont

rolle

r Core

Mem

Core

Mem

Core

Mem

Core

MemCont

rolle

r Core

Mem

コア

Mem

Core

Mem

Core

MemCont

rolle

r Core

Mem

Core

Mem

コア

Mem

X

X

X

X

X

X

X X

X

X

X

X

XX

X

X

Storage

To System network

Accelerator Chip (example)

System

PACS-G: a straw man architecture SIMD architecture, for compute oriented apps (N-body, MD), and

stencil apps. 4096 cores (64x64), 2FMA@1GHz, 4GFlops x 4096 = 16TFlops/chip 2D mesh (+ broardcast/reduction) on-chip network for stencil apps. We expect 14nm technology at the range of year 2018-2020,

Chip dai size: 20mm x 20mm Mainly working on on-chip memory (size 512 MB/chip, 128KB/core), and, with module memory by 3D-stack/wide IO DRAM memory (via 2.5D TSV), bandwidth 1000GB/s, size 16-32GB/chip No external memory (DIM/DDR)

250 W/chips expected 64K chips for 1 EFLOPS (at peak)

コントローラ

ホストプロセッサデータ

メモリ命令メモリ

通信バッファ

通信バッファ

通信バッファ

結果縮約ネットワーク

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

通信バッファ

加速機構間

ネットワーク

放送

メモリ

放送

メモリ

放送

メモリ

放送

メモリ

FSACCチップPACS-Gチップ

3D stack or Wide IO DRAM

A group of 1024～ 2048 chips are con-nected via accelerator network (inter-chip network)

25 – 50Gpbs/link for inter-chip: If we ex-tend 2-D mesh network to the (2D-mesh) external network in a group, we need 200～ 400GB/s (= 32ch. x 25～ 50Gbps x 2(bi-direction))

For 50Gpbs data transfer, we may need direct optical interconnect from chip.

I/O Interface to Host: PCI Express Gen 4 x16 (not enough!!!)

interconnect between chips (2D mesh)

Programming model: XcalableMP + OpenACC

Use OpenACC to specify offloaded fragment of code and data move-ment

To align data and computation to core, we use the concept "template" of XcalableMP (virtual index space). We can generate code for each core.

(And data parallel lang. like C*)

An example of implementation (for 1U rack)

PACS-G: a straw man architecture

Board

acceleratorLSI

PSU

acceleratormodule

memory module・HMC・Wide I/O DRAM

optical interconnectmodule

connectorsilicon interposerto other acc unit.

PSU

Project organization Joint project with Titech (Makino), Aizu U (Nakazato), RIKEN (Taiji), U

Tokyo, KEK, Hiroshima U, and Hitachi as a super computer company Target apps: QCD in particle physics, tree N-body, HMD in

Astrophysics, MD in life sci., FDM of earthquake, FMO in chemistry, NICAM in climate sci.

27

Simulator and Evaluation tools (Kodama) Programming model (Sato)

Study on Implementation and power (Hitachi)

Processor core architecture (Nakazato)

Accelerator Nework(Boku)

Basic Accelerator architecture (Makino)

Particle physics

(Kururamashi, Ishikawa, Matsufuru)

Astrophysics (Umemura, Yoshikawa)

Nano-material Science

(Oshiyama)

Life Science (Taiji, Umeda)

Global Climate Science (Tanaka, Yashiro)

Earth Science

(Okamoto)

API

Programming modelsimulation

Feedback

Application Study (U Tsukuba, RIKEN, U. Tokyo, KEK, Hiroshima U)

Accelerator and Network

System Design(Titech,

U Tsukuba, Aizu)

Programming model

and Simulation Tools

(U. Tsukuba)

Study on Implementation

(Hitachi)

Current status and schedule

We are now working on performance estimation by co-design process

2012 (done): QCD, N-body, MD, HMD 2013: earth quake sim, NICAM (climate), FMO (chemistry)

When all data fits on on-chip memory, ratio B/F is 4 B/F, total mem size 1TB/group When data fits into module memory, ratio B/F is 0.05B/F, total mem size 32TB/group

Also, developing simulators (clock-level/instruction level) for more precious and quantitative performance evaluation

Compiler development (XMP and OpenACC) (Re-)Design and investigation of network topology

2D mesh is sufficient? or, other alternatives? Code development for apps using Host and Acc, including I/O Precise and more detail estimation of power consumptions

AICS Development of International Partnership

NCSA, US under MoU (1st meeting in April, 2013) NCI (National Computational Infracture), Australia, under MoU JSC, Germany, under MoU SISSA (THE SCUOLA INTERNAZIONALE SUPERIORE DI STUDI AVANZATI),

Italia, under agreement. Unversity of Maryland, US, with agreement for collaboration on modeling

and data assimilation.

"maison de la simulation" (INRIA/CNRS), France, under discussion ANL, US, under discussion

Recently, Japan and US agreed the collaboration on system software for supercomputing at "U.S.-Japan Joint High-Level Committee Meeting on Science and Technology Cooperation"

http://www.state.gov/r/pa/prs/ps/2013/04/208651.htm Workshop on international collaboration for exascale computing will be

organized in ISC2013, next week. Towards to JLESC. (esp. for exascale software development)

29

Updates of AICS and the next step for Post-Petascale Computing in Japan Mitsuhisa Sato University of...

Documents

Transcript of Updates of AICS and the next step for Post-Petascale Computing in Japan Mitsuhisa Sato University of...