Efforts on Programming Environment and Tools in China's High-tech R&D Program

Efforts on Programming Environment and Tools in China's High-tech R&D Program

Depei QianSino-German Joint Software Institute (JSI), Beihang University

Email: [email protected] 1, 2011, CScADS tools workshop

China’s High-tech Program

The National High-tech R&D Program (863 Program) proposed by 4 senior Chinese Scientists and

approved by former leader Mr. Deng Xiaoping in March 1986

One of the most important national science and technology R&D programs in China

Now a regular national R&D program planed in 5-year terms, the one just finished is the 11th five-year plan

863 key projects on HPC and Grid

“High performance computer and core software” 4-year project, May 2002 to Dec. 2005 100 million Yuan funding from the MOST More than 2Χ associated funding from local gove

rnment, application organizations, and industry Outcomes: China National Grid (CNGrid)

“High productivity Computer and Grid Service Environment” Period: 2006-2010 940 million Yuan from the MOST and more than 1

B Yuan matching money from other sources

HPC development (2006-2010)

First phase: developing two 100TFlops machines Dawning 5000A for SSC Lenovo DeepComp 7000 for SC of CAS

Second phase: three 1000Tflops machines Tianhe IA: CPU+GPU, NUDT/Tianjin Supercomputi

ng Center Dawning 6000: CPU+GPU, ICT/Dawning/South Chi

na Supercomputing Center (Shenzhen) Sunway: CPU-only, Jiangnan/Shandong Supercom

puting Center

CNGrid development

11 sites CNIC, CAS (Beijing, major site) Shanghai Supercomputer Center (Shanghai, major site ) Tsinghua University (Beijing) Institute of Applied Physics and Computational Mathematic

s (Beijing) University of Science and Technology of China (Hefei, Anh

ui) Xi’an Jiaotong University (Xi’an, Shaanxi) Shenzhen Institute of Advanced Technology (Shenzhen, G

uangdong) Hong Kong University (Hong Kong) Shandong University (Jinan, Shandong) Huazhong University of Science and Technology (Wuhan,

Hubei) Gansu Provincial Computing Center

The CNGrid Operation Center (based on CNIC, CAS)

CNGrid GOS Architecture

Tomcat(Apache)+Axis, GT4, gLite, OMII

Dynamic DeployService

CA Service

System Mgmt Portal

Hosting Environment

Core

System

Tool/App

Message Service

Agora

User Mgmt Res MgmtAgora Mgmt

Naming

HPCG App & Mgmt Portal

GSML Browser

ServiceControllerOther

RController

BatchJob mgmt

MetaScheduleAccount mgmt

File mgmt

metainfo mgmt

HPCG Backend

Resource Space

GOS System Call (Resource mgmt，Agora mgmt, User mgmt, Grip mgmt, etc）GOS Library (Batch, Message, File, etc)

Other Domain Specific Applications

Grip Runtime

Grip Instance MgmtSecurity

Res AC & Sharing

Other 3rd software &

tools

Java J2SE

GridWorkflowDataGrid

IDE Compiler

GSML Composer

GSML Workshop.

Debugger

Grip

Gsh & cmd tools

VegaSSH

Cmd Line Tools

DB ServiceWork Flow

Engine

Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps

OS (Linux/Unix/Windows)

PC Server (Grid Server)

J2SE(1.4.2_07, 1.5.0_07)

Tomcat(5.0.28) +Axis(1.2 rc2)

Axis Handlers for Message Level Security

Core, System and App Level Services

Jasmin: A parallel programming Framework

Contact: Prof. Zeyao Mo, IAPCM, Beijing

[email protected]

Data Dependency

extract

Data Structures

Promote

Communications

Load Balancing

supportParallel

Computing Models

form

separate Models Stencils

Algorithms

Special

Library

Models Stencils

Algorithms

Common

Infrastructureparallel middlewares for scien

tific computing

Applications Codes

Computers

Basic ideas

Hides parallel programming using millons of cores and the hierarchy of parallel computers;

Integrates the efficient implementations of parallel fast numerical algorithms ；

Provides efficient data structures and solver libraries; Supports software engineering for code extensibility.

Basic ideas

Personal Computer

Serial Programming

TeraFlops Cluster

PetaFlops MPP

Scale up usin

g Infra

structu

res

Applications Codes

Basic Ideas

Unstructured

Grid

Structured Grid

Inertial Confinement Fusion

Global Climate Modeling

CFD

Material Simulations

……

Particle Simulation

JASMIN

http:://www.iapcm.ac.cn/jasmin ， 2010SR050446

2003-now

J parallel Adaptive Structured Mesh INfrastructure

JASMIN

Architecture ： Multilayered, Modularized, Object-oriented ； Codes: 　 C++/C/F90/F77 ＋ MPI/OpenMP ， 500,000 lines ；Installation: Personal computers, Cluster, MPP.

JASMIN

V. 2.0

User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc.

HPC implementations( thousands of CPUs) ： data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc.

Numerical Algorithms ： geometry, fast solvers, mature numerical methods, time integrators, etc.

User Interfaces ： Components based Parallel Programming models. ( C++ classes)

JASMIN

Mesh supported

JASMIN

13 codes, 46 researches, concurrently develop

13 codes, 46 researches, concurrently develop

Simulation Cycle

ICF Application Codes

numerical methods

Physical parameters

Expert Experience

Different Combinations

Hides parallel computing and adaptive implementations using tens of thousands of CPU cores ；

Provides efficient data structures, algorithms and solvers; Support software engineering for code extensibility.

Inertial Confinement Fusion: 2004-now

Codes# CPU cores

Codes# CPU cores

LARED-S 32,768 RH2D 1,024

LARED-P 72,000 HIME3D 3,600

LAP3D 16,384 PDD3D 4,096

MEPH3D 38,400 LARED-R 512

MD3D 80,000LARED Integration

128

RT3D 1,000

Simulation duration : several hours to tens of hours.

Numerical simulations on TianHe-1A

Codes Year 2004 Year 2010

LARED-H 2-D radiation

hydrodynamics Lagrange code

serial Parallel

Single bolck MultiblockWithout capsule

NIF ignition target

LARED-R 2-D radiation transport

code

Serial Parallel (2048 cores)

LARED-S 3-D radiation

hydrodynamics Euler code

MPI Parallel (32768 cores)

Single level SAMR

2-D: single group

Multi-group diffusion

3-D: no radiation

3-D: radiaiton multigroup diffusion

LARED-P3-D laser plasma interaction code

MPI Parallel (36000 cores) Terascale of particles

Scale up a factor of 1000Scale up a factor of 1000

GPU programming support and performance optimization

Contact: Prof. Xiaoshe Dong, Xi’an Jiaotong University

Email: [email protected]

GPU program optimization

Three approaches for GPU program optimization memory-access level kernel-speedup level data-partition level

Approaches for GPU program optimization

Memory-access Level

Kernel-speedup Level

Data-partition Level

Source-to-source translation for GPU

Developed a source-to-source

translator, GPU-S2S, for GPU

Facilitate the development of

parallel programs on GPU by

combining automatic mapping and

static compilation

Source-to-source translation for GPU

Insert directives into the source program guide implicit calling of CUDA runtime libraries

enable the user to control the mapping of the compute-in

tensive applications from the homogeneous CPU platfor

m to GPU’s streaming platform

Optimization based on runtime profiling take full advantage of GPU according to the characteristi

cs of applications by collecting runtime dynamic informati

on.

The GPU-S2S architecture

GPU-S2S

GPU supporti ng l i brary

User standard l i brary

Runni ng- t i me performance col l ecti on

Operati ng system

GPU pl atform

Layer of performance di scover

Cal l i ng shared l i brary

Profi l ei nformati on

Pthread thread model

MPI message transfer modelLayer of sof tware

producti vi ty

PGAS programmi ng model

Program translation by GPU-S2S

homogeneous platform code

with directives

Computing function called by homogeneos platform code

Templates library of optimized computing intensive applications

Profile libray

Kernel program of

GPU according templates

Control program of CPU

General purpose

computing interface

GPU-S2S

Calling shared libaryUser defined part

Source code before translation (homogeneous platform program framework)

Source code after translation (GPU streaming architecture platform program framework)

User standard library Calling shared libary

Templates library of optimized computing intensive applications

Profile library

C l anguage compi l er

homogeneous pl atform

code

*. c、*. h Pretreatment

Second l evel dynami c i nstrumentati on

GPU-S2S

Extract profi l e i nformati on:computi ng kernel

Fi rst l evel dynami c i nstrumentati on

Automati cal l y i nserti ng di recti ves

Compi l e and run

Compi l e and run

Extract profi l e i nformati on:Data bl ock si ze, Share memory confi gurati on parameters, J udge whether can use stream

Thi rd l evel dynami c i nstrumentati on i n

CUDA code

Generate CUDA code

usi ng stream

Extract profi l e i nformati on:Number of stream, Data si ze of every stream

Generate CUDA code contai ni ng opti mi zed

kernel Need to opti mi ze

further

Don’ t need to opti mi ze further Termi nati on

Compi l e and run

Fi rst Level Profi l e

Second Level Profi l e

Thi rd Level Profi l e

CUDA code

*. h、*. cu、

*. cCUDA

Compi l er tool

Executabl e code on

GPU

*. o

Runtime optimization based on profiling

First level profiling (function level)

Second level profiling (memory access and kernel improvement )

Third level profiling (data partition)

First level profiling

Scan source code before translation, find function and insert instrumentation before and after the function, compute execution time of every function, and find computing kernel finally.

Homogeneous pl atform codeAl l ocate address

space i ni t i al i zati on

functi on0

Free address space

Source- to-source compi l er

i nstrumentati on0

functi on1

functi onN

...

i nstrumentati on0

i nstrumentati o1

i nstrumentati on1

i nstrumentati onN

i nstrumentati onN

Second level profiling

GPU-S2S scans code, insert instrumentation in the corresponding place of computing kernel

extract profile information, analyze the code, perform some optimization, according to the feature of application to expand the templates, finally generate the CUDA code with optimized kernel

Using share memory is an general approach, containing 13 parameters, having different performance with different values.

Homogeneous pl atform code

Source- to-sourcecompi l er

i nstrumentati on

i nstrumentati on

i nstrumentati on

Computi ng kernel 1

Computi ng kernel 2

Computi ng kernel 3

...

...

Third level profiling

GPU-S2S scans code, find computing kernel and its copy function, insert instrumentation into the corresponding place of code, get copy time and computing time. according to the time to compute the number of streams and data size of each stream. finally generate the optimized CUDA code with stream.

CUDA control codeAl l ocate address

space

functi on0- -copyi n

Free address space

Source- to-source compi l er

i nstrumentati oni

...

i ni t i al i zati on

Al l ocate gl obal address space

functi on0- -kernel

functi on0- -copyout

i nstrumentati oni

i nstrumentati onk

i nstrumentati onk

i nstrumentati ono

i nstrumentati ono

Verification and experiment

Experiment platform ： server ： 4-core Xeon CPU with 12GB me

mory ， NVIDIA Tesla C1060 Redhat enterprise server version 5.3 oper

ation system CUDA version 2.3

Test example ： Matrix multiplication Fast Fourier transform (FFT)

Matrix multiplication Performance comparison before and after profile

Execution performance comparison on different platform

The CUDA code with three

level profiling optimization

achieves 31% improvement

over the CUDA code with

only memory access

optimization, and 91%

improvement over the

CUDA code using only

global memory for

computing .

0

2000000

4000000

6000000

8000000

10000000

1024 2048 4096 8192

di ff erent si ze of i nputdata

time

ms

（） three l evel

profi l eopt i mi zat i onCPU

0

100

200

300

400

500

600

700

800

1024 2048di ff erent si ze of i nput data

time

ms（

）

memoryaccessopt i mi zat i on

onl y usi nggl obalmemory

second l evelprofi l eopt i mi zat i on

thi rd l evelprofi l eopt i mi zat i on

05000

10000150002000025000

300003500040000

4500050000

4096 8192

di ff erent si ze of i nput data

time

ms（

）

memoryaccessopt i mi zat i on

onl y usi nggl obalmemory

second l evelprofi l eopt i mi zat i on

thi rd l evelprofi l eopt i mi zat i on

FFT(1048576 points) Performance comparison before and after profile

FFT(1048576 points ) execution performance comparison on different platform

The CUDA code after

three level profile

optimization achieves

38% improvement over

the CUDA code with

memory access

optimization, and 77%

improvement over the

CUDA code using only

global memory for

computing .

0

200

400

600

800

1000

1200

1400

1600

1800

15 30 45 60 number of Batch

time

ms

（）

memory accessopt i mi zat i on

second l evelprofi l eopt i mi zat i onthi rd l evelprofi l eopt i mi zat i ononl y usi nggl obal memory

0

10000

20000

30000

40000

50000

15 30 45 60

di ff erent si ze of i nput data

time

ms

（） three l evel

profi l eopt i mi zat i onCPU

Programming Multi-GPU system

The traditional programming models, MPI and PGAS, are

not directly suitable for the new CPU+GPU platform. The legacy applications cannot exploit the power of GPUs.

Programming model for CPU-GPU architecture Combining the traditional programming model and GPU-specific progr

amming model, forming a mixed programming model.

Better performance on the CPU-GPU architecture, making more efficie

nt use of the computing power.

CPU

GPU GPU

CPU

GPU GPU…

……

…

The memory of the CPU+GPU system are both distributed and shared. So it is feasible to use MPI and PGAS programming model for this new kind of system.

MPI PGAS

Using message passing or shared data for communication between parallel tasks or GPUs

CPU

Mai nMemMessage

data

CPU

Mai nMem Share space

Pri vate space

Devi ceMem

GPU

Devi ceMem

GPU

Devi ceMem

GPU

Devi ceMem

GPU

Share data

Programming Multi-GPU system

Mixed Programming Model

CPUDevi ce choosi ngProgram i ni t i al

Mai n MM Devi ce MM

Computi ng start cal l computi ng

Mai n MM Devi ce MM

GPU

Host Devi ce

Source data copy i n

Resul t data copy out

CUDA program execution

CUDA runti me

Program start

MPI / UPC runti me

CPU

GPU

CPU

CPU

GPU

CPU

CPU

GPU

CPU cudaMemCopy

Communi cati on between tasks

(communi cati on i nterface of upper programi ng model )

Paral l elTask

Computi ng kernel

end

NVIDIA GPU —— CUDATraditional Programming model —— MPI/UPC

MPI+CUDA/UPC+CUDA

Mixed Programming Model The primary control of an application is implemented by MPI or UPC pro

gramming model. The computing kernels of the application are implemented by CUDA, using GPU to accelerate computing.

Optimizing the computing kernel , make better use of GPUs. Using GPU-S2S to generate the computing kernel program, hiding the C

PU+GPU heterogeneity to use, improving the portability of application.

Pri mary control program

Decl arati on of computi ng

kernel

Computi ng kernel program

<i ncl ude>

<i ncl ude>

Compi l ed wi th mpi cc/ upcc

Compi l ed wi th nvcc

Li nk wi th Nvcc

Run wi th mpi run/ upcrun

Compiling process

MPI+CUDA experiment

Platform ２ NF5588 server, equipped with

1 Xeon CPU (2.27GHz), 12GB MM 2 NVIDIA Tesla C1060 GPU(GT200 architecture ， 4G

B deviceMM) 1Gbt Ethernet RedHatLinux5.3 CUDA Toolkit 2.3 and CUDA SDK OpenMPI 1.3 BerkeleyUPC 2.1

MPI+CUDA experiment (cont’)

Matrix Multiplication program Using block matrix multiply for UPC programming. Data spread on each UPC thread. The computing kernel carries out the multiply of two

blocks at one time, using CUDA to implement. The total time of execution ： Tsum=Tcom+Tcuda=T

com+Tcopy+Tkernel

Tcom: UPC thread communication time

Tcuda: CUDA program execution time Tcopy: Data transmission time between host and device Tkernel: GPU computing time


For 4094*4096 ， the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) i

s 184x of the case with 8 MPI task.

For small scale data ， such as 256 ， 512 , the execution time of using 2 GPUs is

even longer than using 1 GPUs

the computing scale is too small , the communication between two tasks overwhel

m the reduction of computing time.

２ server ， 8 MPI task most １ server with 2 GPUs

Matrix size:8192*8192

Matrix size:16384*16384

Tcuda reduced as the task number increase, but the Tsum of 4 tasks is larger than that of 2.

Reason ： the latency of Ethernet between 2 servers is much higher than the latency on the Bus inside one server 。

If the computing scale is larger or using faster network (e.g. Infiniband), Multi-node with multi-GPUs will still improve the performance of application.


Programming Support and Compilers

Contact: Prof. Xiaobing Feng, ICT, CAS, Beijing

[email protected]

Advanced Compiler Technology (ACT) Group at the ICT, CAS

Institute of Computing Technology (ICT) is founded at 1956, the first and leading institute on computing technology in China

ACT is founded in early 1960’s, and has over 40 years experiences on compilers Compilers for most of the mainframes developed in C

hina Compiler and binary translation tools for Loogson pro

essors Parallel compilers and tools for the Dawning series

(SMP/MPP/cluster)


ACT’s Current research Parallel programming languages and models Optimized compilers and tools for HPC (Dawnin

g) and multi-core processors (Loongson)


• PTA model (Process-based TAsk parallel programming model )– new process-based task construct

• With properties of isolation, atomicity and deterministic submission– Annotate a loop into two parts, prologue and task segment

#pragma pta parallel [clauses]#pragma pta task#pragma pta propagate (varlist)

– Suitable for expressing coarse-grained, irregular parallelism on loops• Implementation and performance

– PTA compiler, runtime system and assistant tool (help writing correct programs)

– Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83 (average 6.72 on 8 cores)

– Code changes is within 10 lines, much smaller than OpenMP

UPC-H : A Parallel Programming Model for Deep Parallel Hierarchies

Hierarchical UPC Provide multi-level data distribution Implicit and explicit hierarchical loop parallelism

Hybrid execution model: SPMD with fork-join Multi-dimensional data distribution and super-pipelining

Implementations on CUDA clusters and Dawning 6000 cluster Based on Berkeley UPC

Enhance optimizations as localization and communication optimization

Support SIMD intrinsics CUDA cluster ： 72% of hand-tuned version’s performa

nce, code reduction to 68% Multi-core cluster: better process mapping and cache re

use than UPC

OpenMP and Runtime Support for Heterogeneous Platforms

Heterogeneous platforms consisting of CPUs and GPUs Multiple GPUs, or CPU-GPU cooperation brings extra data transfer

hurting the performance gain Programmers need unified data management system

OpenMP extension Specify partitioning ratio to optimize data transfer globally Specify heterogeneous blocking sizes to reduce false sharing amo

ng computing devices Runtime support

DSM system based on the blocking size specified Intelligent runtime prefetching with the help of compiler analysis

Implementation and results On OpenUH compiler Gains 1.6X speedup through prefetching on NPB/SP (class C)

Analyzers based on Compiling Techniques for MPI programs

Communication slicing and process mapping tool Compiler part

PDG Graph Building and slicing generation Iteration Set Transformation for approximation

Optimized mapping tool Weighted graph, Hardware characteristic Graph partitioning and feedback-based evaluation

Memory bandwidth measuring tool for MPI programs Detect the burst of bandwidth requirements

Enhance the performance of MPI error checking Redundant error checking removal by dynamically turning on/off

the global error checking With the help of compiler analysis on communicators Integrated with a model checking tool (ISP) and a runtime

checking tool (MARMOT)

LoongCC: An Optimizing Compiler for Loongson Multicore Processors

Based on Open64-4.2 and supporting C/C++/Fortran Open source at http://svn.open64.net/svnroot/open64/trunk/

Powerful optimizer and analyzer with better performances SIMD intrinsic support Memory locality optimization Data layout optimization Data prefetching Load/store grouping for 128-bit memory access instructions

Integrated with Aggressive Auto Parallelization Optimization (AAPO) module

Dynamic privatization Parallel model with dynamic alias optimization Array reduction optimization

DigitalBridge: An Binary Translation System for Loongson Multicore Processors

Fully utilizing hardware characters of Loongson CPUs Handle return instructions by shadow stack Handle Eflag operations by flag pattern Emulate X86 FPU by local FP registers Combination of static and dynamic translation Handle indirect-jumping table Handle misalignment data accesses by dynamic pr

ofile and exception handler Improve data locality by pool allocation Stack variables promotion

Software Tools for High Performance Computing

Contact: Prof. Yi Liu, JSI, Beihang [email protected]

LSP3AS: large-scale parallel program performance analysis system

Source Code

TAU Instrumentation Measurement API

Instrumented Code

Compiler/Linker External Libraries

Executable Datafile

Environment

Visualization and Analysis

Performance Datafile

Profiling Tools

Tracing Tools

Dynamic Compensation

RDMA Transmission and Buffer Management

RDMA Library

Clustering AnalysisBased on Iteration

Clustering VisualizationBased on hierarchy

classify

Traditional Process of performance analysis

Dependency of Each Step Innovations

Analysis based on hierarchical clustering

– Designed for performance tuning on peta-scale HPC systems

– Method is common:• Source code is instrumented b

y inserting specified function-calls

• Instrumented code is executed, while performance data are collected, generating profiling&tracing data files

• The profiling&tracing data is analyzed and visualization report is generated

– Instrumentation: based on TAU from University of Oregon

Compute node ……

Storage system

IO node

Sender

User process

Shared Memory

Receiver

Lustre ClientOr GFS

User process

Compute node

Sender

User process

Shared Memory

User process

Compute node ……Sender

User process

Shared Memory

User process

Compute node

Sender

User process

Shared Memory

User process

IO node

Receiver

Lustre ClientOr GFS

Thread Thread


≈ 10 thousands of nodes in Peta-scale system, massive performance data will be generated, transmitted and stored

Scalable structure for performance data collection

Distributed data collection and transmission: eliminate bottlenecks in network and data processing

Dynamic Compensation algorithm: reduce the influence of performance data volume

Efficient Data Transmission: use Remote Direct Memory Access (RDMA) to achieve high bandwidth and low latency

• Analysis & Visualization– Two approaches to deal with huge

amount of data• Data Analysis: Iteration-based clustering

approach from data mining technology are used

• Visualization: Clustering visualization Based on Hierarchy Classification


SimHPC: Parallel Simulator

Challenge for HPC Simulation: performance Target system: >1,000 nodes and processors Difficult for traditional architecture simulators

e.g. Simics Our solution

Parallel simulation Using cluster to simulate cluster

Use same node in host system with the target Basis: HPC systems uses commercial processors, even blades also ava

ilable for simulator Execution time of instruction sequences are the same in host & target

Processes makes things a little complicated, we will discuss it later Advantage: no need to model and simulate detailed components, such as

pipeline in processors and cache Execution-driven, Full-system simulation, support execution of Linux

and applications include benchmarks (e.g. Linpack)

SimHPC: Parallel Simulator (cont’) Analysis

Execution time of a process in target system is composed of:

process run IO readyT T T T

− Trun: execution time of instruction sequences− TIO: I/O blocking time, such as r/w files, send/recv

msgs− Tready: waiting time in ready-state

equal to hostcan be obtained in Linux kernel

needed to be simulated

So, Our simulator needs to:①Capture system events

• process scheduling• I/O operations: read/write files, MPI send()/recv()

②Simulate I/O and interconnection network subsystems③Synchronize timing of each application process

unequal to hostneeded to be re-calculated

SimHPC: Parallel Simulator (cont’)

System Architecture Application processes of multiple target nodes are

allocated to one host node number of host nodes << number of target nodes

Events are captured on host node while application is running

Events are sent to central node to analyze, synchronize time, and simulation

……

Architecture Simulation

Analysis & Time-axis Sychronize

Interconnection Network

Event Capture

Host node

Control

Event Capture

Host node

Event Capture

Host node

Event Collection

Disk I/O

Simulation Results

Parallel applications

Process ...

Target ...

……Simulator

Host Linux

Host Hardware Platform

Host

Process Process ...

Target

Process Process ...

Target ...

Simulator

Host Linux

Host Hardware Platform

Host

Process Process ...

Target

Process

SimHPC: Parallel Simulator (cont’)

Simulation Slowdown

• Experiment Results– Host: 5 IBM Blade HS21 (2-way Xeon)– Target: 32 – 1024 nodes– OS: Linux– App: Linpack HPL

Linpack performance for Fat-tree and 2D-mesh Interconnection netw

orks

Communication time for Fat-tree and 2D-mesh Interconnection

networks

Simulation Error Test

System-level Power Management

Power-aware Job Scheduling algorithmIdea:①Suspend a node if its idle-time > threshold②Wakeup nodes if there is no enough nodes to execute jobs, while③Avoid node thrashing between busy and suspend state

Since suspend & wakeup operation can consume power

Do not wakeup a suspending node if it just goes to sleep

The algorithm is integrated into OpenPBS


• Power Management Tool– Monitor the power-related status of the system– Reduce runtime power consumption of the

machine– Multiple power management policies

– Manual-control– On-demand control– Suspend-enable– …

Layers of Power Management

Node sleep/wakeup

Node On/Off

CPU Freq. control

Fan speed control

Power control of I/O equipments

...Node Level

Power Management Software / Interfaces

Power Management Policies

Power Management Agent in Node

Management/Interface

Level

Policy Level

Power management test for different Task Load

(Compared to no power management)

• Power Management Test– On 5 IBM HS21 blades Power Mesurement

Power

Control & Monitor

Commands

Status Power data

System

Task Load(tasks per h

our)Power Manage

ment PolicyTask Exec.

Time(s)

Power Consumption

(J)

Comparison

Performance slowdown

Power Saving

20On-demand 3.55 1778077 5.15% -1.66%

Suspend 3.60 1632521 9.76% -12.74%

200On-demand 3.55 1831432 4.62% -3.84%

Suspend 3.65 1683161 10.61% -10.78%

800On-demand 3.55 2132947 3.55% -7.05%

Suspend 3.66 2123577 11.25% -9.34%


Parallel Programming Platform for

Astrophysics Contact: Yunquan Zhang, ISCAS, Beijing

[email protected]

Joint work Shanghai Astronomical Observatory, CAS (SHAO), Institute of Software, CAS (ISCAS) Shanghai Supercomputer Center (SSC)

Build a high performance parallel computing software platform for astrophysics research, focusing on the planetary fluid dynamics and N-body problems

New parallel computing models and parallel algorithms studied, validated and adopted to achieve high performance.

Parallel Computing Software Platform for Astrophysics

Software Architecture

Physical and Mathematic

al Model

Physical and Mathematic

al Model

Parallel Computing

Model

Parallel Computing

Model

Numerical Methods

Numerical Methods

MPIMPI OpenMPOpenMP FortranFortran CC

100T Supercomputer

PETScPETSc AztecAztec

Software Platform for AstrophysicsSoftware Platform for Astrophysics

Web Portal on CNGridWeb Portal on CNGrid

Fluid Dynamics N-body Problem

Improved Preconditioner

Improved Preconditioner

Improved Lib. for Collective Comun

ication

Improved Lib. for Collective Comun

icationSpMVSpMV

FFTWFFTW GSLGSL

LustreLustre

Software Developme

nt

Software Developme

nt

Data Processing Scientific Visualiztion

Data Processing Scientific Visualiztion

The PETSc optimized version1 for astrophysics numerical simulation has been finished. The early performace evaluation for Aztec code and PETSc code on Dawning 5000A is shown.

For 80×80×50 mesh, the execution time of Aztec program is 4-7 times of the PETSc version, average 6 times; For 160×160×100 mesh, the execution time of Aztec program is 2-5 times of the PETSc version, average 4 times.

PETSc Optimized Version 1 (Speedup 4-6)

Method 1: Domain Decomposition Ordering Method for Field Coupling Method 2: Preconditioner for Domain Decomposition Method Method 3: PETSc Multi-physics Data Structure

PETSc Optimized Version 2 (Speedup 15-26)

Left: mesh 128 x 128 x 96 Right: mesh 192 x 192 x 128 Computation Speedup: 15-26Strong scalability: Original code normal, New code idealTest environment: BlueGene/L at NCAR (HPCA2009)

23/4/21

Strong Scalability on Dawning 5000A

65

Strong Scalability

rotmplinear: 192x192x12843

3.6

212.

8

98.5

51.1

26.1

14.4

8.3

4.712

.0

144.

8

65.5

19.232

.369.315

7.7

257.

1

13.523

.8

344.

7

1

10

100

1000

64 128 256 512 1024 2048 4096 8192

number of processor core

Tim

e(S

)

BG/L

5000A曙光

7000深腾

23/4/21

Strong Scalability on TianHe-1A

CLeXML Math Library

CPUCPUComputationa

l ModelComputationa

l Model

BLASBLAS FFTFFT

Self Adaptive Tunning, Instruction Reordering, Software Pipeli

ning…

Self Adaptive Tunning, Instruction Reordering, Software Pipeli

ning…

LAPACKLAPACK

Task Parallel

Task Parallel Iterative SolverIterative Solver

Self Adaptive Tunning

Multi-core parallel

Self Adaptive Tunning

Multi-core parallel

BLAS2 Performance: MKL vs. CLeXML

BLAS3 Performance: MKL vs. CLeXML

FFT Performance: MKL vs. CLeXML

HPC Software support for Earth System Modeling

Contact: Prof. Guangwen Yang, Tsinghua University

[email protected]

72

Source Code

Executable

Standard Data Set

Result Evaluation

Result Visualization

Algorithm(Parallel)

Earth System Model

Development Wizard and

Editor

Compiler/Debugger/Optimizer

Computation Output

Initial Field and Boundary Condition

Running Environment

Data Visualization and Analysis

Tools

Data Management Subsystem

Other Data

Earth System ModelDevelopment Workflow

73

75

Expected Results

integrated high performance computing environment for

earth system model

integrated high performance computing environment for

earth system model

model application systems

Demonstrative Applications

research on global change

Existing tools: compiler

system monitorversion control

editor

development tools：

data conversiondiagnosisdebugging

performance analysishigh availability

template librarymodule library

high performance computers in China

software standards international

resources

provide simplified APIs for locating model data path

provide reliable meta-data management and support user-defined meta-data

support DAP data access protocol and provide model data queries

web based data access portal

provide SQL-Like query interface for climate model semantics

support parallel data aggregation and extraction

support online and offline conversion between different data formats

support graphic workflow operations

data processing service based on ‘cloud’ methods

provide fast and reliable parallel I/O for climate modeling

support compressed storage for earth scientific data

data storage service on parallel file system

Integration and Management of Massive Heterogeneous Data

Technical Route

Compressed Archive File System

Compressed Archive File System

Memory File System

Memory File System

Key-Value Storage System

Key-Value Storage System

Parallel File System PVFS2Parallel File System PVFS2

HadoopHadoop MPIMPI

pNetCDFpNetCDFHDF5HDF5

OpenDAPOpenDAP GPU CUDA SDK

GPU CUDA SDK

Data Grid MiddlewareData Grid Middleware

Support Layer

Request Parsing EngineRequest Parsing EngineAPI （ C & Fortran ）API （ C & Fortran ）

Web Service(Rest & SOAP)Web Service(Rest & SOAP)

PIOPIO

Shell Command

LineShell Command

Line Eclipse ClientEclipse Client Web BrowserWeb Browser

data access service

data processing service

data storage serviceaggre

gation

aggregation

extractionextraction

conversionconversion

queryquery publish

publish

readread writewrite archive

archive

sharesharevisualizatio

n

visualizatio

n

Presentation Layer

Storage Layer

interfac

eservi

ce

toolse

t

browse

browse

transfertransfer

design parallel volume rendering algorithms that can scale to hundreds of cores, and achieve efficient data sampling and composition

design parallel contour surface algorithm to achieve quick extraction and composition of contour surface

design and implementation of parallel visualization algorithms

software acceleration for graphics and image

hardware acceleration for graphics and image

performance optimization for TB-scale data field visualization

visualized representation methods for earth system models

Research topics

Fast Visualization and Diagnosis of Earth System Model Data

78

gra

ph

ical

gra

ph

ical

work

sta

tion

work

sta

tion

HPCHPC

raw raw datadata(TB)(TB)

netw

ork

netw

ork

high-resolution renderer &high-resolution renderer &display walldisplay wall

DMXDMXChromiumChromium

computing nodecomputing nodecomputing nodecomputing nodecomputing nodecomputing nodecomputing nodecomputing node

parallel visualizationparallel visualizationengine libraryengine library

PVE launcherPVE launcher

netCDF, NCnetCDF, NC

high-speedhigh-speedinternal businternal bus

preprocessedpreprocesseddatadata

graphical graphical nodenode

pixel streampixel stream(giga-bps)(giga-bps)

OpenGL streamOpenGL stream

web remote user

data data processorprocessor

meta-meta-data data managermanager local userGUI / C

LI

GUI / CLI

viewerviewer

GUI / CLI

GUI / CLI

viewerviewer

MPMD parallel program debugging

MPMD parallel program performance measurement and analysis

support to efficient execution of MPMD parallel programs

fault-tolerance technologies for MPMD parallel programs

MPMD Program Debugging, and Analysis

MPMD Program Debugging, Analysis Environment

Runtime Support

High Availability

Debugging Performance Analysis

Basic Hardware/Software Environment

Technical Route

LibraryLibraryFile SystemFile SystemOperation System

Operation System

Hardware （ nodes and network ）Hardware （ nodes and network ）

job and resource management

job and resource management

job controljob controlresource managementresource management

management middlewaremanagement middleware

Service Layer

abstraction serviceabstraction service

job scheduling

job scheduling

IDEIntegrationFramework

IDEIntegrationFramework

Shell Command Line

Shell Command Line

Eclipse clientEclipse client browserbrowser

parallel debugging

performance analysis

reliability

data collection

data collection

performance

analysis

performance

analysis

data representation

data representation

groupgroup INTINTsystem monito

r

system monito

r

controllercontroller

plug-in and comman

d

plug-in and comman

dtracktrack analysisanalysis

Presentation Layer

Fundamental Support

queryquery

instrumentation

instrumentation

Language EnvironmentLanguage

Environment

Abstraction Layer

job management UI

job management UI

debug plug-in

debug plug-in

performance analysis plug-

in

performance analysis plug-

in

Debugging and Optimization IDE for Earth System Model ProgramsDebugging and Optimization IDE for Earth System Model Programs

Earth System Model Abstraction Service PlatformEarth System Model Abstraction Service Platform

Technical Route

Debugging Window

debugging serviceresource management and

job schedulingperformance

optimization service

earth system model MPMD program

debugging monitoring

system failure

notification

system

failure n

otifica

tion

an

d fau

lt-toleran

t sche

du

ling

program event collectionhierarchical

scheduling

Performance Analysis Window

event collection

debugging replay

scheduling

reliable monitoring system execution environment

performance sampling data

a plug-in-based expandable development platform

a template-based development supporting environment

a tool library for earth system model development

typical earth system model applications developed using the integrated development environment

Integrated Development Environment (IDE)

Plug-in integration method

Platform Runtime

Workspace

Help

Team

Workbench

JFace

SWT

Eclipse Project

JavaDevelopment

Tools(JDT)

Their Tool

Your Tool

AnotherTool

Plug-inDevelopmentEnvironment

(PDE)

Eclipse Platform

Debug

Encapsulation of reusable modules

Radiation Module

Time Integration Module

BoundaryLayer Module

Coupler Module

Solver Module

…… Module units ：High performancereusable

Module encapsulation specification

Model module lib

Thank You ！

Efforts on Programming Environment and Tools in China's High-tech R&D Program

Documents

Transcript of Efforts on Programming Environment and Tools in China's High-tech R&D Program