China's repo markets: The structure and safeguards of China's ...
Efforts on Programming Environment and Tools in China's High-tech R&D Program
description
Transcript of Efforts on Programming Environment and Tools in China's High-tech R&D Program
Efforts on Programming Environment and Tools in China's High-tech R&D Program
Depei QianSino-German Joint Software Institute (JSI), Beihang University
Email: [email protected] 1, 2011, CScADS tools workshop
China’s High-tech Program
The National High-tech R&D Program (863 Program) proposed by 4 senior Chinese Scientists and
approved by former leader Mr. Deng Xiaoping in March 1986
One of the most important national science and technology R&D programs in China
Now a regular national R&D program planed in 5-year terms, the one just finished is the 11th five-year plan
863 key projects on HPC and Grid
“High performance computer and core software” 4-year project, May 2002 to Dec. 2005 100 million Yuan funding from the MOST More than 2Χ associated funding from local gove
rnment, application organizations, and industry Outcomes: China National Grid (CNGrid)
“High productivity Computer and Grid Service Environment” Period: 2006-2010 940 million Yuan from the MOST and more than 1
B Yuan matching money from other sources
HPC development (2006-2010)
First phase: developing two 100TFlops machines Dawning 5000A for SSC Lenovo DeepComp 7000 for SC of CAS
Second phase: three 1000Tflops machines Tianhe IA: CPU+GPU, NUDT/Tianjin Supercomputi
ng Center Dawning 6000: CPU+GPU, ICT/Dawning/South Chi
na Supercomputing Center (Shenzhen) Sunway: CPU-only, Jiangnan/Shandong Supercom
puting Center
CNGrid development
11 sites CNIC, CAS (Beijing, major site) Shanghai Supercomputer Center (Shanghai, major site ) Tsinghua University (Beijing) Institute of Applied Physics and Computational Mathematic
s (Beijing) University of Science and Technology of China (Hefei, Anh
ui) Xi’an Jiaotong University (Xi’an, Shaanxi) Shenzhen Institute of Advanced Technology (Shenzhen, G
uangdong) Hong Kong University (Hong Kong) Shandong University (Jinan, Shandong) Huazhong University of Science and Technology (Wuhan,
Hubei) Gansu Provincial Computing Center
The CNGrid Operation Center (based on CNIC, CAS)
CNGrid GOS Architecture
Tomcat(Apache)+Axis, GT4, gLite, OMII
Dynamic DeployService
CA Service
System Mgmt Portal
Hosting Environment
Core
System
Tool/App
Message Service
Agora
User Mgmt Res MgmtAgora Mgmt
Naming
HPCG App & Mgmt Portal
GSML Browser
ServiceControllerOther
RController
BatchJob mgmt
MetaScheduleAccount mgmt
File mgmt
metainfo mgmt
HPCG Backend
Resource Space
GOS System Call (Resource mgmt,Agora mgmt, User mgmt, Grip mgmt, etc)GOS Library (Batch, Message, File, etc)
Other Domain Specific Applications
Grip Runtime
Grip Instance MgmtSecurity
Res AC & Sharing
Other 3rd software &
tools
Java J2SE
GridWorkflowDataGrid
IDE Compiler
GSML Composer
GSML Workshop.
Debugger
Grip
Gsh & cmd tools
VegaSSH
Cmd Line Tools
DB ServiceWork Flow
Engine
Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps
OS (Linux/Unix/Windows)
PC Server (Grid Server)
J2SE(1.4.2_07, 1.5.0_07)
Tomcat(5.0.28) +Axis(1.2 rc2)
Axis Handlers for Message Level Security
Core, System and App Level Services
Data Dependency
extract
Data Structures
Promote
Communications
Load Balancing
supportParallel
Computing Models
form
separate Models Stencils
Algorithms
Special
Library
Models Stencils
Algorithms
Common
Infrastructureparallel middlewares for scien
tific computing
Applications Codes
Computers
Basic ideas
Hides parallel programming using millons of cores and the hierarchy of parallel computers;
Integrates the efficient implementations of parallel fast numerical algorithms ;
Provides efficient data structures and solver libraries; Supports software engineering for code extensibility.
Basic ideas
Personal Computer
Serial Programming
TeraFlops Cluster
PetaFlops MPP
Scale up usin
g Infra
structu
res
Applications Codes
Basic Ideas
Unstructured
Grid
Structured Grid
Inertial Confinement Fusion
Global Climate Modeling
CFD
Material Simulations
……
Particle Simulation
JASMIN
http:://www.iapcm.ac.cn/jasmin , 2010SR050446
2003-now
J parallel Adaptive Structured Mesh INfrastructure
JASMIN
Architecture : Multilayered, Modularized, Object-oriented ; Codes: C++/C/F90/F77 + MPI/OpenMP , 500,000 lines ;Installation: Personal computers, Cluster, MPP.
JASMIN
V. 2.0
User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc.
HPC implementations( thousands of CPUs) : data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc.
Numerical Algorithms : geometry, fast solvers, mature numerical methods, time integrators, etc.
User Interfaces : Components based Parallel Programming models. ( C++ classes)
JASMIN
Mesh supported
JASMIN
13 codes, 46 researches, concurrently develop
13 codes, 46 researches, concurrently develop
Simulation Cycle
ICF Application Codes
numerical methods
Physical parameters
Expert Experience
Different Combinations
Hides parallel computing and adaptive implementations using tens of thousands of CPU cores ;
Provides efficient data structures, algorithms and solvers; Support software engineering for code extensibility.
Inertial Confinement Fusion: 2004-now
Codes# CPU cores
Codes# CPU cores
LARED-S 32,768 RH2D 1,024
LARED-P 72,000 HIME3D 3,600
LAP3D 16,384 PDD3D 4,096
MEPH3D 38,400 LARED-R 512
MD3D 80,000LARED Integration
128
RT3D 1,000
Simulation duration : several hours to tens of hours.
Numerical simulations on TianHe-1A
Codes Year 2004 Year 2010
LARED-H 2-D radiation
hydrodynamics Lagrange code
serial Parallel
Single bolck MultiblockWithout capsule
NIF ignition target
LARED-R 2-D radiation transport
code
Serial Parallel (2048 cores)
LARED-S 3-D radiation
hydrodynamics Euler code
MPI Parallel (32768 cores)
Single level SAMR
2-D: single group
Multi-group diffusion
3-D: no radiation
3-D: radiaiton multigroup diffusion
LARED-P3-D laser plasma interaction code
MPI Parallel (36000 cores) Terascale of particles
Scale up a factor of 1000Scale up a factor of 1000
GPU programming support and performance optimization
Contact: Prof. Xiaoshe Dong, Xi’an Jiaotong University
Email: [email protected]
GPU program optimization
Three approaches for GPU program optimization memory-access level kernel-speedup level data-partition level
Approaches for GPU program optimization
Memory-access Level
Kernel-speedup Level
Data-partition Level
Source-to-source translation for GPU
Developed a source-to-source
translator, GPU-S2S, for GPU
Facilitate the development of
parallel programs on GPU by
combining automatic mapping and
static compilation
Source-to-source translation for GPU
Insert directives into the source program guide implicit calling of CUDA runtime libraries
enable the user to control the mapping of the compute-in
tensive applications from the homogeneous CPU platfor
m to GPU’s streaming platform
Optimization based on runtime profiling take full advantage of GPU according to the characteristi
cs of applications by collecting runtime dynamic informati
on.
The GPU-S2S architecture
GPU-S2S
GPU supporti ng l i brary
User standard l i brary
Runni ng- t i me performance col l ecti on
Operati ng system
GPU pl atform
Layer of performance di scover
Cal l i ng shared l i brary
Profi l ei nformati on
Pthread thread model
MPI message transfer modelLayer of sof tware
producti vi ty
PGAS programmi ng model
Program translation by GPU-S2S
homogeneous platform code
with directives
Computing function called by homogeneos platform code
Templates library of optimized computing intensive applications
Profile libray
Kernel program of
GPU according templates
Control program of CPU
General purpose
computing interface
GPU-S2S
Calling shared libaryUser defined part
Source code before translation (homogeneous platform program framework)
Source code after translation (GPU streaming architecture platform program framework)
User standard library Calling shared libary
Templates library of optimized computing intensive applications
Profile library
C l anguage compi l er
homogeneous pl atform
code
*. c、*. h Pretreatment
Second l evel dynami c i nstrumentati on
GPU-S2S
Extract profi l e i nformati on:computi ng kernel
Fi rst l evel dynami c i nstrumentati on
Automati cal l y i nserti ng di recti ves
Compi l e and run
Compi l e and run
Extract profi l e i nformati on:Data bl ock si ze, Share memory confi gurati on parameters, J udge whether can use stream
Thi rd l evel dynami c i nstrumentati on i n
CUDA code
Generate CUDA code
usi ng stream
Extract profi l e i nformati on:Number of stream, Data si ze of every stream
Generate CUDA code contai ni ng opti mi zed
kernel Need to opti mi ze
further
Don’ t need to opti mi ze further Termi nati on
Compi l e and run
Fi rst Level Profi l e
Second Level Profi l e
Thi rd Level Profi l e
CUDA code
*. h、*. cu、
*. cCUDA
Compi l er tool
Executabl e code on
GPU
*. o
Runtime optimization based on profiling
First level profiling (function level)
Second level profiling (memory access and kernel improvement )
Third level profiling (data partition)
First level profiling
Scan source code before translation, find function and insert instrumentation before and after the function, compute execution time of every function, and find computing kernel finally.
Homogeneous pl atform codeAl l ocate address
space i ni t i al i zati on
functi on0
Free address space
Source- to-source compi l er
i nstrumentati on0
functi on1
functi onN
...
i nstrumentati on0
i nstrumentati o1
i nstrumentati on1
i nstrumentati onN
i nstrumentati onN
Second level profiling
GPU-S2S scans code, insert instrumentation in the corresponding place of computing kernel
extract profile information, analyze the code, perform some optimization, according to the feature of application to expand the templates, finally generate the CUDA code with optimized kernel
Using share memory is an general approach, containing 13 parameters, having different performance with different values.
Homogeneous pl atform code
Source- to-sourcecompi l er
i nstrumentati on
i nstrumentati on
i nstrumentati on
Computi ng kernel 1
Computi ng kernel 2
Computi ng kernel 3
...
...
Third level profiling
GPU-S2S scans code, find computing kernel and its copy function, insert instrumentation into the corresponding place of code, get copy time and computing time. according to the time to compute the number of streams and data size of each stream. finally generate the optimized CUDA code with stream.
CUDA control codeAl l ocate address
space
functi on0- -copyi n
Free address space
Source- to-source compi l er
i nstrumentati oni
...
i ni t i al i zati on
Al l ocate gl obal address space
functi on0- -kernel
functi on0- -copyout
i nstrumentati oni
i nstrumentati onk
i nstrumentati onk
i nstrumentati ono
i nstrumentati ono
Verification and experiment
Experiment platform : server : 4-core Xeon CPU with 12GB me
mory , NVIDIA Tesla C1060 Redhat enterprise server version 5.3 oper
ation system CUDA version 2.3
Test example : Matrix multiplication Fast Fourier transform (FFT)
Matrix multiplication Performance comparison before and after profile
Execution performance comparison on different platform
The CUDA code with three
level profiling optimization
achieves 31% improvement
over the CUDA code with
only memory access
optimization, and 91%
improvement over the
CUDA code using only
global memory for
computing .
0
2000000
4000000
6000000
8000000
10000000
1024 2048 4096 8192
di ff erent si ze of i nputdata
time
ms
() three l evel
profi l eopt i mi zat i onCPU
0
100
200
300
400
500
600
700
800
1024 2048di ff erent si ze of i nput data
time
ms(
)
memoryaccessopt i mi zat i on
onl y usi nggl obalmemory
second l evelprofi l eopt i mi zat i on
thi rd l evelprofi l eopt i mi zat i on
05000
10000150002000025000
300003500040000
4500050000
4096 8192
di ff erent si ze of i nput data
time
ms(
)
memoryaccessopt i mi zat i on
onl y usi nggl obalmemory
second l evelprofi l eopt i mi zat i on
thi rd l evelprofi l eopt i mi zat i on
FFT(1048576 points) Performance comparison before and after profile
FFT(1048576 points ) execution performance comparison on different platform
The CUDA code after
three level profile
optimization achieves
38% improvement over
the CUDA code with
memory access
optimization, and 77%
improvement over the
CUDA code using only
global memory for
computing .
0
200
400
600
800
1000
1200
1400
1600
1800
15 30 45 60 number of Batch
time
ms
()
memory accessopt i mi zat i on
second l evelprofi l eopt i mi zat i onthi rd l evelprofi l eopt i mi zat i ononl y usi nggl obal memory
0
10000
20000
30000
40000
50000
15 30 45 60
di ff erent si ze of i nput data
time
ms
() three l evel
profi l eopt i mi zat i onCPU
Programming Multi-GPU system
The traditional programming models, MPI and PGAS, are
not directly suitable for the new CPU+GPU platform. The legacy applications cannot exploit the power of GPUs.
Programming model for CPU-GPU architecture Combining the traditional programming model and GPU-specific progr
amming model, forming a mixed programming model.
Better performance on the CPU-GPU architecture, making more efficie
nt use of the computing power.
CPU
GPU GPU
CPU
GPU GPU…
……
…
The memory of the CPU+GPU system are both distributed and shared. So it is feasible to use MPI and PGAS programming model for this new kind of system.
MPI PGAS
Using message passing or shared data for communication between parallel tasks or GPUs
CPU
Mai nMemMessage
data
CPU
Mai nMem Share space
Pri vate space
Devi ceMem
GPU
Devi ceMem
GPU
Devi ceMem
GPU
Devi ceMem
GPU
Share data
Programming Multi-GPU system
Mixed Programming Model
CPUDevi ce choosi ngProgram i ni t i al
Mai n MM Devi ce MM
Computi ng start cal l computi ng
Mai n MM Devi ce MM
GPU
Host Devi ce
Source data copy i n
Resul t data copy out
CUDA program execution
CUDA runti me
Program start
MPI / UPC runti me
CPU
GPU
CPU
CPU
GPU
CPU
CPU
GPU
CPU cudaMemCopy
Communi cati on between tasks
(communi cati on i nterface of upper programi ng model )
Paral l elTask
Computi ng kernel
end
NVIDIA GPU —— CUDATraditional Programming model —— MPI/UPC
MPI+CUDA/UPC+CUDA
Mixed Programming Model The primary control of an application is implemented by MPI or UPC pro
gramming model. The computing kernels of the application are implemented by CUDA, using GPU to accelerate computing.
Optimizing the computing kernel , make better use of GPUs. Using GPU-S2S to generate the computing kernel program, hiding the C
PU+GPU heterogeneity to use, improving the portability of application.
Pri mary control program
Decl arati on of computi ng
kernel
Computi ng kernel program
<i ncl ude>
<i ncl ude>
Compi l ed wi th mpi cc/ upcc
Compi l ed wi th nvcc
Li nk wi th Nvcc
Run wi th mpi run/ upcrun
Compiling process
MPI+CUDA experiment
Platform 2 NF5588 server, equipped with
1 Xeon CPU (2.27GHz), 12GB MM 2 NVIDIA Tesla C1060 GPU(GT200 architecture , 4G
B deviceMM) 1Gbt Ethernet RedHatLinux5.3 CUDA Toolkit 2.3 and CUDA SDK OpenMPI 1.3 BerkeleyUPC 2.1
MPI+CUDA experiment (cont’)
Matrix Multiplication program Using block matrix multiply for UPC programming. Data spread on each UPC thread. The computing kernel carries out the multiply of two
blocks at one time, using CUDA to implement. The total time of execution : Tsum=Tcom+Tcuda=T
com+Tcopy+Tkernel
Tcom: UPC thread communication time
Tcuda: CUDA program execution time Tcopy: Data transmission time between host and device Tkernel: GPU computing time
MPI+CUDA experiment (cont’)
For 4094*4096 , the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) i
s 184x of the case with 8 MPI task.
For small scale data , such as 256 , 512 , the execution time of using 2 GPUs is
even longer than using 1 GPUs
the computing scale is too small , the communication between two tasks overwhel
m the reduction of computing time.
2 server , 8 MPI task most 1 server with 2 GPUs
Matrix size:8192*8192
Matrix size:16384*16384
Tcuda reduced as the task number increase, but the Tsum of 4 tasks is larger than that of 2.
Reason : the latency of Ethernet between 2 servers is much higher than the latency on the Bus inside one server 。
If the computing scale is larger or using faster network (e.g. Infiniband), Multi-node with multi-GPUs will still improve the performance of application.
MPI+CUDA experiment (cont’)
Advanced Compiler Technology (ACT) Group at the ICT, CAS
Institute of Computing Technology (ICT) is founded at 1956, the first and leading institute on computing technology in China
ACT is founded in early 1960’s, and has over 40 years experiences on compilers Compilers for most of the mainframes developed in C
hina Compiler and binary translation tools for Loogson pro
essors Parallel compilers and tools for the Dawning series
(SMP/MPP/cluster)
Advanced Compiler Technology (ACT) Group at the ICT, CAS
ACT’s Current research Parallel programming languages and models Optimized compilers and tools for HPC (Dawnin
g) and multi-core processors (Loongson)
Advanced Compiler Technology (ACT) Group at the ICT, CAS
• PTA model (Process-based TAsk parallel programming model )– new process-based task construct
• With properties of isolation, atomicity and deterministic submission– Annotate a loop into two parts, prologue and task segment
#pragma pta parallel [clauses]#pragma pta task#pragma pta propagate (varlist)
– Suitable for expressing coarse-grained, irregular parallelism on loops• Implementation and performance
– PTA compiler, runtime system and assistant tool (help writing correct programs)
– Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83 (average 6.72 on 8 cores)
– Code changes is within 10 lines, much smaller than OpenMP
UPC-H : A Parallel Programming Model for Deep Parallel Hierarchies
Hierarchical UPC Provide multi-level data distribution Implicit and explicit hierarchical loop parallelism
Hybrid execution model: SPMD with fork-join Multi-dimensional data distribution and super-pipelining
Implementations on CUDA clusters and Dawning 6000 cluster Based on Berkeley UPC
Enhance optimizations as localization and communication optimization
Support SIMD intrinsics CUDA cluster : 72% of hand-tuned version’s performa
nce, code reduction to 68% Multi-core cluster: better process mapping and cache re
use than UPC
OpenMP and Runtime Support for Heterogeneous Platforms
Heterogeneous platforms consisting of CPUs and GPUs Multiple GPUs, or CPU-GPU cooperation brings extra data transfer
hurting the performance gain Programmers need unified data management system
OpenMP extension Specify partitioning ratio to optimize data transfer globally Specify heterogeneous blocking sizes to reduce false sharing amo
ng computing devices Runtime support
DSM system based on the blocking size specified Intelligent runtime prefetching with the help of compiler analysis
Implementation and results On OpenUH compiler Gains 1.6X speedup through prefetching on NPB/SP (class C)
Analyzers based on Compiling Techniques for MPI programs
Communication slicing and process mapping tool Compiler part
PDG Graph Building and slicing generation Iteration Set Transformation for approximation
Optimized mapping tool Weighted graph, Hardware characteristic Graph partitioning and feedback-based evaluation
Memory bandwidth measuring tool for MPI programs Detect the burst of bandwidth requirements
Enhance the performance of MPI error checking Redundant error checking removal by dynamically turning on/off
the global error checking With the help of compiler analysis on communicators Integrated with a model checking tool (ISP) and a runtime
checking tool (MARMOT)
LoongCC: An Optimizing Compiler for Loongson Multicore Processors
Based on Open64-4.2 and supporting C/C++/Fortran Open source at http://svn.open64.net/svnroot/open64/trunk/
Powerful optimizer and analyzer with better performances SIMD intrinsic support Memory locality optimization Data layout optimization Data prefetching Load/store grouping for 128-bit memory access instructions
Integrated with Aggressive Auto Parallelization Optimization (AAPO) module
Dynamic privatization Parallel model with dynamic alias optimization Array reduction optimization
DigitalBridge: An Binary Translation System for Loongson Multicore Processors
Fully utilizing hardware characters of Loongson CPUs Handle return instructions by shadow stack Handle Eflag operations by flag pattern Emulate X86 FPU by local FP registers Combination of static and dynamic translation Handle indirect-jumping table Handle misalignment data accesses by dynamic pr
ofile and exception handler Improve data locality by pool allocation Stack variables promotion
Software Tools for High Performance Computing
Contact: Prof. Yi Liu, JSI, Beihang [email protected]
LSP3AS: large-scale parallel program performance analysis system
Source Code
TAU Instrumentation Measurement API
Instrumented Code
Compiler/Linker External Libraries
Executable Datafile
Environment
Visualization and Analysis
Performance Datafile
Profiling Tools
Tracing Tools
Dynamic Compensation
RDMA Transmission and Buffer Management
RDMA Library
Clustering AnalysisBased on Iteration
Clustering VisualizationBased on hierarchy
classify
Traditional Process of performance analysis
Dependency of Each Step Innovations
Analysis based on hierarchical clustering
– Designed for performance tuning on peta-scale HPC systems
– Method is common:• Source code is instrumented b
y inserting specified function-calls
• Instrumented code is executed, while performance data are collected, generating profiling&tracing data files
• The profiling&tracing data is analyzed and visualization report is generated
– Instrumentation: based on TAU from University of Oregon
Compute node ……
Storage system
IO node
Sender
User process
Shared Memory
Receiver
Lustre ClientOr GFS
User process
Compute node
Sender
User process
Shared Memory
User process
Compute node ……Sender
User process
Shared Memory
User process
Compute node
Sender
User process
Shared Memory
User process
IO node
Receiver
Lustre ClientOr GFS
Thread Thread
LSP3AS: large-scale parallel program performance analysis system
≈ 10 thousands of nodes in Peta-scale system, massive performance data will be generated, transmitted and stored
Scalable structure for performance data collection
Distributed data collection and transmission: eliminate bottlenecks in network and data processing
Dynamic Compensation algorithm: reduce the influence of performance data volume
Efficient Data Transmission: use Remote Direct Memory Access (RDMA) to achieve high bandwidth and low latency
• Analysis & Visualization– Two approaches to deal with huge
amount of data• Data Analysis: Iteration-based clustering
approach from data mining technology are used
• Visualization: Clustering visualization Based on Hierarchy Classification
LSP3AS: large-scale parallel program performance analysis system
SimHPC: Parallel Simulator
Challenge for HPC Simulation: performance Target system: >1,000 nodes and processors Difficult for traditional architecture simulators
e.g. Simics Our solution
Parallel simulation Using cluster to simulate cluster
Use same node in host system with the target Basis: HPC systems uses commercial processors, even blades also ava
ilable for simulator Execution time of instruction sequences are the same in host & target
Processes makes things a little complicated, we will discuss it later Advantage: no need to model and simulate detailed components, such as
pipeline in processors and cache Execution-driven, Full-system simulation, support execution of Linux
and applications include benchmarks (e.g. Linpack)
SimHPC: Parallel Simulator (cont’) Analysis
Execution time of a process in target system is composed of:
process run IO readyT T T T
− Trun: execution time of instruction sequences− TIO: I/O blocking time, such as r/w files, send/recv
msgs− Tready: waiting time in ready-state
equal to hostcan be obtained in Linux kernel
needed to be simulated
So, Our simulator needs to:①Capture system events
• process scheduling• I/O operations: read/write files, MPI send()/recv()
②Simulate I/O and interconnection network subsystems③Synchronize timing of each application process
unequal to hostneeded to be re-calculated
SimHPC: Parallel Simulator (cont’)
System Architecture Application processes of multiple target nodes are
allocated to one host node number of host nodes << number of target nodes
Events are captured on host node while application is running
Events are sent to central node to analyze, synchronize time, and simulation
……
Architecture Simulation
Analysis & Time-axis Sychronize
Interconnection Network
Event Capture
Host node
Control
Event Capture
Host node
Event Capture
Host node
Event Collection
Disk I/O
Simulation Results
Parallel applications
Process ...
Target ...
……Simulator
Host Linux
Host Hardware Platform
Host
Process Process ...
Target
Process Process ...
Target ...
Simulator
Host Linux
Host Hardware Platform
Host
Process Process ...
Target
Process
SimHPC: Parallel Simulator (cont’)
Simulation Slowdown
• Experiment Results– Host: 5 IBM Blade HS21 (2-way Xeon)– Target: 32 – 1024 nodes– OS: Linux– App: Linpack HPL
Linpack performance for Fat-tree and 2D-mesh Interconnection netw
orks
Communication time for Fat-tree and 2D-mesh Interconnection
networks
Simulation Error Test
System-level Power Management
Power-aware Job Scheduling algorithmIdea:①Suspend a node if its idle-time > threshold②Wakeup nodes if there is no enough nodes to execute jobs, while③Avoid node thrashing between busy and suspend state
Since suspend & wakeup operation can consume power
Do not wakeup a suspending node if it just goes to sleep
The algorithm is integrated into OpenPBS
System-level Power Management
• Power Management Tool– Monitor the power-related status of the system– Reduce runtime power consumption of the
machine– Multiple power management policies
– Manual-control– On-demand control– Suspend-enable– …
Layers of Power Management
Node sleep/wakeup
Node On/Off
CPU Freq. control
Fan speed control
Power control of I/O equipments
...Node Level
Power Management Software / Interfaces
Power Management Policies
Power Management Agent in Node
Management/Interface
Level
Policy Level
Power management test for different Task Load
(Compared to no power management)
• Power Management Test– On 5 IBM HS21 blades Power Mesurement
Power
Control & Monitor
Commands
Status Power data
System
Task Load(tasks per h
our)Power Manage
ment PolicyTask Exec.
Time(s)
Power Consumption
(J)
Comparison
Performance slowdown
Power Saving
20On-demand 3.55 1778077 5.15% -1.66%
Suspend 3.60 1632521 9.76% -12.74%
200On-demand 3.55 1831432 4.62% -3.84%
Suspend 3.65 1683161 10.61% -10.78%
800On-demand 3.55 2132947 3.55% -7.05%
Suspend 3.66 2123577 11.25% -9.34%
System-level Power Management
Parallel Programming Platform for
Astrophysics Contact: Yunquan Zhang, ISCAS, Beijing
Joint work Shanghai Astronomical Observatory, CAS (SHAO), Institute of Software, CAS (ISCAS) Shanghai Supercomputer Center (SSC)
Build a high performance parallel computing software platform for astrophysics research, focusing on the planetary fluid dynamics and N-body problems
New parallel computing models and parallel algorithms studied, validated and adopted to achieve high performance.
Parallel Computing Software Platform for Astrophysics
Software Architecture
Physical and Mathematic
al Model
Physical and Mathematic
al Model
Parallel Computing
Model
Parallel Computing
Model
Numerical Methods
Numerical Methods
MPIMPI OpenMPOpenMP FortranFortran CC
100T Supercomputer
PETScPETSc AztecAztec
Software Platform for AstrophysicsSoftware Platform for Astrophysics
Web Portal on CNGridWeb Portal on CNGrid
Fluid Dynamics N-body Problem
Improved Preconditioner
Improved Preconditioner
Improved Lib. for Collective Comun
ication
Improved Lib. for Collective Comun
icationSpMVSpMV
FFTWFFTW GSLGSL
LustreLustre
Software Developme
nt
Software Developme
nt
Data Processing Scientific Visualiztion
Data Processing Scientific Visualiztion
The PETSc optimized version1 for astrophysics numerical simulation has been finished. The early performace evaluation for Aztec code and PETSc code on Dawning 5000A is shown.
For 80×80×50 mesh, the execution time of Aztec program is 4-7 times of the PETSc version, average 6 times; For 160×160×100 mesh, the execution time of Aztec program is 2-5 times of the PETSc version, average 4 times.
PETSc Optimized Version 1 (Speedup 4-6)
Method 1: Domain Decomposition Ordering Method for Field Coupling Method 2: Preconditioner for Domain Decomposition Method Method 3: PETSc Multi-physics Data Structure
PETSc Optimized Version 2 (Speedup 15-26)
Left: mesh 128 x 128 x 96 Right: mesh 192 x 192 x 128 Computation Speedup: 15-26Strong scalability: Original code normal, New code idealTest environment: BlueGene/L at NCAR (HPCA2009)
23/4/21
Strong Scalability on Dawning 5000A
65
Strong Scalability
rotmplinear: 192x192x12843
3.6
212.
8
98.5
51.1
26.1
14.4
8.3
4.712
.0
144.
8
65.5
19.232
.369.315
7.7
257.
1
13.523
.8
344.
7
1
10
100
1000
64 128 256 512 1024 2048 4096 8192
number of processor core
Tim
e(S
)
BG/L
5000A曙光
7000深腾
23/4/21
Strong Scalability on TianHe-1A
CLeXML Math Library
CPUCPUComputationa
l ModelComputationa
l Model
BLASBLAS FFTFFT
Self Adaptive Tunning, Instruction Reordering, Software Pipeli
ning…
Self Adaptive Tunning, Instruction Reordering, Software Pipeli
ning…
LAPACKLAPACK
Task Parallel
Task Parallel Iterative SolverIterative Solver
Self Adaptive Tunning
Multi-core parallel
Self Adaptive Tunning
Multi-core parallel
BLAS2 Performance: MKL vs. CLeXML
BLAS3 Performance: MKL vs. CLeXML
FFT Performance: MKL vs. CLeXML
FFT Performance: MKL vs. CLeXML
HPC Software support for Earth System Modeling
Contact: Prof. Guangwen Yang, Tsinghua University
72
Source Code
Executable
Standard Data Set
Result Evaluation
Result Visualization
Algorithm(Parallel)
Earth System Model
Development Wizard and
Editor
Compiler/Debugger/Optimizer
Computation Output
Initial Field and Boundary Condition
Running Environment
Data Visualization and Analysis
Tools
Data Management Subsystem
Other Data
Earth System ModelDevelopment Workflow
73
75
Expected Results
integrated high performance computing environment for
earth system model
integrated high performance computing environment for
earth system model
model application systems
Demonstrative Applications
research on global change
Existing tools: compiler
system monitorversion control
editor
development tools:
data conversiondiagnosisdebugging
performance analysishigh availability
template librarymodule library
high performance computers in China
software standards international
resources
provide simplified APIs for locating model data path
provide reliable meta-data management and support user-defined meta-data
support DAP data access protocol and provide model data queries
web based data access portal
provide SQL-Like query interface for climate model semantics
support parallel data aggregation and extraction
support online and offline conversion between different data formats
support graphic workflow operations
data processing service based on ‘cloud’ methods
provide fast and reliable parallel I/O for climate modeling
support compressed storage for earth scientific data
data storage service on parallel file system
Integration and Management of Massive Heterogeneous Data
Technical Route
Compressed Archive File System
Compressed Archive File System
Memory File System
Memory File System
Key-Value Storage System
Key-Value Storage System
Parallel File System PVFS2Parallel File System PVFS2
HadoopHadoop MPIMPI
pNetCDFpNetCDFHDF5HDF5
OpenDAPOpenDAP GPU CUDA SDK
GPU CUDA SDK
Data Grid MiddlewareData Grid Middleware
Support Layer
Request Parsing EngineRequest Parsing EngineAPI ( C & Fortran )API ( C & Fortran )
Web Service(Rest & SOAP)Web Service(Rest & SOAP)
PIOPIO
Shell Command
LineShell Command
Line Eclipse ClientEclipse Client Web BrowserWeb Browser
data access service
data processing service
data storage serviceaggre
gation
aggregation
extractionextraction
conversionconversion
queryquery publish
publish
readread writewrite archive
archive
sharesharevisualizatio
n
visualizatio
n
Presentation Layer
Storage Layer
interfac
eservi
ce
toolse
t
browse
browse
transfertransfer
design parallel volume rendering algorithms that can scale to hundreds of cores, and achieve efficient data sampling and composition
design parallel contour surface algorithm to achieve quick extraction and composition of contour surface
design and implementation of parallel visualization algorithms
software acceleration for graphics and image
hardware acceleration for graphics and image
performance optimization for TB-scale data field visualization
visualized representation methods for earth system models
Research topics
Fast Visualization and Diagnosis of Earth System Model Data
78
gra
ph
ical
gra
ph
ical
work
sta
tion
work
sta
tion
HPCHPC
raw raw datadata(TB)(TB)
netw
ork
netw
ork
high-resolution renderer &high-resolution renderer &display walldisplay wall
DMXDMXChromiumChromium
computing nodecomputing nodecomputing nodecomputing nodecomputing nodecomputing nodecomputing nodecomputing node
parallel visualizationparallel visualizationengine libraryengine library
PVE launcherPVE launcher
netCDF, NCnetCDF, NC
high-speedhigh-speedinternal businternal bus
preprocessedpreprocesseddatadata
graphical graphical nodenode
pixel streampixel stream(giga-bps)(giga-bps)
OpenGL streamOpenGL stream
web remote user
data data processorprocessor
meta-meta-data data managermanager local userGUI / C
LI
GUI / CLI
viewerviewer
GUI / CLI
GUI / CLI
viewerviewer
MPMD parallel program debugging
MPMD parallel program performance measurement and analysis
support to efficient execution of MPMD parallel programs
fault-tolerance technologies for MPMD parallel programs
MPMD Program Debugging, and Analysis
MPMD Program Debugging, Analysis Environment
Runtime Support
High Availability
Debugging Performance Analysis
Basic Hardware/Software Environment
Technical Route
LibraryLibraryFile SystemFile SystemOperation System
Operation System
Hardware ( nodes and network )Hardware ( nodes and network )
job and resource management
job and resource management
job controljob controlresource managementresource management
management middlewaremanagement middleware
Service Layer
abstraction serviceabstraction service
job scheduling
job scheduling
IDEIntegrationFramework
IDEIntegrationFramework
Shell Command Line
Shell Command Line
Eclipse clientEclipse client browserbrowser
parallel debugging
performance analysis
reliability
data collection
data collection
performance
analysis
performance
analysis
data representation
data representation
groupgroup INTINTsystem monito
r
system monito
r
controllercontroller
plug-in and comman
d
plug-in and comman
dtracktrack analysisanalysis
Presentation Layer
Fundamental Support
queryquery
instrumentation
instrumentation
Language EnvironmentLanguage
Environment
Abstraction Layer
job management UI
job management UI
debug plug-in
debug plug-in
performance analysis plug-
in
performance analysis plug-
in
Debugging and Optimization IDE for Earth System Model ProgramsDebugging and Optimization IDE for Earth System Model Programs
Earth System Model Abstraction Service PlatformEarth System Model Abstraction Service Platform
Technical Route
Debugging Window
debugging serviceresource management and
job schedulingperformance
optimization service
earth system model MPMD program
debugging monitoring
system failure
notification
system
failure n
otifica
tion
an
d fau
lt-toleran
t sche
du
ling
program event collectionhierarchical
scheduling
Performance Analysis Window
event collection
debugging replay
scheduling
reliable monitoring system execution environment
performance sampling data
a plug-in-based expandable development platform
a template-based development supporting environment
a tool library for earth system model development
typical earth system model applications developed using the integrated development environment
Integrated Development Environment (IDE)
Plug-in integration method
Platform Runtime
Workspace
Help
Team
Workbench
JFace
SWT
Eclipse Project
JavaDevelopment
Tools(JDT)
Their Tool
Your Tool
AnotherTool
Plug-inDevelopmentEnvironment
(PDE)
Eclipse Platform
Debug
Encapsulation of reusable modules
Radiation Module
Time Integration Module
BoundaryLayer Module
Coupler Module
Solver Module
…… Module units :High performancereusable
Module encapsulation specification
Model module lib
Thank You !