PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H....

25
PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST

Transcript of PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H....

Page 1: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

PSI-SIM: System Performance Evaluation Environment for

Next-Generation Supercomputers

K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M.

Aoyagi

Kyusyu University, ISIT, IST

Page 2: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Background

• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines

How are you Mr. Tera?

I am fine!How about you, Mr. Peta?

Page 3: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Background

• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines

• If you would like to develop a “Peta-Scale” supercomputer, it is required to…– Explore the design space both of computation

nodes and inter-connection network!– Verify the effective performance to be achieved!

• So, we need a performance evaluation environment for peta-scale supercomputers!

Page 4: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Our Goal!

• Problem…– Simulations are 3-orders of magnitude slower

than real machines!– “Peta-scale” is 3-orders of magnitude larger

than “Tera-scale” (i.e. available machines) ! – How can we bridge the gap?

• Develop an efficient performance evaluation environment: PSI-SIM– Divide compute-node simulations and network

simulations!– Abstract the target application program to

accelerate simulation speed!

Page 5: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 6: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 7: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

What is the Skeleton Code?Original code

foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F

for (j=0; j<n; j++) for (k=0; k<n; k++) Func( );}

foo( ) {

BSIM_ADD_TIME(10ms)

MPI_Comm.

BSIM_ADD_TIME(1ms)

BSIM_ADD_TIME(15s)

}

Skeleton code

• Computation blocks are replaced by “Estimated” execution times!• Other modifications (e.g. reducing required memory size)

Page 8: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 9: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Generating Communication Profile

• BSIM-Logger– Executes the skeleton code on an existing

machine– Emulates the behavior of target machine– Generates a communication profile under the

assumption of a ZERO-latency ideal network

• Why Fast?– Abstracted computation blocks are NOT

executed (just update virtual timers)– Mask real communications, but generate

accurate logs

Page 10: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

How Fast? How Accurate?ERI (Electron Repulsion Integral)

Tim

e f

or

loggin

g (

s) Original

Skeleton

Exe.

Tim

e P

redic

ted (

s)

Original Skeleton

NAS PARALLEL FT

Tim

e f

or

loggin

g (

s) Original

Skeleton

Exe.

Tim

e P

redic

ted (

s)

OriginalSkeleton

Page 11: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 12: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Fast, Flexible Interconnection Network Simulator

• NSIM– Inputs the communication profile and a

network configuration file– Generates a communication profile with

estimated interconnect latency

• Why Fast? Why Flexible?– Parallelized implementation– Support a number of parameters

• Topology , Spec. of routers/switches, buffer size, and so on

Page 13: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Performance of BSIM + NSIM

• Performance prediction for HPL execution @16nodes PC cluster

• <120s (problem size = 5,000) @8CPU• About 9,000 MPI-Comm./s@8CPU

Execu

tion T

ime (

s) Measured Predicted

Error=5.3%

Not skeleton execution

Page 14: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 15: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

ANA GroupWork Viewer

Group Work•Indicate load balance

Performance Indicator•Execution time after load-balance optimization

Communication Indicator•Amount of communications per second

Page 16: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Conclusions

• PSI-SIM– Performance evaluation environment for

supercomputers– BSIM+NSIM+ANA

• On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a

“Giga-scale” machine (e.g. 32 CPU Cores)– “Peta-Scale” machine (4K PSI-SIMD CPUs) by

using a “Giga-scale” machine

Page 17: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Backup Slides

Page 18: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Peta-scale Performance Prediction

• Assumption– HPL problem size: 3Million– #of nodes: 4K (PSI-SIMD)– BSIM: use 32 cpus (3GHz Xeon)– NSIM: 10,000 MPI-Comm./s@8CPU

• How long we need to spend?– BSIM: about 300h (<2 weeks)– NSIM: about ??

• under the estimation…

Page 19: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

予測実行時間 (FT)

誤差 -11.6%

誤差 -11.3%

Target machine?: rsccUsed machine?: rscc

Page 20: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

通信プロファイル時間 (FT)

86%削減

19%削減

Target machine?: rsccUsed machine?: rscc

Page 21: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

予測実行時間( ERI)

誤差 -0.2%

誤差 1.5%

誤差 -0.6%

Target machine?: rsccUsed machine?: rscc

Page 22: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

通信プロファイル生成時間( ERI)

91%削減

96%削減

97%削減

Target machine?: rsccUsed machine?: rscc

Page 23: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

実行時間の予測性能通信レイテンシ

評価アプリケーションの規模増加 ⇒ 予測精度が向上評価アプリケーションの規模増加 ⇒ 予測精度が向上

予測精度: 94.7%

Page 24: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

シミュレーション時間(問題サイズ固定: 2000 )

評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上

最近の成果(高速化)分

16プロセス

256プロセス

1,024プロセス

Page 25: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,

Performance of NSIM

Accuracy : 94.7%

7.92,8.36,8.04

114sTarget machine? : PSI-hexaUsed machine?: PSI-hexa