Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting...

17
IBM Research © 2014 IBM Corporation Predicting GPU Performance from CPU Runs Using Machine Learning 1 IBM T. J. Watson Research Center Yorktown Heights, NY USA Ioana Baldini Stephen Fink Erik Altman

Transcript of Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting...

Page 1: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation

Predicting GPU Performance from CPU Runs Using Machine Learning

1

IBM T. J. Watson Research Center

Yorktown Heights, NY USA

Ioana Baldini Stephen Fink Erik Altman

Page 2: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 2

Pitfalls: Porting costs time/money

Sometimes GPU speedup is not worth the effort

GPU

To exploit GPGPU acceleration … … need to port code to GPGPU programming model

Problem: Can a non-expert predict whether a data-parallel application is likely to perform well on a GPU, without incurring the costs of porting?

Page 3: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 3

Maybe this is easy?

0

1

2

3

4

5

6

7

8

9

0 1 2 3 4 5 6 7 8 9

Suppose we already have data parallel CPU Version, e.g.

Hypothesis : Since data parallelism is data parallelism, •  SMP speedup predicts GPU speedup

OpenMP speedup

GPU speedup

Page 4: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 4

What is correlation between OpenMP and GPU performance?

ρ = 0.146

Page 5: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 5

Selected results, OpenMP vs. GPU

Similar scaling on CPU/OpenMP GPU results vary based on device/code

Speedup vs. single-thread CPU

Page 6: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation

GPU

6

GPU performance prediction: traditional approaches

Detailed analytic model of algorithm

Detailed model of GPU architecture

Accurate Performance Models +

0 2 4 6 8

10

0 5 10

Relatively simple code structures Reasoning about non-trivial transformations

Expert knowledge Complex mappings Device – Specific Models

Page 7: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 7

Desired Solution: Fully automatic, for non-experts •  No static analysis •  No detailed architectural models •  Automatically tune for new device

Our Approach: Use machine learning to learn from past experiences porting

sequential code to GPUs

Input: •  corpus of past ports from sequential (C) to GPU/parallel code •  features from sequential code

Output: •  model of GPU speedup

Notes: •  Does not explicitly model specific transformations or architectural detail

- Unlikely to produce highly accurate model, but rough estimate may be useful •  Input to model includes human expert knowledge embodied in previous ports

Page 8: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 8

System architecture

Page 9: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 9

Application Features – Dynamic Instrumentation (Pin)

Category Feature Mnemonic Computation Arithmetic and logic

instructions ALU

SIMD-based instructions SIMD

Memory Memory loads LD Memory stores ST Memory fences FENCE

Control Flow Conditional and unconditional branches

BR

OpenMP Speedup of 12 threads over sequential execution

OMP

Aggregates Total number of instructions TOTAL

Ratio of computation over memory

ALU-MEM

Ratio of computation over GPU communication

ALU-COMM

Page 10: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 10

Supervised Learning: [WEKA 3] Binary/Ternary classifiers

Source: [Martin 95]

Nearest Neighbors with Generalized Exemplars (NNGE) Support Vector Machines (SVM)

Page 11: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 11

Methodology

Runs •  48 runs total •  Cross validation:

-  Leave one out -  Leave two out

Benchmarks •  Parboil 2.0 •  Rodinia 2.2

System •  Dual-processor Intel Xeon X5690 •  12 cores total @ 3.47 GHz •  ATI FirePro v9800 •  Nvidia Tesla C2050

Page 12: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 12

Learning whether GPU acceleration is beneficial

Binary classifier: “GPU Speedup > 1”

Classifier Accuracy

SVM NNGE

Tesla ALU, LD, BR, TOTAL ALU, LD, ST, ALU-MEM, BR, TOTAL

FirePro ALU, LD, BR, TOTAL ALU, LD, BR, TOTAL, OMP

Features Selected

Page 13: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 13

Learning the speedup factor of GPU execution (SVM)

5 Binary classifiers: “GPU Speedup > 1” “GPU Speedup > 2” “GPU Speedup > 3” “GPU Speedup > 4” “GPU Speedup > 5”

Classifier Accuracy

Page 14: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 14

Predicting the Best Device for OpenCL runs

Which of 3 devices is best for a particular program? •  3-way classifier using NNGE: “CPU”, “Tesla”, “FirePro”

Correct 39 of 46 runs. Effective accuracy = 91%

Classifer prediction

Nor

mal

ized

Perfo

rman

ce

Page 15: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 15

Summary of Results

•  Classifiers are ~80% effective at deciding GPU speedup class for k = 1, …, 5

•  Small set of features (ALU, LD, BR) are sufficient

•  Predictions are effective for heterogeneous scheduling problem

Bottom Line

•  Is 80% classifier accuracy good enough? Useful?

•  Machine Learning doesn’t illuminate cause-and-effect

•  Early days – results are promising, but much more exploration needed

Page 16: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation 16

Backup

Page 17: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction

IBM Research

© 2014 IBM Corporation

Related Work

–  Jia et al. [14] used regression techniques to predict the results of simulations of GPU architectures.

– Grewe et al. [7], [8] use machine learning to help split code between cores of a CPU and a GPU.

– GROPHECY [23] predict GPU performance for loops. •  Uses a detailed analytic model with pattern-matching static analysis.

–  Kremlin [6] provides a methodology to discover opportunities for parallelization in sequential code •  Primarily by deriving upper bounds on potential parallelism from a dynamic critical path

measurement. –  Parallel Prophet [16] estimates parallel performance in the presence of memory contention.

•  Based on instrumented, annotated serial programs. – Hong et al. [11] model GPU performance from GPU memory access patterns and

computational density per memory request •  It requires a detailed model of the code to run on the device.