High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance...

53
1 High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin

Transcript of High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance...

Page 1: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

1

High-Performance Execution of Multithreaded Workloads on CMPs

M. Aater SulemanAdvisor: Yale Patt

HPS Research GroupThe University of Texas at Austin

Page 2: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

2

How do we use the transistors?

• More transistors Higher performance core– Performance increases without programmer effort – Larger cores are complex and consume more power

• More transistors Bigger cache– Assist the core by reducing memory accesses– Easier to design and consume less power– Pentium M: 50M out of the 77M were cache

• More transistors More cores– Chip Multiprocessors (CMPs)– Less complex – Run at lower frequency (Power α frequency3)

But, do CMPs improveperformance?

Page 3: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

3

Multithreading

To leverage CMPs, applications must be split into threads

Single-Threaded

But, can we do this forall applications?

Page 4: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

4

Easy-to-parallelize Kernels

Kernel from ImageMagick

GrayscaleToMonochrome (picture)

foreach (OldPixel in picture)if( OldPixel > Threshold)

NewPixel = 1

elseNewPixel = 0

Page 5: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

5

Serial Kernels

Smooth(Picture)for i = 1 to N

Pixel[i] = (Pixel[i-1] + Pixel[i])/2

avg

avg

avg

Kernel from ImageMagick

Old pixels:

New pixels:

1 2 3 4

Page 6: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

6

Amdahl’s Law

As the number of cores increase, even a small serial part can have significant impact on overall performance

Future CMPs must improve performanceof both parallel and serial parts

Page 7: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

7

Outline

• Background

• Speeding up serial part– Asymmetric Chip Multiprocessor (ACMP)

• Speeding up parallel part– Feedback-Driven Threading (FDT)

• Summary

Page 8: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

8

Current CMP Architectures

Page 9: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

9

Current CMP Architectures

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

• Tile many small cores• Sun Niagara Processor• High throughput on the parallel part• Low performance on the serial part

Page 10: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

10

Current CMP Architectures

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

Page 11: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

11

Current CMP Architectures

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

• Tile a few large cores• IBM Power 5, AMD Barcelona, Intel Core2Quad• High performance on the serial part• Low throughput on the parallel part

Largecore

Largecore

Largecore

Largecore

“Tile-Large”Approach

Page 12: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

12

Current CMP Architectures

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

Largecore

Largecore

Largecore

Largecore

“Tile-Large”Approach

Page 13: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

13

The Asymmetric Chip Multiprocessor (ACMP)

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

• Provide one large core and many small cores• Accelerate serial part using the large core• Execute parallel part on small cores

for high throughput

Largecore

Largecore

Largecore

Largecore

“Tile-Large”Approach

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP Approach

Page 14: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

14

The Asymmetric Chip Multiprocessor (ACMP)

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

Largecore

Largecore

Largecore

Largecore

“Tile-Large”Approach

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP Approach

Page 15: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

15

The Asymmetric Chip Multiprocessor (ACMP)

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

• Analytical experiment details– One large core replaces four small cores– Large core provides 2x performance

Largecore

Largecore

Largecore

Largecore

“Tile-Large”Approach

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP Approach

Page 16: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

16

0123456789

0 0.2 0.4 0.6 0.8 1Degree of Parallelism

Spee

dup

vs. 1

Lar

ge C

ore Niagara

Tile-LargeACMP

Performance vs. Parallelism

Both ACMP and Tile-Large outperform Niagara

At high parallelism, Niagara outperforms ACMP

At medium parallelism, ACMP wins

Niagara beats ACMP at 97% parallelism

Page 17: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

17

Throughput of ACMP vs. Niagara

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Page 18: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

18

ACMP Scheduling

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP Approach

Page 19: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

19

Data Transfers in ACMP

• Data is transferred if the serial part requires the data generated by the parallel part or vice-versa

• ACMP– Data is transferred from all small cores

• Niagara/Tile-Large– Data is transferred from all but one core

• Number of data transfers increases by only 3.8%

Page 20: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

20

Experimental Methodology

• Configurations:– Niagara: 16 small cores– Tile-Large: 4 large cores– ACMP: 1 large core, 12 small cores

• Simulated existing multithreaded applications without modification

• Simulation parameters:– x86 cycle accurate processor simulator– Large core: 2GHz, out-of-order, 128-entry window, 4-wide issue, 12-stage pipeline– Small core: 2GHz, in-order, 2-wide, 5-stage pipeline – Private 32 KB L1, private 256KB L2– On-chip interconnect: Bi-directional ring

Page 21: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

21

Performance Results

0

0.2

0.4

0.6

0.8

1

1.2

1.4

mcfis_

nasp

fft_sp

lash

cg_n

asp

ep_n

asp

art_o

mpmg_

nasp

fmm_s

plash

chole

sky

page

conv

erth.2

64 ed

Spee

dup

over

Nia

gara

Tile-LargeACMP

Low

Parallelism

Medium

Parallelism

High

Parallelism

Page 22: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

22

Impact of ACMP on Programmer Effort

• ACMP makes performance less dependent on length of the serial part

• Programmers parallelize the easy-to-parallelize kernels

• Hardware accelerates the difficult-to-parallelize serial part

• Higher performance can be achieved with less effort

Page 23: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

23

Outline

• Background

• Speeding up serial part– Asymmetric Chip Multiprocessor (ACMP)

• Speeding up parallel part– Feedback-Driven Threading (FDT)

• Summary

Page 24: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

24

How Many Threads?

• Some applications:– As many threads as the number of cores

• Other applications:– Performance saturates– Fewer threads than cores

The number of threads must be chosen carefully

Page 25: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

25

Two Important Limitations

• Contention for shared data– Data synchronization: Critical section

• Contention for shared resources– Off-chip bus

Page 26: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

26

Contention for Critical Section

Kernel from PageMine

All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.

a b c d

occu

rrenc

es

Page 27: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

27

Contention for Critical Section

Critical Section: Add local histogram to global histogram

GetPageHistogram(Page *P)

UpdateLocalHistogram(Fraction of Page)

Barrier

Parallel Part

Serial Part

All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.All work and no play makes Jack a dull boy.

a b c d

occu

rrenc

es

Kernel from PageMine

Page 28: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

28

N = 1N = 2

N = 4

N = 8

Contention for Critical Section

Time outside Critical Section (CS)Time inside CS

Page 29: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

29

Two Important Limitations

• Contention for shared data– Data-synchronization: Critical section

• Contention for shared resources– Off-chip bus

Page 30: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

30

Off-Chip Bandwidth

MainMemoryOff-Chip

Bus

Page 31: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

31

Contention for Off-chip Bus

Kernel from ED

EuclideanDistance (Point A)for i = 1 to num_dimensions

sum = sum + A[i] * A[i]

Page 32: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

32

Contention for Off-chip Bus

N = 1N = 2

N = 4

N = 8N=4 and N=8 take

same time to execute

Page 33: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

33

Who Chooses Number of Threads?

• Programmer– No! Not for general-purpose workloads

Large variation in input data and machines• User

– No! I do not want Windows media player to ask me the number of threads

• Set equal to the number of cores– Assumption:

More threads More performance

Goal: A run-time mechanism to estimatethe best number of threads

Page 34: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

34

Outline

• Background

• Speeding up serial part– Asymmetric Chip Multiprocessor (ACMP)

• Speeding up parallel part– Feedback-Driven Threading (FDT)

• Synchronization-Aware Threading (SAT)• Bandwidth-Aware Threading (BAT)• Combining SAT and BAT (SAT+BAT)

• Summary

Page 35: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

35

Feedback-Driven Threading (FDT)

ConventionalMultithreading

Feedback-DrivenThreading

N = K

N = No. of threadsK = No. of cores

Page 36: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

36

Outline

• Background

• Speeding up serial part– Asymmetric Chip Multiprocessor (ACMP)

• Speeding up parallel part– Feedback-Driven Threading (FDT)

• Synchronization-Aware Threading (SAT)• Bandwidth-Aware Threading (BAT)• Combining SAT and BAT (SAT+BAT)

• Summary

Page 37: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

37

Synchronization-Aware Threading (SAT)

N = 1N = 2

N = 4

N = 8

Time outside C.S. NTN = + N x Time inside C.S.

Time outside C.S.Time inside C.S√NCS =

Page 38: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

38

Implementing SAT using FDT

• Train– Measure the time inside and outside the critical

section using cycle counter

• Compute NCS =

• Execute

Time outside C.S.Time inside C.S √

Page 39: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

39

Machine Configuration

• CMP: 32 in-order cores (2-wide, 5-stage deep) • Caches: L1: 8-KB, L2: 64KB. Shared L3: 8MB• Off-chip bus: 64-bit wide, 4x slower than cores • Memory: 200 cycle minimum latency

Page 40: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

40

Results of SATN

orm

. Exe

c. T

ime

Nor

m. E

xec.

Tim

e

PageMine (Data Mining) ISort (NAS)

EP (NAS)GSearch (OSR)

SAT decreases execution time

and saves power

Page 41: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

41

Adaptation of SAT to Input Data

• Time inside and outside the critical section depends on the input to program

• For PageMine, the best number of threads changes with the page size

Page 42: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

42

Outline

• Background

• Speeding up serial part– Asymmetric Chip Multiprocessor (ACMP)

• Speeding up parallel part– Feedback-Driven Threading (FDT)

• Synchronization-Aware Threading (SAT)• Bandwidth-Aware Threading (BAT)• Combining SAT and BAT (SAT+BAT)

• Summary

Page 43: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

43

Bandwidth-Aware Threading (BAT)

N = 1N = 2

N = 4

N = 8N=4 and N=8 take

same time to execute

Total BandwidthBandwidth used by a single threadNBW =

Page 44: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

44

Implementation BAT using FDT

• Train– Measure bandwidth utilization using performance

counters

• Compute NBW =

• Execute

Total BandwidthBandwidth used by a single thread

Page 45: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

45

Results of BATN

orm

. Exe

c. T

ime

Nor

m. E

xec.

Tim

e

ED (Numerical) convert (ImageMagick)

MTwister (nVIDIA)Transpose (nVIDIA)

BAT saves power withoutincreasing execution time

Page 46: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

46

Adaptation of BAT to System Configuration

• The best number of threads is a function of off-chip bandwidth

• BAT correctly predicts the best number of threads for systems with different bandwidth

convert (ImageMagick)

Page 47: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

47

Outline

• Background

• Speeding up serial part– Asymmetric Chip Multiprocessor (ACMP)

• Speeding up parallel part– Feedback-Driven Threading (FDT)

• Synchronization-Aware Threading (SAT)• Bandwidth-Aware Threading (BAT)• Combining SAT and BAT (SAT+BAT)

• Summary

Page 48: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

48

Combining SAT and BAT

• Train– Train for both SAT and BAT

• ComputeNSAT+BAT = MIN (NCS, NBW, Num. cores)

• Execute

Page 49: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

49

Results of SAT+BATFewer threads fewer cache misses

(SAT+BAT) reduces power and execution time

On average, (SAT+BAT) reduces the execution time by 17% and power by 59%

Nor

m. t

o 32

thre

ads

Page 50: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

50

Comparison with Static-Best

Simulate all possible number of threads and choose the best

Two kernels: First needs 12 threads, second needs 32. Static-Best uses 32 for both.

Nor

m. t

o 32

thre

ads

(SAT+BAT) Exec. TimeStatic-Best Exec. Time(SAT+BAT) powerStatic-Best power

Page 51: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

51

Outline

• Background

• Speeding up serial part– Asymmetric Chip Multiprocessor (ACMP)

• Speeding up parallel part– Feedback-Driven Threading (FDT)

• Synchronization-Aware Threading (SAT)• Bandwidth-Aware Threading (BAT)• Combining SAT and BAT (SAT+BAT)

• Summary

Page 52: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

52

Summary

• CMPs have increased the importance of multithreading

• Performance of both serial and parallel parts is important

• Asymmetric Chip Multiprocessor (ACMP) – Accelerates the serial portion using a high-performance core– Provides high throughput on the parallel portion using multiple small cores

• Feedback-Driven Threading (FDT)– Estimates best number of threads at run-time – Adapts to input sets and machine configurations– Does not require programmer/user intervention

Page 53: High-Performance Execution of Multithreaded Workloads on …– Large core provides 2x performance Large core Large core Large core Large core “Tile-Large”Approach Niagara-like

53

• Thank You