Diverge-Merge Processor (DMP)

29
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin

description

Diverge-Merge Processor (DMP). Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin. Outline. Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation - PowerPoint PPT Presentation

Transcript of Diverge-Merge Processor (DMP)

Page 1: Diverge-Merge Processor (DMP)

Diverge-Merge Processor (DMP)

Hyesoon Kim José A. Joao

Onur Mutlu* Yale N. Patt

HPS Research Group *Microsoft ResearchUniversity of Texas at Austin

Page 2: Diverge-Merge Processor (DMP)

2

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 3: Diverge-Merge Processor (DMP)

3

Predicated Execution

Convert control flow dependence to data dependence

(normal branch code)

C B

D

AT N

p1 = (cond) branch p1, TARGET

mov b, 1 jmp JOIN

TARGET: mov b, 0

A

B

C

B

C

D

A

(predicated code)

A

B

C

if (cond) { b = 0;}else { b = 1;} p1 = (cond)

(!p1) mov b, 1

(p1) mov b, 0

Page 4: Diverge-Merge Processor (DMP)

4

Fetch Decode Rename Schedule RegisterRead Execute

Benefit of Predicated Execution Predicated Execution can be high

performance and energy-efficient.

A

BC

D

AE

F

Predicated Execution

Branch Prediction

Pipeline flush!!

E D BF

nop

Fetch Decode Rename Schedule RegisterRead Execute

AB AC B AC BD AD C BE AE D CF B AF E D C B A AF BCDEF E D ABCF E ABCDF E D C B AF E D C ABE D C B AF AF BCDE

Page 5: Diverge-Merge Processor (DMP)

5

Limitations/Problems of Predication

ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser’98] can solve this problem but

it is only applicable to simple hammocks.

Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim’05]

Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region,

and complex CFGs Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths

dynamically.

Page 6: Diverge-Merge Processor (DMP)

6

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 7: Diverge-Merge Processor (DMP)

7

Diverge-Merge Processor (DMP)

DMP can dynamically predicate complex branches

(in addition to simple hammocks).

The compiler identifies Diverge branches

Control-flow merge (CFM) points

The microarchitecture decides when and what to

predicate dynamically.

Page 8: Diverge-Merge Processor (DMP)

8

select-µops (φ-nodes in SSA)

Dynamic Predication

A

B

C

H

Klauser et al.[PACT’98]: Dynamic-hammock predication

C B

H

AT N

mov R1, 1 jmp JOIN

TARGET: mov R1, 0

A

B

C

p1 = (cond) branch p1, TARGET

(mov R1, 1)PR10 = 1

(mov R1, 0)PR11 = 0

PR12 = (cond) ? PR11 : PR10

Low-confidence

H JOIN: add R5, R1, 1

Page 9: Diverge-Merge Processor (DMP)

9

Diverge-Merge Processor

C B

E

D

F G

Frequently executed path

Not frequently executed path

A

C

E

B

H

Insert select-µops

Diverge Branch

CFM point

A

H

Page 10: Diverge-Merge Processor (DMP)

10

diverge-branch executed block CFM point

Diverge-Merge Processor

C B

E

D

F G

Frequently executed path

Not frequently executed path

A A A

A A A

A

H

Page 11: Diverge-Merge Processor (DMP)

11

Control-Flow GraphsA

simple hammock

A

nested hammock

A

frequently-hammock

A

loop

A

. . . . . . . . . . .

non-merging

DMP

Dynamic Hammock

SW pred

Wish br.

Dual-path

Page 12: Diverge-Merge Processor (DMP)

12

Dual-path Execution vs. DMP

Low-confidence

C

D

E

F

B

D

E

F

A

BC

D

E

F

path 1 path 2

C

D

E

F

B

path 1 path 2

Dual-path DMP

CFMCFM

Page 13: Diverge-Merge Processor (DMP)

13

Control-Flow GraphsA

simple hammock

A

nested hammock

A

frequently-hammock

A

loop

A

. . . . . . . . . . .

non-merging

DMP

Dynamic-hammock

SW pred

Wish br.

Dual-path

sometimes

sometimes

Page 14: Diverge-Merge Processor (DMP)

14

0

2

4

6

8

10

12

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twol

f

com

p goijp

eg li

m88

ksim

amea

n

Mis

pre

dic

tio

ns

pe

r k

ilo in

str

uc

tio

ns

(M

PK

I)

non-merging

loop

frequently

nested

simple

Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically

predicated in DMP.

Page 15: Diverge-Merge Processor (DMP)

15

0

2

4

6

8

10

12

gzip vp

rgc

cm

cf

craf

ty

pars

er eon

perlb

mk

gap

vorte

xbz

ip2 twol

f

com

p goijp

eg li

m88

ksim

amea

n

Mis

pre

dic

tio

ns

pe

r k

ilo in

str

uc

tio

ns

(M

PK

I)

non-merging

loop

frequently

nested

simple

Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically

predicated in DMP.

Page 16: Diverge-Merge Processor (DMP)

16

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 17: Diverge-Merge Processor (DMP)

17

Fetch Mechanism

C B

E

D

F G

predicted path

A

C

E

B

H

Diverge Branch

CFM point

A

H

Low Confidence

Round-robin fetch

Page 18: Diverge-Merge Processor (DMP)

18

PR21PR11PR41

add pr21 pr13, #1 (p1)

Dynamic Predication

Arch. Phy. M

R1

R2 PR12

R3 PR13

A

C

E

B

H

branch r0, C

add r1 r3, #1

add r4 r1, r3

add r1 r2, # -1

branch pr10,C p1 = pr10

add pr24 pr41, pr13

add pr31 pr12, # -1(!p1)

Arch. Phy. M

R1

R2 PR12

R3 PR13

PR31

1

1

select-µop pr41 = p1? pr21 : pr31

RAT2

RAT1

Forks RAT, RAS, and GHR

PR11

Page 19: Diverge-Merge Processor (DMP)

19

DMP Support

ISA Support Mark diverge branches/CFM points.

Compiler Support [CGO’07] The compiler identifies diverge branches and the

corresponding CFM points. Hardware Support

Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication

Page 20: Diverge-Merge Processor (DMP)

20

Hardware Complexity Analysis

ST-LD Forwarding

SWpred.

Dualpath

Select-Uop Gen.

Rename Support

Front-End

Check Flush/no Flush

Predicate Registers

Confidence Estimator

Wishbr.

Multi path

Dyn.ham.

DMP

Page 21: Diverge-Merge Processor (DMP)

21

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 22: Diverge-Merge Processor (DMP)

22

Simulation Methodology

12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation

Alpha ISA execution driven simulator Baseline processor configuration

64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence

estimator Less aggressive processor (paper) Power model using Wattch

Page 23: Diverge-Merge Processor (DMP)

23

0

10

20

30

40

50

60

gzip

vpr

gcc

mcf

craf

ty

pars

er

eon

perlb

mk

gap

vort

ex

bzip

2

twol

f

com

p go

ijpeg

li

m88

ksim

hmea

n

IPC

im

pro

vem

ent

(%)

simplesimple,nestedsimple,nested,frequentlysimple,nested,frequently,loop

Different CFG types

Page 24: Diverge-Merge Processor (DMP)

24

Performance Improvement

0

5

10

15

20

25

Per

form

ance

Im

pro

vem

ent

(%) DMP

dynamic-hammockdual-pathmultipathlimited software predicationwish branches

Page 25: Diverge-Merge Processor (DMP)

25

Energy Consumption

-5

0

5

10

Red

uct

ion

(%

)

DMPdynamic-hammockdual-pathmultipathlimited software predicationwish branches

Page 26: Diverge-Merge Processor (DMP)

26

Outline

Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion

Page 27: Diverge-Merge Processor (DMP)

27

Conclusion DMP introduces the concept of frequently-hammocks and it

dynamically predicates complex CFGs.

DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG.

DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy

DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate

dynamically.

Page 28: Diverge-Merge Processor (DMP)

Thank You!!

Page 29: Diverge-Merge Processor (DMP)

Questions?