A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department...

24
A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department of Electrical & Computer Engineering University of Rochester

Transcript of A Performance-Correctness Explicitly-Decoupled Architecture Alok Garg and Michael Huang Department...

A Performance-Correctness Explicitly-Decoupled Architecture

Alok Garg and Michael Huang

Department of Electrical & Computer Engineering

University of Rochester

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

2

Motivation

Performance optimization in a monolithic micro-architecture is difficult

Conservativeness in design reduces the common case efficacy

Want to explicitly decouple correctness & performance

Optimization 1 (e.g. branch prediction)

Optimization 2 (e.g. out-of-order execution)

IF MEM

EX

ID WB

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

3

PerformanceDomain

Explicitly-Decoupled Architecture (EDA)

Design separated into performance and correctness domains Implementation decoupled as well

Optimistic design of entire system stack Economic correctness guarantee Custom software-hardware interface

Software layer

Optimisticcore

Architectural layer

Device layer

HintsCorrectness

Domain

Correctnesscore

Simplethroughput-oriented

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

4

Correctness domainPerformance domain

ILP lookahead using EDA

Autonomy

Managing deviance

Optimisticcore

Correctnesscore

Lookahead agent Throughput engine

Static binarytransformation

Program (semantic) binary

Program (semantic) binarySkeleton

Branch Outcome Queue

L0 L1

L2

Minimal mutual dependence

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

5

Outline

Architectural and software support needed Performance optimization opportunities Complexity reduction opportunities Evaluation Conclusion

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

6

Feed arbitrary value Exact value may not matter

Conventional mechanism Planning against contingency

Tagging entire dependency chain as invalid State check-point and recovery

Type of value substitution Value predictor Explicitly flush the dependence chain of load

Opportunity : simple “0” value substitution Only used when optimistic core is not too far ahead Zero most frequent occurring value

Avoiding L2-miss stalls in lookahead

compare (x>f0)

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

7

Purging stale data

Source of stale data Performance optimizations Binary optimizations

Potential Solutions Timer based eviction mechanism Selective L0 invalidations from skeleton

Choice : do nothing Simply rely on cache replacement

OC CC

L0 L1

L2

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

8

Complexity reduction

Optimistic core – tradeoff complexity to improve performance E.g., Load Store Queue

Correctness core – throughput oriented design Accurate branch prediction from OC

No check-pointing and selective pipeline flush required

Cache misses are significantly mitigated Latency of various operations is less critical

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

9

Load-Hit speculation

Processor Pipeline

ld

Issue Reg Reg Ex Ex

Load Miss

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

10

Outline

Architectural and software support needed Performance optimization Opportunities Complexity reduction opportunities Evaluation Conclusion

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

11

Evaluation Environment

Simulation strives to model EDA very faithfully Value driven execution for optimistic core Data values in the caches Faithful simulation of branches Scheduling replays Prefetch modeling fidelity Stream prefetcher

Power modeling – both switching and leakage

SPEC CPU2000 and SPLASH(2) benchmark suite

System Configuration – loosely based on Power4 ROB/Register (INT, FP) – 128/(32, 32) L0 cache – 16KB, 4-way, 2 cycle L1 cache – 32KB, 4-way, 2 cycle L2 cache – 1MB, 8-way, 400 cycle BOQ – 512 entry Register copy latency – 32 cycles

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

12

Performance gain of optimismsp

eed

up

spee

du

p

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

13

Effect on explicitly parallel programs

spee

du

p

Exploiting ILP is not guaranteed to be less effective than exploiting thread level parallelism

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

14

Energy Implications

Reasons

Skeleton not the entire program

Few wrong path instructions in CC

Smaller cache hierarchy in OC

Reduce energy waste due to idling

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

15

Performance impact with reduction in in-flight capacity

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

16

Impact of simplifications and conservativeness

Removing Load-hit speculation

Making out-of-orderINT issue queue

in-order

10% clock freq. reduction

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

17

Other details in the paper

Related work discussion Quantitative comparison with past works Details on skeleton construction Eliminating useless branches Delayed release of prefetches Understanding sensitivity to performance domain

errors System diagnosis

* More details left in the technical report version

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

18

Conclusion

Performance-correctness explicitly-decoupled arch. Independent focus on performance and correctness goals Each goal can be achieved more efficiently with less complexity

Demonstrated a concrete design with efficient lookahead Achieves good performance boosting Does not consume excessive energy Better tolerance to conservatism

Future work Optimization beyond ILP lookahead Custom design of optimistic and correctness core

A Performance-Correctness Explicitly-Decoupled Architecture

Alok Garg and Michael Huang

Department of Electrical & Computer Engineering

University of Rochester

Link to technical report: http://www.ece.rochester.edu/~garg/documents/micro08tr.pdf

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

20

Related Work

Dynamic verification using DIVA checker [austin99]

Lookahead techniques Two-pass execution [sundaramoorthy00], [purser00], [zhou05], [barnes03],

[mesa-martinez07], [greskamp07]

Helper-threading [dubois98], [annavaram01], [luk01], [zilles01], [chappell99], [collins01], [roth01], [moshovos01], [farcy98]

Enhancing processor’s capability to buffer more in-flight instructions [balasubramonian00], [lebeck02], [torres05], [gandhi05], [akkary03], [sethumadhavan03]

Runahead execution [mutlu03], [dundas97], [ceze04], [kirman05]

Parallelization oriented techniques [zilles02], [balakrishnan06]

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

21

Differences from DIVA

Traditional CoreDIVA Checker &

commitCommunication

decoded instructioninput and output Values

DIVA Decoupling

Explicit Decoupling (EDA)

low bandwidthhints

have to produce correctoutput

frequent repairment

infrequent repairmentfree to performrisky optimizations

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

22

Comparison with DCE

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

23

Sensitivity to performance domain circuit errors

11/11/2008

"A Performance-Correctness Explicitly-Decoupled Architecture (EDA)", Alok Garg, MICRO 2008

24

Load-Store queue simplification

1 st

2 …

3 st

4 …

5 st

6 …

7 ld

age oldestyoungest

Load Queue

Store Queue

ld7

… st1st3st5

dispatch

ld7

Store-load replay

st5

Load queue removed Store-load replay support not required Priority logic replaced with simpler forwarding logic