Assertion-Based Microarchitecture Design for Improved Fault Tolerance

40
1 NC STATE UNIVERSITY Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North Carolina State University

description

Assertion-Based Microarchitecture Design for Improved Fault Tolerance. Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg. Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering North Carolina State University. Motivation. Technology scaling - PowerPoint PPT Presentation

Transcript of Assertion-Based Microarchitecture Design for Improved Fault Tolerance

Page 1: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

1

NC STATE UNIVERSITY

Assertion-Based Microarchitecture Design for Improved Fault Tolerance

Vimal K. Reddy

Ahmed S. Al-Zawawi, Eric Rotenberg

Center for Embedded Systems Research (CESR)Department of Electrical & Computer Engineering

North Carolina State University

Page 2: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

2

NC STATE UNIVERSITY

Motivation

• Technology scaling– Smaller, faster transistors– Transistors more susceptible to transient faults

• How to build reliable processors using unreliable transistors?

Page 3: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

3

NC STATE UNIVERSITY

Motivation

Need efficient fault tolerance solutions

Technology Scaling

Per

form

ance

No fault tolerance, erroneous operation

Susceptible to transient faults

Inefficient fault tolerance, correct operation

No fault tolerance, correct operation

Due to high overheads of inefficient fault tolerance

Page 4: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

4

NC STATE UNIVERSITY

Redundant Multithreading (RMT)

• Duplicate a program and compare outcomes to detect transient faults

• Positives– Simple and straightforward– General solution– Complete fault coverage

• Negatives– High overhead

Page 5: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

5

NC STATE UNIVERSITY

RMT using an extra processor

Processor Processor

==

Program

PowerCost

Page 6: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

6

NC STATE UNIVERSITY

RMT using simultaneous multithreading

PROGRAM

Processor

====

Performance

Power

Pipelinestages

Page 7: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

7

NC STATE UNIVERSITY

Alternate solution: Targeted fault checks

• Add regimen of fault checks• Specific to logic block

• Arbitrary latches: Robust latch design• Arbitrary gates: Self-checking logic• FSM: Self-checking FSM designs• ALU: Self-checking ALUs, RESO• Storage and buses: Parity, ECC

• Positives– No overhead of duplicating the program

• Negatives– Not general, i.e., many types of checks needed

Page 8: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

8

NC STATE UNIVERSITY

Our contribution: Microarchitectural Assertions

• Novel class of fault checks

• Key Idea: Confirm µarch. “truths” within processor pipeline

• Catch-all checks for microarchitecture mechanisms

• Positives– Broad coverage (catch-all checks)– Very low-overhead solution (no redundant execution)

Page 9: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

9

NC STATE UNIVERSITYSpectrum of fault checks

(RMT)

ArchitectureLevel

==

Processor

==

Processor

Page 10: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

10

NC STATE UNIVERSITYSpectrum of fault checks

(RMT)

ArchitectureLevel

Processor

==

Processor

parity/ECC

s0

s1

s2 self-checkFSM

==

==

==

Q

QSET

CLR

S

R

robustflip-flop

==

self-checkALU

== ==

==

====

==

==

==

== ==

==

==

==

==

==

==

==

==

====

==

==

==

==

==

==

== ==

==

==

==

==

==

==

==

====

==

Logic CircuitLevel

Page 11: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

11

NC STATE UNIVERSITYSpectrum of fault checks

(RMT)

ArchitectureLevel

==

Processor

Logic CircuitLevel

Processor==

== == ==

==

==

====

==

==

====

==

==

====

==

====

==

====

==

== ==

==

==

==

==

==

==

==

==

==

==

====

==

== ==

==

====

==

==

==

==

==

==

==

==

==

==

==

==

==

====

==

==

==

==

==

==

== ==

==

==

==

==

==

==

==

====

==

Processor

Page 12: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

12

NC STATE UNIVERSITYSpectrum of fault checks

(RMT)

ArchitectureLevel

==

Processor

Logic CircuitLevel

Processor==

== == ==

==

==

====

==

==

====

==

==

====

==

====

==

====

==

== ==

==

==

==

==

==

==

==

==

==

==

====

==

MicroarchitectureLevel

==

==

==

==

==

==

====

==

==

==

==

==

==

== ==

==

==

==

==

==

==

==

====

==

Processor

μarchassert

==

μarchassert

==

Page 13: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

13

NC STATE UNIVERSITY

Examples of Microarchitectural Assertions

• Register Name Authentication (RNA)– Aims to detect faults in renaming unit– Asserts consistencies in renaming unit

• Exploits inherent redundancy in renaming structures• Asserts expected physical register states

• Timestamp-based Assertion Checking (TAC)– Aims to detect faults in issue unit– Asserts sequential order among data dependent

instructions– Uses timestamps

Page 14: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

14

NC STATE UNIVERSITYp1

RenameLogic

p2p3p4

ReorderBuffer

r1r2r3

ArchitectureMapTable

I1 (r1)

Freelist

I1 (r1:p1) old_r1:p5

Program

I2 (r1)

I3 (r1)

I1 (r1) p1

I1 (r1:p1)

I1 (r1:p1) old_r1:p5 p1p5p6p7

p5p6p7

r1r2r3

RenameMapTable

p1I1 retires

p2p3p4

Page 15: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

15

NC STATE UNIVERSITY

RenameLogic

p2p3p4

ReorderBuffer

r1r2r3

ArchitectureMapTable

FreelistProgram

I2 (r1)

I3 (r1)

I2 (r1)

I2 (r1:p2)

I2 (r1:p2) old_r1:p1 p1p5p6p7

p6p7

p2

p2

I2 (r1:p2) old_r1:p1

r1r2r3

RenameMapTable

p5

p1

Page 16: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

16

NC STATE UNIVERSITY

RenameLogic

ReorderBuffer

r1r2r3

ArchitectureMapTable

FreelistProgram

I3 (r1)

p1p5p6p7

p1p6p7

p2

I2 (r1:p2) old_r1:p1Observation

r1’s old == r1’s archmapping mapping

RenameMapTable

r1r2r3

p3p4p5

Page 17: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

17

NC STATE UNIVERSITY

RenameLogic

ReorderBuffer

r1r2r3

ArchitectureMapTable

Freelist

r1r2r3

RenameMapTable

Program

I3 (r1)

p1p5p6p7

p6p7

p2

I2 (r1:p2) old_r1:p1RNA prev mapping check

old_r1 == arch_r1?

p1 == p1?

p6

p1

p3p4p5

Page 18: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

18

NC STATE UNIVERSITY

RenameLogic

ReorderBuffer

r1r2r3

ArchitectureMapTable

FreelistProgram

I3 (r1)

p1p6p7

I2 (r1:p2) old_r1:p1

p3

r1r2r3

RenameMapTable

p1p5p6p7

p6

I3 (r1:p3)

I3 (r1:p3) old_r1:p6 p3

I3 (r1:p3) old_r1:p6

p4p5

Page 19: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

19

NC STATE UNIVERSITY

RenameLogic

ReorderBuffer

r1r2r3

ArchitectureMapTable

FreelistProgram

p1p6p7

I2 (r1:p2) old_r1:p1RNA prev mapping check

old_r1 == arch_r1?

p1 == p1?

r1r2r3

RenameMapTable

p1p5p6p7

p6 p3

I3 (r1:p3) old_r1:p6

p4p5

p2

p4p5

I2 retires

Page 20: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

20

NC STATE UNIVERSITY

RenameLogic

ReorderBuffer

r1r2r3

ArchitectureMapTable

FreelistProgram

p6p7

RNA prev mapping checkold_r1 == arch_r1?

p6 == p2?

r1r2r3

RenameMapTable

p1p5p6p7

p6 p3

I3 (r1:p3) old_r1:p6

p5p1

p2

FAULT DETECTED

p4

At I3 retirement

Page 21: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

21

NC STATE UNIVERSITYp1p2p3

RenameLogic

p4

ReorderBuffer

r1r2r3

ArchitectureMapTable

Freelist

r1r2r3

RenameMapTable

Program

I1 (r1)

I2 (r1)I1 (r1) p1

I1 (r1:p6)

I1 (r1:p6 old_r1:p5)p6p7

p5p6p7

p5 p6

I1 (r1:p6) old_r1:p5

Page 22: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

22

NC STATE UNIVERSITYp2p3

RenameLogic

p4

ReorderBuffer

r1r2r3

ArchitectureMapTable

Freelist

r1r2r3

RenameMapTable

Program

I2 (r1)I2 (r1) p2

I2 (r1:p2)

I2 (r1:p2 old_r1:p6)p6p7

p5p6p7

p5 p6

I1 (r1:p6) old_r1:p5

p2

I2 (r1:p2) old_r1:p6

Page 23: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

23

NC STATE UNIVERSITY

RenameLogic

p3p4

ReorderBuffer

r1r2r3

ArchitectureMapTable

Freelist

r1r2r3

RenameMapTable

Program

p6p7

p5p6p7

p5 p6

I1 (r1:p6) old_r1:p5

p2

I2 (r1:p2) old_r1:p6

p3p4

I1 retires

p6

RNA prev mapping checkold_r1 == arch_r1?

p5 == p5?

Page 24: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

24

NC STATE UNIVERSITY

RenameLogic

p3p4

ReorderBuffer

r1r2r3

ArchitectureMapTable

Freelist

r1r2r3

RenameMapTable

Program

p6p7

p6p7

p5 p6 p2

I2 (r1:p2) old_r1:p6

p3p4

At I2 retirement

p6

RNA prev mapping checkold_r1 == arch_r1?

p6 == p6?FAULT NOT DETECTED

Page 25: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

25

NC STATE UNIVERSITYp1

RenameLogic

p2p3p4

Rdy/FreeBit Array

RegisterFile

Freelist

I2 (r2 )

r1r2r3

RenameMapTable

p1

I1 (r1 )

FU FU FU FU

p1p2p3p4

Rdy Free p1p2p3p4

Issue Queue

issue

writeback

11110

000 0

Free bit unset

I3 (r1 )

ProgramI1 (r1 ) p1

I1(r1:p1 )

I1(r1:p1 )

I1(r1:p1 )

Page 26: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

26

NC STATE UNIVERSITYp2

RenameLogic

p3p4

Rdy/FreeBit Array

RegisterFile

r1r2r3

RenameMapTable

p2

I2 (r2 )

FU FU FU FU

p1p2p3p4

Rdy Free p1p2p3p4

Issue Queue

issue

writeback

1110

000

0

I3 (r1 )

ProgramI2 (r2 ) p2

I2(r2:p2 )

I2(r2:p2 )

I1(r1:p1 ) I2(r2:p2 )

p1

0

Freelist

Page 27: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

27

NC STATE UNIVERSITY

RenameLogic

p3p4

Rdy/FreeBit Array

RegisterFile

Freelist

r1r2r3

RenameMapTable

p1

FU FU FU FU

p1p2p3p4

Rdy Free p1p2p3p4

Issue Queue

issue

writeback

00110

0001

XXX

Ready bit set

Program

I1(r1:p1 ) I2(r2:p2 )

Observation

Ready bit == 0 (not executed)

Free bit == 0 (not in freelist)

1

p2

XXX

I3 (r1 )

Page 28: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

28

NC STATE UNIVERSITY

RenameLogic

p3p4

Rdy/FreeBit Array

RegisterFile

Freelist

r1r2r3

RenameMapTable

p1

FU FU FU FU

p1p2p3p4

Rdy Free p1p2p3p4

Issue Queue

issue

writeback

00110

011

XXX

I3 (r1 )

Program

p2

I3 (r1 ) p3

I3(r1:p2 )

I3(r1:p2 ) p2

I3(r1:p2 )

0

Ready of p2 == 0?

FAULT DETECTED

Free of p2 == 0?

RNA writeback check

XXX

Page 29: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

29

NC STATE UNIVERSITY

RNA summary

• RNA previous mapping check– Detects faults in renaming structures

• Rename map table, Architecture map table• Branch checkpoint tables• Active list (renaming state)

• RNA writeback state check– Detects faults in renaming logic and freelist

Page 30: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

30

NC STATE UNIVERSITY

Source renaming

• Pure source renaming faults undetected– E.g., fault in source renaming logic– Researching solutions similar to RNA

• However, faults causing deadlock are detectable– Faulty source name causing cyclic dependency– Faulty source name points to unpopped freelist entry– Other faults that cause phantom producers

• Use watchdog timer to detect deadlocks

Page 31: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

31

NC STATE UNIVERSITY

Timestamp-based Assertion Checking (TAC)

• Confirm data dependent instructions issued sequentially– Assign timestamps to instructions at issue– At retirement, confirm instruction timestamp greater than

producers’ timestamps

• TAC checkInstruction timestamp >= Producer’s timestamp + latency

• Faults on checking logic can only cause false alarms

Page 32: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

32

NC STATE UNIVERSITY

Scheduler

FU FU

HT

I1 (r1)

I3 (r1, r1)

I4 (r1, r1)

I2 (r3)

I5 (r3, r3)

I6 (r3, r3)I1I2I3I4I5I6

issue

writeback

Dataflow

r1

r2

r3

ReorderBuffer

IssueQueue

retire

I1I2I3I4I5I6

TimestampCounter

1

0 143251 11111 Lat.

TS

234560

TS Lat.

TAC check

Instr TS >= Prod TS + Prod Lat

ArchitecturalState

0 0

00

0 0

Page 33: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

33

NC STATE UNIVERSITY

Scheduler

FU FU

HT

I1 (r1)

I3 (r1, r1)

I4 (r1, r1)

I2 (r3)

I5 (r3, r3)

I6 (r3, r3)I1I2I3I4I5I6

issue

writeback

Dataflow

ReorderBuffer

IssueQueue

HHHHHH retire

25 01

11

41

3111 Lat.

TS

r1

r2

r3

TS Lat.

1 1

0 1

4 1

TAC check

I1: 1 >= 0 + 0I2: 0 >= 0 + 0I3: 4 >= 1 + 1I4: 3 >= 4 + 1

FAULT DETECTED

ArchitecturalState00

0 0

0 0

Page 34: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

34

NC STATE UNIVERSITYExperiments

• Randomly injected faults in timing simulator– 1000 faults per benchmark– Faults target issue and rename state– Simulation ends 1 million cycles following fault injection

• Observations– Fault detected by an assert (Assert) or not (Undet)– Fault corrupts architectural state (SDC) or not (Masked)

• Possible outcomes– Assert+SDC– Undet+SDC– Assert+Masked– Undet+Masked

Page 35: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

35

NC STATE UNIVERSITY

TAC - Fault Injection Experiments

• Type of faults injected– Ready bits prematurely set– Speculatively issued cache-missing load dependents not

reissued

Page 36: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

36

NC STATE UNIVERSITY

TAC - Fault Injection Experiments

80% of faults cause SDC20% of faults aremasked

All SDC faults are detected

Zero undetected SDC faults

0

10

20

30

40

50

60

70

80

90

100

bzip gap gcc gzip parser perl twolf vortex vpr

Undet+SDC

Undet+Masked

Assert+Masked

Assert+SDC% o

f to

tal i

nje

cte

d f

au

lts

SPEC2K Integer BenchmarksAVG

Page 37: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

37

NC STATE UNIVERSITY

RNA - Fault Injection Experiments

• Faults injected– Flip bits of an entry in architecture map table– Flip bits of an entry in rename map table– Flip bits of an entry in freelist– Flip destination register bits at dispatch

• Watchdog timer included to detect deadlocks• Additional possible outcomes

– Assert+Wdog : RNA detected a future deadlock– Undet+Wdog : Deadlock undetected by RNA (possibly

would have been caught by RNA in future)

Page 38: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

38

NC STATE UNIVERSITY

RNA - Fault Injection Experiments

68% SDC faults24% deadlocks8% masked faults

0

10

20

30

40

50

60

70

80

90

100

bzip gap gcc gzip parser perl twolf vortex vpr

% o

f tot

al fa

ults

inje

cted

Undet+SDCUndet+MaskedUndet+WdogAssert+MaskedAssert+WdogAssert+SDC

SPEC2K Integer BenchmarksAVG

88% of SDC faults detected by RNA

12% of SDC faults undetected

75% of deadlocks not detected by RNA (but detected by watchdog)

25% of deadlocks detected by RNA beforehand

Page 39: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

39

NC STATE UNIVERSITY

0

20

40

60

80

100

arch

_map

rena

me_

map

free

list

dest

arch

_map

rena

me_

map

free

list

dest

arch

_map

rena

me_

map

free

list

dest

arch

_map

rena

me_

map

free

list

dest

arch

_map

rena

me_

map

free

list

dest

arch

_map

rena

me_

map

free

list

dest

bzip gap gcc gzip parser perl

fau

lt o

utc

om

e d

istr

ibu

tio

n

Assert+SDC Assert+Wdog Assert+MaskUndet+Wdog Undet+Mask Undet+SDC

AVG

Arc

h_M

apR

en_M

apfr

eelis

tde

st

RNA - Fault outcome distribution

“dest” faults cause deadlocks because of phantom producers

Deadlock blocks retirementThus, RNA check can’t complete

“arch_map” faults cause most undetected SDC

Long live register range

System trap before RNA check can complete

Page 40: Assertion-Based Microarchitecture Design for Improved Fault Tolerance

40

NC STATE UNIVERSITYConclusions

(RMT)

ArchitectureLevel

==

Processor

Logic CircuitLevel

Processor==

== == ==

==

==

====

==

==

====

==

==

====

==

====

==

====

==

== ==

==

==

==

==

==

==

==

==

==

==

====

==

MicroarchitectureLevel

==

==

==

==

==

==

====

==

==

==

==

==

==

== ==

==

==

==

==

==

==

==

====

==

Processor

μarchassert

==

μarchassert

==