F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman...

20
FORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS IN PROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT THE UNIVERSITY OF BRITISH COLUMBIA

Transcript of F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman...

Page 1: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

FORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS

IN PROGRAMSLayali Rashid, Karthik Pattabiraman and

Sathish Gopalakrishnan

THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

THE UNIVERSITY OF BRITISH COLUMBIA

Page 2: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

2

Contributions

• Software-driven diagnosis of hardware transient errors– Diagnosis: “isolate the first affected

instruction”• Program-level analysis

– Guarantees on the diagnosis• Completeness• Accuracy

THE UNIVERSITY OF BRITISH COLUMBIA

Page 3: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

3

Why Software-Driven Diagnosis?

• No expensive hardware modifications.• Minimal software instrumentation.• Diagnose faults which manifest at the

program-level only.• Direct access to the affected device is not

required.

THE UNIVERSITY OF BRITISH COLUMBIA

Page 4: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

4

Diagnosis Approach

THE UNIVERSITY OF BRITISH COLUMBIA

Detector Triggered

Dump File(e.g. failing detector, register file)

Error Diagnosis

Transient Error Faulty inst

Page 5: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

5

Diagnosis Approach

Detector Triggered

Dump File(e.g. failing detector, register file)

Model Checking

Transient Error Faulty inst

THE UNIVERSITY OF BRITISH COLUMBIA

Page 6: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

6

Model Checking Using SymPLFIED

• Formal model for analyzing programs[DSN’08]– Evaluate the effect of transient hardware errors on

programs.• Symbolic error propagation technique

– Represent errors using a single symbol (err) to avoid state space explosion.

THE UNIVERSITY OF BRITISH COLUMBIA

Page 7: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

7

Example: Factorial Program1 movi $2, #1 2 read $13 mov $3, $1 4 movi $4, #15 loop: setgt $5, $3, $4 6 beq $5, #0, exit7 mult $2, $2, $38 subi $3, $3, #19 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = "12 print $2

Result variable

User input

Loops while $3 < $4

Error detector

THE UNIVERSITY OF BRITISH COLUMBIA

Page 8: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

8

1 movi $2, #1 2 read $13 mov $3, $1 4 movi $4, #15 loop: setgt $5, $3, $4 6 beq $5, #0, exit7 mult $2, $2, $38 subi $3, $3, #19 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = "12 print $2

A transient fault, $3 = 13

THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Propagation

$1 = 5

Detector is triggered

Page 9: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

9

1 movi $2, #1 2 read $13 mov $3, $1 4 movi $4, #15 loop: setgt $5, $3, $4 6 beq $5, #0, exit7 mult $2, $2, $38 subi $3, $3, #19 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = "12 print $2

A transient fault, $3 = 13

THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Propagation

$1 = 5

Detector is triggered

Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1

Page 10: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

10

1 movi $2, #1 2 read $13 mov $3, $1 4 movi $4, #15 loop: setgt $5, $3, $4 6 beq $5, #0, exit7 mult $2, $2, $38 subi $3, $3, #19 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = "12 print $2

THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Diagnosis

A transient fault, $3 = err

False Line 7

True Exit

True Line 10

False Detector triggered

$2 = err

Page 11: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

11

1 movi $2, #1 2 read $13 mov $3, $1 4 movi $4, #15 loop: setgt $5, $3, $4 6 beq $5, #0, exit7 mult $2, $2, $38 subi $3, $3, #19 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = "12 print $2

THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Diagnosis

A transient fault, $3 = err

False Line 7

True Exit

True Line 10

False Detector triggered

$2 = err

SymPLFIED’s SolutionInstruction 3 InjectedDetector triggered$1 = 5$2 = err$3 = err$4 = 1$5 = 1

Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1

Page 12: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

12

1 movi $2, #1 2 read $13 mov $3, $1 4 movi $4, #15 loop: setgt $5, $3, $4 6 beq $5, #0, exit7 mult $2, $2, $38 subi $3, $3, #19 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = "12 print $2

THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Diagnosis

A transient fault, $3 = err

False Line 7

True Exit

True Line 10

False Detector triggered

$2 = err

SymPLFIED’s SolutionInstruction 3 InjectedDetector triggered$1 = 5$2 = err$3 = err$4 = 1$5 = 1

Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1

The crash dump file can be used to identify the faulty instruction.

Page 13: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

13

Instructions that trigger a detector

Inject at a random bit in SimpleScalar

Y

YCreate a dump

fileError diagnosisDone

More inst?

NDetector

triggered?

Experimental Methodology

THE UNIVERSITY OF BRITISH COLUMBIA

• Enhance SymPLFIED to diagnose errors. • Modify SimpleScalar simulator to inject faults.• Evaluate for Matrix Multiply and Insertion Sort.

Page 14: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

14

Results for Matrix Multiply Number of detectors 1 4 6Number of faults injected in SS 167 275 286

Number of faults detected in SS 74 135 150

Diagnosed faults (%) 100 77 80Undiagnosed fault (%) 0 23 20

THE UNIVERSITY OF BRITISH COLUMBIA

Page 15: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

15

Number of detectors 1 4 6Number of faults injected in SS 167 275 286

Number of faults detected in SS 74 135 150

Diagnosed faults (%) 100 77 80Undiagnosed fault (%) 0 23 20

Results for Matrix Multiply (1)

• The proposed technique diagnoses 77%-100% of the detected errors for the matrix multiply program.

• The undiagnosed errors are implementation artifacts of the SymPLFIED tool.

THE UNIVERSITY OF BRITISH COLUMBIA

Page 16: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

16

Number of detectors 1 4 6Number of faults injected in SS 167 275 286

Number of faults detected in SS 74 135 150

Diagnosed faults (%) 100 77 80Undiagnosed fault (%) 0 23 20

Results for Matrix Multiply (2)

• The number of faults injected in SimpleScalar is proportional to the number of detectors.

• Adding more detectors increases the diagnosis accuracy.

THE UNIVERSITY OF BRITISH COLUMBIA

Page 17: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

17

Conclusions and Future Work• Software diagnosis of hardware faults is

possible and can be automated using formal techniques.– Our diagnosis method is able to diagnose significant

number of errors using a few detectors.• Future Work

– Investigate improvements with limited hardware support.

– Improve scalability using heuristics.– Extend to intermittent & permanent faults.

THE UNIVERSITY OF BRITISH COLUMBIA

Page 18: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

18

Backup Slides

THE UNIVERSITY OF BRITISH COLUMBIA

Page 19: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

19

Related Work

Hardware Fault Diagnosis

Hardware- BasedTechniques

ProbabilisticTechniques Formal Methods Periodic-Testing

Techniques

THE UNIVERSITY OF BRITISH COLUMBIA

Page 20: F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.

20

Results for Insertion Sort

THE UNIVERSITY OF BRITISH COLUMBIA

Number of detectors 1 4 7Number of faults injected in SS 11 165 198

Number of faults detected in SS 8 64 83

Diagnosed faults (%) 100 87 89Undiagnosed fault (%) 0 13 11