Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf ·...
Transcript of Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf ·...
![Page 1: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/1.jpg)
FORMAL DIAGNOSIS OF HARDWARE
TRANSIENT ERRORS IN PROGRAMS
Layali Rashid, Karthik Pattabiraman and
Sathish Gopalakrishnan
THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT
THE UNIVERSITY OF BRITISH COLUMBIA
![Page 2: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/2.jpg)
Contributions
• Software-driven diagnosis of hardware transient errors
– Diagnosis: “isolate the first affected instruction”
• Program-level analysis
– Guarantees on the diagnosis
• Completeness
• Accuracy
2THE UNIVERSITY OF BRITISH COLUMBIA
![Page 3: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/3.jpg)
Why Software-Driven Diagnosis?
• No expensive hardware modifications.
• Minimal software instrumentation.
• Diagnose faults which manifest at the program-level only.
• Direct access to the affected device is not required.
3THE UNIVERSITY OF BRITISH COLUMBIA
![Page 4: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/4.jpg)
Diagnosis Approach
4THE UNIVERSITY OF BRITISH COLUMBIA
Detector Triggered
Dump File(e.g. failing detector, register file)
Error Diagnosis
Transient Error Faulty inst
![Page 5: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/5.jpg)
Diagnosis Approach
Detector Triggered
Dump File(e.g. failing detector, register file)
Model Checking
Transient Error Faulty inst
5THE UNIVERSITY OF BRITISH COLUMBIA
![Page 6: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/6.jpg)
Model Checking Using SymPLFIED
• Formal model for analyzing programs[DSN’08]
– Evaluate the effect of transient hardware errors on programs.
• Symbolic error propagation technique
– Represent errors using a single symbol (err) to avoid state space explosion.
6THE UNIVERSITY OF BRITISH COLUMBIA
![Page 7: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/7.jpg)
Example: Factorial Program1 movi $2, #1
2 read $1
3 mov $3, $1
4 movi $4, #1
5 loop: setgt $5, $3, $4
6 beq $5, #0, exit
7 mult $2, $2, $3
8 subi $3, $3, #1
9 assert($3 < $1 + 1)
10 beq $0, #0, loop
11 exit: prints "Factorial = "
12 print $2
Result variable
User input
Loops while $3 < $4
Error detector
7THE UNIVERSITY OF BRITISH COLUMBIA
![Page 8: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/8.jpg)
1 movi $2, #1
2 read $1
3 mov $3, $1
4 movi $4, #1
5 loop: setgt $5, $3, $4
6 beq $5, #0, exit
7 mult $2, $2, $3
8 subi $3, $3, #1
9 assert($3 < $1 + 1)
10 beq $0, #0, loop
11 exit: prints "Factorial = "
12 print $2
A transient fault, $3 = 13
8THE UNIVERSITY OF BRITISH COLUMBIA
Example: Error Propagation
$1 = 5
Detector is triggered
![Page 9: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/9.jpg)
1 movi $2, #1
2 read $1
3 mov $3, $1
4 movi $4, #1
5 loop: setgt $5, $3, $4
6 beq $5, #0, exit
7 mult $2, $2, $3
8 subi $3, $3, #1
9 assert($3 < $1 + 1)
10 beq $0, #0, loop
11 exit: prints "Factorial = "
12 print $2
A transient fault, $3 = 13
9THE UNIVERSITY OF BRITISH COLUMBIA
Example: Error Propagation
$1 = 5
Detector is triggered
Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1
![Page 10: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/10.jpg)
1 movi $2, #1
2 read $1
3 mov $3, $1
4 movi $4, #1
5 loop: setgt $5, $3, $4
6 beq $5, #0, exit
7 mult $2, $2, $3
8 subi $3, $3, #1
9 assert($3 < $1 + 1)
10 beq $0, #0, loop
11 exit: prints "Factorial = "
12 print $2
10THE UNIVERSITY OF BRITISH COLUMBIA
Example: Error Diagnosis
A transient fault, $3 = err
False Line 7
True Exit
True Line 10
False Detector triggered
$2 = err
![Page 11: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/11.jpg)
1 movi $2, #1
2 read $1
3 mov $3, $1
4 movi $4, #1
5 loop: setgt $5, $3, $4
6 beq $5, #0, exit
7 mult $2, $2, $3
8 subi $3, $3, #1
9 assert($3 < $1 + 1)
10 beq $0, #0, loop
11 exit: prints "Factorial = "
12 print $2
11THE UNIVERSITY OF BRITISH COLUMBIA
Example: Error Diagnosis
A transient fault, $3 = err
False Line 7
True Exit
True Line 10
False Detector triggered
$2 = err
SymPLFIED’s SolutionInstruction 3 InjectedDetector triggered$1 = 5$2 = err$3 = err$4 = 1$5 = 1
Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1
![Page 12: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/12.jpg)
1 movi $2, #1
2 read $1
3 mov $3, $1
4 movi $4, #1
5 loop: setgt $5, $3, $4
6 beq $5, #0, exit
7 mult $2, $2, $3
8 subi $3, $3, #1
9 assert($3 < $1 + 1)
10 beq $0, #0, loop
11 exit: prints "Factorial = "
12 print $2
12THE UNIVERSITY OF BRITISH COLUMBIA
Example: Error Diagnosis
A transient fault, $3 = err
False Line 7
True Exit
True Line 10
False Detector triggered
$2 = err
SymPLFIED’s SolutionInstruction 3 InjectedDetector triggered$1 = 5$2 = err$3 = err$4 = 1$5 = 1
Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1
The crash dump file can be used to identify the faulty instruction.
![Page 13: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/13.jpg)
Instructions that trigger a detector
Inject at a random bit in SimpleScalar
Y
YCreate a dump
fileError diagnosisDone
More inst?
NDetector
triggered?
Experimental Methodology
13THE UNIVERSITY OF BRITISH COLUMBIA
• Enhance SymPLFIED to diagnose errors.
• Modify SimpleScalar simulator to inject faults.
• Evaluate for Matrix Multiply and Insertion Sort.
![Page 14: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/14.jpg)
Results for Matrix Multiply Number of detectors 1 4 6
Number of faults injected in SS 167 275 286
Number of faults detected in SS 74 135 150
Diagnosed faults (%) 100 77 80
Undiagnosed fault (%) 0 23 20
14THE UNIVERSITY OF BRITISH COLUMBIA
![Page 15: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/15.jpg)
Number of detectors 1 4 6
Number of faults injected in SS 167 275 286
Number of faults detected in SS 74 135 150
Diagnosed faults (%) 100 77 80
Undiagnosed fault (%) 0 23 20
Results for Matrix Multiply (1)
• The proposed technique diagnoses 77%-100% of the detected errors for the matrix multiply program.
• The undiagnosed errors are implementation artifacts of the SymPLFIED tool.
15THE UNIVERSITY OF BRITISH COLUMBIA
![Page 16: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/16.jpg)
Number of detectors 1 4 6
Number of faults injected in SS 167 275 286
Number of faults detected in SS 74 135 150
Diagnosed faults (%) 100 77 80
Undiagnosed fault (%) 0 23 20
Results for Matrix Multiply (2)
• The number of faults injected in SimpleScalar is proportional to the number of detectors.
• Adding more detectors increases the diagnosis accuracy.
16THE UNIVERSITY OF BRITISH COLUMBIA
![Page 17: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/17.jpg)
Conclusions and Future Work
• Software diagnosis of hardware faults is possible and can be automated using formal techniques.
– Our diagnosis method is able to diagnose significant number of errors using a few detectors.
• Future Work
– Investigate improvements with limited hardware support.
– Improve scalability using heuristics.
– Extend to intermittent & permanent faults.
17THE UNIVERSITY OF BRITISH COLUMBIA
![Page 18: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/18.jpg)
Backup Slides
THE UNIVERSITY OF BRITISH COLUMBIA 18
![Page 19: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/19.jpg)
Related Work
Hardware Fault Diagnosis
Hardware- BasedTechniques
ProbabilisticTechniques
Formal MethodsPeriodic-Testing
Techniques
19THE UNIVERSITY OF BRITISH COLUMBIA
![Page 20: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT](https://reader033.fdocuments.us/reader033/viewer/2022051805/5ff785281f8e0d5b876337f8/html5/thumbnails/20.jpg)
Results for Insertion Sort
THE UNIVERSITY OF BRITISH COLUMBIA 20
Number of detectors 1 4 7
Number of faults injected in SS 11 165 198
Number of faults detected in SS 8 64 83
Diagnosed faults (%) 100 87 89
Undiagnosed fault (%) 0 13 11