IFRA Instruction Footprint Recording & Analysis for Post-Silicon Bug Localization Sung-Boem Park...
-
Upload
christy-killingbeck -
Category
Documents
-
view
219 -
download
0
Transcript of IFRA Instruction Footprint Recording & Analysis for Post-Silicon Bug Localization Sung-Boem Park...
IFRA
Instruction Footprint Recording & Analysis
for Post-Silicon Bug Localization
Sung-Boem Park
Subhasish Mitra
Robust Systems Group
Departments of Electrical Eng. & Computer Sc.
Stanford University11
Key Message
Post-silicon bug localization – Major bottleneck
Pinpoint from system failure
Bug location, exposing stimulus
Existing schemes – Expensive & not scalable
IFRA – New technique for processors
Eliminates limitations of existing techniques
96% accuracy
1% area, ~0% performance impact
22
Outline
Motivation
IFRA Overview
Simulation Results
Conclusion
3
Microprocessor Development Flow
4
“Post-silicon cost & complexity is rising faster than design cost”
S. Yerramilli, VP, Intel, ITC06 Invited Address
Pre-Silicon
Post-Silicon
Pre-Silicon Verification
Design
Manufacturing Test
POST-SILICON VALIDATION
Post-Silicon Validation Costs: 35% of Development Time25% of Design Resources
Detect – Run test content in system
e.g., OS, games, functional tests
Localize – Pinpoint from system failure (e.g., crash)
Bug location – e.g., ALU, decoder, scheduler
Exposing stimulus – e.g., instruction sequence
Dominates cost [Josephson DAC06]
Root cause & Fix
Optical probing, patch / circuit edit / respin
5
Post-Silicon Validation Steps
6
Post-Silicon Bug Types [Josephson DAC06]
Functional bugs – Incorrect logic implementation
e.g., design errors
Short localization time – e.g., hours to days
Electrical bugs / circuit marginalities
e.g., speed-path, noise, races, hold time
Some voltage / temp / frequency corners
LONG localization time – e.g., days to weeks
Our focus
6
Reproduce failure on tester
2 days
Localize on tester3 days
Not always Possible
Tester-based
Detect in system
Existing Post-Silicon Bug Localization Flows
7
Detect in system
System-based
Localize failure in system
1 to 4 weeks
Major ProblemsFailure Reproduction
System-level simulation
8
IFRA vs. Existing Techniques
8
TechniquesTrace buffers
Clock manipulation
Checkpoint+ replay
Scan techniques IFRA
Intrusive? ? Yes No
Failure reproduction? Yes No
System-level simulation? Yes No
Area impact? Yes No 1%
Instruction Footprint Recording & Analysis
Insert recorders inside chip design
DesignPhase
Record special info. in recorders / Run tests
Scan out recorder contents
Post-analyze offline
Localized Bug: (location, stimulus)
Failuredetected?
Yes
No
Post-SiValidation
9
No system simulation Self-consistency against
test program binary
Non-intrusiveNo failure reproduction Single test run sufficient
Outline
Motivation
IFRA Overview
Hardware Support
Automated Post-Analysis Techniques
Simulation Results
Conclusion
10
IFRA Hardware in Superscalar Processor
11
FETCH
DECODE
ISSUE
EXECUTE
COMMIT
Branch Predictor I-CacheI-TLBFetch Queue
Pipeline Registers
Decoders
Pipeline Registers
Reg Rename
Phys Regfile
Pipeline Registers
Instruction Window
Pipeline Registers
2xBr2xALUMUL
2xLSUD-CacheD-TLBFPU
Pipeline Registers
Reorder Buffer Reg Map
Pipeline Registers
Reg Map Reg FreeDISPATCH
Alpha 21264
Part of scan chain
Post-TriggerGenerator
Recorders
Recorders
Recorders
Recorders
Recorders
Recorders
ID assignment
Slow wireNo at-speed
routing
Scan chain
INST1 ID1Auxiliary Info: PC1INST2 Auxiliary Info: PC2 ID2
Pipeline Reg
Pipeline Reg ID1INST1
ID1INST1 Auxiliary Info: Decoded bits1
ID1INST1
ID2 Auxiliary Info: Decoded bits2ID2INST2
INST2 ID2 Auxiliary Info: Decoded bits2
INST2 ID2 Auxiliary Info: PC2ID2
Recording Operation Example
12
FETCH
DECODE
ID Assignment
Branch Predictor I-CacheI-TLB
Fetch Queue
Decoder
ID1 Auxiliary Info: PC1
ID1 Auxiliary Info: Decoded bits1
Recorder 1
Recorder 2
Instruction Footprints
Special ID assignment rule
13
Special Rule for Instruction ID Assignment
Simplistic ID assignment inadequate
Speculation + flushes, out-of-order execution
PC does not work for loops
Special ID assignment rule – formal proof in paper
ID width: log24n bits
n = max. instructions in flight
e.g., 8 bits for Alpha-like processor (n=64)
No timestamp or global synchronization required
13
Dominated by memory
Simple control logic
Idle cycle compaction
Circular buffer control
Serialization
Stop / Start recording
No high-speed global routing
Contents scanned out after failure detection
Instruction Footprint Recorder Design
14
Circular Buffer
Con
trol
Log
ic
Post-triggersignal
Instruction ID + Auxiliary info.
To slow scan chain
14
What to Record?Pipeline stage Auxiliary information Bits per
recorderNumber of recorders
Fetch PC 32 4Decode Decoding results 4 4Dispatch 2-bit residue of reg. name 6 4
Issue 3-bit residue of operands 6 4Execution
(ALU, MUL)3-bit residue of result 3 4
Execution(Branch)
None 0 2
Execution(Load/Store unit)
3-bit residue of result32-bit memory address
35 2
Commit Exceptions ~0 4
15
Total required storage for all recorders: 60 KBytes
Post-Trigger Generation
16
time
Failure after 2 billion cycles(e.g., crash)
Error after a billion cycles(e.g., speedpath)
t=0
Code Execution
Too much storage overheadto store 1 billion cycles
Post-Trigger Generation
17
time
Early failure detection techniques (post-triggers) Classical error detection – residue, parity Deadlock & segfault detection
Special early warnings to pause recording Details in paper
Failure after 2 billion cycles(e.g., crash)
Error after a billion cycles(e.g., speedpath)
t=0
Code Execution
Need to capturein recorder storage
Early failure detection necessary
18
IFRA Area Impact
1% chip-level area impact
Synopsys Design Compiler synthesis
Alpha 21264-like processor: 2MB L2 cache
TSMC 130nm technology
No global at-speed routing
Area dominated by circular buffers in recorders
Total recorder storage: 60 KBytes
Outline
Motivation
IFRA Overview
Hardware Support
Post-Analysis Techniques
Simulation Results
Conclusion
19
20
Post-Analysis Overview
Link footprints
Test program binary
Footprints from recorders
Run high-level analysis
Run low-level analysis
List of bug location-stimulus pairs
Control-flow analysis
Data-dependency analysis
Decoding analysis
Load/Store analysis
Residue consistency check
(Not covered today – Details in paper)
21
Linking Footprints from Recorder ContentsCommit-stage
recorderFetch-stage
recorderExecution-stage
recorderTest program
binary
INST6 INST5 INST4 INST3 INST2
INST0
ID: 7 ID: 6 ID: 5 ID: 4 ID: 7 ID: 6 ID: 5
AUX7 AUX6 AUX5 AUX4 AUX3 AUX2 AUX1
PC4 PC3 PC2 PC1 PC3 PC2 PC1
ID: 6 ID: 5 ID: 4 ID: 7 ID: 6 ID: 5
AUX17 AUX16 AUX15 AUX14 AUX12 AUX11
ID: 7 ID: 6 ID: 5 ID: 4 ID: 7 ID: 6 ID: 5
PC6 PC5 PC4 PC3 PC2 PC0…
… ……
ID: 0 AUX13
ID: 0 AUX0
ID: 0 AUX8
ID: 0 PC0
ID: 0 PC5
PC1 INST1
PC7 INST7
time
ID: 0 AUX10
Special ID assignment rule ensures: Uncommitted instructions uniquely identified Relative orders of identical IDs maintained
Even under flushes & out-of-order execution
ID: 0 AUX18
… … … …
ID: 0 PC4
22
Debug Example
Link footprints
Bug locations + exposing stimulus
?
??
???
??
??
??? ?
??
?
?
?
?
?
?
?
?
?
?
?
Low-level analysis
High-level analysis
23
Debug Example – Decision 1
R0 R3 + R6
R5 R0 + R6
……
R0 R1 + R2
Test Program Binary
Fetch-stage recorder
Serial execution trace
24
Debug Example – Question 1
R0 R3 + R6
R5 R0 + R6
……
RAW hazard
R0 R1 + R2
R0=3
Issue-stagerecorder
R0=5
Execute-stagerecorder
Residue of values mismatch?
Serial execution trace
Producer of R0
Consumer of R0
25
Debug Example – Question 2
R0 R3 + R6
R5 R0 + R6
……
RAW hazard
R0 R1 + R2
Residue of phys. reg. names mismatch?
R0=P5
Dispatch-stagerecorder
R0=P2
Serial execution trace
Producer of R0
Consumer of R0
26
Debug Example – Question 3
R0 R3 + R6
R5 R0 + R6
……
RAW hazard
R0 R1 + R2
Serial execution trace
Producer of R0
Consumer of R0
Residue of phys. reg. name match with
previous producer?
R0=P5
Dispatch-stagerecorder
R0=P5Previous producer
27
Debug Example – Result
Arch. Dest. Reg
Pipeline Register
Decoder
Read Circuit
Write Circuit
Reg. Mapping
Rest of pipeline reg. R0 R1 + R2R0 R3 + R6
R5 R0 + R6
Stim
ulates Bug
Bug Location
Rest of modules in
dispatch stage
……
…
Propagates to failure
Outline
Motivation
IFRA Overview
Simulation Results
Conclusion
28
29
Experimental Setup
Simplescalar architectural simulator
Alpha 21264 configuration
Augmented with ~1K error injection points
Error model – single bit-flips
Hard-to-repeat electrical bugs
Both flip-flops & combinational logic
Stimulus
SpecInt 2000 benchmarks
Experimental Flow
30
Any failure detected?
Yes
No
Short error latency?
Yes
Warm up for a million cycles
Inject error
Masked/silent errorMasked/
silent error
No
100K simulation runs800 post-analysis runs
Post-analyze
Complete miss
Complete miss
Localization with
candidates
Localization with
candidates
Exact localization
Exact localization
IFRA Bug Localization Results
31
Localization resolution Bug exposing stimulus One of 200 erroneous design blocks
Avg. block size: 10K 2-input NAND gates
Correct localization (96%)
Complete miss (4%)
Exactlocalization
(78%)
Localization with avg. 6 candidates
(22%)
Outline
Motivation
IFRA Overview
Simulation Results
Conclusion
32
Conclusion
IFRA
Inexpensive
1% area, no expensive logic analyzers
No failure reproduction or system simulation
Effective
96% accuracy
Practical
Alpha processor demonstration
3333
Acknowledgement Bob Gottlieb, Intel
Nagib Hakim, Intel
Ted Hong, Stanford University
Doug Josephson, Intel
Onur Mutlu, Microsoft Research
Priyadarshan Patra, Intel
Eric Rentschler, AMD
Jason Stinson, Intel
34