DACOTA: Post-silicon validation of the memory subsystem in...
Transcript of DACOTA: Post-silicon validation of the memory subsystem in...
![Page 1: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/1.jpg)
DACOTA: Post-silicon validation of the
memory subsystem in multi-core designs
Andrew DeOrio Ilya Wagner
Valeria Bertacco
Advanced Computer
Architecture Laboratory
University of Michigan, Ann Arbor
HPCA 2009
![Page 2: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/2.jpg)
2/22
Multi-core Designs
• Many simple processors
• Communicate through interconnect network
Intel Polaris Tilera TILE64
![Page 3: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/3.jpg)
3/22
Complex Multi-core: Memory Subsystem
• Cache coherence: the ordering of operations to a single
cache line
• Memory consistency: controls the ordering of
operations among different memory addresses
L2 Cache
Interconnect
...L1 Cache
Core 0
L1 Cache
Core N-1
The memory subsystem is hard to verify
![Page 4: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/4.jpg)
4/22
The Verification Landscape
Pre-Silicon Post-Silicon Runtime
• Fast: at-speed
• Early HW prototypes
• Hard-to-find bugs
• Relatively new technology
– Ad-hoc
• Slow: ~Hz
• Stimuli generators
• Random testers
• Formal verification
• Fast: at-speed
• Research ideas
– Austin, Malik, Sorin
• Microcode patching
– Intel, AMD
bugs exposed: 98% bugs exposed: 2% bugs exposed: <1%
effort: 70% effort: 30% effort: 0%
Logic
Sim.
Stimuli
module dut;
assign x = ~y | z;
always @ * begin
…
end RTL
![Page 5: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/5.jpg)
5/22
Post-Silicon Validation Today
=
Silicon prototype
Sim
ulat
ion
serv
ers
Test
g
ener
ato
r
Prototype’s final state
Simulation final state
Simulation is the bottleneck of the validation process
Critical path
![Page 6: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/6.jpg)
6/22
• 10% of the bugs that made it to product are related to
the memory subsystem
Escaped Bugs in the Memory Subsystem
bug AW38 No Fix Instruction fetch may cause a livelock during
snoops of the L1 data cache
Excerpt from Specification Update
[Nov. 2007]
Memory related bugs are hard to find
![Page 7: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/7.jpg)
7/22
Post-Silicon Design Goals
• High coverage
– Enable self-detection of memory ordering errors
– Coherence and consistency errors
• Low area impact
• No performance impact after shipment
=
Test
g
ener
ato
r
Prototype’s final state
Simulation final state
Critical path
self
check
![Page 8: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/8.jpg)
8/22
L2 Cache
Interconnect
...L1 Cache
Core 0
L1 Cache
Core N-1
• Logging
– stores ordering info
– uses cache storage
temporarily
• Checking
– starts when storage fills
– distributed algorithm on
individual cores
DACOTA: Data Coloring for Consistency Testing and Analysis
Post-silicon validation for the memory subsystem
benchmark execution check time
![Page 9: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/9.jpg)
9/22
• DACOTA controller augments cache controller logic
• Reconfigures a portion of cache for activity log
Low Overhead Logging Architecture
dacota
components
L 2 Cache
Interconnect
data access vector
c o
n t r o
l D
a c o t a
Core 0
c o
n t r o
l D
a c o t a
Core N - 1
![Page 10: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/10.jpg)
10/22
0 0 0 0 …
• Attach access vector to each cache line
– Tracks the order of memory accesses to one line
– Entry for each core stores a sequence ID
data access vector
• Allocate space for activity log
– Stores a sequence of access vectors
in program order
one entry for
each core counter
1 2 1 2
1. core 0 store
2. core 1 store
3. core 0 store
3 3
Low Overhead Logging Architecture
L2 Cache
Interconnect
data access vector
co
ntro
lD
aco
ta
Core 0
co
ntro
lD
aco
ta
Core N-1
![Page 11: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/11.jpg)
11/22
L2 Cache
Interconnect
...L1 Cache
Core 0
L1 Cache
Core N-1
Checking Algorithm – On Site
• Compares activity logs from L1 caches
• Distributed algorithm runs on cores
1. Aggregate logs
2. Construct graph (protocol specific)
• many protocol supported: SC, TSO, processor C., weak C.
3. Search graph for cycles, indicating ordering violation
ST A1
ST B1
ST A2
LD B1
ST C1
ST D1
ST C2
ST E1
ST B2
ST B1 ST A2
ST C2
ST E1
…
ST A1
![Page 12: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/12.jpg)
12/22
c o
n t r o
l D
a c o t a
Core 0
c o
n t r o
l D
a c o t a
Core 1
Example – Sequential Consistency
Issue Order
[C1] store to address 0xC
[C0] load from address 0xC
[C1] load from address 0xB
[C0] store to address 0xA
[C0] store to address 0xB
[C1] load from address 0xA
<data> 0xB
1 1 0 0xC
tag data log
<data> 0xA 1 1 0 0xB
<data> 0xB
1 1 0 0xB
<data> 0xA 0 0 0 0xA <data> 0xA
<data> 0xC <data> 0xC
1 1 0 0xA 1 1 0 0xC
Actual Order
[C1] store to address 0xC
[C0] load from address 0xC
[C1] load from address 0xA
[C0] store to address 0xA
[C0] store to address 0xB
[C1] load from address 0xB
![Page 13: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/13.jpg)
13/22
Example – Sequential Consistency
1 1 0 0xA 1 1 0 0xB
ST 0xA
ST 0xB LD 0xA
LD 0xB
Activity Logs
1 1 0 0xB 0 0 0 0xA
cycle indicates
violation
1 1 0 0xC 1 1 0 0xC
LD 0xC ST 0xC
address reference edges
program order edges
![Page 14: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/14.jpg)
14/22
L2 Cache
data access vector
co
ntro
lD
aco
ta
Core 0
co
ntro
lD
aco
ta
Core N-1
Experimental Setup
• Implemented checkers in GEMs simulator
• Created buggy versions of cache controllers
• TSO consistency model
directory-based
MOESI cache
coherence
4MB
Mesh network
… 16 cores
![Page 15: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/15.jpg)
15/22
Experimental Setup
• Bugs inspired by bugs found in processor errata
• Injected one at a time
shared-store store to a shared line may not invalidate other caches
invisible-store store message may not reach all cores
store-alloc1 store allocation in any core may not occur properly
store-alloc2 store allocation in one core may not occur properly
reorder1 invalid store reordering (all cores)
reorder2 invalid store reordering (one core)
reorder3 invalid store reordering (single address, all cores)
reorder4 invalid store reordering (single address, one core)
• Testbenches
– Directed random stimulus: memory intensive
– SPLASH2 Benchmarks
Cycles to
Expose
Bug
0.3M
1.3M
1.9M
2.3M
1.4M
2.8M
2.9M
5.6M
![Page 16: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/16.jpg)
16/22
Performance Impact - Random P
erf
orm
ance o
verh
ead (
%)
299
0
20
40
60
80
100
120
Computation overhead
Communication overhead
average
![Page 17: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/17.jpg)
17/22
0
10
20
30
40
50
Performance Impact – SPLASH2 P
erf
orm
ance o
verh
ead (
%)
Computation overhead
Communication overhead
average
Pre-Silicon 100,000,000 %
Traditional Post-Silicon 10,000 %
DACOTA Post-Silicon 60 %
100x more tests!
![Page 18: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/18.jpg)
18/22
Area Impact
Area Overhead - Storage
DACOTA 544 B
Chen, et al., 2008 617,472 B
Meixner, et al., 2006 940,032 B
Pre-Silicon 100,000,000 %
Traditional Post-Silicon 10,000 %
DACOTA Post-Silicon 60 %
Runtime 0 %
• Implemented DACOTA in Verilog
• 0.01% overhead in OpenSPARC T1
![Page 19: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/19.jpg)
19/22
Communication Overhead O
verh
ead d
ue to c
om
munic
ation (
%)
Core activity log entries Core activity log entries
0
50
100
150
200
250
300
350
64 128 256 512 1024 2048
large_1000_shared
barrier
locks
small_0_shared
average
0
5
10
15
20
25
30
35
40
64 128 256 512 1024 2048
radix lu
cholesky fft
average
SPLASH2 benchmarks Random benchmarks
![Page 20: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/20.jpg)
20/22
Checking Algorithm Overhead O
verh
ead d
ue to C
heckin
g A
lg. (%
)
Core activity log entries
0
20
40
60
80
100
120
64 128 256 512 1024 2048
radix
lu
cholesky
fft
average
0
100
200
300
400
500
600
700
64 128 256 512 1024 2048
large_1000_shared
barrier
locks
small_0_shared
average
Core activity log entries
SPLASH2 benchmarks Random benchmarks
ideal trade-off
![Page 21: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/21.jpg)
21/22
Related Work
Pre-Silicon Post-Silicon Runtime
Meixner, et al., 2006;
Chen, et al., 2008
• Effective for
protection against
transient faults
• Problematic for
functional errors
• High area overhead
Dill, et al., 1992;
Abts, et al., 1993;
Pong, et al., 1997;
German, et al., 2003
• Formal verification
possible for
abstract protocol
• Insufficient for
implementation
Josephson, et al., 2006
Paniccia, et al., 1998
Whetsel, et al., 1991
Tsang, et al., 2000
• Post-Si testing
DeOrio, et al., 2008
• Post-Si verification
• Verifies coherence,
but not consistency
![Page 22: DACOTA: Post-silicon validation of the memory subsystem in ...andrewdeorio.com/research/assets/HPCA09Dacota-slides.pdfDACOTA: Post-silicon validation of the memory subsystem in multi-core](https://reader033.fdocuments.us/reader033/viewer/2022041907/5e6401c6c5e0af27ea454032/html5/thumbnails/22.jpg)
22/22
Conclusions
• DACOTA is an on-chip post-silicon debugging solution
for detecting errors in memory ordering
– Enables self-detection of memory ordering errors
• Effective at catching bugs
– 100x more coverage than traditional post-silicon
• Very low area overhead
– 0.01% area overhead on OpenSPARC T1
• No performance impact to end user
– Disable on shipment