DACOTA: Post-silicon validation of the memory subsystem in...

DACOTA: Post-silicon validation of the

memory subsystem in multi-core designs

Andrew DeOrio Ilya Wagner

Valeria Bertacco

Advanced Computer

Architecture Laboratory

University of Michigan, Ann Arbor

HPCA 2009

2/22

Multi-core Designs

• Many simple processors

• Communicate through interconnect network

Intel Polaris Tilera TILE64

3/22

Complex Multi-core: Memory Subsystem

• Cache coherence: the ordering of operations to a single

cache line

• Memory consistency: controls the ordering of

operations among different memory addresses

L2 Cache

Interconnect

...L1 Cache

Core 0

L1 Cache

Core N-1

The memory subsystem is hard to verify

4/22

The Verification Landscape

Pre-Silicon Post-Silicon Runtime

• Fast: at-speed

• Early HW prototypes

• Hard-to-find bugs

• Relatively new technology

– Ad-hoc

• Slow: ~Hz

• Stimuli generators

• Random testers

• Formal verification

• Fast: at-speed

• Research ideas

– Austin, Malik, Sorin

• Microcode patching

– Intel, AMD

bugs exposed: 98% bugs exposed: 2% bugs exposed: <1%

effort: 70% effort: 30% effort: 0%

Logic

Sim.

Stimuli

module dut;

assign x = ~y | z;

always @ * begin

…

end RTL

5/22

Post-Silicon Validation Today

=

Silicon prototype

Sim

ulat

ion

serv

ers

Test

g

ener

ato

r

Prototype’s final state

Simulation final state

Simulation is the bottleneck of the validation process

Critical path

6/22

• 10% of the bugs that made it to product are related to

the memory subsystem

Escaped Bugs in the Memory Subsystem

bug AW38 No Fix Instruction fetch may cause a livelock during

snoops of the L1 data cache

Excerpt from Specification Update

[Nov. 2007]

Memory related bugs are hard to find

7/22

Post-Silicon Design Goals

• High coverage

– Enable self-detection of memory ordering errors

– Coherence and consistency errors

• Low area impact

• No performance impact after shipment

=

Test

g

ener

ato

r

Prototype’s final state

Simulation final state

Critical path

self

check

8/22

L2 Cache

Interconnect

...L1 Cache

Core 0

L1 Cache

Core N-1

• Logging

– stores ordering info

– uses cache storage

temporarily

• Checking

– starts when storage fills

– distributed algorithm on

individual cores

DACOTA: Data Coloring for Consistency Testing and Analysis

Post-silicon validation for the memory subsystem

benchmark execution check time

9/22

• DACOTA controller augments cache controller logic

• Reconfigures a portion of cache for activity log

Low Overhead Logging Architecture

dacota

components

L 2 Cache

Interconnect

data access vector

c o

n t r o

l D

a c o t a

Core 0

c o

n t r o

l D

a c o t a

Core N - 1

10/22

0 0 0 0 …

• Attach access vector to each cache line

– Tracks the order of memory accesses to one line

– Entry for each core stores a sequence ID

data access vector

• Allocate space for activity log

– Stores a sequence of access vectors

in program order

one entry for

each core counter

1 2 1 2

1. core 0 store

2. core 1 store

3. core 0 store

3 3

Low Overhead Logging Architecture

L2 Cache

Interconnect

data access vector

co

ntro

lD

aco

ta

Core 0

co

ntro

lD

aco

ta

Core N-1

11/22

L2 Cache

Interconnect

...L1 Cache

Core 0

L1 Cache

Core N-1

Checking Algorithm – On Site

• Compares activity logs from L1 caches

• Distributed algorithm runs on cores

1. Aggregate logs

2. Construct graph (protocol specific)

• many protocol supported: SC, TSO, processor C., weak C.

3. Search graph for cycles, indicating ordering violation

ST A1

ST B1

ST A2

LD B1

ST C1

ST D1

ST C2

ST E1

ST B2

ST B1 ST A2

ST C2

ST E1

…

ST A1

12/22

c o

n t r o

l D

a c o t a

Core 0

c o

n t r o

l D

a c o t a

Core 1

Example – Sequential Consistency

Issue Order

[C1] store to address 0xC

[C0] load from address 0xC

[C1] load from address 0xB

[C0] store to address 0xA

[C0] store to address 0xB

[C1] load from address 0xA

<data> 0xB

1 1 0 0xC

tag data log

<data> 0xA 1 1 0 0xB

<data> 0xB

1 1 0 0xB

<data> 0xA 0 0 0 0xA <data> 0xA

<data> 0xC <data> 0xC

1 1 0 0xA 1 1 0 0xC

Actual Order

[C1] store to address 0xC

[C0] load from address 0xC

[C1] load from address 0xA

[C0] store to address 0xA

[C0] store to address 0xB

[C1] load from address 0xB

13/22

Example – Sequential Consistency

1 1 0 0xA 1 1 0 0xB

ST 0xA

ST 0xB LD 0xA

LD 0xB

Activity Logs

1 1 0 0xB 0 0 0 0xA

cycle indicates

violation

1 1 0 0xC 1 1 0 0xC

LD 0xC ST 0xC

address reference edges

program order edges

14/22

L2 Cache

data access vector

co

ntro

lD

aco

ta

Core 0

co

ntro

lD

aco

ta

Core N-1

Experimental Setup

• Implemented checkers in GEMs simulator

• Created buggy versions of cache controllers

• TSO consistency model

directory-based

MOESI cache

coherence

4MB

Mesh network

… 16 cores

15/22

Experimental Setup

• Bugs inspired by bugs found in processor errata

• Injected one at a time

shared-store store to a shared line may not invalidate other caches

invisible-store store message may not reach all cores

store-alloc1 store allocation in any core may not occur properly

store-alloc2 store allocation in one core may not occur properly

reorder1 invalid store reordering (all cores)

reorder2 invalid store reordering (one core)

reorder3 invalid store reordering (single address, all cores)

reorder4 invalid store reordering (single address, one core)

• Testbenches

– Directed random stimulus: memory intensive

– SPLASH2 Benchmarks

Cycles to

Expose

Bug

0.3M

1.3M

1.9M

2.3M

1.4M

2.8M

2.9M

5.6M

16/22

Performance Impact - Random P

erf

orm

ance o

verh

ead (

%)

299

0

20

40

60

80

100

120

Computation overhead

Communication overhead

average

17/22

0

10

20

30

40

50

Performance Impact – SPLASH2 P

erf

orm

ance o

verh

ead (

%)

Computation overhead

Communication overhead

average

Pre-Silicon 100,000,000 %

Traditional Post-Silicon 10,000 %

DACOTA Post-Silicon 60 %

100x more tests!

18/22

Area Impact

Area Overhead - Storage

DACOTA 544 B

Chen, et al., 2008 617,472 B

Meixner, et al., 2006 940,032 B

Pre-Silicon 100,000,000 %

Traditional Post-Silicon 10,000 %

DACOTA Post-Silicon 60 %

Runtime 0 %

• Implemented DACOTA in Verilog

• 0.01% overhead in OpenSPARC T1

19/22

Communication Overhead O

verh

ead d

ue to c

om

munic

ation (

%)

Core activity log entries Core activity log entries

0

50

100

150

200

250

300

350

64 128 256 512 1024 2048

large_1000_shared

barrier

locks

small_0_shared

average

0

5

10

15

20

25

30

35

40

64 128 256 512 1024 2048

radix lu

cholesky fft

average

SPLASH2 benchmarks Random benchmarks

20/22

Checking Algorithm Overhead O

verh

ead d

ue to C

heckin

g A

lg. (%

)

Core activity log entries

0

20

40

60

80

100

120

64 128 256 512 1024 2048

radix

lu

cholesky

fft

average

0

100

200

300

400

500

600

700

64 128 256 512 1024 2048

large_1000_shared

barrier

locks

small_0_shared

average

Core activity log entries

SPLASH2 benchmarks Random benchmarks

ideal trade-off

21/22

Related Work

Pre-Silicon Post-Silicon Runtime

Meixner, et al., 2006;

Chen, et al., 2008

• Effective for

protection against

transient faults

• Problematic for

functional errors

• High area overhead

Dill, et al., 1992;

Abts, et al., 1993;

Pong, et al., 1997;

German, et al., 2003

• Formal verification

possible for

abstract protocol

• Insufficient for

implementation

Josephson, et al., 2006

Paniccia, et al., 1998

Whetsel, et al., 1991

Tsang, et al., 2000

• Post-Si testing

DeOrio, et al., 2008

• Post-Si verification

• Verifies coherence,

but not consistency

22/22

Conclusions

• DACOTA is an on-chip post-silicon debugging solution

for detecting errors in memory ordering

– Enables self-detection of memory ordering errors

• Effective at catching bugs

– 100x more coverage than traditional post-silicon

• Very low area overhead

– 0.01% area overhead on OpenSPARC T1

• No performance impact to end user

– Disable on shipment

DACOTA: Post-silicon validation of the memory subsystem in...

Documents

Transcript of DACOTA: Post-silicon validation of the memory subsystem in...