Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane...

Parallel Methods for Verifying the Consistency of

Weakly-Ordered Architectures

Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader

Challenges of Design Verification

• Contemporary hardware designs require

millions of lines of RTL code

– More lines of code written for verification than for

the implementation itself

• Tradeoff between performance and design

complexity

– Speculative execution, shared caches, instruction

reordering

– Performance wins out

GTC 2016, San Jose, CA 2

Performance vs. Design Complexity

• Programmer burden

– Requires correct usage of

synchronization

• Time to market

– Earlier remediation of bugs is less costly

– Re-spins on tapeout are expensive

• Significant time spent of verification

– Verification techniques are often NP-

complete


Memory Consistency Models

• Contract between SW and HW regarding the

semantics of memory operations

• Classic example: Sequential Consistency (SC)

– All processors observe the same ordering of

operations serviced by memory

– Too strict for modern optimizations/architectures

• Nomenclature

– ST[A] → 1 “Wrote a value of 1 to location A”

– LD[B] ← 2 “Read a value of 2 from location B”


ARM Idiosyncrasies

• Our focus: ARMv8

• Speculative Execution is allowed

• Threads can reorder reads and writes

– Assuming no dependency exists

• Writes are not guaranteed to be

simultaneously visible to other cores


Problem Setup

1. Construct an initial graph

– Vertices represent load, store,

and barrier insts

– Edges represent memory

ordering

• Based on architectural rules

2. Iteratively infer additional

edges to the graph

– Based on existing

relationships

3. Check for cycles

– If one exists: contradiction!GTC 2016, San Jose, CA 6

LD[B] ← 92

LD[A] ← 2

ST[B] → 92

LD[B] ← 93

LD[B] ← 92

CPU 0

CPU 1

ST[B] → 90

• Given an inst. trace from a simulator, RTL, or silicon

TSOtool

• Hangal et al., ISCA ’04

– Designed for SPARC, but portable to ARM

• Each store writes a unique value to memory

– Easily map a load to the store that wrote its data

• Tradeoff between accuracy and runtime

– Polynomial time, but false positives are possible

– If a cycle is found, a bug indeed exists

– If no cycles are found, execution appears consistent


Need for Scalability

• Must run many tests to maximize coverage

– Stress different portions of the memory subsystem

• Longer tests put supporting logic in more interesting

states

– Many instructions are required to build history in an LRU

cache, for instance

• Using a CPU cluster does not suffice

– The results of one set of tests dictate the structure of the

ensuing tests

– Faster tests help with interactivity!

• Solution: Efficient algorithms and parallelism


Inferred Edge Insertions (Rule 6)

• S can reach X

• X does not load data

from S


W: ST[A] → 2

X: LD[A] ← 2

S: ST[A] → 1


• S can reach X

• X does not load data

from S

• S comes before the

node that stored X’s

data


W: ST[A] → 2

X: LD[A] ← 2

S: ST[A] → 1


• S can reach X

• Loads read data from S, not X


L: LD[A] → 1

X: ST[A] → 2

S: ST[A] → 1

M: LD[A] → 1


• S can reach X

• Loads read data from S, not X

• Loads came before X


L: LD[A] → 1

X: ST[A] → 2

S: ST[A] → 1

M: LD[A] → 1

Initial Algorithm for Inferring Edges

for_each(store vertex S)

{

for_each(reachable vertex X from S) //Getting this set is expensive!

{

if(location[S] == location[X])

{

if((type[X] == LD) && (data[S] != data[X]))

{

//Add Rule 6 edge from S to W, the store that X read from

}

else if(type[X] == ST)

{

for_each(load vertex L that reads data from S)

{

//Add Rule 7 edge from L to X

}

} //End if instruction type is store

} //End if location

} //End for each reachable vertex

} //End for each store


Virtual Processors (vprocs)

• Split instructions from physical to virtual processors

• Each vproc is sequentially consistent

– Program order ↔ Memory order


ST[B] → 91

ST[A] → 1

LD[A] ← 2

ST[B] → 92

VPROC 0

ST[A] → 1

LD[A] ← 2

VPROC 1

VPROC 2

ST[B] → 91

ST[B] → 92

CPU 0

Reverse Time Vector Clocks (RTVC)

• Consider the RTVC of ST[B] = 90

Purple: ST[B] = 92

Blue: NULL

Green: LD[B] = 92

Orange: LD[B] = 92

• Track the earliest

successor from each

vertex to each vproc

– Captures transitivity


LD[B] ← 92

LD[A] ← 2

ST[B] → 92

LD[B] ← 93

LD[B] ← 92

CPU 0

CPU 1

ST[B] → 90

Complexity of inferring edges: 𝑂 𝑛2𝑝2𝑑𝑚𝑎𝑥

Updating RTVCs

• Computing RTVCs once is fast

– Process vertices in the reverse

order of a topological sort

– Check neighbors directly, then

their RTVCs

• Every time a new edge is

inserted, RTVC values need to

change

– # of edge insertions ≈ 𝑚


• TSOtool implements both vprocs and RTVCs

Facilitating Parallelism

• Repeatedly updating RTVCs is expensive

– For 𝑘 edge insertions, RTVC updates take 𝑂(𝑘𝑝𝑛) time

• 𝑘 = 𝑂 𝑛2 , but usually is a small multiple of 𝑛

• Idea: Update RTVCs once per iteration rather than

per edge insertion

– For 𝑖 iterations RTVC updates take 𝑂(𝑖𝑝𝑛) time

• 𝑖 ≪ 𝑘 (less than 10 for all test cases)

– Less communication between threads

• Complexity of inferring edges: 𝑂(𝑛2𝑝)


Correctness

• Inferred edges found by our approach will not be the

same as the edges found by TSOtool

– Might not infer an edge that TSOtool does

• RTVC for TSOtool can change mid-iteration

– Might infer an edge that TSOtool does not

• Our approach will have “stale” RTVC values

• Both approaches make forward progress

– Number of edges monotonically increases

• Any edge inserted by our approach could have been

inserted by the naïve approach [Thm 1]

• If TSOtool finds a cycle, we will also find a cycle [Thm 2]


Parallel Implementations

• OpenMP

– Each thread keeps its own partition of added

edges

– After each iteration of inferring edges, reduce

• CUDA

– Assign threads to each store instruction

– Threads independently traverse the vprocs of this

store

– Atomically add edges to a preallocated array in

global memory


Experimental Setup

• Intel Core i7-2600K CPU

– Quad core, 3.4GHz, 8MB LLC, 16GB DRAM

• NVIDIA GeForce GTX Titan

– 14 SMs, 837 MHz base clock, 6GB DRAM

• ARM system under test

– Cortex-A57, quad core

• Instruction graphs range from 𝑛 = 218 to 𝑛 = 222

vertices, 𝑛 ≈ 𝑚

– Sparse, high-diameter, low-degree

– Tests vary by their distribution of LD/ST/DMB

instructions, # of vprocs, and inst dependencies


Importance of Scaling


• 512K instructions

per core

• 2M total

instructions

Speedup over TSOtool (Application)

Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU

64K*4 = 256K 27 5.64x 7.62x 9.43x 10.79x

128K*4 = 512K 27 5.31x 7.12x 8.90x 10.76x

256K*4 = 1M 23 6.30x 9.05x 12.13x 15.47x

512K*4 = 2M 10 3.68x 6.41x 10.81x 24.55x

1M*4 = 4M 2 3.05x 5.58x 9.97x 37.64x


• GPU is always best; scales much better to larger tests

• Extreme case: 9 hours using TSOtool → under 10

minutes using our GPU approach

• Avg. Parallel speedups over our improved sequential

approach:

– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)

Summary

• Relaxing the updates to RTVCs lead to a better

sequential approach and facilitated parallel

implementations

– Trade off between redundant work and parallelism

• Faster execution leads to interactive bug-finding

• The GPU scales well to larger problem instances

– Helpful for corner case bugs that slip through pre-silicon

verification

• For the twelve largest test cases our GPU

implementation achieves a 26.36x average

application speedup


Acknowledgments

• Shankar Govindaraju, and Tom Hart for their

help on understanding NVIDIA’s

implementation of TSOtool for ARM


Questions

“To raise new questions, new possibilities, to

regard old problems from a new angle, requires

creative imagination and marks real advance in

science.”– Albert Einstein

25GTC 2016, San Jose, CA

Backup

26GTC 2016, San Jose, CA

Sequential Consistency Examples

• Valid

• Invalid


P1: ST[x]→1 P2: LD[x]←1 LD[x]←2P3: LD[x]←1 LD[x]←2P4: ST[x]→2

t=0 t=1 t=2

P1: ST[x]→1 P2: LD[x]←1 LD[x]←2P3: LD[x]←2 LD[x]←1P4: ST[x]→2

t=0 t=1 t=2

• ST[x]→1 handled before

ST[x]→2

• Writes propagate to P2

and P3 in a different

order

– Valid for weaker memory

models

Weaker Models

• SC is intuitive, but is too strict

– Prevents common compiler/arch. optimizations

• Commercial products use weaker models

– x86: Total Store Order (TSO)

– Power/ARM: Relaxed Memory Ordering (RMO)

• Weaker models allow for greater optimization

opportunities

– Cost: More complicated semantics


Initial Algorithm: Weaknesses

• Expensive to compute

– 𝑂(𝑛3), assuming edges can be inserted in 𝑂(1)time

– Repeated iteratively until a fixed point is reached

• Requires the transitive closure of the graph

– Expensive to store

– Capturing 𝑛2 relationships (does vertex 𝑖 reach

vertex 𝑗?)

• Adds lots of redundant edges

– Should leverage transitivity when possible


A

B

C

Reverse Time Vector Clocks (RTVCs)

• vprocs provide implicit orderings


ST[B] → 91

ST[B] → 92

ST[A] → 1


• vprocs provide implicit orderings

• Reverse Vector Time Clock

– Track the earliest successor from each vertex to each

vproc

• Bounds the number of reachable edges to be

inspected by 𝑝, the number of vprocs

– No need to compute or store the transitive closure!


ST[B] → 91

ST[B] → 92

ST[A] → 1


• Track the earliest successor from each vertex to

each vproc

– Captures transitivity

• Traverse vprocs rather than the graph itself

– No need to check every reachable vertex

• Bounds the number of reachable edges to be

inspected by 𝑝, the number of vprocs

– No need to compute or store the transitive closure!


Superfluous work?

• Our approach tends

to add more edges

than TSOtool, some of

which are redundant

– Worst case: 36%

additional edges

– The redundancy is well

worth the

performance benefits


Test Info

𝒏 = 𝑽 𝒎 = 𝑬 TSOtool

Inferred

Iterations ST/LD/BAR

(%)

2,097,963 3,799,254 4,487,224 5 76/24/0

2,098,219 3,686,624 4,411,887 4 79/21/0

1,977,832 4,453,340 5,179,108 5 46/53/1

2,097,741 3,875,831 4,635,852 7 77/23/0

1,936,321 5,109,990 5,236,671 5 44/54/2

2,098,321 2,491,062 4,257,077 6 80/20/0

2,097,809 4,321,793 4,404,753 7 78/21/1

1,871,831 3,660,617 4,861,044 6 44/54/2

2,097,809 4,434,120 4,418,555 5 80/20/0

4,195,405 6,934,725 9,338,902 7 76/23/1

4,194,961 7,960,567 8,963,281 6 78/22/0


Speedup over TSOtool (Inferring edges)

Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU

64K*4 = 256K 27 15.09x 29.31x 53.45x 57.90x

128K*4 = 512K 27 16.41x 31.49x 57.34x 76.98x

256K*4 = 1M 23 14.51x 27.98x 51.68x 72.32x

512K*4 = 2M 10 4.01x 7.52x 14.19x 42.90x

1M*4 = 4M 2 3.08x 5.70x 10.39x 45.16x


• Number of tests decreases with test size because

of industrial time constraints

– Motivation for this work

• Avg. Parallel speedups over our improved

sequential approach:

– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)

Problem Setup

1. Construct an initial graph

– Vertices represent load, store,

and barrier insts

– Edges represent memory

ordering

• Based on architectural rules

2. Iteratively infer additional

edges to the graph

– Based on existing

relationships

3. Check for cycles

– If one exists: contradiction!GTC 2016, San Jose, CA 36

LD[B] ← 92

LD[A] ← 2

ST[B] → 92

LD[B] ← 93

LD[B] ← 92

CPU 0

CPU 1

ST[B] → 90

• Given an inst. trace from a simulator, RTL, or silicon




per core

• 512K total

instructions




per core

• 1M total

instructions

Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane...

Documents

Transcript of Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane...