Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane...
Transcript of Parallel Methods for Verifying the Consistency of Weakly ...€¦ · Adam McLaughlin, Duane...
Parallel Methods for Verifying the Consistency of
Weakly-Ordered Architectures
Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader
Challenges of Design Verification
• Contemporary hardware designs require
millions of lines of RTL code
– More lines of code written for verification than for
the implementation itself
• Tradeoff between performance and design
complexity
– Speculative execution, shared caches, instruction
reordering
– Performance wins out
GTC 2016, San Jose, CA 2
Performance vs. Design Complexity
• Programmer burden
– Requires correct usage of
synchronization
• Time to market
– Earlier remediation of bugs is less costly
– Re-spins on tapeout are expensive
• Significant time spent of verification
– Verification techniques are often NP-
complete
GTC 2016, San Jose, CA 3
Memory Consistency Models
• Contract between SW and HW regarding the
semantics of memory operations
• Classic example: Sequential Consistency (SC)
– All processors observe the same ordering of
operations serviced by memory
– Too strict for modern optimizations/architectures
• Nomenclature
– ST[A] → 1 “Wrote a value of 1 to location A”
– LD[B] ← 2 “Read a value of 2 from location B”
GTC 2016, San Jose, CA 4
ARM Idiosyncrasies
• Our focus: ARMv8
• Speculative Execution is allowed
• Threads can reorder reads and writes
– Assuming no dependency exists
• Writes are not guaranteed to be
simultaneously visible to other cores
GTC 2016, San Jose, CA 5
Problem Setup
1. Construct an initial graph
– Vertices represent load, store,
and barrier insts
– Edges represent memory
ordering
• Based on architectural rules
2. Iteratively infer additional
edges to the graph
– Based on existing
relationships
3. Check for cycles
– If one exists: contradiction!GTC 2016, San Jose, CA 6
LD[B] ← 92
LD[A] ← 2
ST[B] → 92
LD[B] ← 93
LD[B] ← 92
CPU 0
CPU 1
ST[B] → 90
• Given an inst. trace from a simulator, RTL, or silicon
TSOtool
• Hangal et al., ISCA ’04
– Designed for SPARC, but portable to ARM
• Each store writes a unique value to memory
– Easily map a load to the store that wrote its data
• Tradeoff between accuracy and runtime
– Polynomial time, but false positives are possible
– If a cycle is found, a bug indeed exists
– If no cycles are found, execution appears consistent
GTC 2016, San Jose, CA 7
Need for Scalability
• Must run many tests to maximize coverage
– Stress different portions of the memory subsystem
• Longer tests put supporting logic in more interesting
states
– Many instructions are required to build history in an LRU
cache, for instance
• Using a CPU cluster does not suffice
– The results of one set of tests dictate the structure of the
ensuing tests
– Faster tests help with interactivity!
• Solution: Efficient algorithms and parallelism
GTC 2016, San Jose, CA 8
Inferred Edge Insertions (Rule 6)
• S can reach X
• X does not load data
from S
GTC 2016, San Jose, CA 9
W: ST[A] → 2
X: LD[A] ← 2
S: ST[A] → 1
Inferred Edge Insertions (Rule 6)
• S can reach X
• X does not load data
from S
• S comes before the
node that stored X’s
data
GTC 2016, San Jose, CA 10
W: ST[A] → 2
X: LD[A] ← 2
S: ST[A] → 1
Inferred Edge Insertions (Rule 7)
• S can reach X
• Loads read data from S, not X
GTC 2016, San Jose, CA 11
L: LD[A] → 1
X: ST[A] → 2
S: ST[A] → 1
M: LD[A] → 1
Inferred Edge Insertions (Rule 7)
• S can reach X
• Loads read data from S, not X
• Loads came before X
GTC 2016, San Jose, CA 12
L: LD[A] → 1
X: ST[A] → 2
S: ST[A] → 1
M: LD[A] → 1
Initial Algorithm for Inferring Edges
for_each(store vertex S)
{
for_each(reachable vertex X from S) //Getting this set is expensive!
{
if(location[S] == location[X])
{
if((type[X] == LD) && (data[S] != data[X]))
{
//Add Rule 6 edge from S to W, the store that X read from
}
else if(type[X] == ST)
{
for_each(load vertex L that reads data from S)
{
//Add Rule 7 edge from L to X
}
} //End if instruction type is store
} //End if location
} //End for each reachable vertex
} //End for each store
GTC 2016, San Jose, CA 13
Virtual Processors (vprocs)
• Split instructions from physical to virtual processors
• Each vproc is sequentially consistent
– Program order ↔ Memory order
GTC 2016, San Jose, CA 14
ST[B] → 91
ST[A] → 1
LD[A] ← 2
ST[B] → 92
VPROC 0
ST[A] → 1
LD[A] ← 2
VPROC 1
VPROC 2
ST[B] → 91
ST[B] → 92
CPU 0
Reverse Time Vector Clocks (RTVC)
• Consider the RTVC of ST[B] = 90
Purple: ST[B] = 92
Blue: NULL
Green: LD[B] = 92
Orange: LD[B] = 92
• Track the earliest
successor from each
vertex to each vproc
– Captures transitivity
GTC 2016, San Jose, CA 15
LD[B] ← 92
LD[A] ← 2
ST[B] → 92
LD[B] ← 93
LD[B] ← 92
CPU 0
CPU 1
ST[B] → 90
Complexity of inferring edges: 𝑂 𝑛2𝑝2𝑑𝑚𝑎𝑥
Updating RTVCs
• Computing RTVCs once is fast
– Process vertices in the reverse
order of a topological sort
– Check neighbors directly, then
their RTVCs
• Every time a new edge is
inserted, RTVC values need to
change
– # of edge insertions ≈ 𝑚
GTC 2016, San Jose, CA 16
• TSOtool implements both vprocs and RTVCs
Facilitating Parallelism
• Repeatedly updating RTVCs is expensive
– For 𝑘 edge insertions, RTVC updates take 𝑂(𝑘𝑝𝑛) time
• 𝑘 = 𝑂 𝑛2 , but usually is a small multiple of 𝑛
• Idea: Update RTVCs once per iteration rather than
per edge insertion
– For 𝑖 iterations RTVC updates take 𝑂(𝑖𝑝𝑛) time
• 𝑖 ≪ 𝑘 (less than 10 for all test cases)
– Less communication between threads
• Complexity of inferring edges: 𝑂(𝑛2𝑝)
GTC 2016, San Jose, CA 17
Correctness
• Inferred edges found by our approach will not be the
same as the edges found by TSOtool
– Might not infer an edge that TSOtool does
• RTVC for TSOtool can change mid-iteration
– Might infer an edge that TSOtool does not
• Our approach will have “stale” RTVC values
• Both approaches make forward progress
– Number of edges monotonically increases
• Any edge inserted by our approach could have been
inserted by the naïve approach [Thm 1]
• If TSOtool finds a cycle, we will also find a cycle [Thm 2]
GTC 2016, San Jose, CA 18
Parallel Implementations
• OpenMP
– Each thread keeps its own partition of added
edges
– After each iteration of inferring edges, reduce
• CUDA
– Assign threads to each store instruction
– Threads independently traverse the vprocs of this
store
– Atomically add edges to a preallocated array in
global memory
GTC 2016, San Jose, CA 19
Experimental Setup
• Intel Core i7-2600K CPU
– Quad core, 3.4GHz, 8MB LLC, 16GB DRAM
• NVIDIA GeForce GTX Titan
– 14 SMs, 837 MHz base clock, 6GB DRAM
• ARM system under test
– Cortex-A57, quad core
• Instruction graphs range from 𝑛 = 218 to 𝑛 = 222
vertices, 𝑛 ≈ 𝑚
– Sparse, high-diameter, low-degree
– Tests vary by their distribution of LD/ST/DMB
instructions, # of vprocs, and inst dependencies
GTC 2016, San Jose, CA 20
Importance of Scaling
GTC 2016, San Jose, CA 21
• 512K instructions
per core
• 2M total
instructions
Speedup over TSOtool (Application)
Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU
64K*4 = 256K 27 5.64x 7.62x 9.43x 10.79x
128K*4 = 512K 27 5.31x 7.12x 8.90x 10.76x
256K*4 = 1M 23 6.30x 9.05x 12.13x 15.47x
512K*4 = 2M 10 3.68x 6.41x 10.81x 24.55x
1M*4 = 4M 2 3.05x 5.58x 9.97x 37.64x
GTC 2016, San Jose, CA 22
• GPU is always best; scales much better to larger tests
• Extreme case: 9 hours using TSOtool → under 10
minutes using our GPU approach
• Avg. Parallel speedups over our improved sequential
approach:
– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)
Summary
• Relaxing the updates to RTVCs lead to a better
sequential approach and facilitated parallel
implementations
– Trade off between redundant work and parallelism
• Faster execution leads to interactive bug-finding
• The GPU scales well to larger problem instances
– Helpful for corner case bugs that slip through pre-silicon
verification
• For the twelve largest test cases our GPU
implementation achieves a 26.36x average
application speedup
GTC 2016, San Jose, CA 23
Acknowledgments
• Shankar Govindaraju, and Tom Hart for their
help on understanding NVIDIA’s
implementation of TSOtool for ARM
GTC 2016, San Jose, CA 24
Questions
“To raise new questions, new possibilities, to
regard old problems from a new angle, requires
creative imagination and marks real advance in
science.”– Albert Einstein
25GTC 2016, San Jose, CA
Backup
26GTC 2016, San Jose, CA
Sequential Consistency Examples
• Valid
• Invalid
GTC 2016, San Jose, CA 27
P1: ST[x]→1 P2: LD[x]←1 LD[x]←2P3: LD[x]←1 LD[x]←2P4: ST[x]→2
t=0 t=1 t=2
P1: ST[x]→1 P2: LD[x]←1 LD[x]←2P3: LD[x]←2 LD[x]←1P4: ST[x]→2
t=0 t=1 t=2
• ST[x]→1 handled before
ST[x]→2
• Writes propagate to P2
and P3 in a different
order
– Valid for weaker memory
models
Weaker Models
• SC is intuitive, but is too strict
– Prevents common compiler/arch. optimizations
• Commercial products use weaker models
– x86: Total Store Order (TSO)
– Power/ARM: Relaxed Memory Ordering (RMO)
• Weaker models allow for greater optimization
opportunities
– Cost: More complicated semantics
GTC 2016, San Jose, CA 28
Initial Algorithm: Weaknesses
• Expensive to compute
– 𝑂(𝑛3), assuming edges can be inserted in 𝑂(1)time
– Repeated iteratively until a fixed point is reached
• Requires the transitive closure of the graph
– Expensive to store
– Capturing 𝑛2 relationships (does vertex 𝑖 reach
vertex 𝑗?)
• Adds lots of redundant edges
– Should leverage transitivity when possible
GTC 2016, San Jose, CA 29
A
B
C
Reverse Time Vector Clocks (RTVCs)
• vprocs provide implicit orderings
GTC 2016, San Jose, CA 30
ST[B] → 91
ST[B] → 92
ST[A] → 1
Reverse Time Vector Clocks (RTVCs)
• vprocs provide implicit orderings
• Reverse Vector Time Clock
– Track the earliest successor from each vertex to each
vproc
• Bounds the number of reachable edges to be
inspected by 𝑝, the number of vprocs
– No need to compute or store the transitive closure!
GTC 2016, San Jose, CA 31
ST[B] → 91
ST[B] → 92
ST[A] → 1
Reverse Time Vector Clocks (RTVCs)
• Track the earliest successor from each vertex to
each vproc
– Captures transitivity
• Traverse vprocs rather than the graph itself
– No need to check every reachable vertex
• Bounds the number of reachable edges to be
inspected by 𝑝, the number of vprocs
– No need to compute or store the transitive closure!
GTC 2016, San Jose, CA 32
Superfluous work?
• Our approach tends
to add more edges
than TSOtool, some of
which are redundant
– Worst case: 36%
additional edges
– The redundancy is well
worth the
performance benefits
GTC 2016, San Jose, CA 33
Test Info
𝒏 = 𝑽 𝒎 = 𝑬 TSOtool
Inferred
Iterations ST/LD/BAR
(%)
2,097,963 3,799,254 4,487,224 5 76/24/0
2,098,219 3,686,624 4,411,887 4 79/21/0
1,977,832 4,453,340 5,179,108 5 46/53/1
2,097,741 3,875,831 4,635,852 7 77/23/0
1,936,321 5,109,990 5,236,671 5 44/54/2
2,098,321 2,491,062 4,257,077 6 80/20/0
2,097,809 4,321,793 4,404,753 7 78/21/1
1,871,831 3,660,617 4,861,044 6 44/54/2
2,097,809 4,434,120 4,418,555 5 80/20/0
4,195,405 6,934,725 9,338,902 7 76/23/1
4,194,961 7,960,567 8,963,281 6 78/22/0
GTC 2016, San Jose, CA 34
Speedup over TSOtool (Inferring edges)
Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU
64K*4 = 256K 27 15.09x 29.31x 53.45x 57.90x
128K*4 = 512K 27 16.41x 31.49x 57.34x 76.98x
256K*4 = 1M 23 14.51x 27.98x 51.68x 72.32x
512K*4 = 2M 10 4.01x 7.52x 14.19x 42.90x
1M*4 = 4M 2 3.08x 5.70x 10.39x 45.16x
GTC 2016, San Jose, CA 35
• Number of tests decreases with test size because
of industrial time constraints
– Motivation for this work
• Avg. Parallel speedups over our improved
sequential approach:
– 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU)
Problem Setup
1. Construct an initial graph
– Vertices represent load, store,
and barrier insts
– Edges represent memory
ordering
• Based on architectural rules
2. Iteratively infer additional
edges to the graph
– Based on existing
relationships
3. Check for cycles
– If one exists: contradiction!GTC 2016, San Jose, CA 36
LD[B] ← 92
LD[A] ← 2
ST[B] → 92
LD[B] ← 93
LD[B] ← 92
CPU 0
CPU 1
ST[B] → 90
• Given an inst. trace from a simulator, RTL, or silicon
Importance of Scaling
GTC 2016, San Jose, CA 37
• 128K instructions
per core
• 512K total
instructions
Importance of Scaling
GTC 2016, San Jose, CA 38
• 256K instructions
per core
• 1M total
instructions