Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

Yu Yang (PhD student; summer intern at CBL)

Xiaofang Chen (PhD student; summer intern at IBM)

Ganesh GopalakrishnanRobert M. Kirby

School of ComputingUniversity of Utah

SPIN 2007 Workshop Presentation

Supported by: Microsoft HPC Institutes

NSF CNS 0509379

Thread Programming will become more prevalent

FV of thread programs will grow in importance

Why FV for Threaded Programs

> 80% of chipsshipped will bemulti-core

(photo courtesy of

Intel Corporation.)

Model Checking will Increasingly be thru Dynamic Methods

Also known as Runtime or In-Situ methods

Why Dynamic Verification Methods

• Even after early life-cycle modeling and validation, the final code will have far more details

• Early life-cycle modeling is often impossible- Use of libraries (API) such as MPI, OpenMP, Shmem, …

- Library function semantics can be tricky

- The bug may be in the library function implementation

Model Checking will often be “stateless”

Why Stateless

• One may not be able to access a lot of the state

- e.g. state of the OS

. It is expensive to hash and lookup revisits

. Stateless is easier to parallelize

Partial Order Reduction is Crucial !

Why POR?

Process P0:-------------------------------0: MPI_Init1: MPI_Win_lock2: MPI_Accumulate3: MPI_Win_unlock4: MPI_Barrier5: MPI_Finalize

Process P1:-------------------------------0: MPI_Init1: MPI_Win_lock2: MPI_Accumulate3: MPI_Win_unlock4: MPI_Barrier5: MPI_Finalize

ONLYDEPENDENTOPERATIONS

• 504 interleavings without POR (2 * (10!)) / (5!)^2• 2 interleavings with POR !!

Dynamic POR is almost a “must” !

( Dynamic POR as in Flanagan and Godefroid, POPL 2005)

Why Dynamic POR ?

a[ j ]++ a[ k ]--

• Ample Set depends on whether j == k

• Can be very difficult to determine statically

• Can determine dynamically

Why Dynamic POR ?

The notion of action dependence (crucial to POR methods) is a function of the execution

Computation of “ample” sets in Static POR versus in DPOR

Ample determinedusing “local” criteria

Current State

Next move of Red process

Nearest DependentTransitionLooking Back

Add Red Process to“Backtrack Set”

This builds the Ampleset incrementally based on observed dependencies

Blue is in “Done” set

{ BT }, { Done }

We target C/C++ PThread Programs Instrument the given program (largely automated) Run the concurrent program “till the end” Record interleaving variants while advancing When # recorded backtrack points reaches a soft

limit, spill work to other nodes In one larger example, a 11-hour run was finished in

11 minutes using 64 nodes

Heuristic to avoid recomputations was essential for speed-up. First known distributed DPOR

Putting it all together …

A Simple DPOR Example

{}, {}t0:

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock{}, {}t0:

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

{}, {}t0:

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t1: lock

{}, {}t0:

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t1: lock

{t1}, {t0}t0:

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{}, {}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

{t1}, {t0}

{t2}, {t1}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t2: lock

{t1,t2}, {t0}

{}, {t1, t2}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

t2: lock

t2: unlock

{t1,t2}, {t0}

{}, {t1, t2}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t0: lock

t0: unlock

{t1,t2}, {t0}

{}, {t1, t2}

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

{t2}, {t0,t1}t0:

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

t1: lock

t1: unlock

{t2}, {t0, t1}t0:

lock(t)

unlock(t)

lock(t)

unlock(t)

lock(t)

unlock(t)

For this example, all the paths explored during DPOR

For others, it will be a proper subset

Idea for parallelization: Explore computations from the backtrack set in other processes.

“Embarrassingly Parallel” – it seems so, anyway !

We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect”

Multithreaded C/C++ program

instrumented program

instrumentation

Thread library wrapper

compile

executableexecutable

thread 1

thread n

schedulerrequest/permit

request/permit

Stateless search does not maintain search history Different branches of an acyclic space can be

explored concurrently Simple master-slave scheme can work here

– one load balancer + workers

We then made the following observations

worker a worker b

Request unloading

idle node id

work description

report result

load balancer

We then devised a work-distribution scheme…

We got zero speedup! Why?

Deeper investigation revealed that multiple nodes

ended up exploring the same interleavings

Illustration of the problem (1 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Heuristic : Handoff DEEPEST backtrack point for another node to explore

Reason : Largest number of paths emanate from there

To Node 1

Detail of (2 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Node 0

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{ }, {t0,t1}

{t2}, {t1}

Detail of (2 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Node 1Node 0

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{ }, {t0,t1}

{t2}, {t1}

t0: lock{t1}, {t0}

Detail of (2 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Node 1Node 0

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{ }, { t0,t1 }

{t2}, {t1}

t0: lock{ t1 }, {t0}

t1 is forced into DONE set before workhanded to Node 1

Node 1 keeps t1 in backtrack set

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

To Node 1

Decide to do THIS work at Node 0 itself…

t0: lock

t0: unlock

{}, {t0,t1}

{t2}, {t1}

{t1}, {t0}

Being expanded by Node 0

Being expanded by Node 1

t0: lock

t0: unlock

{t2}, {t0,t1}

{}, {t2}t2: lock

t2: unlockt2: unlock

t0: lock

t0: unlock

{t2}, {t0,t1}

{}, {t2}

{t1}, {t0}t1: lock

t1: unlock

t2: lock

t0: lock

t0: unlock

{t2}, {t0,t1}

{}, {t2}

{t2}, {t0, t1}t1: lock

t1: unlock

t2: lock

{}, {t2}

Redundancy!

New Backtrack Set Computation: Aggressively mark up the stack!

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1,t2}, {t0}

{t2}, {t1}

Update the backtrack sets of

ALL dependent operations! Forms a good allocation scheme Does not involve any synchronizations Redundant work may still be performed Likelihood is reduced because a node

aggressively “owns” one operation and

all its dependants

Implementation and Evaluation

Using MPI for communication among nodes Did experiments on a 72-node cluster

– 2.4 GHz Intel XEON process, 2GB memory/node

– Two (small) benchmarks

Indexer & file system benchmark used in Flanagan and Godefoid’s DPOR paper

– Aget -- a multithreaded ftp client

– Bbuf – an implementation of bounded buffer

Sequential Checking Time

Benchmark Threads Runs Time (sec)

fsbench 26 8,192 291.32

indexer 16 32,768 1188.73

aget 6 113,400 5662.96

bbuf 8 1,938,816 39710.43

Speedup on indexer & fs (small exs);so diminishing returns > 40 nodes…

Speedup on aget

Speedup on bbuf

Conclusions and Future Work

Method described is VERY promising We have an in-situ model checker for MPI programs

also! (EuroPVM / MPI 2007)– Will be parallelized using MPI for work distribution!

The C/PThread Work needs to be pushed a lot more:– Automate Instrumentation

– Try many new examples

– Improve work-distribution heuristic in response to findings

– Release tool

Questions?

Answers !

Properties: Currently – Local “assert”s

– Deadlocks

– Uninitialized Variables

No plans for liveness

Tool release likely in 6 months

That is a very good question. Let’s talk!

Extra Slides

Concurrent operations on some database

Class A operations:

pthread_mutex_lock(mutex); a_count++;if (a_count == 1) pthred_mutex_lock(res);pthread_mutex_unlock(mutex); …pthread_mutex_lock(mutex);a_count--;if (a_count == 0) pthread_mutex_unlock(res);pthread_mutex_unlock(mutex);

Class B operations:

pthread_mutex_lock(mutex);b_count++;if (b_count == 1) pthred_mutex_lock(res);pthread_mutex_unlock(mutex); …pthread_mutex_lock(mutex);b_count--;if (b_count == 0) pthread_mutex_unlock(res);pthread_mutex_unlock(mutex);

Initial random execution

a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexa6 : acquire mutexa7 : a_count a8 : a_count == 0a9 : release resa10 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1b4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class A operations:

a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexa6 : acquire mutexa7 : a_count --a8 : a_count == 0a9 : release resa10 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1b4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class A operations:

Class B operations:

a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexa6 : acquire mutexa7 : a_count-- a8 : a_count == 0a9 : release resa10 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1b4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class B operations:

Dependent operations?

Class B operations:

Start an alternative execution

Class A operations:

Get a deadlock!

a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1a6 : acquire mutexa7 : a_count --a8 : a_count == 0a9 : release resa10 : release mutexb4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class A operations:

pthread_mutex_lock(mutex); a_count++;if (a_count == 1) pthred_mutex_lock(res);pthread_mutex_unlock(mutex);pthread_mutex_lock(mutex);

Class B operations:

pthread_mutex_lock(mutex);b_count++;if (b_count == 1) pthred_mutex_lock(res);

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

Documents

Transcript of Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

KeylocKing Threaded inserTsclarkandosborne.com/wp-content/uploads/2019/12/ACME_Threaded_… · KeylocKing Threaded inserTs

Multi Threaded Programming

Sectional drawing - valveexpo.com · Spigots ANSI/ASME B36.19M, Schedule 40s 65 Threaded connections Threaded sockets DIN ISO 228 1 Threaded sockets NPT 31 Threaded spigots DIN 11851

Threaded Rods

Regression Verification for Multi-Threaded Programs

TMK UP ULTRA Threaded & Coupled Drilling with Casing ... · Threaded & Coupled Drilling with Casing Connection ... Threaded & Coupled Drilling with Casing Connection ... • Provides

CS 267: Automated Verification Lecture 11: Partial Order Reduction Instructor: Tevfik Bultan

MetalSkin Threaded Connections

Design, Construction, & Verification of a Partial- Depth ... · Design, Construction, & Verification of a Partial- ... (Drilling Mate System on hydromill)) ... w/locations selected

Updates on Parallel GCAM · 2014-09-30 · 5 Parallel structure of GCAM Firstlevel’Supervisor’ Mul7=threaded’ GCAMWorker’Mul7=threaded’ GCAMWorker’Mul7=threaded’ GCAMWorker’Mul7=threaded’

FULLY-THREADED OPTIMAL FULLY-THREADED DIVERSE …

Partial Order Reduction for Verification of Timed Systems · Partial Order Reduction 13 2.1 Introduction 13 2.2 Basic Notions 15 2.3 Principles of Partial Order Reduction 18 2.4 Conditions

Threaded Programming

Threaded inserts for metal - Kerb · PDF file•self-tapping threaded inserts for metal, wood and plastics, •Threaded inserts for cold embedding •Threaded inserts for hot or ultrasound

Threaded Paths

MODCO - HOME - m5incorporatedm5incorporatedm5incorporated.com/wp-content/uploads/2014/03/Modco_brochure.pdf · The Acme threaded design rates Modco ... compression set, ... Partial

THREADED - vintrol.com

Hardware and Petri nets Partial order methods for analysis and verification of asynchronous circuits.

Threaded Fasteners

CS 267: Automated Verification Lecture 11: Partial Order Reduction Instructor: Tevfik Bultan.