On-the-Fly Data-Race Detection in Multithreaded Programs

On-the-Fly Data-RaceDetection in

Multithreaded Programs

Prepared by Eli Pozniansky under Supervision of Prof. Assaf Schuster

2

Table of Contents

What is a Data-Race? Why Data-Races are Undesired? How Data-Races Can be Prevented? Can Data-Races be Easily Detected? Feasible and Apparent Data-Races Complexity of Data-Race Detection

NP and Co-NP Program Execution Model & Ordering Relations Complexity of Computing Ordering Relations Proof of NP/Co-NP Hardness

3

Table of ContentsCont.

So How Data-Races Can be Detected? Lamport’s Happens-Before Approximation

Approaches to Detection of Apparent Data-Races: Static Methods Dynamic Methods:

Post-Mortem Methods On-The-Fly Methods

4


Closer Look at Dynamic Methods: DJIT+

Local Time Frames Vector Time Frames Logging Mechanism Data-Race Detection Using Vector Time Frames Which Accesses to Check? Which Time Frames to Check? Access History & Algorithm Coherency Results

5


Lockset Locking Discipline The Basic Algorithm & Explanation Which Accesses to Check? Improving Locking Discipline

Initialization Read-Sharing Barriers

False Alarms Results

Combining DJIT+ and Lockset Summary References

6

What is a Data Race?

Concurrent accesses to a shared location by two or more threads, where at least one is for writing

Example (variable X is global and shared):

Thread 1 Thread 2X=1 T=YZ=2 T=X

Usually indicative of bug!

7

Why Data-Races areUndesired?

Programs with data-races: Usually demonstrate unexpected and even

non-deterministic behavior. The outcome might depend on specific

execution order (A.K.A threads’ interleaving).

Re-executing may not always produce the same results/same data-races.

Thus, hard to debug and hard to write correct programs.

8

Why Data Races areUndesired? – Example

First interleaving: Thread 1 Thread 2 1. reg1X 2. incr reg1 3. Xreg1

4. reg2X 5. incr reg2

6. Xreg2 Second interleaving: Thread 1 Thread

21. reg1X

2. incr reg13. reg2X

4. incr reg25. Xreg26. Xreg1 At the beginning: X=0. At the end: X=1 or X=2?

Depends on the scheduling order

Machine codefor ‘X++’

9

Execution Order

Each thread has a different execution speed.

The speed may change over time. For an external observer of the time axis,

instructions appear in execution order.

Any order is legal. Execution order for a single

thread is called program order.

Time

T1

T2

10

Lock(m)

Unlock(m) Lock(m)

Unlock(m)

How Data Races Can be Prevented?

Explicit synchronization between threads: Locks Critical Sections Barriers Mutexes Semaphores Monitors Events Etc.

Thread 1 Thread 2

X++

T=X

11

Synchronization –“Bad” Bank Account Example

Thread 1 Thread 2Deposit( amount ) { Withdraw( amount ) {

balance+=amount; if (balance<amount);} print( “Error” );

elsebalance–

=amount; }

‘Deposit’ and ‘Withdraw’ are not “atomic”!!!

What is the final balance after a series of concurrent deposits and withdraws?

12

Synchronization –“Good” Bank Account

ExampleThread 1 Thread 2Deposit( amount ) { Withdraw( amount ) {

Lock( m ); Lock( m );balance+=amount; if (balance<amount)Unlock( m ); print( “Error” );

} elsebalance–=amount;Unlock( m ); }

Since critical sections can never execute concurrently, this version exhibits no data-races.

Critical Sections

13

Is This Enough?

Theoretically – YES Practically – NO

What if programmer accidentally forgets to place correct synchronization?

How all such data race bugs can be detected in large program?

How to eliminate redundant synchronization?

14

Can Data Races be Easily Detected? – No!

The problem of deciding whether a given program contains potential data races (called feasible) is NP-hard [Netzer&Miller 1990] Input size = # instructions performed Even for 2 threads only Even with no loops/recursion

Lots of execution orders: (#threads)thread_length*threads

Also all possible inputs should be tested Side effects of the detection code can eliminate

all data races

alock(m)...unlock(m)

lock(m)bunlock(m)

15

Feasible Data-Races

Based on the possible behavior of the program (i.e. semantics of the program’s computation).

The actual (!) data-races that can possibly happen in some program execution.

Require full analyzing of the program’s semantics to determine if the execution could have allowed accesses to same shared variable to execute concurrently.

16

Apparent Data Races

Approximations of the feasible data races Based on only the behavior of program

explicit synchronization (and not on program semantics)

Important since data-races are usually result of improper synchronization

Easier to locate Less accurate Exist iff at least one feasible data race exists Exhaustively locating all apparent data races

is still NP-hard (and, in fact, undecidable)

17

Apparent Data-Races Cont.

Accesses a and b to same shared variable in some execution, are ordered, if there is a chain of corresponding explicit synchronization events between them.

a and b are said to have potentially executed concurrently if no explicit synchronization prevented them from doing so.

Initially: grades = oldDatabase; updated = false;

grades:=newDatabase;

updated:=true; while (updated == false);

X:=grades.gradeOf(lecturersSon);

Thread T.A.

Thread Lecturer

18

Feasible vs. Apparent

Thread 1 [Ffalse] Thread 2X++F=true

if (F==true)X– –

Apparent data-races in the execution above – 1 & 2.

Feasible data-races – 1 only!!! – No feasible execution exists, in which ‘X--’ is performed before ‘X++’ (suppose ‘F’ is false at start).

Protecting ‘F’ only will protect ‘X’ as well.

2

1

19

Feasible vs. Apparent

Thread 1 [Ffalse] Thread 2X++ Lock( m )Lock( m ) T = FF=true Unlock( m )Unlock( m ) if (T==true)

X– – No feasible or apparent data-races exist under

any execution order!!! ‘F’ is protected by a lock. ‘X++’ and ‘X– –’ are

always ordered and properly synchronized. Rather there is a sync‘ chain of Unlock(m)-Lock(m)

between ‘X++’ and ‘X– –’, or only ‘X++’ executes.

20

Complexity ofData-Race Detection

Exactly locating the feasible data-races is an NP-hard problem. The apparent races, which are simpler to

locate, must be detected for debugging.

Apparent data-races exist if and only if at least one feasible data-race exists somewhere in the execution.

The problem of exhaustively locating all apparent data-races is still NP-hard.

21

Reminder: NP and Co-NP

There is a set of NP problems for which: There is no polynomial solution. There is an exponential solution.

Problem is NP-hard if there is a polynomial reduction from any of the problems in NP to this problem.

Problem is NP-complete, if it is NP-hard and it resides in NP.

Intuitively - if the answer for the problem can be only ‘yes’/‘no’ we can either answer ‘yes’ and stop, or never stop (at least not in polynomial time).

22

Reminder: NP and Co-NP Cont.

The set of Co-NP problems is complementary to the set of NP problems.

Problem is Co-NP-hard if we can only answer ‘no’.

If problem is both in NP and Co-NP, then it’s in P (i.e. there is a polynomial solution).

The problem of checking whether a boolean formula is satisfiable is NP-complete. Answer ‘yes’ if satisfiable assignment for variables

was found. Same, but not-satisfiable – Co-NP-complete.

23

Why Data-Race Detectionis NP-Hard?

Question: How can we know that in a program P two accesses, a and b, to the same shared variable are concurrent?

Answer: We must check all execution orders of P and see. If we discover an execution order, in which

a and b are concurrent, we can report on data-race and stop.

Otherwise we should continue checking.

24

Program Execution Model

Consider a class of multi-threaded programs that synchronize by counting semaphores.

Program execution is described by collection of events and two relations over the events.

Synchronization event – instance of some synchronization operation (e.g. signal, wait).

Computation event – instance of a group of statements in same thread, none of which are synchronization operations (e.g. x=x+1).

25

Program Execution Model –Events’ Relations

Temporal ordering relation – a T→ b means that a completes before b begins (i.e. last action of a can affect first action of b).

Shared data dependence relation - a D→ b means that a accesses a shared variable that b later accesses and at least one of the accesses is a modification to variable. Indicates when one event causally affects

another.

26

Program Execution Model –Program Execution

Program execution P – a triple <E,T→,D→>, where E is a finite set of events, and T→ and D→ are the above relations that satisfy the following axioms: A1: T→ is an irreflexive partial order (a T↛ a). A2: If a T→ b T↮ c T→ d then a T→ d. A3: If a D→ b then b T↛ a.

Notes: ↛ is a shorthand for ¬(a→b). ↮ is a shorthand for ¬(a→b)⋀¬(b→a). Notice that A1 and A2 imply transitivity of T→.

27

Program Execution Model –Feasible Program Execution

Feasible program execution for P – execution of a program that: performs exactly the same events as P May exhibit different temporal ordering.

Definition: P’=<E’,T’→,D’→> is a feasible program execution for P=<E,T→,D→> (potentially occurred) if F1: E’=E (i.e. exactly the same events), and F2: P’ satisfies the axioms A1 - A3 of the model, and F3: a D→ b ⇒ a D’→ b (i.e. same data dependencies)

Note: Any execution with same shared-data dependencies as P will execute exactly the same events as P.

28

Program Execution Model –Ordering Relations

Given a program execution, P=<E,T→,D→>, and the set, F(P), of feasible program executions for P, the following relations are defined: Summarize the temporal orderings present in the

feasible program executions.

Must-Have Could-Have

Happened- Before

a MHB→ b ⇔∀<E,T→,D→>∈F(P), a T→ b

a CHB→ b ⇔∃<E,T→,D→>∈F(P), a T→ b

Concurrent-With

a MCW↔ b ⇔∀<E,T→,D→>∈F(P), a T↮ b

a CCW↔ b ⇔∃<E,T→,D→>∈F(P), a T↮ b

Ordered-With

a MOW↔ b ⇔∀<E,T→,D→>∈F(P), ¬(a T↮ b)

a COW↔ b ⇔∃<E,T→,D→>∈F(P), ¬(a T↮ b)

29

Program Execution Model –Ordering Relations -

Explanation The must-have relations describe orderings that

are guaranteed to be present in all feasible program executions in F(P).

The could-have relations describe orderings that could potentially occur in at least one of the feasible program executions in F(P).

The happened-before relations show events that execute in a specific order.

The concurrent-with relations show events that execute concurrently.

The ordered-with relations show events that execute in either order but not concurrently.

30

Complexity of Computing Ordering Relations

The problem of computing any of the must-have ordering relations (MHB, MCW, MOW) is Co-NP-hard.

The problem of computing any of the could-have relations (CHB, CCW, COW) is NP-hard.

Theorem 1: Given a program execution, P=<E,T→,D→>, that uses counting semaphores, the problem of deciding whether a MHB→ b, a MCW↔ b or a MOW↔ b (any of the must-have orderings) is Co-NP-hard.

31

Proof of Theorem 1 –Notes

The proof is a reduction from 3CNFSAT such that any boolean formula is not satisfiable iff a MHB→ b for two events, a and b, defined in the reduction.

The problem of checking whether 3CNFSAT formula is not satisfiable is Co-NP-complete.

The presented proof is only for the must-have-happened-before (MHB) relation. Proofs for the other relations are analogous.

The proof can also be extended to programs that use binary semaphores, event style synchronization and other synchronization primitives (and even single counting semaphore).

32

Proof of Theorem 1 –3CNFSAT

An instance of 3CNFSAT is given by: A set of n variables, V={X1,X2, …,Xn}. A boolean formula B consisting of conjunction

of m clauses, B=C1⋀C2⋀…⋀Cm. Each clause Cj=(L1⋁L2⋁L3) is a disjunction of

three literals. Each literal Lk is any variable from V or its

negation - Lk=Xi or Lk=⌐Xi. Example:

B=(X1⋁X2⋁⌐X3)⋀(⌐X2⋁⌐X5⋁X6)⋀(X1⋁X4⋁⌐X5)

33

Proof of Theorem 1 –Idea of the Proof

Given an instance of 3CNFSAT formula, B, we construct a program consisting of 3n+3m+2 threads which use 3n+m+1 semaphores (assumed to be initialized to 0).

The execution of this program simulates a nondeterministic evaluation of B.

Semaphores are used to represent the truth values of each variable and clause.

The execution exhibits certain orderings iff B is not satisfiable.

34

Proof of Theorem 1 –The Construction per Variable For each variable, Xi, the following

three threads are constructed:wait( Ai )signal( Xi )..signal( Xi )

wait( Ai )signal( not-Xi )..signal( not-Xi )

signal( Ai )wait( Pass2 )signal( Ai )

“. . .” indicates as many signal(Xi) (or signal(not-Xi)) operations as the number of occurrences of the literal Xi (or ⌐Xi) in the formula B.

35

Proof of Theorem 1 –The Construction per Variable

The semaphores Xi and not-Xi are used to represent the truth value of variable Xi.

Signaling the semaphore Xi (or not-Xi) represents the assignment of True (or False) to variable Xi.

The assignment is accomplished by allowing either signal(Xi) or signal(not-Xi) to proceed, but not both (due to concurrent wait(A i) operations in two leftmost threads).

36

Proof of Theorem 1 –The Construction per Clause

For each clause, Cj, the following three threads are constructed:

wait( L1 )signal( Cj )



L1, L2 and L3 are the semaphores corresponding to literals in clause Cj (i.e. Xi or not-Xi).

The semaphore Cj represents the truth value of clause Cj. It is signaled iff the truth assignments to variables, cause the clause Cj to evaluate to True.

37

Proof of Theorem 1 –Explanation of Construction

The first 3n threads operate in two phases: The first pass is a non-deterministic guessing

phase in which: Each variable used in the boolean formula B is

assigned a unique truth value. Only one of the Xi and not-Xi semaphores is signaled.

The second pass (begins after semaphore Pass2 is signaled) is used to ensure that the program doesn’t deadlock:

The semaphore operations that were not allowed to execute during the first pass are allowed to proceed.

38

Proof of Theorem 1 –The Final Construction

Additional two threads are created:

There are n ‘signal(Pass2)’ operations – one for each variable.

There are m ‘wait(Cj)’ operations – one for each clause.

wait( C1 )..

wait( Cm )b: skip

a: skip

signal( Pass2 )..

signal( Pass2 )

m n

39

Proof of Theorem 1 –Putting All Together

Event b is reached only after semaphore Cj, for each clause j, has been signaled.

The program contains no conditional statements or shared variables. Every execution of the program executes

the same events and exhibits the same shared-data dependencies (i.e. none).

Claim: For any execution a MHB→ b iff B is not satisfiable.

40

Proof of Theorem 1 –Proving the “if” Part

Assume that B is not satisfiable. Then there is always some clause, Cj, that is

not satisfied by the truth values guessed during the first pass. Thus, no signal(Cj) operation is performed during the first pass.

Event b can’t execute until this signal(Cj) operation is performed, which can then only be done during the second pass.

The second pass doesn’t occur until after event a executes, so event a must precede event b.

Therefore, a MHB→ b.

41

Proof of Theorem 1 –Proving the “only if” Part

Assume that a MHB→ b. This means that there is no execution in which b

either precedes a or executes concurrently with a. Assume by way of contradiction that B is

satisfiable. Then some truth assignment can be guessed

during the first pass that satisfies all of the clauses.

Event b can then execute before event a, contradicting the assumption.

Therefore, B is not satisfiable.

42

Complexity of Computing Ordering Relations – Cont.

Since a MHB→ b iff B is not satisfiable, the problem of deciding a MHB→ b is Co-NP-hard.

By similar reductions, programs can be constructed such that the non-satisfiability of B can be determined from the MCW or MOW relations. The problem of deciding these relations is therefore also Co-NP-hard.

Theorem 2: Given a program execution, P=<E,T→,D→>, that uses counting semaphores, the problem of deciding whether a CHB→ b, a CCW↔ b or a COW↔ b (any of the could-have orderings) is NP-hard.

Proof by similar reductions …

43

Complexity of Race Detection -

Conditions, Loops and Input The presented model is too simplistic. What if the “if” and “while” statements are

used? What if the user’s input is allowed?

Thread 1 Thread 2

Y = ReadFromInput( );while ( Y < 0 ) Print( Y );X++;[1]

X++;[2]

If Y≥0 there is a data-race. Otherwise it is not possible, since [1] is never reached.

44

Complexity of Race Detection -

“NP-Harder”?

The proof above does not use conditional statements, loops or input from outside.

The problem of data-race detection is much-much harder then deciding an NP-complete problem. Intuitively - there is no exponential solution,

since it’s not known whether the program will stop.

Thus, in general case, it’s undecidable.

45

So How Data-Races Can be Detected? – Approximations

Deciding whether a CHB→ b or a CCW↔ b will reveal feasible data-races.

Since it is intractable problem, the temporal ordering relation T→ should be approximated and apparent data-races located instead.

Recall that apparent data-races exist if and only if at least one feasible race exists.

Yet, it remains a hard problem to locate all apparent data-races.

46

Approximation Example – Lamport’s Happens-Before

The happens-before partial order, denoted a hb→b, is defined for access events (reads, writes, releases and acquires) that happen in a specific execution, as follows:

Shared accesses a and b are concurrent,a hb↮ b, if neither a hb→ b nor b hb→ a holds.

Program Order:a and b are events performed by the same thread, with a preceding b

Release and Acquire:a is a release of a some sync’ object S and b is a corresponding acquire

Transitivity:a hb→c and c hb→b

Thread 1

Thread 2

.a

.unlock(L)

.

.

.

.

.

.

.

.lock(L)

.b

a hb→b

47

Approaches to Detection ofApparent Data-Races – Static

There are two main approaches to detection of apparent data-races (sometimes a combination of the both is used): Static – perform a compile-time analysis of the

code.– Too conservative:

Can’t know or understand the semantics of the program. Result in excessive false alarms that hide the real data-races.

+ Test the program globally: See the whole code of the tested program Can warn about all possible errors in all possible executions.

48

Approaches to Detection ofApparent Data-Races –

Dynamic Dynamic – use tracing mechanism to detect

whether a particular execution actually exhibited data-races.+ Detect only those apparent data-races that actually

occur during a feasible execution.– Test the program locally:

Consider only one specific execution path of the program each time.

Post-Mortem Methods – after the execution terminates, analyze the trace of the run and warn about possible data-races that were found.

On-The-Fly Methods – buffer partial trace information in memory, analyze it and detect races as they occur.

49

Approaches to Detection ofApparent Data-Races

No “silver bullet” exists.

The accuracy is of great importance (especially in large programs).

There is always a tradeoff between the amount of false positives (undetected races) and false negatives (false alarms).

The space and time overheads imposed by the techniques are significant as well.

50

Closer Look atDynamic Methods

We show two dynamic methods for on-the-fly detection of apparent data-races in multi-threaded programs with locks and barriers: DJIT+ – based on Lamport’s happens-before

partial order relation and Mattern’s virtual time (vector clocks). Implemented in Millipede and MultiRace systems.

Lockset – based on locking discipline and lockset refinement. Implemented in Eraser tool and MultiRace system.

51

DJIT+

Description Detects the apparent data-races in

program execution when they actually occurs.

Based on the happens-before partial order. Can announce data-races race-by-race. After the cause of the race is verified, the

search for other races can proceed. The main disadvantage of the technique is

that it is highly dependent on the scheduling order.

52

DJIT+ Local Time Frames (LTF)

The execution of each thread is split into a sequence of time frames

A new time frame starts on each release (unlock/barrier)

For every access there is a time stamp = a vector built from LTFs of all threads at the moment of the access

Thread LTF

x = 1lock( L1 )z = 2lock( L2 )y = 3unlock( L2 )z = 4barrier( B )x = 5

1

1

1

2

3

53

DJIT+

Local Time FramesClaim 1: Let a in thread ta and b in thread tb be two accesses, where a occurs at time frame Ta, and the release in ta, corresponding to the latest acquire in tb which precedes b, occurs at time frame Tsync in ta. Then a hb→ b iff Ta < Tsync.

TFa ta tb

Ta

Trelease

Tsync

acq.a

.rel.

rel...

.

.

.acq

.

.

.

.acq

.b

54

DJIT+

Local Time Frames

Proof:- If Ta < Tsync then (a hb→ release) and since (release hb→ acquire) and (acquire hb→ b), we get (a hb→ b).- If (a hb→ b) and since a and b are in distinct threads, then by definition there exists a pair of corresponding release an acquire, so that (a hb→ release) and (acquire hb→ b). It follows that Ta < Trelease

≤ Tsync.

55

DJIT+

Vector Time Frames (VTF) A vector stt[.] for each thread t

Vector size = maxthreads (the maximum number of threads to execute)

Thread ID = thread index stt[t] is the LTF of t

Holds the number of releases actually made by t stt[u] stores the latest LTF of thread u known

to t If u is an acquirer of t’s release, then u’s vector

is updated:for k=0 to maxthreads-1

stu[k] = max( stu[k], stt[k] )

56

DJIT+

Vector Time Frames

In such way, the vector of u is notified of: The latest time frame of t. The latest time frames of other threads

according to the knowledge of t. Note that a thread can learn about a

release performed by another thread through “gossip”, when this information is transferred through a chain of corresponding release-acquire pairs.

57

Thread 1 Thread 2 Thread 3(1 1 1)

(1 1 1) (1 1 1)

write Xrelease( m1 )read Z

(2 1 1) acquire( m1

)read Yrelease( m2 )write X

(2 1 1)

(2 2 1)acquire( m2 )write X

(2 2 1)

DJIT+

Vector Time Frames

58

DJIT+ Vector Time Frames

Claim 2: Let a and b be two accesses in respective threads ta and tb, which happened during respective local time frames Ta and Tb. Let f denote the value of sttb[ta] at the time when b occurs. Then a hb→ b iff Ta < f.

TFa ta tc tb TFb

Ta a.

rel........

.

.

.

.acq

.rel....

.

.

.

.

.

.

.

.acq

.b Tb

59

DJIT+ Vector Time Frames

Proof:- If (a hb→ b) and since a and b are in distinct threads, then there exists a chain of releases and corresponding acquires such that the first release in ta and the last acquire in tb, so that (a hb→ first release) and (first release hb→ last acquire). The information on ta’s local time frame is transferred through that chain, reaches tb and stored in sttb[ta] (=f). Thus it follows that Ta < Tfirst release ≤ f.

- If Ta < f then there is a sequence of corresponding release-acquire pairs, which transfer the local time frame from ta to tb, finally resulting in tb “hearing” that ta entered a time frame which is later than Ta. This same sequence can be used to transitively apply the hb→ relation from a to b.

60

DJIT+ Logging Mechanism

We assume the existence of some logging mechanism, which is: Capable of logging all the accesses to all

shared locations as they occur. Accesses are logged ‘atomically’ (no data-

races on the accesses to the log) Agrees with the happens-before partial order:

If a hb→ b, then i is logged prior to b. Also it follows that – if a and b accesses to same shared

location v and a is logged prior to b, then b hb↛ a.

61

DJIT+ Data-Race Detection Using

VTF

Theorem 1: Let a and b be two accesses to the same shared variable in respective threads ta and tb during respective local time frames Ta and Tb. Suppose that at least one of a or b is a write. Assume that a was logged and tested for races prior to b. Then a and b form a data-race iff at the time when b is logged it holds that sttb[ta] ≤ Ta.

62

DJIT+ Data-Race Detection Using

VTF

Proof:- If sttb[ta] ≤ Ta then, by Claim 2, a hb→ b doesn’t hold. Since b is only currently being logged, it can not hold that b hb→ a. Thus a and b are concurrent and form a data race (since at least one of them is a write).- If a and b form a data race then a hb→ b doesn’t hold. Thus, by Claim 2, sttb[ta] ≤ Ta.

63

DJIT+

Data Race Detection Predicate

P(a,b) ≜ ( a.type = write ⋁ b.type = write ) ⋀ ⋀ ( a.time_frame ≥ stb.thread_id[a.thread_id] )

P gets two accesses, a and b, such that: a and b are in different threads a and b access same shared location a was logged and tested earlier b is currently logged

P returns TRUE iff a and b form a data race

Obviously, very expensive

64

DJIT+ Which Accesses to Check?

We have assumed that there is a logging mechanism, which records all accesses.

Logging all accesses in all threads and testing the predicate P for each pair of them will impose a great overhead on the system.

Actually some of the accesses can be discarded.

65

Claim 3: Consider an access a in thread ta during time frame Ta, and accesses b and c in thread tb=tc during time frame Tb=Tc. Assume that c precedes b in the program order. If a and b are concurrent, then a and c are concurrent as well.

TFa ta tb TFb

Ta

.

.

.

.a

relc.b

Tc

Tb

Ta a....

.relc.b

Tc

Tb


66


Proof:- Let fb and fc denote the respective values of sttb[ta] when b and c happen. Since sttb[ta] is monotonically increasing, and c precedes b, we know that fb ≥ fc. Since a hb→ b does not hold, we know by Claim 2 that Ta ≥ fb. Thus, Ta ≥ fc and again by Claim 2 we get that a hb→ c is false.- Let fa denote the value of stta[tb] when a happens. Since b hb→ a does not hold, we know by Claim 2 that Tb ≥ fa. Since Tb=Tc we get that Tc ≥ fa. Thus by Claim 2, c hb→ a is false.

67

Thread 1 Thread 2

lock( L )write Xread Xunlock( L )

read X

lock( L )write Xunlock( L )

lock( L )read Xwrite Xwrite Xunlock( L )DR

DJIT+

Which Accesses to Check? - Example

Accesses b and c previously logged in thread t1 in a same time frame

Access b precedes access c in the program order

Access a currently logged in thread t2

If a and b are synchronized, then a and c are synchronized as well

It is sufficient to log and test only the first read access and the first write access to every variable in each time frame!

No logging

b

c

No logging

a

68

Assume that in thread ta an access a is currently being logged and in thread tb we previously logged a write b in time frame Tb and another previous write c in time frame Tc, so that

Tb < Tc.

TFa ta tb TFb

Ta

.

.

.

.

.

.

.

.

.a.

b.

acq.

rel..c...

Tb

Tc

DJIT+ Which Time Frames to

Check?

69


Check? Claim 4: If a forms a data-race with b then it

certainly forms a data-race with c. Proof: Easy, since Tc > Tb ≥ stta[tb].

Either pair a-b or a-c can be considered to be the apparent data-race to be reported.

Also, if there is no data-race between a and c, then there is also no data-race between a and b. Therefore, the a-b pair should not be checked.

70


Check?

For current read access to a shared variable v, it is enough to check it against the last time frame in each of the other threads, which wrote to v.

For current write access to v, it is enough to check it against the last time frame in each of the other threads, which read from v, and the last time frame in each of the other threads, which wrote to v.

Djit+

Access History & Algorithm Each variable v holds for each of the threads:

The last time frames in which they read from v The last time frames in which they wrote to v

w-tfn.........w-tf2w-tf1

r-tfn.........r-tf2r-tf1

V

Time frames of recent writes

to v – one for each thread

Time frames of recent reads

from v – one for each thread

On each first read and first write to v in a time frame every thread updates the access history of v with LTF

If the access to v is a read, the thread checks all recent writes by other threads to v

If the access is a write, the thread checks all recent reads as well as all recent writes by other threads to v

To support weak memory model, thehistory should be atomic and coherent

72

DJIT+ Coherency

In fact, the presented algorithm uses only the coherency assumption on the access history.

Coherency means that: For each variable v there is an agreed-among-all-

threads global order Rv on all accesses to v. The reads always return the most recently written

value. Hence, the algorithm described above is correct

also for weakly ordered systems. E.g., the data-race-free-1 memory

model only requires that in total absence of data-races the program executes as if it was sequentially consistent.

Thread 1 Thread 2

write v1, 1write v2, 2

read v2, 2

read v1, 0

The history is coherent, butnot sequentially consistent.

73

DJIT+ Results

The DJIT algorithm was implemented in several academic systems – Millipede and MultiRace.

No false alarms No missed races in given feasible execution Very sensitive to differences in threads’

scheduling Should be applied each time the program executes

(and not only in debug mode)

Requires enormous number of runs Yet cannot prove that the tested program is race free

74

LocksetLocking Discipline

Lockset detects violations of locking discipline

The locking discipline is a programming policy that ensures total absence of data races

A common and simple locking discipline is that every shared location is consistently protected by the same lock on each access

The main drawback is a possibly excessive number of false alarms

75

LocksetWhat is the Difference?

[1] hb→ [2], yet there is a feasible data-race under different scheduling.

Thread 1 Thread 2

Z = Z + 1;[1]

Lock( m );V = V + 1;Unlock( m );

Lock( m );V = V + 1;Unlock( m );Z = Z + 1;[2]

Thread 1 Thread 2

Z = Z + 1;[1]

Lock( m );Flag = true;Unlock( m );

Lock( m );T = Flag;Unlock( m );if ( T == true ) Z = Z + 1;[2]

No any locking discipline on Y. Yet [1] and [2] are ordered under all possible schedulings.

76

LocksetThe Basic Algorithm

C(v) – the set of all locks that consistently protected v in the execution so far

locks_held(t) – the set of all locks currently acquired by thread t

The algorithm:- For each v, init C(v) to the set of all possible locks- On each access to v by thread t:

- lhvlocks_held(t)- if this is a read, then lhvlhv ∪

{readers_lock}- C(v)C(v) ∩ lhv

- if C(v)=∅, issue a warning

77

LocksetExplanation

The process is called lockset refinement. It ensures that any lock that consistently

protected v is contained in C(v). A lock m is in C(v) if in execution up to that

point, every thread that has accessed v was holding m at the moment of access.

If some lock m consistently protects v, it will remain in C(v) till the termination of the program.

The addition of fake readers_lock lock ensures that concurrent reads are not interpreted as data races.

The first write to v permanently removes readers_lock from C(v).

78

LocksetExample

Thread 1 Thread 2 lhvC(v)

{ } {L1, L2,RL}

Warning:

lockingdisciplinefor v isviolated!!

!

lock( L1 )

read v {L1,RL} {L1,RL}

unlock( L1 ){ }

lock( L2 )

write v {L2} { }unlock( L2 )

{ }

RL = readers_lockprevents from multiple reads to generate false alarms

79

Extended Lockset Which Accesses to Check?

Two accesses, a and b, to v Both in same thread Both in same time frame Access a precedes access b

Then: Locksa(v) ⊆ Locksb(v) Locksu(v) is the set of real

locks acquired by the thread during access u to v

Thread Locksu(v)

unlock…lock(L1)write x[1]

write x [2]

lock(L2)write x [3]

unlock( L2 )unlock( L1 )

{L1}{L1}={L1}

{L1,L2}⊇{L1}

Accesses [1], [2], [3] are all in same time frame

80

Extended Lockset Which Accesses to Check?

It follows that: 1) [C(v) ∩ Locksa(v)] ⊆ [C(v) ∩ Locksb(v)] 2) If C(v) ∩ Locksa(v)≠∅ then C(v) ∩ Locksb(v)≠∅ Only first access in each time frame need

be logged and checked!!! The addition of readers_lock forces us

to check both first read and first write in each time frame

Lockset needs same logging mechanism as Djit+!

81

Extended Lockset Improving Locking Discipline

The locking discipline described above is too strict. There are common programming practices that

violate the discipline, yet are free from data-races: Initialization: Shared variables are usually initialized

without holding any locks. Read-Shared Data: Some shared variables are

written during initialization only and are read-only thereafter.

Barriers: Threads can synchronize through barriers, which are not supported by the notion of locking discipline. For data-race-free programs using barriers only, the basic Lockset will report false alarms on every pair of accesses from different threads.

82

Extended LocksetInitialization

When initializing newly allocated data there is no need to lock it, since other threads can not hold a reference to it yet.

Unfortunately, there is no easy way of knowing when initialization is complete.

Therefore, a shared variable is initialized when it is first accessed by a second thread.

As long as a variable is accessed by a single thread, reads and writes don’t update C(v).

83

Extended LocksetRead-Shared Data

There is no need to protect a variable if it’s initialized once and thereafter is read-only.

To support unlocked read-sharing, the fake readers_lock was added.

Still, some additional mechanism is needed so that the initialization will not permanently remove the readers_lock from C(v).

Note: The fake lock doesn’t prevent from threads to execute the reads concurrently.

84

Extended Lockset Supporting Barriers

Barrier is a global synchronization primitive Locks are 2-way

In order to pass the barrier, all threads must reach it first and only then continue.

Observations: reaching a barrier ≅ starting new execution No races between accesses from different

sides of a barrier Idea – restart Lockset detection each time

barrier is reached by all threads.

85

Extended Lockset Supporting Barriers

Variable v is supposed to be initialized when: It is first accessed by a second thread The thread that first accessed v reaches a barrier

Initializing

Virgin Shared

Empty

Clean

Exclusive

write by first thread

read/write by same thread

read/write by new thread

barrier

barrier

barrier

barrier

barrier

read/write by any threadread/write by

some thread

read/write by any thread,C(v) not empty

read/write by any thread, C(v) is empty

read/write by new thread, C(v) is empty

read by any thread

read/write by new thread,C(v) not empty

barrier

read/write by same thread

Statetransitiondiagramemployedfor eachvariable

86

Extended LocksetStates Explanation

Virgin – The variable is new and have not been referenced by any thread.

Initializing: – The variable is initialized by only one thread. C(v) is not updated in this state.

Shared: – The data is accessed by more than one thread. C(v) is updated on each access.

Empty: – C(v) became empty. Data race warning is announced only the first time this state is reached.

Clean: – Barrier was reached by all threads. C(v) is initialized to hold the set of all possible locks.

Exclusive – Similar to the Initializing state - after reaching the barrier, the variable is accessed by only one thread. It’s supposed to be already initialized. Thus, C(v) is updated on each access, but data race is announced only if another thread accesses v, and C(v) is empty.

87

The refined algorithm will still produce a false alarm in the following simple case:

Thread 1 Thread 2 C(v)

Lock( m1 ); v = v + 1; Unlock( m1 );

Lock( m2 ); v = v + 1; Unlock( m2 );

Lock( m1 ); Lock( m2 ); v = v + 1; Unlock( m2 ); Unlock( m1 );

{m1}

{m1}

{ }

LocksetStill False Alarms

88

LocksetAdditional False Alarms

Additional possible false alarms are: Queue that implicitly protects its elements by

accessing the queue through locked head and tail fields.

Thread that passes arguments to a worker thread. Since the main thread and the worker thread never access the arguments concurrently, they do not use any locks to serialize their accesses.

Privately implemented locks,which don’t communicate with Lockset.

True data races that don’t affectthe correctness of the program(for example Benign races).

if (f == 0)lock(m);if (f == 0)

f = 1;unlock(m);

89

Lockset Results

The basic Lockset was implemented in a full scale testing tool, Eraser, which is used in industry (not “on paper only”).

The extended Lockset was implemented in MultiRace academic system.

Less sensitive to differences in threads’ scheduling Detects a superset of all apparently raced locations

in an execution of a program Possible races can be rarely missed

Our extension for barriers can be used to check programs that employ barriers only and no locks

Still lots of false alarms Still dependent on scheduling

Cannot prove the tested program is race free

S

A

F

L

Combining Djit+ and LocksetAll shared

locations in some program P

All feasibly raced locations in program P

Violations detected by Lockset in

execution E of P

D

Raced locations detected by DJIT+ in

execution E of P

All apparently raced locations in

program P

Lockset can detect suspected races in more execution orders

DJIT+ can filter out the spurious warnings reported by Lockset

Every completed data race is also a locking discipline violation

For many types of programs L tends to cover A – we detect a subset and a superset of all raced locations!!!

The number of checks performed by DJIT+ can be reduced with the help of Lockset

If C(v) is not empty yet, DJIT+ should not check v for races

The implementation overhead comes mainly from the access logging mechanism

Can be shared by both algorithms

91

Dynamic Data-Race DetectionSummary

The solutions are not universal. Not all located apparent data races are feasible. Still requires a large number of runs to check as

much executions paths as possible. Still cannot prove the program to be data race

free. Since slowdowns can be high, a satisfying testing

can take months. Different (or new) types of synchronization might

require different detection techniques. Inserting a detection code in a program can

perturb the threads’ interleaving so that races will disappear (less sensitive in Lockset).

Maybe to combine with some static analysis? Maybe better approximations can be found...?

92

The End

93

References

S. Adve and M. D. Hill. A Unified Formalization of Four Shared-Memory Models. Technical Report, University of Wisconsin, Sept. 1992.

A. Itzkovitz, A. Schuster, and O. Zeev-Ben-Mordechai. Towards Integration of Data Race Detection in DSM System. In The Journal of Parallel and Distributed Computing (JPDC), 59(2): pp. 180-203, Nov. 1999

L. Lamport. Time, Clock, and the Ordering of Events in a Distributed System. In Communications of the ACM, 21(7): pp. 558-565, Jul. 1978

F. Mattern. Virtual Time and Global States of Distributed Systems. In Parallel & Distributed Algorithms, pp. 215 226, 1989.

94

ReferencesCont.

R. H. B. Netzer and B. P. Miller. What Are Race Conditions? Some Issues and Formalizations. In ACM Letters on Programming Languages and Systems, 1(1): pp. 74-88, Mar. 1992.

R. H. B. Netzer and B. P. Miller. On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions. In 1990 International Conference on Parallel Processing, 2: pp. 93 97, Aug. 1990

R. H. B. Netzer and B. P. Miller. Detecting Data Races in Parallel Program Executions. In Advances in Languages and Compilers for Parallel Processing, MIT Press 1991, pp. 109-129.

95

ReferencesCont.

E. Pozniansky. Efficient On-The-Fly Data Race Detection in Multithreaded C++ Programs. Research Thesis, May 2003.

S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T.E. Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. In ACM Transactions on Computer Systems, 15(4): pp. 391-411, 1997

O. Zeev-Ben-Mordehai. Efficient Integration of On-The-Fly Data Race Detection in Distributed Shared Memory and Symmetric Multiprocessor Environments. Research Thesis, May 2001.

On-the-Fly Data-Race Detection in Multithreaded Programs

Documents

Transcript of On-the-Fly Data-Race Detection in Multithreaded Programs