Dynamic Data-Race Detection in Lock-Based Multi-Threaded Programs Prepared by Eli Pozniansky under...

Dynamic Data-Race Detection in

Lock-Based Multi-Threaded Programs

Prepared by Eli Pozniansky under Supervision of Prof. Assaf Schuster

2

Table of Contents

What is a Data-Race? Why Data-Races are Undesired? How Data-Races Can be Prevented? Can Data-Races be Easily Detected? Feasible and Apparent Data-Races Complexity of Data-Race Detection

Program Execution Model Complexity of Computing Ordering Relations Proof of NP/Co-NP Hardness

3

Table of ContentsCont.

So How Data-Races Can be Detected? Lamport’s Happens-Before Approximation

Approaches to Detection of Apparent Data-Races: Static Methods Dynamic Methods:

Post-Mortem Methods On-The-Fly Methods

4


Closer Look at Dynamic Methods: DJIT

Local Time Frames Vector Time Frames Predicate for Data-Race Detection Which Accesses to Check? Which Time Frames to Check? Access History First Data-Race Results

5


Lockset Locking Discipline The Basic Algorithm Improving Locking Discipline

Initialization Read-Sharing

Refinement for Read-Write Locks False Alarms Results

Summary References

6

What is a Data-Race?

Data-race is an anomaly of concurrent accesses by two or more threads to a shared variable and at least one is for writing.

Example (variable X is global and shared):

Thread 1 Thread 2X=1 T=YZ=2 T=X

7

Why Data-Races areUndesired?

Programs which contain data-races usually demonstrate unexpected and even non-deterministic behavior.

The outcome might depend on specific execution order (A.K.A threads’ interleaving).

Re-running the program may not always produce the same results.

Thus, hard to debug and hard to write correct programs.

8

Why Data-Races areUndesired? - Example

First Interleaving: Thread 1 Thread 21. X=02. T=X3. X++

Second Interleaving: Thread 1 Thread 21. X=02. X++3. T=X

T==0 or T==1?

9

Execution Order

Each thread has a different execution speed, which may change over time.

For an external observer of the time axis, instructions’ execution is ordered in execution order.

Any order is legal. Execution order for a single

thread is called program order.Time

T1

T2

10

How Data-Races Can be Prevented? – Explicit

Synchronization

Idea: In order to prevent undesired concurrent accesses to shared locations, we must explicitly synchronize between threads.

The means for explicit synchronization are: Locks, Mutexes and Critical Sections Barriers Binary Semaphores and Counting Semaphores Monitors Single-Writer/Multiple-Readers (SWMR) Locks Others

11

Synchronization –“Bad” Bank Account Example

Thread 1 Thread 2Deposit( amount ) { Withdraw( amount ) {

balance+=amount; if (balance<amount);} print( “Error” );

elsebalance–

=amount; }

‘Deposit’ and ‘Withdraw’ are not “atomic”!!!

What is the final balance after a series of concurrent deposits and withdraws?

12

Synchronization –“Good” Bank Account

ExampleThread 1 Thread 2Deposit( amount ) { Withdraw( amount ) {

Lock( m ); Lock( m );balance+=amount; if (balance<amount)Unlock( m ); print( “Error” );

} elsebalance–=amount;Unlock( m ); }

Since critical sections can never execute concurrently, this version exhibits no data-races.

Critical Sections

13

Is This Enough?

Is This Enough? Theoretically – YES. Practically – NO.

What if programmer accidentally forgets to place correct synchronization?

How all such data-race bugs can be detected in large program?

14

Can Data-Races be Easily Detected? – No!

Unfortunately, the problem of deciding whether a given program contains potential data-races is computationally hard!!!

There are a lot of execution orders. For t threads of n instructions each the number of possible orders is about tn*t.

In addition to all different schedulings, all possible inputs should be tested as well.

To compound the problem, inserting a detection code in a program can perturb its execution schedule enough to make all errors disappear.

15

Feasible Data-Races

Feasible Data-Races: races that are based on the possible behavior of the program (i.e. semantics of the program’s computation).

These are the actual (!) data-races that can possibly happen in any specific execution.

Locating feasible data-races requires full analyzing of the program’s semantics to determine if the execution could have allowed a and b (accesses to same shared variable) to execute concurrently.

16

Apparent Data-Races

Apparent Data-Races: approximations (!) of feasible data-races that are based on only the behavior of the explicit synchronization performed by some feasible execution (and not the semantics of the program’s computation, i.e. ignoring all conditional statements).

Important, since data-races are usually a result of improper synchronization. Thus easier to detect, but less accurate.

17

Apparent Data-Races Cont.

For example, a and b, accesses to same shared variable in some execution, are said to be ordered, if there is a chain of corresponding explicit synchronization events between them.

Similarly, a and b are said to have potentially executed concurrently if no explicit synchronization prevented them from doing so.

18

Feasible vs. ApparentExample 1

Thread 1 [Ffalse] Thread 2X++;F=true;

while (F==false) {};X– –;

Apparent data-races in the execution above – 1 & 2 (no synchronization chain between racing accesses)

Feasible data-race – 1 only!!! – No feasible execution exists, in which ‘X--’ is performed before ‘X++’ (suppose F is false at start).

Note that protecting ‘F’ only will protect X as well.

2

1

19

Feasible vs. Apparent Example 2

Thread 1 [Ffalse] Thread 2X++; while( 1 ) {Lock( m ); Lock( m );F=true; if ( F == true ) break;Unlock( m ); Unlock( m ); }

X– –; No feasible or apparent data-races exist under

any execution order!!! F is protected by means of lock. The accesses

to X are always ordered and properly synchronized.

20

Complexity ofData-Race Detection

Exactly locating the feasible data-races is an NP-hard problem. Thus, the apparent races, which are simpler to locate, must be detected for debugging.

Fortunately, apparent data-races exist if and only if at least one feasible data-race exists somewhere in the execution.

Yet, the problem of exhaustively locating all apparent data-races still remains NP-hard.

21

Reminder: NP and Co-NP

There is a set of NP problems for which: There is no polynomial solution. There is an exponential solution.

Problem is NP-hard if there is a polynomial reduction from any of the problems in NP to this problem. Problem is NP-complete, if in addition it resides in NP.

Intuitively - if the answer for the problem can be only ‘yes’/‘no’ we can either answer ‘yes’ and stop, or never stop (at least not in polynomial time).

22

Reminder: NP and Co-NP Cont.

There is also a set of Co-NP problems which is complementary to set of NP problems.

For Co-NP-hard problem with answers ‘yes’ or ‘no’, we can only answer ‘no’.

If problem is both in NP and Co-NP, then it’s in P (i.e. there is a polynomial solution).

The problem of checking whether a boolean formula is satisfiable is NP-complete (answer ‘yes’ if satisfiable assignment for variables was found).

Same, but not-satisfiable – Co-NP-complete.

23

Why Data-Race Detectionis NP-Hard?

How can we know that in a program P two accesses, a and b, to the same shared variable are concurrent?

Intuitively – we must check all execution orders of P and see. If we discover an execution order, in which a and b are concurrent, we can report on data-race and stop. Otherwise we should continue checking.

24

Program Execution Model

Consider a class of multi-threaded programs that synchronize by counting semaphores.

Program execution is described by collection of events and two relations over the events.

Synchronization event – instance of some synchronization operation (e.g. signal, wait).

Computation event – instance of a group of statements in same thread, none of which are synchronization operations (e.g. x=x+1).

25

Program Execution Model –Events’ Relations

Temporal ordering relation – a T→ b means that a completes before b begins (i.e. last action of a can affect first action of b).

Shared data dependence relation - a D→ b means that a accesses a shared variable that b later accesses and at least one of the accesses is a modification to variable. Indicates when one event causally affects another.

26

Program Execution Model –Program Execution

Program execution P – a triple <E,T→,D→>, where E is a finite set of events, and T→ and D→ are the above relations that satisfy the following axioms: A1: T→ is an irreflexive partial order (a T↛ a). A2: If a T→ b T↮ c T→ d then a T→ d. A3: If a D→ b then b T↛ a.

Notes: ↛ is a shorthand for ¬(a→b). ↮ is a shorthand for ¬(a→b)⋀¬(b→a). Notice that A1 and A2 imply transitivity of T→

relation

27

Program Execution Model –Feasible Program Execution

Feasible program execution for P – execution of a program that performs exactly the same events as P, but may exhibit different temporal ordering.

Definition: P’=<E’,T’→,D’→> is a feasible program execution for P=<E,T→,D→> (potentially occurred) if F1: E’=E (i.e. exactly the same events), and F2: P’ satisfies the axioms A1 - A3 of the model, and F3: a D→ b ⇒ a D’→ b (i.e. same data dependencies)

Note: Any execution that exhibits the same shared-data dependencies as P will execute exactly the same events as P.

28

Program Execution Model –Ordering Relations

Given a program execution, P=<E,T→,D→>, and the set, F(P), of feasible program executions for P, the following relations (that summarize the temporal orderings present in the feasible program executions) are defined:

Must-Have Could-Have

Happened- Before

a MHB→ b ⇔∀<E,T→,D→>∈F(P), a T→ b

a CHB→ b ⇔∃<E,T→,D→>∈F(P), a T→ b

Concurrent-With

a MCW↔ b ⇔∀<E,T→,D→>∈F(P), a T↮ b

a CCW↔ b ⇔∃<E,T→,D→>∈F(P), a T↮ b

Ordered-With

a MOW↔ b ⇔∀<E,T→,D→>∈F(P), ¬(a T↮ b)

a COW↔ b ⇔∃<E,T→,D→>∈F(P), ¬(a T↮ b)

29

Program Execution Model –Ordering Relations -

Explanation The must-have relations describe orderings

that are guaranteed to be present in all feasible program executions in F(P).

The could-have relations describe orderings that could potentially occur in at least one of the feasible program executions in F(P).

The happened-before relations show events that execute in a specific order, the concurrent-with relations show events that execute concurrently, and the ordered-with relations show events that execute in either order but not concurrently.

30

Complexity of Computing Ordering Relations

The problem of computing any of the must-have ordering relations (MHB, MCW, MOW) is Co-NP-hard and the problem of computing any of the could-have relations (CHB, CCW, COW) is NP-hard.

Theorem 1: Given a program execution, P=<E,T→,D→>, that uses counting semaphores, the problem of deciding whether a MHB→ b, a MCW↔ b or a MOW↔ b (any of the must-have orderings) is Co-NP-hard.

31

Proof of Theorem 1 –Notes

The presented proof is only for the must-have-happened-before (MHB) relation. Proofs for the other relations are analogous.

The proof is a reduction from 3CNFSAT such that any boolean formula is not satisfiable iff a MHB→ b for two events, a and b, defined in the reduction.

The problem of checking whether 3CNFSAT formula is not satisfiable is Co-NP-complete.

The proof can also be extended to programs that use binary semaphores, event style synchronization and other synchronization primitives (and even single counting semaphore).

32

Proof of Theorem 1 –3CNFSAT

An instance of 3CNFSAT is given by: A set of n variables, V={X1,X2, …,Xn}. A boolean formula B consisting of conjunction

of m clauses, B=C1⋀C2⋀…⋀Cm. Each clause Cj=(L1⋁L2⋁L3) is a disjunction of

three literals. Each literal Lk is any variable from V or its

negation - Lk=Xi or Lk=⌐Xi. Example:

B=(X1⋁X2⋁⌐X3)⋀(⌐X2⋁⌐X5⋁X6)⋀(X1⋁X4⋁⌐X5)

33

Proof of Theorem 1 –Idea of the Proof

Given an instance of 3CNFSAT formula, B, we construct a program consisting of 3n+3m+2 threads which use 3n+m+1 semaphores (assumed to be initialized to 0).

The execution of this program simulates a nondeterministic evaluation of B.

Semaphores are used to represent the truth values of each variable and clause.

The execution exhibits certain orderings iff B is not satisfiable.

34

Proof of Theorem 1 –The Construction per Variable For each variable, Xi, the following

three threads are constructed:wait( Ai )signal( Xi )..signal( Xi )

wait( Ai )signal( not-Xi )..signal( not-Xi )

signal( Ai )wait( Pass2 )signal( Ai )

“. . .” indicates as many signal(Xi) (or signal(not-Xi)) operations as the number of occurrences of the literal Xi (or ⌐Xi) in the formula B.

35

Proof of Theorem 1 –The Construction per Variable

The semaphores Xi and not-Xi are used to represent the truth value of variable Xi.

Signaling the semaphore Xi (or not-Xi) represents the assignment of True (or False) to variable Xi.

The assignment is accomplished by allowing either signal(Xi) or signal(not-Xi) to proceed, but not both (due to concurrent wait(A i) operations in two leftmost threads).

36

Proof of Theorem 1 –The Construction per Clause

For each clause, Cj, the following three threads are constructed:

wait( L1 )signal( Cj )



L1, L2 and L3 are the semaphores corresponding to literals in clause Cj (i.e. Xi or not-Xi).

The semaphore Cj represents the truth value of clause Cj. It is signaled iff the truth assignments to variables, cause the clause Cj to evaluate to True.

37

Proof of Theorem 1 –Explanation of Construction

The first 3n threads operate in two phases: The first pass is a non-deterministic guessing

phase in which each variable used in the boolean formula B is assigned a unique truth value. Only one of the Xi and not-Xi semaphores is signaled.

The second pass, which begins after semaphore Pass2 is signaled, is used to ensure that the program doesn’t deadlock – the semaphore operations that were not allowed to execute during the first pass are allowed to proceed.

38

Proof of Theorem 1 –The Final Construction

Additional two threads are created:

There are n ‘signal(Pass2)’ operations – one for each variable.

There are m ‘wait(Cj)’ operations – one for each clause.

wait( C1 )..

wait( Cm )b: skip

a: skip

signal( Pass2 )..

signal( Pass2 )

m n

39

Proof of Theorem 1 –Putting All Together

Event b is reached only after semaphore Cj, for each clause j, has been signaled.

Since the program contains no conditional statements or shared variables, every execution of the program executes the same events and exhibits the same shared-data dependencies (i.e. none).

Claim: For any execution a MHB→ b iff B is not satisfiable.

40

Proof of Theorem 1 –Proving the “if” Part

Assume that B is not satisfiable. Then there is always some clause, Cj, that is

not satisfied by the truth values guessed during the first pass. Thus, no signal(Cj) operation is performed during the first pass.

Event b can’t execute until this signal(Cj) operation is performed, which can then only be done during the second pass.

The second pass doesn’t occur until after event a executes, so event a must precede event b.

Therefore, a MHB→ b.

41

Proof of Theorem 1 –Proving the “only if” Part

Assume that a MHB→ b. This means that there is no execution in which b

either precedes a or executes concurrently with a. Assume by way of contradiction that B is

satisfiable. Then some truth assignment can be guessed

during the first pass that satisfies all of the clauses.

Event b can then execute before event a, contradicting the assumption.

Therefore, B is not satisfiable.

42

Complexity of Computing Ordering Relations – Cont.

Since a MHB→ b iff B is not satisfiable, the problem of deciding a MHB→ b is Co-NP-hard.

By similar reductions, programs can be constructed such that the non-satisfiability of B can be determined from the MCW or MOW relations. The problem of deciding these relations is therefore also Co-NP-hard.

Theorem 2: Given a program execution, P=<E,T→,D→>, that uses counting semaphores, the problem of deciding whether a CHB→ b, a CCW↔ b or a COW↔ b (any of the could-have orderings) is NP-hard.

Proof by similar reductions …

43

Complexity of Race Detection -

Conditions, Loops and Input The presented model is too simplistic. What if conditional statements, like “if” and “while”,

are used? What if an input from user is allowed?

Thread 1 Thread 2

Y = ReadFromInput( );while ( Y < 0 ) Print( Y );X--;

X++;

If Y≥0 there is a data-race on X. Otherwise it is not possible, since ‘X--’ is never reached.

44

Complexity of Race Detection -

“NP-Harder”? The proof above does not use conditional

statements, loops or input from outside. This suggests that the problem of data-race

detection may be even harder than deciding an NP-complete problem.

With loops and recursion, we do not know whether potentially concurrent accesses will indeed be executed, so the question becomes equivalent to the halting problem.

Thus, in general case, race detection is undecidable.

45

So How Data-Races Can be Detected? – Approximations

Since it is intractable problem to decide whether a CHB→ b or a CCW↔ b (needed to detect feasible data-races), the temporal ordering relation T→ should be approximated and apparent data-races located instead.

Recall that apparent data-races exist if and only if at least one feasible race exists.

Yet, it remains a hard problem to locate all apparent data-races.

46

Approximation Example – Lamport’s Happens-Before

The happens-before partial order, denoted hb→, is defined for access events (reads, writes, releases and acquires) that happen in a specific execution, as follows: Program Order: If a and b are events performed by

the same thread, with a preceding b in program order, then a hb→ b.

Release and Acquire: Let a be a release and b be an acquire. If a and b take part in the same synchronization event, then a hb→ b.

Transitivity: If a hb→ b and b hb→ c, then a hb→ c. Shared accesses a and b are concurrent

(denoted by a hb↮ b) if neither a hb→ b nor b hb→ a holds.

47

Approaches to Detection ofApparent Data-Races – Static

There are two main approaches to detection of apparent data-races (sometimes a combination of both is used): Static Methods – perform a compile-time

analysis of the code.– Too conservative. Can’t know or understand the

semantics of the program. Result in excessive number of false alarms that hide the real data-races.

+ Test the program globally – see the full code of the tested program and can warn about all possible errors in all possible executions.

48

Approaches to Detection ofApparent Data-Races –

Dynamic Dynamic Methods – use tracing mechanism

to detect whether a particular execution of a program actually exhibited data-races.+ Detect only those apparent data-races that occur

during a feasible execution.– Test the program locally - consider only one

specific execution path of the program each time. Post-Mortem Methods – after the execution

terminates, analyze the trace of the run and warn about possible data-races that were found.

On-The-Fly Methods – buffer partial trace information in memory, analyze it and detect races as they occur.

49

Approaches to Detection ofApparent Data-Races

No “silver bullet” exists.

The accuracy is of great importance (especially in large programs).

Yet, there is always a tradeoff between the amount of false positives (undetected races) and false negatives (false alarms).

The space and time overheads imposed by the techniques are significant as well.

50

Closer Look atDynamic Methods

We will see two dynamic methods for on-the-fly detection of apparent data-races in lock-based multi-threaded programs: DJIT – based on Lamport’s happens-before

partial order relation and Mattern’s virtual time (vector clocks). Implemented in Millipede and Multipage systems.

Lockset – based on locking discipline and lockset refinement. Implemented in Eraser tool.

51

DJIT (1)Description

Detects the first apparent data-race in a program when it actually occurs.

It is enough to announce only the very first data-race, since later races can be after-effects of the first one.

After the race (or it’s cause) is fixed, the search for other races can proceed.

The main disadvantage of the technique is that it is highly dependent on the scheduling order.

52

DJIT(2)Logical Token

Observation – each synchronization event involves some logical token.

The token is released by one set of threads that reach a certain point in their execution and is acquired by another set of threads.

Once all the members of the corresponding releasing set have released their tokens, members of the acquiring set are allowed to proceed their execution.

53

DJIT(3)Local Time Frames

The execution of each thread is split into a sequence of time frames.

A new time frame starts on each release.

Note that according to the above observation concerning logical tokens: Lock Acquire (or acq) Unlock Release (or rel)

Thread TF

X = 1Lock( m1 )Z = 2Lock( m2 )Y = 3Unlock( m2 )Z = 4Unlock( m1 )X = 5

1

1

1

2

3

54

DJIT(4) Local Time Frames

Claim 1: Let a in thread ta and b in thread tb be two accesses, where a occurs at time frame Ta, and the release in ta, corresponding to the latest acquire in tb which precedes b, occurs at time frame Tsync in ta. Then a hb→ b iff Ta < Tsync.

TFa ta tb

Ta

Trelease

Tsync

acq.a.

rel.

rel(m)...

.

.

.acq

.

.

.

.acq(m

).b

Possible sequence of release-acquire

55

DJIT(5) Local Time Frames

Proof:- If Ta < Tsync then a hb→ release and since release hb→ acquire and acquire hb→ b, we get a hb→ b.- If a hb→ b and since a and b are in distinct threads, then by definition there exists a pair of corresponding release an acquire, so that a hb→ release and acquire hb→ b. It follows that Ta < Trelease ≤ Tsync.

56

DJIT(6)Vector Time Frames (VTF)

For each thread t a vector stt[.] exists, whose size is the maximum number of threads (maxthreads).

stt[t] is the local time frame of thread t. It actually holds the number of ‘releases’ made by thread t.

stt[u] stores the latest local time frame of u, whose release is known by t (to have happened before t’s latest acquire).

If u is an acquirer of t’s release, then u’s vector is updated in the following way:

for k = 0 to maxthreads – 1stu[k] = max( stu[k], stt[k] )

57

DJIT(7)Vector Time Frames

In such way, the vector of u is notified of: The latest time frame of t. The latest time frames of other threads

according to the knowledge of t. Note that a thread can learn about a

release performed by another thread through “gossip”, when this information is transferred through a chain of corresponding release-acquire pairs.

58

Thread 1 Thread 2 Thread 3(1 1 1)

(1 1 1) (1 1 1)

write Xrelease( m1 )read Z

(2 1 1) acquire( m1

)read Yrelease( m2 )write X

(2 1 1)

(2 2 1)acquire( m2 )write X

(2 2 1)

DJIT(8)Vector Time Frames

59

DJIT(9) Vector Time Frames

Claim 2: Let a and b be two accesses in respective threads ta and tb, which happened during respective local time frames Ta and Tb. Let f denote the value of sttb[ta] at the time when b occurs. Then a hb→ b iff Ta < f.

TFa ta tc tb TFb

Ta a.

rel........

.

.

.

.acq

.rel....

.

.

.

.

.

.

.

.acq

.b Tb

60

DJIT(10) Vector Time Frames

Proof:- If a hb→ b and since a and b are in distinct threads, then there exists a chain of releases and corresponding acquires such that the first release in ta and the last acquire in tb, so that a hb→ first release and first release hb→ last acquire. The information on ta’s local time frame is transferred through that chain, reaches tb and stored in sttb[ta] (=f). Thus it follows that Ta < Tfirst release ≤ f.

- If Ta < f then there is a sequence of corresponding release-acquire pairs, which transfer the local time frame from ta to tb, finally resulting in tb “hearing” that ta entered a time frame which is later than Ta. This same sequence can be used to transitively apply the hb→ relation from a to b.

61

DJIT(11) Sequential Consistency

The proposed algorithm assumes a sequential consistency model (SC), which is common in multi-threaded environments.

This means that there exists a global order, R, on all the events in the execution, where R confirms with the view of all processes, and all reads see the most recent written values.

The definition of the hb→ partial relation is consistent with R, in the sense that if a hb→ b then a precedes b in R (otherwise an acquire could precede its corresponding release in the global order, contradicting the view of the acquirer).

62

DJIT(12) Data-Race Detection Using

VTF

Theorem 1: Let a and b be two accesses to the same shared variable in respective threads ta and tb during respective local time frames Ta and Tb. Suppose that at least one of a or b is a write. Assume that a is performed in the global order R prior to b and that it doesn’t constitute a data race with any of the preceding accesses in R. Then a and b form a data-race iff at the time when b occurs it holds that sttb[ta] ≤ Ta.

63

DJIT(13) Data-Race Detection Using

VTF

Proof:- If sttb[ta] ≤ Ta then, by Claim 2, a hb→ b doesn’t hold. Since a precedes b in R, it can not hold that b hb→ a. Thus a and b are concurrent and form a data race (since at least one of them is for writing).- If a and b form a data race then a hb→ b doesn’t hold. Thus, by Claim 2, sttb[ta] ≤ Ta.

64

DJIT(14) Predicate for Data-Race

Detection

The algorithmic aspect of Theorem 1 is encapsulated in the following predicate P:P(a,b) ≜ ( a.type = write ⋁ b.type = write ) ⋀ ⋀ ( a.time_frame ≥ stb.thread_id[a.thread_id] )

P gets two accesses, a and b, to same shared variable, where a occurred earlier (according to the global order R) and b is just performed.

P returns True iff a and b form a data-race.

65

DJIT(15)Which Accesses to Check?

We have assumed that there is a logging mechanism, which records all accesses.

Logging all accesses in all threads and testing the predicate P for each pair of them will impose a great overhead on the system.

Actually some of the accesses can be discarded.

66

Claim 3: Consider an access a in thread ta during time frame Ta, and accesses b and c in thread tb=tc during time frame Tb=Tc. Assume that c precedes b in the program order. If a and b are concurrent, then a and c are concurrent as well.

TFa ta tb TFb

Ta

.

.

.

.a

relc.b

Tc

Tb

Ta a....

.relc.b

Tc

Tb


67


Proof:- Let fb and fc denote the respective values of sttb[ta] when b and c happen. Since sttb[ta] is monotonically increasing, and c precedes b, we know that fb ≥ fc. Since a hb→ b does not hold, we know by Claim 2 that Ta ≥ fb. Thus, Ta ≥ fc and again by Claim 2 we get that a hb→ c is false.- Let fa denote the value of stta[tb] when a happens. Since b hb→ a does not hold, we know by Claim 2 that Tb ≥ fa. Since Tb=Tc we get that Tc ≥ fa. Thus by Claim 2, c hb→ a is false.

68


Recall that we are interested in recording only the first apparent data race which occurs during the execution.

Claim 3 implies that for this purpose, it is sufficient to record only the first read access and the first write access to a variable in each time frame.

In addition it’s sufficient to apply the predicate P to pairs of accesses which are the first in their respective time frames.

69

Thread 1 Thread 2

acquire( m )write Xread Xwrite Xrelease( m )

read X

acquire( m )write Xwrite Xrelease( m )

acquire( m )read Xwrite Xwrite Xrelease( m )DR


!!!

!!!

Only the accesses marked with ‘!!!’ are checked.

!!!

!!!

!!!

70

Assume that in thread ta an access a occurs and thread tb= tc performed a previous (according to the global order R) access b in time frame Tb and another previous access c in time frame Tc so that Tb < Tc.

TFa ta tb=t

c

TFb

Ta

.

.

.

.

.

.

.

.

.a.

b.

acq.

rel..c...

Tb

Tc

DJIT(20) Which Time Frames to

Check?

71


Check? We want to find only the very first data-race,

when it actually occurs (assuming that all previous accesses didn’t form a data-race).

Claim 4: If a is concurrent with b then it certainly concurrent with c.

Proof: Easy, since Tc > Tb ≥ stta[tb] = stta[tc]. Thus, either pair (a-b or a-c) can be considered

to be the first apparent data-race (since there were no races till a occurred).

This also means that if there is no race between a and c, then there is also no data-race between a and b. Therefore, this pair should not be checked.

72


Check? We want to support the common SWMR (Single

Writer / Multiple Readers) semantics, allowing concurrent reads but not writes.

Thus, developing the observation above, we need to check current write access to a shared variable v against the last time frame in each of the other threads which recently read from v, and the last time frame in a thread which recently wrote to v.

For current read access to v, it is enough to check against the last time frame in a thread which recently wrote to v.

73


Check? More formally - Let a be a current access

to a shared variable v in thread ta: If there was a prior write to v in ta and since

that write there were no accesses to v in other threads then there is no need to check anything.

If there was a prior write to v in other thread tb (according to the global order R) and since that write there were no accesses to v in other threads besides ta and tb then it’s sufficient to check a only with the latest access to v in tb (since otherwise we would have found the race earlier according to Claims 3 & 4).

74


Check? If there were prior reads from v in other

threads t1, t2,…,tk (according to the global order R). Then, if a is a write, it should be checked with each of the most recent reads in t1, t2,…,tk.

If a is a read then it should be checked with the most recent write to v (according to R).

75

DJIT(25)Access History

Applying the above observations (concerning which accesses to check and which time frames to check), it is easy to see that the complexity of checking whether a given access races with previous accesses is small.

Each variable v holds for each of the threads the last time frame in which they read from v and the last time frame in which any of the threads wrote to v. The IDs of the accessing threads are saved as well.

76

DJIT(26) Access History

On each first read and first write to v in a time frame every thread updates the access history of v.

If the access to variable v is a read, the thread checks the recent write to v.

If the access is a write, the thread checks all reads from v by other threads and the recent write to v.tf1/

id1

tf2/id2

... ... ... tfn/idn

tfk/idkV

Time frames of recent readsfrom v – one for each thread

Time frame ofrecent write to

v

77

DJIT(27)Coherency

Actually, the presented algorithm uses only coherency guarantees.

Coherency means that for each variable v there is a global order, Rv, on all operations performed on it.

Hence, the algorithm described above is correct also for coherent systems, which are not necessarily sequentially consistent.

In fact, the algorithm may be alsoapplied to systems with even morerelaxed consistency (a.k.a . weakly ordered systems).

Thread 1 Thread 2

write v1, 1write v2, 2

read v2, 2

read v1, 0

The history is coherent, butnot sequentially consistent.

78

DJIT(28)“First” Apparent Data-Race

Note, that if a and b race each other, then a might also race with accesses that occurred in tb previous to b (as shown in the example of Claim 4).

It is impossible to find these data-races before a occurs.

By the definitions, although the corresponding accesses in tb precede b, their races with a occur simultaneously to the race of b and a, and thus are not considered “earlier”.

The definitions can be refined, defining the first apparent data-race to be the first access in tb with which a apparently races.

This will clearly require a bigger access history.

79

DJIT(29)Why Only “First Data-Race”?

Where in the proofs we used the fact that there were no prior data-races?

Consider the following example:

Since the access history for eachvariable consists of only onerecent write, the data-race [1]-[3] is not detected (though the accesses are concurrent).

This is due to a prior race [1]-[2] and the fact that [2] and [3] are in the same thread.

Hint: In order to locate more than only first data-race for each variable, the write history should contain last time frames of all other threads (and not only the most recent).

Thread 1 Thread 2

write X[1]

write X[2]

release(m)write X[3]

80

DJIT(30)More Than One Data-Race

Actually DJIT can be extended to detect more than only one data-race in a program.

Still, there are some good reasons for not doing so: Recall that later data-races can be after effects of the

first one (the program “goes crazy” after the first race). Only the first data race is guaranteed to be feasible

(though it’s not necessarily a crucial bug). Later races can be apparent and hence irrelevant:

Thread 1 Thread 2

X=1;[1]

F=true;while( !F );X=2;[2]

There is only one feasible data

race – on F (it is false at start).

Thus, if we announce on all possible races, false alarms

areinevitable.

81

DJIT (31)Results

The DJIT algorithm was implemented in several academic systems – Millipede and Multipage.

+ Currently DJIT detects the very first apparent data-race. After the race (or it’s cause) is fixed (or marked to ignore), the search for other races can proceed. The extended version of DJIT can detect all races that appear during the execution.

– Very sensitive to differences in threads’ interleaving. Thus it’s recommended to apply the algorithm every time the program executes (and not only in debug mode).

– Still requires enormous number of runs to ensure that the tested program is race free, yet can not prove it.

82

Lockset (1)Locking Discipline

A locking discipline is a programming policy that ensures the absence of data-races.

A simple, yet common locking discipline is to require that every shared variable is protected by a mutual-exclusion lock.

The Lockset algorithm detects violations of locking discipline.

The main drawback is a possibly excessive number of false alarms.

83

Lockset (2)What is the Difference?

[1] hb→ [2], yet there is a feasible data-race under different scheduling.

Thread 1 Thread 2

Y = Y + 1;[1]

Lock( m );V = V + 1;Unlock( m ); Lock( m );

V = V + 1;Unlock( m );Y = Y + 1;[2]

Thread 1 Thread 2

Y = Y + 1;[1]

Lock( m );Flag = true;Unlock( m );

Lock( m );T = Flag;Unlock( m );

if ( T == true ) Y = Y + 1;[2]

No any locking discipline on Y. Yet [1] and [2] are ordered under all possible schedulings.

84

Lockset (3)The Basic Algorithm

For each shared variable v let C(v) be as set of locks that have protected v for the computation so far.

Let locks_held(t) at any moment be the set of locks held by the thread t at that moment.

The Lockset algorithm:- for each v, init C(v) to the set of all possible locks- on each access to v by thread t:

- C(v) C(v) ∩ locks_held(t)- if C(v) = ∅, issue a warning

85

Lockset (4)Explanation

Clearly, a lock m is in C(v) if in execution up to that point, every thread that has accessed v was holding m at the moment of access.

The process, called lockset refinement, ensures that any lock that consistently protects v is contained in C(v).

If some lock m consistently protects v, it will remain in C(v) till the termination of the program.

86

Lockset (5)Example

The locking discipline for v is violated since no lock protects it consistently.

Program locks_held C(v)

Lock( m1 ); v = v + 1; Unlock( m1 );

Lock( m2 ); v = v + 1; Unlock( m2 );

{ }

{m1}

{ }

{m2}

{ }

{m1, m2}

{m1}

{ } warning

87

Lockset (6)Improving the Locking

Discipline The locking discipline described above is too

strict. There are three very common programming

practices that violate the discipline, yet are free from any data-races: Initialization: Shared variables are usually

initialized without holding any locks. Read-Shared Data: Some shared variables are

written during initialization only and are read-only thereafter.

Read-Write Locks: Read-write locks allow multiple readers to access shared variable, but allow only single writer to do so.

88

Lockset (7)Initialization

When initializing newly allocated data there is no need to lock it, since other threads can not hold a reference to it yet.

Unfortunately, there is no easy way of knowing when initialization is complete.

Therefore, a shared variable is initialized when it is first accessed by a second thread.

As long as a variable is accessed by a single thread, reads and writes don’t update C(v).

89

Lockset (8)Read-Shared Data

There is no need to protect a variable if it’s read-only.

To support unlocked read-sharing, races are reported only after an initialized variable has become write-shared by more than one thread.

90

Lockset (9)Initialization and Read-

Sharing

Newly allocated variables begin in the Virgin state. As various threads read and write the variable, its state changes according to the transition above.

Races are reported only for variables in the Shared-Modified state.

The algorithm becomes more dependent on scheduler.

Virgin Shared-Modified

Exclusive

Shared wr byany thrrd by

any thr

wr byfirst thr wr by

new thr

rd bynew thr

rd/wr byfirst thr

91

Lockset (10)Initialization and Read-

Sharing The states are:

Virgin – Indicates that the data is new and have not been referenced by any other thread.

Exclusive – Entered after the data is first accessed (by a single thread). Subsequent accesses don’t update C(v) (handles initialization).

Shared – Entered after a read access by a new thread. C(v) is updated, but data-races are not reported. In such way, multiple threads can read the variable without causing a race to be reported (handles read-sharing).

Shared-Modified – Entered when more than one thread access the variable and at least one is for writing. C(v) is updated and races are reported as in original algorithm.

92

Lockset (11)Read-Write Locks

Many programs use Single Writer/Multiple Readers (SWMR) locks as well as simple locks.

The basic algorithm doesn’t support correctly such style of synchronization.

Definition: For a variable v, some lock m protects v if m is held in write mode for every write of v, and m is held in some mode (read or write) for every read of v.

93

Lockset (12)Read-Write Locks – Final

Refinement

When the variable enters the Shared-Modified state, the checking is different:

Let locks_held(t) be the set of locks held in any mode by thread t.

Let write_locks_held(t) be the set of locks held in write mode by thread t.

94

Lockset (13)Read-Write Locks – Final

Refinement The refined algorithm (for Shared-

Modified):- for each v, initialize C(v) to the set of all locks- on each read of v by thread t:

- C(v) C(v) ∩ locks_held(t)- if C(v) = ∅, issue a warning

- on each write of v by thread t:- C(v) C(v) ∩ write_locks_held(t)- if C(v) = ∅, issue a warning

Since locks held purely in read mode don’t protect against data-races between the writer and other readers, they are not considered when write occurs and thus removed from C(V).

95

The refined algorithm will still produce a false alarm in the following simple case:

Thread 1 Thread 2 C(v)

Lock( m1 ); v = v + 1; Unlock( m1 );

Lock( m2 ); v = v + 1; Unlock( m2 );

Lock( m1 ); Lock( m2 ); v = v + 1; Unlock( m2 ); Unlock( m1 );

{m1,m2}

{m1}

{ }

Lockset (14)Still False Alarms

96

Lockset (15)Additional False Alarms

Additional possible false alarms are: Queue that implicitly protects its elements by

accessing the queue through locked head and tail fields.

Thread that passes arguments to a worker thread. Since the main thread and the worker thread never access the arguments concurrently, they do not use any locks to serialize their accesses.

Privately implemented SWMR locks,which don’t communicate with Lockset.

True data races that don’t affectthe correctness of the program(for example “benign” races).

if (f == 0)lock(m);if (f == 0)

f = 1;unlock(m);

97

Lockset (16) Results

Lockset was implemented in a full scale testing tool, called Eraser, which is used in industry (not “on paper only”).

+ Eraser was found to be quite insensitive to differences in threads’ interleaving (if applied to programs that are “deterministic enough”).

– Since a superset of apparent data-races is located, false alarms are inevitable.

– Still requires enormous number of runs to ensure that the tested program is race free, yet can not prove it.

– The measured slowdowns are by a factor of 10 to 30.

98

Dynamic Data-Race DetectionSummary

There is no one, better solution. DJIT notifies on one apparent data-race, which is

the very first in the execution. Lockset notifies on a bunch of apparent data-

races, some or even all of them are false alarms. Maybe to combine both techniques? Maybe to combine with other known techniques? Maybe to combine with some static analysis? Maybe better approximations can be found...?

99

Dynamic Data-Race DetectionSummary – Cont.

The solutions are not universal. The data-races that are found, are apparent and

not feasible. Still requires a large number of runs to check as

much executions paths as possible. Since slowdowns can be large, a satisfying testing

can take months. Different (or new) types of synchronization require

different detection techniques. Inserting a detection code in a program can

perturb the threads’ interleaving so that races will disappear (less sensitive in Lockset).

100

References

S. Adve, M. Hill and R. Netzer. Detecting Data Races on Weak Memory Systems. In Proceedings of the 18th Annual Symposium on Computer Architectures, pp. 234-243, May 1991.

A. Itzkovitz, A. Schuster, and O. Zeev-Ben-Mordechai. Towards Integration of Data Race Detection in DSM System. In The Journal of Parallel and Distributed Computing (JPDC), 59(2): pp. 180-203, Nov. 1999

L. Lamport. Time, Clock, and the Ordering of Events in a Distributed System. In Communications of the ACM, 21(7): pp. 558-565, Jul. 1978

F. Mattern. Virtual Time and Global States of Distributed Systems. In Parallel & Distributed Algorithms, pp. 215 226, 1989.

101

ReferencesCont.

R. H. B. Netzer and B. P. Miller. What Are Race Conditions? Some Issues and Formalizations. In ACM Letters on Programming Languages and Systems, 1(1): pp. 74-88, Mar. 1992.

R. H. B. Netzer and B. P. Miller. On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions. In 1990 International Conference on Parallel Processing, 2: pp. 93 97, Aug. 1990

R. H. B. Netzer and B. P. Miller. Detecting Data Races in Parallel Program Executions. In Advances in Languages and Compilers for Parallel Processing, MIT Press 1991, pp. 109-129.

102

ReferencesCont.

S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T.E. Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. In ACM Transactions on Computer Systems, 15(4): pp. 391-411, 1997

O. Zeev-Ben-Mordehai. Efficient Integration of On-The-Fly Data Race Detection in Distributed Shared Memory and Symmetric Multiprocessor Environments. Research Thesis, May 2001.

103

The End

Dynamic Data-Race Detection in Lock-Based Multi-Threaded Programs Prepared by Eli Pozniansky under...

Documents

Transcript of Dynamic Data-Race Detection in Lock-Based Multi-Threaded Programs Prepared by Eli Pozniansky under...