PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli...

51
PADTAD, Nice, April 03 1 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster

Transcript of PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli...

Page 1: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 1

Efficient On-the-FlyData Race Detection in

Multithreaded C++ Programs

Eli Pozniansky & Assaf Schuster

Page 2: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 2

Thread 1 Thread 2X++ T=YZ=2 T=X

What is a Data Race?

Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug

Page 3: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 3

Lock(m)

Unlock(m) Lock(m)

Unlock(m)

How Can Data Races be Prevented?

Explicit synchronization between threads: Locks Critical Sections Barriers Mutexes Semaphores Monitors Events Etc.

Thread 1 Thread 2

X++

T=X

Page 4: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 4

Is This Sufficient?

Yes! No!

Programmer dependent Correctness – programmer may forget to synch

Need tools to detect data races

Expensive Efficiency – to achieve correctness,

programmer may overdo. Need tools to remove excessive synch’s

Page 5: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 5

Detecting Data Races?

NP-hard [Netzer&Miller 1990] Input size = # instructions performed Even for 3 threads only Even with no loops/recursion

Execution orders/scheduling (#threads)thread_length

# inputs Detection-code’s side-effects Weak memory, instruction reorder,

atomicity

Page 6: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 6

#define N 100Type g_stack = new Type[N];int g_counter = 0;Lock g_lock;

void push( Type& obj ){lock(g_lock);...unlock(g_lock);}void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}void popAll( ) {

lock(g_lock); delete[] g_stack;g_stack = new Type[N];g_counter = 0;unlock(g_lock);

}int find( Type& obj, int number ) {

lock(g_lock); for (int i = 0; i < number; i++)

if (obj == g_stack[i]) break; // Found!!!if (i == number) i = -1; // Not found… Return -1 to callerunlock(g_lock);return i;

}int find( Type& obj ) {

return find( obj, g_counter );}

Where is Waldo?

Page 7: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 7

#define N 100Type g_stack = new Type[N];int g_counter = 0;Lock g_lock;

void push( Type& obj ){lock(g_lock);...unlock(g_lock);}void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);}void popAll( ) {

lock(g_lock); delete[] g_stack;g_stack = new Type[N];g_counter = 0;unlock(g_lock);

}int find( Type& obj, int number ) {

lock(g_lock); for (int i = 0; i < number; i++)

if (obj == g_stack[i]) break; // Found!!!if (i == number) i = -1; // Not found… Return -1 to callerunlock(g_lock);return i;

}int find( Type& obj ) {

return find( obj, g_counter );}

Can You Find the Race?Similar problem was foundin java.util.Vector

write

read

Page 8: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 8

Apparent Data Races

Based only the behavior of the explicit synch not on program semantics

Easier to locate Less accurate

Exist iff “real” (feasible) data race exist

Detection is still NP-hard

Initially: grades = oldDatabase; updated = false;

grades = newDatabase;

updated = true; while (updated == false);

X:=grades.gradeOf(lecturersSon);

Thread T.A.

Thread Lecturer

Page 9: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 9

Detection Approaches

Restricted pgming model Usually fork-join

Static Emrath, Padua 88 Balasundaram, Kenedy 89 Mellor-Crummy 93 Flanagan, Freund 01

Postmortem Netzer, Miller 90, 91 Adve, Hill 91

On-the-fly Dinning, Schonberg 90,

91 Savage et.al. 97 Itskovitz et.al. 99 Perkovic, Keleher 00

Issues:• pgming model• synch’ method• memory model• accuracy• overhead• granularity• coverage

fork

join

fork

join

Page 10: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 10

MultiRace Approach

On-the-fly detection of apparent data races Two detection algorithms (improved versions)

Lockset [Savage, Burrows, Nelson, Sobalvarro, Anderson 97]

Djit+ [Itzkovitz, Schuster, Zeev-ben-Mordechai 99] Correct even for weak memory systems

Flexible detection granularity Variables and Objects Especially suited for OO programming languages

Source-code (C++) instrumentation + Memory mappings Transparent Low overhead

Page 11: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 11

Djit+ [Itskovitz et.al. 1999]

Apparent Data Races

Lamport’s happens-before partial order

a,b concurrent if neither a hb→ b nor b hb→ a Apparent data race Otherwise, they are “synchronized”

Djit+ basic idea: check each access performed against all “previously performed” accesses

Thread 1

Thread 2

.a

.Unlock(L

)...

.

.

.

.

.Lock(L)

.b

a hb→ b

Page 12: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 12

Djit+

Local Time Frames (LTF)

The execution of each thread is split into a sequence of time frames.

A new time frame starts on each unlock.

For every access there is a timestamp = a vector of LTFs known to the thread at the moment the access takes place

Thread LTF

x = 1lock( m1 )z = 2lock( m2 )y = 3unlock( m2 )z = 4unlock( m1 )x = 5

1

1

1

2

3

Page 13: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 15

Djit+

Checking Concurrency

P(a,b) ≜ ( a.type = write ⋁ b.type = write ) ⋀

⋀ ( a.ltf ≥ b.timestamp[a.thread_id] )

a was logged earlier than b. P returns TRUE iff a and b are racing.

Problem: Too much logging, too many checks.

Page 14: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 16

Thread 2 Thread 1

lock( m )write Xread Xunlock( m )

read X

lock( m )write Xunlock( m )

lock( m )read Xwrite Xwrite Xunlock( m )race

Djit+

Which Accesses to Check?

No logging

c No logging

a in thread t1, and b and c in thread t2 in same ltf

b precedes c in the program order.

If a and b are synchronized, then a and c are synchronized as well.

It is sufficient to record only the first read access and the first write access to a variable in each ltf.

b

a

Page 15: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 17

a occurs in t1

b and c “previously” occur in t2

If a is synchronized with c then it must also be synchronized with b.

Thread 1 Thread 2

.

.

.

.

.

.

.

.lock(m)

.a

b.

unlock.c.

unlock(m)..

Djit+

Which LTFs to Check?

It is sufficient to check a “current” access with the “most recent” accesses in each of the other threads.

Page 16: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 18

Djit+

Access History For every variable v for each of the threads:

The last ltf in which the thread read from v The last ltf in which the thread wrote to v

w-ltfn.........w-ltf2w-ltf1

r-ltfn.........r-ltf2r-ltf1

V

LTFs of recent writesto v – one for each thread

LTFs of recent readsfrom v – one for each thread

On each first read and first write to v in a ltf every thread updates the access history of v

If the access to v is a read, the thread checks all recent writes by other threads to v

If the access is a write, the thread checks all recent reads as well as all recent writes by other threads to v

Page 17: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 19

Djit+ Pros and Cons

No false alarms No missed races (in a given scheduling)

Very sensitive to differences in scheduling Requires enormous number of runs. Yet:

cannot prove tested program is race free.

Can be extended to support other synchronization primitives, like barriers and counting semaphores

Correct on relaxed memory systems [Adve+Hill, 1990 data-race-free-1]

Page 18: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 22

Lockset [Savage et.al. 1997]

Warning: lockingDiscipline for v isViolated

Thread 1

{m1, m2}

{ }

C(v)Locks(v)Thread 2

lock(m1)read v {m1} {m1}unlock(m1)

{ }lock(m2)write v {m2} { }unlock(m2)

{ }

Locking discipline: every shared location is consistently protected by a lock.

Lockset detects violations of this locking discipline.

Page 19: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 23

Lockset vs. Djit+

Thread 1 Thread 2

y++[1]

lock( m )v++unlock( m )

lock( m )v++unlock( m )y++[2]

[1] hb→ [2], yet there might be a data race on y under a different scheduling the locking discipline is violated

Page 20: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 24

Lockset Which Accesses to Check?

a and b in same thread, same time frame, a precedes b, then: Locksa(v) ⊆ Locksb(v)

Locksu(v) is set of locks held during access u to v.

Thread Locks(v)

unlock … lock(m1)a: write v write v lock(m2)b: write v unlock(m2) unlock(m1)

{m1}{m1}= {m1}

{m1,m2}⊇ {m1}

Only first accesses need be checked in every time frame Lockset can use same logging (access history) as DJIT+

Page 21: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 25

LocksetPros and Cons

Less sensitive to scheduling Detects a superset of all apparently

raced locations in an execution of a program:

races cannot be missed

Lots of false alarmsStill dependent on scheduling:

cannot prove tested program is race free

Page 22: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 26

Lockset Supporting Barriers

Initializing

Virgin Shared

Empty

Clean

Exclusive

write by first thread

read/write by same thread

read/write by new thread

barrier

barrier

barrier

barrier

barrier

read/write by any threadread/write by

some thread

read/write by any thread,C(v) not empty

read/write by any thread, C(v) is empty

read/write by new thread, C(v) is empty

read by any thread

read/write by new thread,C(v) not empty

barrier

read/write by same thread

Per-variablestatetransitiondiagram

Page 23: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 27

S

A

F

L

Combining Djit+ and LocksetAll shared

locations in some program P

All raced locations in P

Violations detected by Lockset in P

D

Raced locations detected by Djit+ in P

All apparently raced locations in P

Lockset can detect suspected races in more execution orders

Djit+ can filter out the spurious warnings reported by Lockset

Lockset can help reduce number of checks performed by Djit+

If C(v) is not empty yet, Djit+ should not check v for races

The implementation overhead comes mainly from the access logging mechanism

Can be shared by the algorithms

Page 24: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 28

Implementing Access Logging:

Recording First LTF Accesses

Read-Only View

No-Access View

Physical Shared Memory

Thread 1

Thread 2

Thread 4

Thread 3

Read-Write View

Virtual Memory

X

Y

X

Y

X

Y

X

Y

• An access attempt with wrong permissions generates a fault

• The fault handler activates the logging and the detection mechanisms, and switches views

Page 25: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 29

Swizzling Between Views

Thread

Read-Only View

No-Access View

Physical Shared Memory

Read-Write View

Virtual Memory

X

Y

X

Y

X

Y

X

Y

unlock(m)

write x

unlock(m)

read x

write x

readfault

writefault

writefault

Page 26: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 30

Detection Granularity

A minipage (= detection unit) can contain: Objects of primitive types – char, int, double, etc. Objects of complex types – classes and structures Entire arrays of complex or primitive types

An array can be placed on a single minipage or split across several minipages. Array still occupies contiguous addresses.

4321 4321 4321

Page 27: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 31

Playing with Detection Granularity

to Reduce Overhead

Larger minipages reduced overhead Less faults

A minipage should be refined into smaller minipages when suspicious alarms occur Replay technology can help (if available)

When suspicion resolved – regroup May disable detection on the accesses involved

Page 28: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 32

Detection Granularity

Slowdowns of FFT using different granularities

0.1

1

10

100

1000

1 2 4 8 16 32 64 128 256 all

Number of complex numbers in minipage

Slo

wd

ow

n (

log

ari

thm

ic s

cale

)

1 thread 2 threads 4 threads 8 threads 16 threads 32 threads 64 threads

Page 29: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 33

All allocation routines, new and malloc, are overloaded. They get two additional arguments: # requested elements # elements per minipage

Type* ptr = new(50, 1) Type[50]; Every class Type inherits from

SmartProxy<Type> template class:class Type : public SmartProxy<Type> { ... }

The SmartProxy<Type> class consists of functions that return a pointer (or a reference) to the instance through its correct view

Highlights of Instrumentation (1)

Page 30: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 34

All occurrences of potentially shared primitive types are wrapped into fully functional classes:

int class _int_ : public SmartProxy<int> { int val;...}

Global and static objects are copied to our shared space during initialization All accesses are redirected to come from our copies

No source code – the objects are ‘touched’: memcpy( dest->write(n),src->read(n),n*sizeof(Type) ) All accesses to class data members are

instrumented inside the member functions: data_mmbr=0; smartPointer()->data_mmbr=0;

Highlights of Instrumentation(2)

Page 31: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 35

Example of Instrumentation

void func( Type* ptr, Type& ref, int num ) {for ( int i = 0; i < num; i++ ) {

ptr->smartPointer()->data +=ref.smartReference().data;

ptr++;}Type* ptr2 = new(20, 2) Type[20];memset( ptr2->write(20*sizeof(Type)), 0, 20*sizeof(Type) );ptr = &ref;ptr2[0].smartReference() = *ptr->smartPointer();ptr->member_func( );

}

Currently, the desired value is specified by userthrough source code annotation

NoChange!!!

Page 32: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 36

Loop Optimizations

Original code:for ( i = 0; i < num; i++)

arr[i].data = i; Instrumented code:

for ( i = 0; i < num; i++)arr[i].smartReference().data = i; Very expensive code

Optimize when entire array on single minipage:if ( num > 0 ) arr[0].smartReference().data = 0; Touch first elementfor ( i = 1; i < num; i++) i runs from 1 and not from 0

arr[i].data = i; Access the rest of array without faults Optimize otherwise (no synchronization in loop):

arr.write( num ); Touch for writing all minipages of arrayfor ( i = 0; i < num; i++) Efficient if number of elements in array

arr[i].data = i; is high and number of minipages is low

Page 33: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

Reporting Races in MultiRace

Page 34: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

Benchmark Specifications (2 threads)

InputSet

Shared Memory

# Mini-pages

# Write/ Read Faults

# Time- frames

Time in sec

(NO DR)

FFT 28*28 3MB 4 9/10 20 0.054

IS 223 numbers 215 values

128KB 3 60/90 98 10.68

LU 1024*1024 matrix,

block size 32*32

8MB 5 127/186 138 2.72

SOR 1024*2048 matrices,

50 iterations

8MB 2 202/200 206 3.24

TSP 19 cities, recursion level 12

1MB 9 2792/ 3826

678 13.28

WATER 512 molecules, 15 steps

500KB 3 15438/ 15720

15636 9.55

Page 35: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

Benchmark Overheads (4-way IBM Netfinity server, 550MHz, Win-NT)

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

1 2 4 8 16 32 64Number of Threads

Ove

rhea

ds

(DR

/No

DR

)

FFT IS LU SOR TSP WATER

Page 36: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

Overhead BreakdownTSP

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 4 8 16 32 64Number of threads

No DR +Instrumentation +Write Faults +Read Faults +Djit +Lockset

2794|3825 2792|3826 2788|3827 2808|38642820|3904

2834|3966

2876|4106

WATER

0

0.5

1

1.5

2

2.5

1 2 4 8 16 32 64Number of threads

No DR +Instrumentation +Write Faults +Read Faults +Djit +Lockset

7788|7920 15438|15720 23178|23760 38658|39840

67708|70090

123881|128663

220864|230446

Numbers above bars are # write/read faults. Most of the overhead come from page faults. Overhead due to detection algorithms is small.

Page 37: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 41

SummaryMultiRace is:

Transparent Supports two-way and global synchronization

primitives: locks and barriers Detects races that actually occurred (Djit+) Usually does not miss races that could occur

with a different scheduling (Lockset) Correct for weak memory models Scalable Exhibits flexible detection granularity

Page 38: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 42

Conclusions

MultiRace makes it easier for programmer to trust his programs No need to add synchronization “just in

case” In case of doubt - MultiRace should be

activated each time the program executes

Page 39: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 43

Future/Ongoing work

Implement instrumenting pre-compiler Higher transparency Higher scalability Automatic dynamic granularity adaptation Integrate with scheduling-generator Integrate with record/replay Integrate with the compiler/debugger May get rid of faults and views Optimizations through static analysis Implement extensions for semaphores and other

synch primitives Etc.

Page 40: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 44

The End

Page 41: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 45

Minipages and Dynamic Granularity of Detection

Minipage is a shared location that can be accessed using the approach of views.

We detect races on minipages and not on fixed number of bytes.

Each minipage is associated with the access history of Djit+ and Lockset state.

The size of a minipage can vary.

Page 42: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 46

Implementing Access Logging

In order to record only the first accesses (reads and writes) to shared locations in each of the time frames, we use the concept of views.

A view is a region in virtual memory. Each view has its own protection – NoAccess /

ReadOnly / ReadWrite. Each shared object in physical memory can be

accessed through each of the three views. Helps to distinguish between reads and writes. Enables the realization of the dynamic

detection unit and avoids false sharing problem.

Page 43: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 47

Disabling Detection

Obviously, Lockset can report false alarms.

Also Djit+ detects apparent races that are not necessarily feasible races: Intentional races Unrefined granularity Private synchronization

Detection can be disabled through the use of source code annotations.

Page 44: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 48

Overheads

The overheads are steady for 1-4 threads – we are scalable in number of CPUs.

The overheads increase for high number of threads.

Number of page faults (both read and write) increases linearly with number of threads.

In fact, any on-the-fly tool for data race detection will be unscalable in number of threads when number of CPUs is fixed.

Page 45: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 49

Instrumentation Limitations

Currently, no transparent solution for instrumenting global and static pointers. In order to monitor all accesses to these pointers

they should be wrapped into classes compiler’s automatic pointer conversions are lost.

Will not be a problem in Java. All data members of the same instance of

class always reside on the same minipage. In the future – will split classes dynamically.

Page 46: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

Breakdowns of OverheadsFFT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1 2 4 8 16 32 64Number of threads

No DR +Instrumentation +Write Faults +Read Faults +Djit +Lockset

6|5

9|10

15|2027|40 51|80

99|160

195|320

IS

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1 2 4 8 16 32 64Number of threads

No DR +Instrumentation +Write Faults +Read Faults +Djit +Lockset

0|0 60|90 244|424974|1814

3885|7485

15452|30332

61919|122399

LU

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1 2 4 8 16 32 64Number of threads

No DR +Instrumentation +Write Faults +Read Faults +Djit +Lockset

65|63 127|186 219|398396|805 688|1551

1218|29012158|5453

SOR

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1 2 4 8 16 32 64Number of threads

No DR +Instrumentation +Write Faults +Read Faults +Djit +Lockset

102|100202|200

402|400802|800

1602|1600

3202|3200

6402|6400

Page 47: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 51

References

T. Brecht and H. Sandhu. The Region Trap Library: Handling traps on application-defined regions of memory. In USENIX Annual Technical Conference, Monterey, CA, June 1999.

A. Itzkovitz, A. Schuster, and O. Zeev-Ben-Mordechai. Towards Integration of Data Race Detection in DSM System. In The Journal of Parallel and Distributed Computing (JPDC), 59(2): pp. 180-203, Nov. 1999

L. Lamport. Time, Clock, and the Ordering of Events in a Distributed System. In Communications of the ACM, 21(7): pp. 558-565, Jul. 1978

F. Mattern. Virtual Time and Global States of Distributed Systems. In Parallel & Distributed Algorithms, pp. 215 226, 1989.

Page 48: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 52

ReferencesCont.

R. H. B. Netzer and B. P. Miller. What Are Race Conditions? Some Issues and Formalizations. In ACM Letters on Programming Languages and Systems, 1(1): pp. 74-88, Mar. 1992.

R. H. B. Netzer and B. P. Miller. On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions. In 1990 International Conference on Parallel Processing, 2: pp. 93 97, Aug. 1990

R. H. B. Netzer and B. P. Miller. Detecting Data Races in Parallel Program Executions. In Advances in Languages and Compilers for Parallel Processing, MIT Press 1991, pp. 109-129.

Page 49: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 53

ReferencesCont.

S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T.E. Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. In ACM Transactions on Computer Systems, 15(4): pp. 391-411, 1997

E. Pozniansky. Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs. Research Thesis.

O. Zeev-Ben-Mordehai. Efficient Integration of On-The-Fly Data Race Detection in Distributed Shared Memory and Symmetric Multiprocessor Environments. Research Thesis, May 2001.

Page 50: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

OverheadsFFT

1.106

1.008

0.8070.866

0.913

1.037

1.406

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1 2 4 8 16 32 64Number of threads

DR/No DR DR No DR

IS

1.000 1.000 1.013 1.009 1.002 1.029

1.120

00.10.20.30.40.50.60.70.80.9

11.11.21.31.41.51.61.71.8

1 2 4 8 16 32 64Number of threads

DR/No DR DR No DR

LU

1.074 1.060 1.0751.117

1.160

1.3461.409

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1 2 4 8 16 32 64Number of threads

DR/No DR DR No DR

SOR

0.929 0.9471.000 1.008 1.023 1.034 1.051

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1 2 4 8 16 32 64Number of threads

DR/No DR DR No DR

Page 51: PADTAD, Nice, April 031 Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.

PADTAD, Nice, April 03 55

OverheadsWATER

0.937 0.985 1.001 1.042

1.257

1.676

2.329

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

1 2 4 8 16 32 64Number of threads

DR/No DR DR No DR

TSP

0.995 0.996 0.997 0.9971.026

1.093

1.308

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1 2 4 8 16 32 64Number of threads

DR/No DR DR No DR

The testing platform: 4-way IBM Netfinity, 550 MHz 2GB RAM Microsoft Windows NT