Discovering and Understanding Performance Bottlenecks in Transactional Applications

31
Discovering and Understanding Performance Bottlenecks in Transactional Applications Ferad Zyulkyarov 1,2 , Srdjan Stipic 1,2 , Tim Harris 3 , Osman S. Unsal 1 , Adrián Cristal 1,4 , Ibrahim Hur 1 , Mateo Valero 1,2 1 BSC-Microsoft Research Centre 2 Universitat Politècnica de Catalunya 3 Microsoft Research Cambridge 4 IIIA - Artificial Intelligence Research Institute CSIC - Spanish National Research Council 19th International Conference on Parallel Architectures and Compilation Techniques 11-15 September 2010 – Vienna

description

Discovering and Understanding Performance Bottlenecks in Transactional Applications. Ferad Zyulkyarov 1,2 , Srdjan Stipic 1,2 , Tim Harris 3 , Osman S. Unsal 1 , Adrián Cristal 1,4 , Ibrahim Hur 1 , Mateo Valero 1,2. 1 BSC-Microsoft Research Centre 2 Universitat Politècnica de Catalunya - PowerPoint PPT Presentation

Transcript of Discovering and Understanding Performance Bottlenecks in Transactional Applications

Page 1: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Discovering and Understanding Performance Bottlenecks in Transactional

ApplicationsFerad Zyulkyarov1,2, Srdjan Stipic1,2, Tim Harris3, Osman S. Unsal1,

Adrián Cristal1,4, Ibrahim Hur1, Mateo Valero1,2

1BSC-Microsoft Research Centre

2Universitat Politècnica de Catalunya

3Microsoft Research Cambridge

4IIIA - Artificial Intelligence Research Institute CSIC - Spanish National Research Council

19th International Conference on Parallel Architectures and Compilation Techniques11-15 September 2010 – Vienna

Page 2: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Abstract the TM Implementation

2

for (i = 0; i < N; i++){ atomic { x[i]++; }}

for (i = 0; i < N; i++){ atomic { y[i]++; }}

Thread 1 Thread 2Accesses to different arrays.

Accesses to different arrays.We can observe

overheads inherent to the TM implementation.

We can observe overheads inherent to the

TM implementation.We are not interested in

such bottlenecks.We are not interested in

such bottlenecks.

Page 3: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Abstract the TM Implementation

3

for (i = 0; i < N; i++){ atomic { x[i]++; }}

for (i = 0; i < N; i++){ atomic { x[i]++; }}

Thread 1 Thread 2Accesses to the same

arrays.Accesses to the same

arrays.Contention:

Bottleneck common to all implementations of the

TM programming model.

Contention:Bottleneck common to all

implementations of the TM programming model.

We are interested in this kind of bottlenecks.

We are interested in this kind of bottlenecks.

Page 4: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Can We Find This Kind of Bottlenecks?

4

atomic{ statement1;

statement2;

statement3;

statement4;

}

Abort rate 80%

Where aborts happen?

Where aborts happen?Which variables

conflict?Which variables

conflict?Are there false conflicts?

Are there false conflicts?

Page 5: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Can We Find This Kind of Bottlenecks?

5

atomic{ statement1;

statement2;

statement3;

statement4;

}

counter1=0;

counter2=0;

counter3=0;

counter4=0;

Page 6: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Can We Find This Kind of Bottlenecks?

6

atomic{ statement1;

statement2;

statement3;

statement4;

}

counter1=1;

counter2=0;

counter3=0;

counter4=0;

Page 7: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Can We Find This Kind of Bottlenecks?

7

atomic{ statement1;

statement2;

statement3;

statement4;

}

counter1=1;

counter2=1;

counter3=0;

counter4=0;

Conflict between statement2 and

statement4.

Conflict between statement2 and

statement4.

GoalProfiling techniques to find bottlenecks (important

conflicting locations) and why these conflicts happen.

Page 8: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Outline

Profiling Techniques

Implementation

Case Studies

8

Page 9: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Profiling Techniques

9

Visualizing transactions

Conflict point discovery

Identifying conflicting data structures

Page 10: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Transaction Visualizer (Genome)

10

Aborts occur at the first and last atomic blocks in

program order.

Aborts occur at the first and last atomic blocks in

program order.

Garbage CollectionGarbage Collection

14% Aborts

Wait on barrierWait on barrier

When these aborts

happen?

Page 11: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Aborts Graph (Bayes)

11

AB1 AB2

AB3

AB4

AB5

AB6

AB7

AB8

AB9

AB10

AB12

AB11

AB13

AB14

AB1593% Aborts93% Aborts

73% 20%

Page 12: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Number of Aborts vs Wasted Work

12

atomic{ counter++}

atomic{ hashtable.Rehash();}

Aborts = 9Aborts = 9 Aborts = 1Aborts = 1Wasted Work = 10%Wasted Work = 10% Wasted Work = 90%Wasted Work = 90%

Page 13: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Conflict Point Discovery

13

File:Line #Conf. Method Line

Hashtable.cs:51 152 Add If (_container[hashCode]…

Hashtable.cs:48 62 Add uint hashCode = HashSdbm(…

Hashtable.cs:53 5 Add _container[hashCode] = n …

Hashtable.cs:83 5 Add while (entry != null) …

ArrayList.cs:79 3 Contains for (int i = 0; i < count; i++ )

ArrayList.cs:52 1 Add if (count == capacity – 1) …

Page 14: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Conflicts Context

14

increment() { counter++;}

probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } }}

probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } }}

Thread 1------------for (int i = 0; i < 100; i++) { probability80(); probability20();}

Thread 2------------for (int i = 0; i < 100; i++) { probability80(); probability20();}

All conflicts happen here.

All conflicts happen here.

Bottom-up view

+ increment (100%) |---- probability80 (80%) |---- probability20 (20%)

Bottom-up view

+ increment (100%) |---- probability80 (80%) |---- probability20 (20%)

Top-down view

+ main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

Top-down view

+ main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

Page 15: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Identifying multiple conflictsfrom a single run

15

atomic { obj1.x = t1; obj2.x = t2; obj3.x = t3; ... ... ...}

atomic { ... ... ... obj1.x = t1; obj2.x = t2; obj3.x = t3;}

Thread 1 Thread 2Conflict detected at 1st iteration

Conflict detected at 1st iterationConflict detected at 2nd

iterationConflict detected at 2nd

iterationConflict detected at 3rd iteration

Conflict detected at 3rd iteration

Page 16: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Identifying Conflicting Objects

16

List list = new List();list.Add(1);list.Add(2);list.Add(3);...atomic { list.Replace(3, 33);}

List 1 2 3

0x08 0x10 0x18 0x20

GC DbgEng

Object Addr0x20

GC Root0x08

Variable Name (list)

Memory Allocator

DbgEng

Instr Addr0x446290

List.cs:1

Per-Object View

+ List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)

Per-Object View

+ List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)

Page 17: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Outline

Profiling Techniques

Implementation- Bartok- The data that we collect- Probe effect and profiling

Case Studies

17

Page 18: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Bartok

• C# to x86 research compiler with language level support for TM

• STM– Eager versioning (i.e. in place update)– Detects write-write conflicts eagerly (i.e. immediately)– Detects read-write conflicts lazily (i.e. at commit)– Detects conflicts at object granularity

18

Page 19: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Profiling Data That We Collect

• Timestamp– TX start,

– TX commit or TX abort

• Read and write set size

• On abort– The instruction of the read and write operations involved in

the conflict

– The conflicting memory address

– The call stack

• Process data offline or during GC

19

Page 20: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Probe Effect and Overheads

20

Thread Bayes Genome Intruder Labyrinth Vacation WormBench1 0.59 0.27 0.29 0.07 0.26 0.292 0.45 0.30 0.39 0.03 0.24 0.054 0.01 0.21 0.55 0.01 0.18 0.088 0.02 0.18 1.19 0.16 0.19 0.11

Normalized Abort Rates

Normalized Execution Time

Thread Bayes Genome Intruder Labyrinth Vacation WormBench2 0.00 0.00 0.00 0.00 0.00 0.004 0.11 0.00 0.01 0.00 0.00 0.008 0.12 0.00 0.02 0.00 0.00 0.00

Average 0.016Average 0.016

Average 0.25Average 0.25

Page 21: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Outline

Profiling Techniques

Implementation

Case Studies

21

Page 22: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Case Studies

Bayes

Intruder

Labyrinth

22

Page 23: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Bayes

23

public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr;}

Wrapper object for function arguments.Wrapper object for

function arguments.

FindBestTaskArg arg = new FindBestTaskArg();

arg.learnerPtr = learnerPtr;arg.queries = queries;arg.queryVectorPtr = queryVectorPtr;arg.parentQueryVectorPtr = parentQueryVectorPtr;arg.bitmapPtr = visitedBitmapPtr;arg.workQueuePtr = workQueuePtr;arg.aQueryVectorPtr = aQueryVectorPtr;arg.bQueryVectorPtr = bQueryVectorPtr;

Create wrapper object.

Create wrapper object.

Page 24: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Bayes

24

public class FindBestTaskArg { public int toId; public Learner learnerPtr; public Query[] queries; public Vector queryVectorPtr; public Vector parentQueryVectorPtr; public int numTotalParent; public float basePenalty; public float baseLogLikelihood; public Bitmap bitmapPtr; public Queue workQueuePtr; public Vector aQueryVectorPtr; public Vector bQueryVectorPtr;}

FindBestTaskArg arg = new FindBestTaskArg();

arg.learnerPtr = learnerPtr;arg.queries = queries;arg.queryVectorPtr = queryVectorPtr;arg.parentQueryVectorPtr = parentQueryVectorPtr;arg.bitmapPtr = visitedBitmapPtr;arg.workQueuePtr = workQueuePtr;arg.aQueryVectorPtr = aQueryVectorPtr;arg.bQueryVectorPtr = bQueryVectorPtr;

atomic { FindBestInsertTask(BestTaskArg arg)}

Call the function using the wrapper

object.

Call the function using the wrapper

object.

Create wrapper object.

Create wrapper object.

98% of wasted work is due to the wrapper object

2 threads – 24% execution time4 threads – 80% execution time

98% of wasted work is due to the wrapper object

2 threads – 24% execution time4 threads – 80% execution time

Page 25: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Bayes – Solution

25

atomic { FindBestInsertTaskArg ( toId, learnerPtr, queries, queryVectorPtr, parentQueryVectorPtr, numTotalParent, basePenalty, baseLogLikelihood, bitmapPtr, workQueuePtr, aQueryVectorPtr, bQueryVectorPtr, );}

Passed the arguments directly and avoid

using wrapper object.

Passed the arguments directly and avoid

using wrapper object.

Page 26: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Intruder – Map Data Structure

26

1

2

3

4

5

6

1 2 4

2 3

1 2

1

1/3

3/16/2

4/3

6/32/46/4

Network Stream

Assembled packet fragments

Page 27: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Network Stream

Assembled packet fragments

Intruder – Map Data Structure

27

1

2

3

4

5

6

1 2 4

2 3

1 2

1

1/3

3/1

6/2

4/3

6/32/46/4

Aborts caused 68% wasted

work.

Replaced with a chaining hashtable.

Page 28: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Intruder – Moving Code

28

Write-write conflicts are

detected eagerly.

Write-write conflicts are

detected eagerly.

More to roll back more wasted workMore to roll back

more wasted workatomic { Decoded decodedPtr = new Decoded();

char[] data = new char[length]; Array.Copy(packetPtr.Data, data, length); decodedPtr.flowId = flowId; decodedPtr.data = data;

} this.decodedQueuePtr.Push(decodedPtr);

Little to roll back, less wasted workLittle to roll back, less wasted work

Page 29: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Labyrinth

29

atomic{ localGrid.CopyFrom(globalGrid);

if (this.PdoExpansion(myGrid, myExpansionQueue, src, dst)) { pointVector = PdoTraceback(grid, myGrid, dst, bendCost); success = true; raced = grid.addPathOfOffsets(pointVector); }}

2 threads – 80% wasted work4 threads – 98% wasted work2 threads – 80% wasted work4 threads – 98% wasted work

Watson PACT’07, it is safe if localGrid is not

up to date.

Watson PACT’07, it is safe if localGrid is not

up to date.

Don’t instrument CopyFrom with

transactional read and writes.

Don’t instrument CopyFrom with

transactional read and writes.

Page 30: Discovering and Understanding Performance Bottlenecks in Transactional Applications

Summary

• Design principles– Abstract the underlying TM system– Report results at the source language constructs– Low instrumentation probe effect and overhead

• Profiling techniques– Visualizing transactions– Conflict point discovery– Identifying conflicting data structures

30

Page 31: Discovering and Understanding Performance Bottlenecks in Transactional Applications

PPoPP’2010

Debugging Programs that use Atomic Blocks and Transactional Memory

ICS’2009

QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory

PPoPP’2008

Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server

31

Край