Executing Parallel Programs with Potential Bottlenecks Efficiently
-
Upload
iliana-herring -
Category
Documents
-
view
37 -
download
0
description
Transcript of Executing Parallel Programs with Potential Bottlenecks Efficiently
![Page 1: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/1.jpg)
Executing Parallel Programs with Potential Bottlenecks Efficiently
University of Tokyo
Yoshihiro OyamaKenjiro Taura (visiting UCSD)
Akinori Yonezawa
![Page 2: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/2.jpg)
Programs We Consider
bottleneckobject
(e.g., counter)
exclusivemethod
exclusivemethod
exclusivemethod
exclusivemethod
……..
Context: Implementation of concurrent OO langs on SMPs and DSM machines
e.g., synchronizedmethods in Java
update!
update!update!
update!
programs updating shared data frequently with mutex operations
![Page 3: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/3.jpg)
Amdahl’s Law
int foo(…){ int x = 0, y = 0; parallel for (…) { ... } lock(); printf(…); unlock(); parallel for (…) { c[i]=0; } parallel for (…) { baz(5); } return x * 2 + y;}
int foo(…){ int x = 0, y = 0; parallel for (…) { ... } lock(); printf(…); unlock(); parallel for (…) { c[i]=0; } parallel for (…) { baz(5); } return x * 2 + y;}
90% → can execute in parallel10% → must execute sequentially (bottleneck)
10 times speedup, at most
Can you really gain10 times speedup???
but...
You expect 10 times speedup
![Page 4: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/4.jpg)
Speedup Curvesfor Programs with Bottlenecks
# of PEs
tim
e
ideal
real
“Excessive” processors may be used! ∵ It is difficult to predict dynamic behavior ∵ Different phases need different num. of PEs
![Page 5: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/5.jpg)
Preliminary Experiments using aSimple Counter Program in C
0
500
1000
1500
2000
0 10 20 30 40 50 60 70# of PEs
tim
e (
msec)
spin simple blocking lock our scheme
• Solaris threads & Ultra Enterprise 10000• Each processor increments a shared counter in parallel
The time didn’t remain constant, but increases dramatically.
![Page 6: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/6.jpg)
Goal
• Efficient execution of programs with bottlenecks– Focusing on synchronization of methods
time to execute a wholeprogram in parallel
time to execute onlybottlenecks sequentially
making closer
other parts
1PE
bottleneck partsother parts
50PE
bottleneck parts
![Page 7: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/7.jpg)
What Problem Should We Solve?
1PE 50PE
other partsbottleneck parts
ideal implementation
Stop the increase of the time consumed in bottlenecks!
other parts
bottleneck parts
bottleneckparts
other partsnaïve implementation
![Page 8: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/8.jpg)
Put it in Prof. Ito’s terminology!
• He aims at keeping:– the PP/M SP/S property≧
• Our work aims at keeping:– the PP/M ≧ PP/S property
Performance on 100 PE should behigher than that on 1 PE!
![Page 9: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/9.jpg)
Presentation Overview
• Examples of potential bottlenecks• Two naïve schemes and their problems
– Local-based execution– Owner-based execution
• Our scheme– Detachment of requests– Priority mechanism using compare-and-swap– Two compile-time optimizations
• Performance evaluation & Related work
![Page 10: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/10.jpg)
Examples ofPotential Bottleneck Objects
• Objects introduced to easily reuse MT-unsafe functions in MT env.
• Abstract I/O objects
• Stubs in distributed systems– One stub conducts all communications in a site
• Shared global variables– e.g., counters to collect statistics information
It is sometimes difficult to eliminate them.
![Page 11: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/11.jpg)
![Page 12: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/12.jpg)
Local-based Execution(e.g., Implementation with Spin-locks)
instance variables
method
methodmethod
Advantage:No need to move “computation”
Disadvantage:Cache misses when referencing an object (due to invalidation/update of cache by other processors)
object
method methodmethod
Each PE executesmethods by itself
↓Each PE references/updates
an object by itself
![Page 13: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/13.jpg)
Confirmation of Overheadin Local-based Execution
0
500
1000
1500
2000
0 10 20 30 40 50 60 70
# of PEs
tim
e (m
sec)
empty method counter
C programon Ultra Enterprise 10000
Overhead of referencing/updating an object• increases according to the increase of PEs• occupies 1/3 of whole exec. time on 60 PEs
![Page 14: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/14.jpg)
Owner-based Execution
a request (a data structurecontaining method info)object
ownernon-owners
owner = a processor holding an object’s lock currently
owner present → creates and inserts a requestowner absent → becomes an owner and executes a method
=
![Page 15: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/15.jpg)
Owner-based Executionwith Simple Blocking Locks
object
instancevariables
Dequeued• one by one• with aux. locks
One processor likely executes multiple methods consecutively
![Page 16: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/16.jpg)
Advantages/Disadvantagesof Owner-based Execution
Advantage:
Disadvantages:
Less cache misses to reference an object
Overhead to move “computation”• Synchronization operations for a queue• Waiting time to manipulate a queue• Cache misses to read requests
(focusing on owner’s execution, which typically gives a critical path)
Can they be reduced?
![Page 17: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/17.jpg)
![Page 18: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/18.jpg)
Overview of Our Scheme
• Improve simple blocking locks– Detach requests
• Reduce the frequency of mutex operations
– Give high priority to owner• Reduce the time required to take control of requests
– Prefetch requests• Reduce cache misses in reading requests
Our scheme is realized implicitly by a compiler and runtime of a concurrent object-oriented language Schematic
![Page 19: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/19.jpg)
Data Structures
• Requests are managed with a list– 1-word pointer area (lock area) is added to each object
– Non-owner : creates and inserts a request– Owner : picks requests out and execute
them
object
![Page 20: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/20.jpg)
Design Policy
• Owner’s behavior determines a critical path
• We make owner’s execution fast, above all
• We allow non-owners’ execution to be slow
Battle in Bottleneck: 1 owner vs. 99 non-owners
We should help him!
![Page 21: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/21.jpg)
Non-owners Inserting a Request
B Cobject A
Y ZX
![Page 22: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/22.jpg)
Non-owners Inserting a Request
B Cobject A
Y ZX
Update with compare-and-swapRetry if interrupted
![Page 23: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/23.jpg)
Non-owners Inserting a Request
B Cobject A
Y ZX
Update with compare-and-swapRetry if interrupted
♪
Non-owners repeatthe loop until success
![Page 24: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/24.jpg)
Y
Owner Detaching Requests
B Cobject A
Important• A whole list is detached• Update with swap always succeeds → owner is never interrupted by other processors
![Page 25: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/25.jpg)
Y
Owner Detaching Requests
B Cobject A
Important• A whole list is detached• Update with swap always succeeds → owner is never interrupted by other processors
♪
![Page 26: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/26.jpg)
Owner Executing Requests
Y B C
object
A
1. No synchronization operations by owner
inserting requests withoutdisturbing owner
ZX
executed in turnwithout mutex ops
![Page 27: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/27.jpg)
Giving Higher Priority to Owner
• Insertion by non-owner (compare-and-swap):
may fail many times
• Detachment by owner (swap):
always succeeds in constant steps
2. Owner never spins to manipulate requests
![Page 28: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/28.jpg)
Compile-time Optimization (1/2)
• Prefetch requests
while this request is processed
the request is prefetched
3. Reduce cache misses to read requests
...while (req != NULL) { PREFETCH(req->next); EXECUTE(req); req = req->next;}...
...while (req != NULL) { PREFETCH(req->next); EXECUTE(req); req = req->next;}...
![Page 29: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/29.jpg)
Compile-time Optimization (2/2)
• Caching instance variables in registers– Non-owners do not reference/update an object
while detached requests are processed
passing IVs in registers
Two versions of code are provided for one method Code to process requests : uses instance variables on memory Code to execute methods directly : uses instance variables in registers
object
![Page 30: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/30.jpg)
Achieving Similar Effectsin Low-level Languages (e.g., in C)
• “Always spin-lock” approach– Waste of CPU cycles, memory bandwidth
– Deadlocks
• “Finding bottlenecks→rewriting code” approach– Implements owner-based execution only in bottlenecks
– Harder than “support of high-level lang” approach• Implementing owner-based execution is troublesome
• Bottlenecks appear dynamically in some programs
![Page 31: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/31.jpg)
![Page 32: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/32.jpg)
Experimental Results (1/2)RNA secondary structure prediction (with stat.)
in Schematic on Ultra Enterprise 10000
01000200030004000500060007000
0 10 20 30 40 50 60 70
# of PEs
tim
e (m
sec)
spin l o c kblocking spin bl o c kour scheme
![Page 33: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/33.jpg)
Experimental Results (2/2)RNA secondary structure prediction (with stat.)
in Schematic on Origin 2000
0
4000
8000
12000
16000
0 10 20 30 40 50 60 70 80 90 100 110
# of PEs
tim
e (m
sec)
spin l o c kblocking spin bl o c kour scheme
![Page 34: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/34.jpg)
Interesting Results using aSimple Counter Program in C
• Simple blocking locks :waiting time was the largest overhead
– 70 % of owner’s whole execution time
• Our scheme is efficient also on uniprocessor– Spin-locks: 641 msec– Simple blocking locks: 1025 msec– Our scheme: 810 msec
(execution time)
![Page 35: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/35.jpg)
Related Work (1/3)- execution of methods invoked in parallel -
• ICC++ [Chien et al. 96]– Detects nonexclusive methods through static analysis
• Concurrent Aggregates [Chien 91]– Realizes interleaving through explicit programming
• Cooperative Technique [Barnes 93]– PE entering critical section later “helps” predecessors
•Focus on exposing parallelism among nonexclusive operations•No remark on performance loss in bottlenecks
![Page 36: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/36.jpg)
Related Work (2/3)- efficient spin-locks when contention occurs -
• MCS Lock [Mellor-Crummey et al. 91]– Provides spin area for each processor
• Exponential Backoff [Anderson 90]– Is heuristics to “withdraw” processors which fai
led in lock acquisition– Needs some skills to determine parameters
These locks give local-based execution→ Low locality in referencing bottleneck objects
![Page 37: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/37.jpg)
Related Work (3/3)- efficient Java monitors -
• Bimodal object-locking [Onodera et al. 98],Thin Locks [Bacon et al. 98]– Affected our low-level implementation– Uses unoptimized “fat locks” in contended objects
• Meta-locks [Agesen et al. 99]– Clever technique similar to MCS locks– No busy-waiting even in contended objects
• Their primary concern lies on uncontended cases• They do not take locality of object references into account
![Page 38: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/38.jpg)
Summary
• Serious performance loss in existing schemes– spin-locks: low locality of object references– blocking locks: overhead in contended request queue
• Very fast execution in contended objects– Highly-optimized owner-based execution
• Excellent Performance– Several times faster than simple schemes!
(several hundred percent speedup!)
![Page 39: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/39.jpg)
Future Work
• Solving a problem to use large memory in some cases– A long list of requests may be formed– The problem is common to owner-based schemes– This work focused on time-efficiency, not on space-efficiency– Simple solution: memory used for requests ≧ some threshold
⇒ dynamic switch to local-based execution
• Increasing/decreasing PEs according to exec. status– System automatically decides the “best” number of PEs
for each program point– It eliminates the existence of excessive processors itself
![Page 40: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/40.jpg)
ここからは質問タイムに見せるスライド
• ここからは質問タイムに見せるスライド
![Page 41: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/41.jpg)
More Detailed Measurementsusing a Counter Program in C
0
500
1000
1500
2000
0 10 20 30 40 50 60 70# of PEs
tim
e (
mse
c)
spin block block (det a c h )getone detach reg. + p r e f .
• Solaris threads & Sun Ultra Enterprise 10000• Each processor increments a shared counter
![Page 42: Executing Parallel Programs with Potential Bottlenecks Efficiently](https://reader035.fdocuments.us/reader035/viewer/2022062301/56812cec550346895d91b01f/html5/thumbnails/42.jpg)
No guarantee of FIFO order
• The method invoked later may beexecuted earlier– Simple solution: “reverse” detached requests– Better solution:
• Can we use a queue, instead of list?
• Are 64bit compare-and-swap/swap necessary?