SAĞNAK TAŞIRLAR, VIVEK SARKAR
Transcript of SAĞNAK TAŞIRLAR, VIVEK SARKAR
DATA-DRIVEN TASKS ���AND���
THEIR IMPLEMENTATION SAĞNAK TAŞIRLAR, VIVEK SARKAR
DEPARTMENT OF COMPUTER SCIENCE. RICE UNIVERSITY
1
Fork/Join graphs constraint ||-ism 2
Fork/Join models restrict task graphs to be series-parallel
Can not describe without hampering ||-ism
Fork/Join models constrain control and data dependences
Tasks can only be created after all data dependences satisfied
Necessitates ordering task creation to conform to that restriction
May hamper performance
Macro-dataflow for intuitive ||-ism 3
Kernel based programming
Build a task graph of kernel instantiations
Restrict dependences to true dependences race-freedom, determinism
Provides productivity
TaskA TaskB
TaskA TaskB
TaskA
TaskC
main
TaskB
TaskC
single-assignment data
Futures [Baker & Hewitt 1977] 4
future = (storage, resolvingProcess, waitingTasks)
Future F = {stmt1;…; return v;}!
! ! ! !task g = {stmt; F.get();…;}!
TaskF addressF TaskG TaskH TaskJ
Creation Create an empty Data-Driven Future (DDF) object
Resolution ( put ) Resolve what value a DDF is referring to
Data-Driven Tasks (DDTs) ( async await(…) ) A task provides a consumer list of DDFs on declaration A task can only read DDFs that it is registered to
Difference from futures: Creation of container (DDF) and computation (DDT) are
separate events
Data-Driven Futures (DDFs) &���Data-Driven Tasks (DDTs)
5
DataDrivenFuture = (storage, waitingTasks)
(resolvingProcess)
DDF/DDT Code Sample 6
DataDrivenFuture left = new DataDrivenFuture ();
DataDrivenFuture right = new DataDrivenFuture(); finish {
async await ( left ) useLeftChild(left); // Task1
async await ( right ) useRightChild(right); // Task2
async await ( left, right ) useBothChildren( left, right ); // Task3
async left.put(leftChildCreator()); // Task4
async right.put(rightChildCreator()); // Task5
}
Task5
Task4
Task2
Task1
Task3
DDTs provide
Non-series-parallel task dependence graph support Less restricted parallelism Better scheduling opportunities
Single assignment (SA) Race-freedom on DDF
accesses
Determinism if all shared data is expressed as DDFs
SA-value lifetime restriction
Smaller than graph lifetime
DDF creator: Provides DDF reference to
producers and consumers
DDF lifetime depends on Creator lifetime
Resolver lifetime
Consumers’ lifetimes
7
DDTA
DDTC
DDTD
DDTB
DDTE
DDTF
DDF
Data-Driven Scheduling 8
Steps register self to items wrapped into DDFs
PlaceHolderleft
DDFleft Task1
DDFleft
DDFright
Valueright
Task3
DDFleft DDFright
✕
DDF left = new DDF(); DDF right = new DDF();
TaskC
async await (left) use(left); // Task1 async await (right) use(right); // Task2
async builder(right); // Task5
PlaceHolderright ✕
Task4
resolve DDFleft async await (left,right) use(left,right); // Task3 async builder(left); // Task4
Task4
ready queue
Valueleft
Task2
DDFright
Task5
Task5
resolve DDFright
Task1 Task3 Task2
Mapping Macro-Dataflow to Task-Parallelism 9
Control & data dependences as first level constructs Task-parallel frameworks have them coupled e.g., OpenMP, Cilk
Kernel instantiations may have multiple predecessors Need to wait for all
Staged readiness concepts Created ( control dependence satisfied )
Data dependences satisfied
Schedulable / Ready
DDTs provide a natural implementation for Macro-Dataflow Every kernel instantiation is a DDT
Data dependences between DDTs are expressed through DDFs
Provides race freedom
Experimental Results 10
Compared DDT implementation with four macro-data schedulers from past work
that used Concurrent Collections (CnC)
CnC uses global data collections to synchronize tasks
DDT/DDF results obtained at task-parallel level
without allocating global data collections
CnC can be automatically translated to DDFs (ongoing work)
Use Java wait/notify for premature data access Blocking granularity
Instance level vs Collection level (fine-grain vs. coarse-grain)
A blocked task blocks an entire worker thread Need to create more worker threads to avoid deadlock
Blocking Schedulers 11
WorkerC
step1
Get (keyc) ItemCollectionΘ
keyα keyβ valueβ
valueα wait
WorkerD
step2
Put(keyc,valuec)
notify
time
Every kernel instantiation is a guarded execution Guard condition is the availability of input data
Task can be created eagerly before input data is available Promoted to ready when data provided
Delayed async Scheduling 12
Value left = new Value (); Value right = new Value (); finish { async when ( left.isReady() ) useLeftChild(left); // Task1
async when ( right.isReady()) useRightChild(right); // Task2
async when ( right.isReady() && left.isReady() ) useBothChildren( left, right ); // Task3 async left.put(leftChildCreator()); // Task4
async right.put(rightChildCreator()); // Task5
} Work Sharing Ready Task Queue
push
async1
async2
async5
Schedule
Yes No
Evaluate guard
Is true?
Yes
Requeue
No
async3
async4
pop Popped Task
Delayed?
Data Driven Rollback & Replay 13
WorkerC
step1
Get (keyc) ItemCollectionΘ
keya
keyb valueb
valuea
WorkerD
step2 Put(keyc,valuec)
waitlista
waitlistb
keyc empty waitlistc
Insert step1 to waitlistc Throw exception to unwind
step3
Re-execute steps in waitlistc on Put()
step1
Get (keyc)
Get (keyd)
valuec
step1 ✕
Experimental Setup 14
4-socket Xeon quad-core Intel E7730 2.4 GHz Shared 3MB L2 cache per pair of cores. Main memory 32 GBs. #worker threads:16
8-way SMT 8-core Niagara Sun UltraSPARC T2 Shared 4MB L2 cache #worker threads: 64
32-bit Sun Hotspot JDK 1.6 JVM GCC 4.1.2 for JNI
30 runs for statistical soundness Read ‘Serial’ as single-threaded execution of || code
Cholesky decomposition 15
10,081 10,010 10,305 10,309
8,748
2,472
1,197 979 853 790
0
2,000
4,000
6,000
8,000
10,000
12,000
Coarse Grain Blocking
Fine Grain Blocking
Delayed Async Data Driven Rollback&Replay
Data Driven Futures
Exe
cuti
on
in
mil
li-s
ecs
Serial Parallel
Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Cholesky decomposition CnC application with Habanero-Java steps on 16-core Xeon with input matrix size 2000 × 2000 and with tile size 125 × 125
Tasks
Black-Scholes formula ( PARSEC ) 16
33,871 33,966 34,311 34,121 34,729
4,300 4,309 4,279 5,061
2,353
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
Coarse Grain Blocking
Fine Grain Blocking
Delayed Async Data Driven Rollback&Replay
Data Driven Futures
Exe
cuti
on
in
mil
li-s
ecs
Serial Parallel
Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java steps on 16-core Xeon with input size 1,000,000 and with tile size 62,500
Tasks
Rician Denoising ( Medical Imaging ) 17
498,776 499,666 483,770
349,051
81,502 58,313 53,569 53,817
0
100,000
200,000
300,000
400,000
500,000
Coarse Grain Blocking * Fine Grain Blocking * Delayed Async * Data Driven Futures
Exe
cuti
on
in
mil
li-s
ecs
Serial Parallel
Average execution times and 90% confidence interval of 30 runs of single threaded and 16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input image size 2937 × 3872 and with tile size 267 × 484
Tasks
* Explicit memory management required for non-DDT schedules to avoid out-of-memory exception
Heart Wall Tracking Dependence Graph 18
Step1
Step3
Step4
Step5
Step6
Step7
Step8
Step9
Step10
Step2
Step1
Step3
Step4
Step5
Step6
Step7
Step8
Step9
Step10
Step2
IterationJ IterationJ+1
Heart Wall Tracking ( Rodinia ) 19
162,248 157,554 156,159
47,989
11,076 9,897
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
Delayed Async Data Driven Rollback&Replay Data Driven Futures
Exe
cuti
on
in
mil
li-s
ecs
Serial Parallel
Minimum execution times of 13 runs of single threaded and 16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon with 104 frames
Tasks
Related Work 20
Futures Can build arbitrary task graphs
get()/force() is usually a blocking operation
future task creation is bound to container at creation time
Dataflow Typically blocks on one datum (Ivar) at a time, unlike async await (…)
Nabbit ( Cilk library ) Can build arbitrary task graphs, more explicit than DDTs
No garbage collection and unwinding of task graph
Concurrent Collections ( CnC ) Globalized data collections and general tags (keys) makes memory
management challenging
DDTs can be used to obtain more efficient implementations of CnC
Conclusions 21
Data-Driven Futures and Data-Driven Tasks
help build arbitrary task graphs and extend task-parallel frameworks
introduce the more-intuitive macro-dataflow to programmers on task-parallel frameworks
support Data-Driven scheduling that outperforms alternative schedulers in both execution time and memory requirements
help to implement blocking in tasks without blocking workers
Compile Concurrent Collections down to DDTs
Compiler optimizations to move DDF allocations to further reduce lifetimes
Hierarchical DDTs for granularity optimizations
Work-stealing support for DDTs
Use DDTs to implement all blocking synchronizations without blocking worker, i.e. replace each waiting continuation as a DDT
Locality aware scheduling with DDTs
Future Work 22
For a hands-on trial, visit http://habanero.rice.edu/hj
http://habanero.rice.edu/cnc