Windows-NT based Distributed Virtual Parallel Machine
http://www.cs.technion.ac.il/Labs/Millipede
The MILLIPEDE ProjectTechnion, Israel
What is Millipede ?
A strong Virtual Parallel Machine:
employ non-dedicated distributed environments
Distributed Environment
Implementation of Parallel Programming Langs
Programs
Programming Paradigms
Millipede
Layer Distributed Shared Memory (DSM)
Communication Packages
U-Net, Transis, Horus,…
Operating System Services
Communication, Threads,Page Protection, I/O
Software Packages
User-mode threads
ParCJava
SPLASH
ParFortran90CParPar
Other
“Bare Millipede”
Cilk/Calipso CC++
Events Mechanism (MJEC)
Migration Services (MGS)
So, what’s in a VPM?
Check list:Using non-dedicated cluster of PCs (+ SMPs)
Multi-threaded
Shared memory
User-mode
Strong support for weak memory
Dynamic page- and job-migration
Load sharing for maximal locality of reference
Convergence to optimal level of parallelism
Millipede
insideMillipedeinside
Using a non-dedicated cluster
• Dynamically identify idle machines
• Move work to idle machines
• Evacuate busy machines
• Do everything transparently
to native user
• Co-existence of
several parallel
applications
Multi-Threaded Environments• Well known:
– Better utilization of resources– An intuitive and high level of abstraction– Latency hiding by comp. and comm. overlap
• Natural for parallel programing paradigms & environments– Programmer defined max-level of parallelism– Actual level of parallelism set dynamically.
Applications scale up and down– Nested parallelism– SMPs
• The Tradeoff: Higher level of parallelism VS. Better locality of memory reference
• Optimal speedup - not necessarily with the maximal number of computers
• Achieved level of parallelism - depends on the program needs and onthe capabilities of the system
Convergence to Optimal Speedup
PVM
/* Receive data from master */msgtype = 0;pvm_recv(-1, msgtype);pvm_upkint(&nproc, 1, 1);pvm_upkint(tids, nproc, 1);pvm_upkint(&n, 1, 1);pvm_upkfloat(data, n, 1);
/* Determine which slave I am(0..nproc-1)*/for(i=0; i<nproc; i++) if(mytid==tids[i]) {me=i; break;}
/* Do calculations with data*/result=work(me, n, data, tids, nproc);
/* send result to master */pvm_initsend(PvmDataDefault);pvm_pkint(&me, 1, 1);pvm_pkfloat(&result, 1, 1);msg_type = 5;master = pvm_paremt();pvm_send(master, msgtype);
/* Exit PVM before stopping */pvm_exit();
C-Linda
/* Retrieve data from DSM */rd(“init data”, ?nproc, ?n, ?data);
/* Worker id is given at creation no need to compute it now */
/* do calculation. put result in DSM*/out(“result”, id, work(id, n, data, nproc));
“Bare” Millipede
result = work(milGetMyId(), n, data, milGetTotalIds());
No/Explicit/Implicit Access Shared Memory
Relaxed Consistency(Avoiding false sharing and ping pong)
• Sequential, CRUW,
Sync(var), Arbitrary-CW Sync
• Multiple relaxations for different shared variables within the same program
• No broadcast, no central address servers(so can work efficiently interconnected LANs)
• New protocols welcome (user defined?!)
• Step-by-step optimization towards maximal parallelism
page
copies
LU Decomposition 1024x1024 matrix written in SPLASH - Advantages gained when reducing consistency of a single variable (the Global structure):
Number of page migrations (page #4)
0
10
20
30
40
50
60
70
1 2 3 4 5hosts
#mig
rati
on
s p
er h
ost
original
reduced
Reducing Consistency
1
1.5
2
2.5
3
3.5
4
1 2 3 4 5hosts
spee
du
ps
Original
Reduced
MJEC - Millipede Job Event Control
• A job has a unique systemwide id• Jobs communicate and synchronize by sending events• Although a job is mobile, its events follow and reach
its events queue wherever it goes• Event handlers are context-sensitive
An open mechanism with which varioussynchronization methods can be implemented
MJEC (con’t)
• Modes:– In Execution-Mode: arriving events are enqueued– In Dispatching-Mode: events are dequeued and
handled by a user-supplied dispatching routine
MJEC Interface
Registration and Entering Dispatch Mode:
milEnterDispatchingMode((FUNC)foo, void *context) Post Event:
milPostEvent(id target, int event, int data)
Dispatcher Routine Syntax:int foo(id origin, int event, int data, void *context)
Execution Mode
ret := func(INIT, context)
ret==EXIT?
milEnterDispatchingMode(func, context)
ret := func(EXIT, context)
event pending?No
Yes
Yesret := func(event, context)
Wait for event
Experience with MJEC
• ParC: ~ 250 linesSPLASH: ~ 120 lines
• Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers
• Implementation of location-dependent services (e.g., graphical display)
Example - Barriers with MJEC
Dispatcher:
… … ...
BARRIER(...)
Dispatcher:
… …
EVENTARR
Barrier Server
Job
… … …
Job
Barrier() {milPostEvent(BARSERV, ARR, 0);
milEnterDispatchingMode(wait_in_barrier, 0);
}
wait_in_barrier(src, event, context) {if (event == DEP)
return EXIT_DISPATCHER;
else
return STAY_IN_DISPATCHER;
}
Example - Barriers with MJEC (con’t)
Dispatcher:
… … ...
BARRIER(...)
Dispatcher:
… …
EVENTDEP
Barrier Server
Job
BARRIER(...)
Dispatcher:
… …
EVENTDEP
Job
BarrierServer() {milEnterDispatchingMode(barrier_server, info);
}
barrier_server(src, event, context) {if (event == ARR)
enqueue(context.queue, src);
if (should_release(context))
while(context.cnt>0) {
milPostEvent(context.dequeue, DEP);
}
return STAY_IN_DISPATCHER;
}
Dynamic Page- and Job-Migration
• Migration may occur in case of:– Remote memory access– Load imbalance– User comes back from lunch– Improving locality by location rearrangement
• Sometimes migration should be disabled– by system: ping-pong, critical section– by programmer: control system
Locality of memory reference is THE dominant efficiency factor
Migration Can Help Locality:Only Job Migration Only Page Migration Page & Job Migration
Load Sharing + Max. Locality = Minimum-Weight multiway cut
p pq q
r r
Problems with themultiway cut model
• NP-hard for #cuts>2. We have n>X,000,000. Polynomial 2-approximations known
• Not optimized for load balancing
• Page replica
• Graph changes dynamically
• Only external accesses are
recorded ===> only partial
information is available
Our Approach
• Record the history of remote accesses
• Use this information when taking decisions concerning load balancing/load sharing
• Save old information to avoid repeating bad decisions (learn from mistakes)
• Detect and solve ping-pong situations
• Do everything by piggybacking on communication that is taking place anyway
page 1page 2page 1 page 0
Access
Ping PongDetection (local):
1. Local threads attempt to use the page short time after it leaves the local host
2. The page leaves the host shortly after arrival
Treatment (by ping-pong server):• Collect information regarding all participating hosts and threads• Try to locate an underloaded target host• Stabilize the system by locking-in pages/threads
TSP - Effect of Locality 15 cities, Bare Millipede
Optimization
sec
0500
1000150020002500300035004000
1 2 3 4 5 6
NO-FS
OPTIMIZED-FS
FS
hosts
In the NO-FS case false sharing is avoided by aligning all allocations to page size. In the other two cases each page is used by 2 threads: in FS no optimizations are used, and in OPTIMIZED -FS the history mechanism is
enabled.
k optimized? # DSM- related messages
# ping-pong treatment msgs
Number of thread migrations
execution time (sec)
2 Yes 5100 290 68 6452 No 176120 0 23 1020
3 Yes 4080 279 87 6203 No 160460 0 32 1514
4 Yes 5060 343 99 6904 No 155540 0 44 1515
5 Yes 6160 443 139 7005 No 162505 0 55 1442
TSP on 6 hostsk number of threads falsely sharing a page
Ping Pong Detection SensitivityTSP-1
0100200300400500600700800900
1000
2 3 4 5 6 7 8 9 10111213 14151617 181920
Best results are achieved at maximal sensitivity,since all pages are accessed frequently.
0100200300400500600700800900
10001100
2 3 4 5 6 7 8 9 1011 121314151617 181920
Since part of the pages are accessed frequentlyand part - only occasionally, maximal sensitivitycauses unnecessary ping pong treatment andsignificantly increases execution time.
TSP-2
Applications• Numerical computations:
Multigrid
• Model checking:BDDs
• Compute-intensive graphics:Ray-Tracing, Radiosity
• Games, Search trees, Pruning, Tracking, CFD ...
Performance Evaluation
L - underloaded
H - overloaded
Delta(ms) - lock in time
t/o delta - polling (MGS,DSM)
msg delta - system pages delta
T_epoch - max history time
??? - remove old histories - refresh old histories
L_epoch - histories length
page histories vs. job histories
migration heuristic - which func?
ping-pong - - what is initial noise? - what freq. is PP?
LU Decomposition 1024x1024 matrix written in SPLASH:
Performance improvements when there are few threads on each host
1
2
3
4
5
6
1 2 3 4 5 6hosts
spee
du
p
1 threads/host
3 threads/host
LU Decomposition 2048x2048 matrix written in SPLASH -Super-Linear speedups due to the caching effect.
7.18
4.47
1
2
3
4
5
6
7
8
1 3 6hosts
sp
eed
up
Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC
0
2040
60
80100
120
140160
180
1 2 3 4
hosts
Tim
e
0
0.5
1
1.5
2
2.5
3
3.5
Sp
eed
up
Overhead of ParC/Millipede on a single host.
Testing with Tracking algorithm:
Overheads
0.970.98
0.991
1.01
1.021.03
1.041.05
1 10 20targets
Rel
ativ
e
Pure C"Bare" MillipedeParC on Millipede
Info...
http://www.cs.technion.ac.il/Labs/Millipede
Release available at the Millipede site !
Top Related