Download - Windows-NT based Distributed Virtual Parallel Machine

Windows-NT based Distributed Virtual Parallel Machine

http://www.cs.technion.ac.il/Labs/Millipede

The MILLIPEDE ProjectTechnion, Israel

What is Millipede ?

A strong Virtual Parallel Machine:

employ non-dedicated distributed environments

Distributed Environment

Implementation of Parallel Programming Langs

Programs

Programming Paradigms

Millipede

Layer Distributed Shared Memory (DSM)

Communication Packages

U-Net, Transis, Horus,…

Operating System Services

Communication, Threads,Page Protection, I/O

Software Packages

User-mode threads

ParCJava

SPLASH

ParFortran90CParPar

Other

“Bare Millipede”

Cilk/Calipso CC++

Events Mechanism (MJEC)

Migration Services (MGS)

So, what’s in a VPM?

Check list:Using non-dedicated cluster of PCs (+ SMPs)

Multi-threaded

Shared memory

User-mode

Strong support for weak memory

Dynamic page- and job-migration

Load sharing for maximal locality of reference

Convergence to optimal level of parallelism

Millipede

insideMillipedeinside

Using a non-dedicated cluster

• Dynamically identify idle machines

• Move work to idle machines

• Evacuate busy machines

• Do everything transparently

to native user

• Co-existence of

several parallel

applications

Multi-Threaded Environments• Well known:

– Better utilization of resources– An intuitive and high level of abstraction– Latency hiding by comp. and comm. overlap

• Natural for parallel programing paradigms & environments– Programmer defined max-level of parallelism– Actual level of parallelism set dynamically.

Applications scale up and down– Nested parallelism– SMPs

• The Tradeoff: Higher level of parallelism VS. Better locality of memory reference

• Optimal speedup - not necessarily with the maximal number of computers

• Achieved level of parallelism - depends on the program needs and onthe capabilities of the system

Convergence to Optimal Speedup

PVM

/* Receive data from master */msgtype = 0;pvm_recv(-1, msgtype);pvm_upkint(&nproc, 1, 1);pvm_upkint(tids, nproc, 1);pvm_upkint(&n, 1, 1);pvm_upkfloat(data, n, 1);

/* Determine which slave I am(0..nproc-1)*/for(i=0; i<nproc; i++) if(mytid==tids[i]) {me=i; break;}

/* Do calculations with data*/result=work(me, n, data, tids, nproc);

/* send result to master */pvm_initsend(PvmDataDefault);pvm_pkint(&me, 1, 1);pvm_pkfloat(&result, 1, 1);msg_type = 5;master = pvm_paremt();pvm_send(master, msgtype);

/* Exit PVM before stopping */pvm_exit();

C-Linda

/* Retrieve data from DSM */rd(“init data”, ?nproc, ?n, ?data);

/* Worker id is given at creation no need to compute it now */

/* do calculation. put result in DSM*/out(“result”, id, work(id, n, data, nproc));

“Bare” Millipede

result = work(milGetMyId(), n, data, milGetTotalIds());

No/Explicit/Implicit Access Shared Memory

Relaxed Consistency(Avoiding false sharing and ping pong)

• Sequential, CRUW,

Sync(var), Arbitrary-CW Sync

• Multiple relaxations for different shared variables within the same program

• No broadcast, no central address servers(so can work efficiently interconnected LANs)

• New protocols welcome (user defined?!)

• Step-by-step optimization towards maximal parallelism

page

copies

LU Decomposition 1024x1024 matrix written in SPLASH - Advantages gained when reducing consistency of a single variable (the Global structure):

Number of page migrations (page #4)

0

10

20

30

40

50

60

70

1 2 3 4 5hosts

#mig

rati

on

s p

er h

ost

original

reduced

Reducing Consistency

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5hosts

spee

du

ps

Original

Reduced

MJEC - Millipede Job Event Control

• A job has a unique systemwide id• Jobs communicate and synchronize by sending events• Although a job is mobile, its events follow and reach

its events queue wherever it goes• Event handlers are context-sensitive

An open mechanism with which varioussynchronization methods can be implemented

MJEC (con’t)

• Modes:– In Execution-Mode: arriving events are enqueued– In Dispatching-Mode: events are dequeued and

handled by a user-supplied dispatching routine

MJEC Interface

Registration and Entering Dispatch Mode:

milEnterDispatchingMode((FUNC)foo, void *context) Post Event:

milPostEvent(id target, int event, int data)

Dispatcher Routine Syntax:int foo(id origin, int event, int data, void *context)

Execution Mode

ret := func(INIT, context)

ret==EXIT?

milEnterDispatchingMode(func, context)

ret := func(EXIT, context)

event pending?No

Yes

Yesret := func(event, context)

Wait for event

Experience with MJEC

• ParC: ~ 250 linesSPLASH: ~ 120 lines

• Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers

• Implementation of location-dependent services (e.g., graphical display)

Example - Barriers with MJEC

Dispatcher:

… … ...

BARRIER(...)

Dispatcher:

… …

EVENTARR

Barrier Server

Job

… … …

Job

Barrier() {milPostEvent(BARSERV, ARR, 0);

milEnterDispatchingMode(wait_in_barrier, 0);

}

wait_in_barrier(src, event, context) {if (event == DEP)

return EXIT_DISPATCHER;

else

return STAY_IN_DISPATCHER;

}

Example - Barriers with MJEC (con’t)

Dispatcher:

… … ...

BARRIER(...)

Dispatcher:

… …

EVENTDEP

Barrier Server

Job

BARRIER(...)

Dispatcher:

… …

EVENTDEP

Job

BarrierServer() {milEnterDispatchingMode(barrier_server, info);

}

barrier_server(src, event, context) {if (event == ARR)

enqueue(context.queue, src);

if (should_release(context))

while(context.cnt>0) {

milPostEvent(context.dequeue, DEP);

}

return STAY_IN_DISPATCHER;

}

Dynamic Page- and Job-Migration

• Migration may occur in case of:– Remote memory access– Load imbalance– User comes back from lunch– Improving locality by location rearrangement

• Sometimes migration should be disabled– by system: ping-pong, critical section– by programmer: control system

Locality of memory reference is THE dominant efficiency factor

Migration Can Help Locality:Only Job Migration Only Page Migration Page & Job Migration

Load Sharing + Max. Locality = Minimum-Weight multiway cut

p pq q

r r

Problems with themultiway cut model

• NP-hard for #cuts>2. We have n>X,000,000. Polynomial 2-approximations known

• Not optimized for load balancing

• Page replica

• Graph changes dynamically

• Only external accesses are

recorded ===> only partial

information is available

Our Approach

• Record the history of remote accesses

• Use this information when taking decisions concerning load balancing/load sharing

• Save old information to avoid repeating bad decisions (learn from mistakes)

• Detect and solve ping-pong situations

• Do everything by piggybacking on communication that is taking place anyway

page 1page 2page 1 page 0

Access

Ping PongDetection (local):

1. Local threads attempt to use the page short time after it leaves the local host

2. The page leaves the host shortly after arrival

Treatment (by ping-pong server):• Collect information regarding all participating hosts and threads• Try to locate an underloaded target host• Stabilize the system by locking-in pages/threads

TSP - Effect of Locality 15 cities, Bare Millipede

Optimization

sec

0500

1000150020002500300035004000

1 2 3 4 5 6

NO-FS

OPTIMIZED-FS

FS

hosts

In the NO-FS case false sharing is avoided by aligning all allocations to page size. In the other two cases each page is used by 2 threads: in FS no optimizations are used, and in OPTIMIZED -FS the history mechanism is

enabled.

k optimized? # DSM- related messages

# ping-pong treatment msgs

Number of thread migrations

execution time (sec)

2 Yes 5100 290 68 6452 No 176120 0 23 1020

3 Yes 4080 279 87 6203 No 160460 0 32 1514

4 Yes 5060 343 99 6904 No 155540 0 44 1515

5 Yes 6160 443 139 7005 No 162505 0 55 1442

TSP on 6 hostsk number of threads falsely sharing a page

Ping Pong Detection SensitivityTSP-1

0100200300400500600700800900

1000

2 3 4 5 6 7 8 9 10111213 14151617 181920

Best results are achieved at maximal sensitivity,since all pages are accessed frequently.

0100200300400500600700800900

10001100

2 3 4 5 6 7 8 9 1011 121314151617 181920

Since part of the pages are accessed frequentlyand part - only occasionally, maximal sensitivitycauses unnecessary ping pong treatment andsignificantly increases execution time.

TSP-2

Applications• Numerical computations:

Multigrid

• Model checking:BDDs

• Compute-intensive graphics:Ray-Tracing, Radiosity

• Games, Search trees, Pruning, Tracking, CFD ...

Performance Evaluation

L - underloaded

H - overloaded

Delta(ms) - lock in time

t/o delta - polling (MGS,DSM)

msg delta - system pages delta

T_epoch - max history time

??? - remove old histories - refresh old histories

L_epoch - histories length

page histories vs. job histories

migration heuristic - which func?

ping-pong - - what is initial noise? - what freq. is PP?

LU Decomposition 1024x1024 matrix written in SPLASH:

Performance improvements when there are few threads on each host

1

2

3

4

5

6

1 2 3 4 5 6hosts

spee

du

p

1 threads/host

3 threads/host

LU Decomposition 2048x2048 matrix written in SPLASH -Super-Linear speedups due to the caching effect.

7.18

4.47

1

2

3

4

5

6

7

8

1 3 6hosts

sp

eed

up

Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC

0

2040

60

80100

120

140160

180

1 2 3 4

hosts

Tim

e

0

0.5

1

1.5

2

2.5

3

3.5

Sp

eed

up

Overhead of ParC/Millipede on a single host.

Testing with Tracking algorithm:

Overheads

0.970.98

0.991

1.01

1.021.03

1.041.05

1 10 20targets

Rel

ativ

e

Pure C"Bare" MillipedeParC on Millipede

Info...

http://www.cs.technion.ac.il/Labs/Millipede

[email protected]

Release available at the Millipede site !