Parallel Algorithms

31
Parallel Algorithms Sung Yong Shin TC Lab CS Dept. KAIST

description

Parallel Algorithms. Sung Yong Shin TC Lab CS Dept. KAIST. Contents. 1. Background 2. Parallel Computers 3. PRAM 4. Parallel Algorithms. 1. Background. Von Neumann Machines sequential executing one instruction at a time - PowerPoint PPT Presentation

Transcript of Parallel Algorithms

Page 1: Parallel Algorithms

Parallel Algorithms

Sung Yong Shin

TC Lab

CS Dept. KAIST

Page 2: Parallel Algorithms

Contents

1. Background

2. Parallel Computers

3. PRAM

4. Parallel Algorithms

Page 3: Parallel Algorithms

1. Background

• Von Neumann Machines

– sequential

– executing one instruction at a time Inherent limitation

“ not faster than electrical signals ” 1 ft / 1 nanosecond ( 10-9 sec )

• Parallelism or Concurrency Carrying out many operations

simultaneously

– partition a complex problem in such a way that various parts of the work can be carried out independently and in parallel,

and combine the results when all subcomputation are complete.

– need parallel computers to support this approach.

Page 4: Parallel Algorithms

Two approaches

• Hardware-oriented

– A parallel architecture of a specific kind is built.

– The parallel algorithms for solving different problems are developed to make use of these hardware features to the best advantage.

• Problem-oriented

– Whether the parallel algorithms can truly enhance the speed of obtaining a solution to a given problem, or not.

– If so, how much ?

Page 5: Parallel Algorithms

Problems

(i) The usefulness of parallel computers depends greatly on :

– suitable parallel algorithms

– parallel computer languages

“ A major rethinking needed”

(ii) Practical limitations by parallel computers“ too many factors to be considered”

How to abstract ingredient from complex reality !!!

Page 6: Parallel Algorithms

Which problems can be solved substantially faster using many processors rather than one processor ?

• Nicholas Pippenger (1976)

“ NC-class problems” ( Nick’s Class )

“ ultra-fast on a parallel computer with feasible amount of hardware”

( independent of the particular parallel model chosen )

Inherent Parallelism

probably not possible now but for the future !!!

“ fascinating research topics”P

P-complete

NC

P(n) processors

(log n)m

P = NC ?

Page 7: Parallel Algorithms

Applications (needs )

• Computer vision / Image processing

• Computer Graphics

• Searching huge databases

• Artificial Intelligence

· · · · · · · ·

Page 8: Parallel Algorithms

2. Parallel Computers

SIMD ( Single Instruction Multiple Data Stream )

MIMD ( Multiple Instruction Multiple Data Stream )

What does SISD stand for ?

Page 9: Parallel Algorithms

Mutiply

Branch

Subtract

Divide

Add

Function Unit

DataSourcex

y

Result x+y

Program

SISD

Page 10: Parallel Algorithms

Mutiply Branch Subtract Divide Add

Function Unit Data

Source

xy

Result x+y

Program

SIMD– array processors– vector processors (pipelining)

Function Unit

Function Unit

Add

Result s+q

vw

qs

Result v+w Add

Add

Page 11: Parallel Algorithms

BranchAdd

MultiplySubtractDivide

DataSource

xy

Process1

MIMD

Function Unit

Add

Result s/q

vw

qs

Result w+v

Branch

MultiplyBranchSubtractDivideAdd

Process2

Divide

DivideSubtractBranch

AddMultiply

Process3

Multiply

SubtractMultiplyDivideAdd

Branch

Process4

DataSource

DataSource

DataSource

Function Unit

YES

NO

Function Unit

Function Unit

Result x · y

Page 12: Parallel Algorithms

Array Processors

ControlProcessor

Memory

Communication network

ArithmeticProcessor

Memory

ArithmeticProcessor

Memory

ArithmeticProcessor

Memory

PE

master slave

instructions (for multiple data)

slave slave

Page 13: Parallel Algorithms

tightly coupled multiprocessors

Interconnection network

PP P

M M M

· · ·

· · ·

IdenticalProcessors

Page 14: Parallel Algorithms

loosely coupled multiprocessors

Interconnection network

PP P

M M M

· · ·IdenticalProcessingElements( PEs )

Page 15: Parallel Algorithms

Vector ( pipe-line ) processors

Determinenormalizationfactor

Addexponentsand multiplymantissas

Alignoperandsaccordingly

Comparecomponents Normalize

results

Operand one

Operand two Result

functional unit

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

A simplified pipeline for floating-point multiplication

Page 16: Parallel Algorithms

3. PRAM (Parallel Random Access Machine)

(i) p general-purpose processors(ii) Each processor is connected to a large shared, random access memory M.(iii) Each processor has a private (or local) memory for its own computation.(iv) All communications among processors take place via the shared memory.(v) The input for an algorithm is assumed to be the 1st n memory cells, and output is to be placed in the 1st cell.(vi) All memory cells are initialized to be “0”.

[A PRAM]

Processors

Interconnection

Memory

P1 P2 P3 Pp· · ·

· · ·

· · ·1 m

M

Page 17: Parallel Algorithms

(vii) All processors run the same program.(viii) Each processor knows its own index.(ix) A PRAM program may instruct processors to do different things depending on their indices.

read computation write

three phases

Page 18: Parallel Algorithms

Major Assumption

(i) PRAM processors are synchronized !!!

(1) processors begin each step at the same time.

(2) All the processors that write at any step write at the same time.

(ii) Any number of processors may read the same memory cell

concurrently !!!

Page 19: Parallel Algorithms

Variants of PRAM’s

CREW ( Concurrent Read Exclusive Write )

CRCW ( Concurrent Read Concurrent Write )

– Common-write

– Priority-write

Why not EREW ?

yes, if you want !!!

Page 20: Parallel Algorithms

Other Models

[Other parallel architectures]

(a) A hypercube (dimension = 3) (b) A bounded degree network (degree = 4)

(c) Octree model

··· ···· · · · · · · · · · · · · · · ·

Page 21: Parallel Algorithms

4. Parallel Algorithms

• Binary Fan-in Technique

• Matrix multiplication

• Handling write conflicts

• Merging & Sorting

Page 22: Parallel Algorithms

Binary Fan-in Technique

[A parallel tournament]( finding Max )

Read

Read

Compute

Write

Read

Compute

Write

Read

Compute

Write

P1x1

P3 P5 P7

x2

x3

x4

x5

x6

x7

x8

M[1] = max

comparisonread

save write

Page 23: Parallel Algorithms

[A tournament example showing the activity of all the processors.]

Processors: Step 0read M[i] into big

Step 1read M[i+1] into tempbig := max (big, temp)write big

Step 2read M[i+2] into tempbig := max (big, temp)write big

Step 3read M[i+4] into tempbig := max (big, temp)write big

16 12 1 17 23 19 4 8

P1 P2 P3 P4 P5 P6 P7 P8

M

16 12 1 17 23 19 4 8

16 12 1 17 23 19 4 8M

12 1 17 23 19 4 8 – 16 12 17 23 23 19 8 8

16 12 17 23 23 19 8 8M

17 23 23 19 8 8 – – 17 23 23 23 23 19 8 8

17 23 23 23 23 19 8 8M

23 23 23 23 23 19 8 8M

23 19 8 8 – – – – 23 23 23 23 23 19 8 8

max

Page 24: Parallel Algorithms

read M[i] into big ; incr := 1 ; write – { some very small value } into M[n+i] ; for step := 1 to lg n do read M[i+incr] into temp ; big := max (big, temp) ; incr := 2 * incr ; write big into M[i] end { for }

O( log n ) using n/2 processors

no write conflicts

Page 25: Parallel Algorithms

Matrix Multiplication

njibac

BAC

n

kkjikij

,1for

:

1

O(n) using n2 processors

What if using n3 processors ?

O( log n )

Why ?

Page 26: Parallel Algorithms

Handling write conflict

Algorithm : Computing the or of n Bits

Input : Bits x1, · · · · ,xn in M[1],· · · ·, M[n].

Output : x1 · · · · xn in M[1].

Pi reads xi from M[i] ;

If xi=1, then Pi writes 1 in M[1].

O(1) using n processors

write conflict !!!

CRCW

– Common-write

– Priority-write

Page 27: Parallel Algorithms

Fast algorithm for finding Max

2 7 3 6 0 0 0 0

2 7 3 6 1 0 1 1

7

[Example for the fast max-finding algorithm]

O(1) using processors

common-write

2n

Initial memory contents (n = 4).

After Step 2

After Step 3

Input loser

1

8

11 1 1 1

1

P12 P13 P14 P23 P34 P24

P23

7

Page 28: Parallel Algorithms

Algorithm : Finding the Largest of n Keys

Input : n keys x1, x2,···, xn, initially in memory cells M[1], M[2],···, M[n] (n>2).Output : The largest key will be left in M[1].

Comment : For clarity, the processors will be numbered Pi.j for 1 i j n. Step 1

Pi.j reads xi (from M[i]). Step 2

Pi.j reads xj (from M[j]).

Pi.j compares xi and xj. Let k be the index of the smaller key. (If the keys are equal, let k be the smaller index.)

Pi.j writes 1 in loser[k]. {At this point, every key other than the largest has lost a comparison. } Step 3

Pi.i+1 reads loser[i] ( and P1.n reads loser[n]) ;

Any processor that read a 0 writes xi in M[1]. (P1.n would write xn.)

{ Pi.i+1 already has xi in its local memory ; P1.n has xn. }

Page 29: Parallel Algorithms

Merging and Sorting merging

[Parallel merging]

O(log n) using n processors no write conflict

xn/2x1 yny1

<xi >xixi yj<yj

xi

>yj

x1,…, xi-1 and y1,…, yj-1 (merged)

P1 Pn/2+1Pn/2Pn

M[1] M[n/2] M[n]

M[i+j-1]

Pi

Pi

· · · · · ·

(a) Assignment of processors to keys.

(b) Binary search steps; Pi finds j such that yj-1<xi<yj.

(c) Output step.

binary search

Page 30: Parallel Algorithms

Algorithm : Parallel Merging

Input : Two sorted lists of n/2 keys each, in the first n cells of memory. Output : The merged list, in the first n cells of memory.

Comment : Each processor Pi has a local variable x (if in/2) or y (if i>n/2) and other local variables for conducting its binary search. Each processor has a local variable position that will indicate where to write its key.

Initialization : Pi reads M[i] into x (if in/2) or into y (if i>n/2).

Pi does initialization for its binary search.Binary search steps : Processors Pi, for 1in/2, do binary search in M[n/2+1],…,

M[n] to find the smallest j such that x<M[n/2+j], and assign i+j–1 to position. If there is no such j, Pi assigns n/2+i to position. Processors Pn/2+i, for 1in/2, do binary search in M[1],…, M[n/2] to find the smallest j such that y<M[j], and assign i+j–1 to position. If there is no such j, Pi assigns n/2+i to position.

Output step : Each Pi (for 1in) writes its key (x or y) in M[position].

Page 31: Parallel Algorithms

Break the list into two halves.Sort the two halves (recursively).

Merge the two sorted halves.

Algorithm : Sorting by Merging Input : A list of n keys in M[1],…,M[n]. Output : The n key sorted in nondecreasing order in M[1],…,M[n]. Comment : The indexing in the algorithm is easier if the number of keys is a

power of 2, so the first step will “pad” the input with large keys at the end. We still use only n processors.

Pi writes (some large key) in M[n+i] ;for t := 1 to lg n do k := 2t-1 ; { the size of the lists being merged }

Pi,…, Pi+2k-1 merge the two sorted lists of size k beginning at M[i];

end { for }

O((log n)2) using n processors