Parallel Algorithms
description
Transcript of Parallel Algorithms
Parallel Algorithms
Sung Yong Shin
TC Lab
CS Dept. KAIST
Contents
1. Background
2. Parallel Computers
3. PRAM
4. Parallel Algorithms
1. Background
• Von Neumann Machines
– sequential
– executing one instruction at a time Inherent limitation
“ not faster than electrical signals ” 1 ft / 1 nanosecond ( 10-9 sec )
• Parallelism or Concurrency Carrying out many operations
simultaneously
– partition a complex problem in such a way that various parts of the work can be carried out independently and in parallel,
and combine the results when all subcomputation are complete.
– need parallel computers to support this approach.
Two approaches
• Hardware-oriented
– A parallel architecture of a specific kind is built.
– The parallel algorithms for solving different problems are developed to make use of these hardware features to the best advantage.
• Problem-oriented
– Whether the parallel algorithms can truly enhance the speed of obtaining a solution to a given problem, or not.
– If so, how much ?
Problems
(i) The usefulness of parallel computers depends greatly on :
– suitable parallel algorithms
– parallel computer languages
“ A major rethinking needed”
(ii) Practical limitations by parallel computers“ too many factors to be considered”
How to abstract ingredient from complex reality !!!
Which problems can be solved substantially faster using many processors rather than one processor ?
• Nicholas Pippenger (1976)
“ NC-class problems” ( Nick’s Class )
“ ultra-fast on a parallel computer with feasible amount of hardware”
( independent of the particular parallel model chosen )
Inherent Parallelism
probably not possible now but for the future !!!
“ fascinating research topics”P
P-complete
NC
P(n) processors
(log n)m
P = NC ?
Applications (needs )
• Computer vision / Image processing
• Computer Graphics
• Searching huge databases
• Artificial Intelligence
· · · · · · · ·
2. Parallel Computers
SIMD ( Single Instruction Multiple Data Stream )
MIMD ( Multiple Instruction Multiple Data Stream )
What does SISD stand for ?
Mutiply
Branch
Subtract
Divide
Add
Function Unit
DataSourcex
y
Result x+y
Program
SISD
Mutiply Branch Subtract Divide Add
Function Unit Data
Source
xy
Result x+y
Program
SIMD– array processors– vector processors (pipelining)
Function Unit
Function Unit
Add
Result s+q
vw
qs
Result v+w Add
Add
BranchAdd
MultiplySubtractDivide
DataSource
xy
Process1
MIMD
Function Unit
Add
Result s/q
vw
qs
Result w+v
Branch
MultiplyBranchSubtractDivideAdd
Process2
Divide
DivideSubtractBranch
AddMultiply
Process3
Multiply
SubtractMultiplyDivideAdd
Branch
Process4
DataSource
DataSource
DataSource
Function Unit
YES
NO
Function Unit
Function Unit
Result x · y
Array Processors
ControlProcessor
Memory
Communication network
ArithmeticProcessor
Memory
ArithmeticProcessor
Memory
ArithmeticProcessor
Memory
PE
master slave
instructions (for multiple data)
slave slave
tightly coupled multiprocessors
Interconnection network
PP P
M M M
· · ·
· · ·
IdenticalProcessors
loosely coupled multiprocessors
Interconnection network
PP P
M M M
· · ·IdenticalProcessingElements( PEs )
Vector ( pipe-line ) processors
Determinenormalizationfactor
Addexponentsand multiplymantissas
Alignoperandsaccordingly
Comparecomponents Normalize
results
Operand one
Operand two Result
functional unit
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
A simplified pipeline for floating-point multiplication
3. PRAM (Parallel Random Access Machine)
(i) p general-purpose processors(ii) Each processor is connected to a large shared, random access memory M.(iii) Each processor has a private (or local) memory for its own computation.(iv) All communications among processors take place via the shared memory.(v) The input for an algorithm is assumed to be the 1st n memory cells, and output is to be placed in the 1st cell.(vi) All memory cells are initialized to be “0”.
[A PRAM]
Processors
Interconnection
Memory
P1 P2 P3 Pp· · ·
· · ·
· · ·1 m
M
(vii) All processors run the same program.(viii) Each processor knows its own index.(ix) A PRAM program may instruct processors to do different things depending on their indices.
read computation write
three phases
Major Assumption
(i) PRAM processors are synchronized !!!
(1) processors begin each step at the same time.
(2) All the processors that write at any step write at the same time.
(ii) Any number of processors may read the same memory cell
concurrently !!!
Variants of PRAM’s
CREW ( Concurrent Read Exclusive Write )
CRCW ( Concurrent Read Concurrent Write )
– Common-write
– Priority-write
Why not EREW ?
yes, if you want !!!
Other Models
[Other parallel architectures]
(a) A hypercube (dimension = 3) (b) A bounded degree network (degree = 4)
(c) Octree model
··· ···· · · · · · · · · · · · · · · ·
4. Parallel Algorithms
• Binary Fan-in Technique
• Matrix multiplication
• Handling write conflicts
• Merging & Sorting
Binary Fan-in Technique
[A parallel tournament]( finding Max )
Read
Read
Compute
Write
Read
Compute
Write
Read
Compute
Write
P1x1
P3 P5 P7
x2
x3
x4
x5
x6
x7
x8
M[1] = max
comparisonread
save write
[A tournament example showing the activity of all the processors.]
Processors: Step 0read M[i] into big
Step 1read M[i+1] into tempbig := max (big, temp)write big
Step 2read M[i+2] into tempbig := max (big, temp)write big
Step 3read M[i+4] into tempbig := max (big, temp)write big
16 12 1 17 23 19 4 8
P1 P2 P3 P4 P5 P6 P7 P8
M
16 12 1 17 23 19 4 8
16 12 1 17 23 19 4 8M
12 1 17 23 19 4 8 – 16 12 17 23 23 19 8 8
16 12 17 23 23 19 8 8M
17 23 23 19 8 8 – – 17 23 23 23 23 19 8 8
17 23 23 23 23 19 8 8M
23 23 23 23 23 19 8 8M
23 19 8 8 – – – – 23 23 23 23 23 19 8 8
max
read M[i] into big ; incr := 1 ; write – { some very small value } into M[n+i] ; for step := 1 to lg n do read M[i+incr] into temp ; big := max (big, temp) ; incr := 2 * incr ; write big into M[i] end { for }
O( log n ) using n/2 processors
no write conflicts
Matrix Multiplication
njibac
BAC
n
kkjikij
,1for
:
1
O(n) using n2 processors
What if using n3 processors ?
O( log n )
Why ?
Handling write conflict
Algorithm : Computing the or of n Bits
Input : Bits x1, · · · · ,xn in M[1],· · · ·, M[n].
Output : x1 · · · · xn in M[1].
Pi reads xi from M[i] ;
If xi=1, then Pi writes 1 in M[1].
O(1) using n processors
write conflict !!!
CRCW
– Common-write
– Priority-write
Fast algorithm for finding Max
2 7 3 6 0 0 0 0
2 7 3 6 1 0 1 1
7
[Example for the fast max-finding algorithm]
O(1) using processors
common-write
2n
Initial memory contents (n = 4).
After Step 2
After Step 3
Input loser
1
8
11 1 1 1
1
P12 P13 P14 P23 P34 P24
P23
7
Algorithm : Finding the Largest of n Keys
Input : n keys x1, x2,···, xn, initially in memory cells M[1], M[2],···, M[n] (n>2).Output : The largest key will be left in M[1].
Comment : For clarity, the processors will be numbered Pi.j for 1 i j n. Step 1
Pi.j reads xi (from M[i]). Step 2
Pi.j reads xj (from M[j]).
Pi.j compares xi and xj. Let k be the index of the smaller key. (If the keys are equal, let k be the smaller index.)
Pi.j writes 1 in loser[k]. {At this point, every key other than the largest has lost a comparison. } Step 3
Pi.i+1 reads loser[i] ( and P1.n reads loser[n]) ;
Any processor that read a 0 writes xi in M[1]. (P1.n would write xn.)
{ Pi.i+1 already has xi in its local memory ; P1.n has xn. }
Merging and Sorting merging
[Parallel merging]
O(log n) using n processors no write conflict
xn/2x1 yny1
<xi >xixi yj<yj
xi
>yj
x1,…, xi-1 and y1,…, yj-1 (merged)
P1 Pn/2+1Pn/2Pn
M[1] M[n/2] M[n]
M[i+j-1]
Pi
Pi
· · · · · ·
(a) Assignment of processors to keys.
(b) Binary search steps; Pi finds j such that yj-1<xi<yj.
(c) Output step.
binary search
Algorithm : Parallel Merging
Input : Two sorted lists of n/2 keys each, in the first n cells of memory. Output : The merged list, in the first n cells of memory.
Comment : Each processor Pi has a local variable x (if in/2) or y (if i>n/2) and other local variables for conducting its binary search. Each processor has a local variable position that will indicate where to write its key.
Initialization : Pi reads M[i] into x (if in/2) or into y (if i>n/2).
Pi does initialization for its binary search.Binary search steps : Processors Pi, for 1in/2, do binary search in M[n/2+1],…,
M[n] to find the smallest j such that x<M[n/2+j], and assign i+j–1 to position. If there is no such j, Pi assigns n/2+i to position. Processors Pn/2+i, for 1in/2, do binary search in M[1],…, M[n/2] to find the smallest j such that y<M[j], and assign i+j–1 to position. If there is no such j, Pi assigns n/2+i to position.
Output step : Each Pi (for 1in) writes its key (x or y) in M[position].
Break the list into two halves.Sort the two halves (recursively).
Merge the two sorted halves.
Algorithm : Sorting by Merging Input : A list of n keys in M[1],…,M[n]. Output : The n key sorted in nondecreasing order in M[1],…,M[n]. Comment : The indexing in the algorithm is easier if the number of keys is a
power of 2, so the first step will “pad” the input with large keys at the end. We still use only n processors.
Pi writes (some large key) in M[n+i] ;for t := 1 to lg n do k := 2t-1 ; { the size of the lists being merged }
Pi,…, Pi+2k-1 merge the two sorted lists of size k beginning at M[i];
end { for }
O((log n)2) using n processors