Parallel Arch
-
Upload
manishbhardwaj8131 -
Category
Documents
-
view
217 -
download
0
Transcript of Parallel Arch
-
7/27/2019 Parallel Arch
1/11
Parallel Architectures
1
Parallel ArchitecturesParallel Architectures
COMP375 Computer Architecture
an rgan za on
Ever Faster
There is a continuing drive to create fastercompu ers.
Some large engineering problems requiremassive computing resources.
Increases in clock speed alone will not
performance.
Parallelism provides an ability to get moredone in the same amount of time.
Intel Microprocessor Performance
From Stallings textbook
Intel Performance
7000.0
8000.0
ArchitecturalImprovements
2000.0
3000.0
4000.0
5000.0
6000.0
Clock
Architecture
0.0
1000.0
1987 1989 1991 1993 1995 1997 1999 2001 2002 2003
-
7/27/2019 Parallel Arch
2/11
Parallel Architectures
2
Large Computing ChallengesLarge Computing Challenges
(Memory Requirement)
Demands exceed the capabilities
o even t e astest current
uniprocessor systems
from http://meseec.ce.rit.edu/eecc756-spring2002/756-3-8-2005.ppt
Parallelism
The key to making a computer fast is to doas muc n para e as poss e.
Processors in modern home PCs do a lotin parallel
Superscalar execution
Hyperthreading
Pipelining
Processes and Threads All modern operating systems support
multiple processes and threads.
Processes have their own memory space.
Each program is a process. Threads are a flow of control through a
process. Each process has at least onerea .
Single processor systems support multipleprocesses and threads by sequentiallysharing the processor.
Hyperthreading
Hyperthreading allows the apparentlypara e execu on o wo rea s.
Each thread has its own set of registers,but they share a single CPU.
The CPU alternates executing instructions.
Pipelining is improved because adjacentinstructions come from different threadsand therefore do not have data hazards.
-
7/27/2019 Parallel Arch
3/11
Parallel Architectures
3
Hyperthreading Performance
Also called Multiple Thread 1
processors
Overall throughput isincreased
rea
Thread 1
Thread 2
Thread 1
Thread 2
run as fast as they wouldwithout hyperthreading
reaThread 2
Pipelining
Hyperthreading
1. Runs best in singlerea e app ca ons
2. Replaces pipelining3. Duplicates the user
register set
. -processors.
How Visible is Parallelism?
Superscalar
Programmer never notices
Programmer must create multiple threads in
the program.
Multiple processes
Different programs
Programmer never notices
Parts of the same program
Programmer must divide the work among differentprocesses.
-
7/27/2019 Parallel Arch
4/11
Parallel Architectures
4
Flynns Parallel Classification
SISD Single Instruction Single Data
standard uniprocessors
SIMD Single Instruction Multiple Data
vector and array processors
MISD - Multiple Instruction Single Data
sys o c processors MIMD Multiple Instruction Multiple Data
MIMD Computers
Shared Memory Attached processors
Multiple processor systems
Multiple instruction issue processors
Multi-Core processors
NUMA
epara e emory Clusters
Message passing multiprocessors
Hypercube
Mesh
Single Instruction Multiple Data
Many high performance computing programsper orm ca cu a ons on arge a a se s.SIMD computers can be used to performmatrix operations.
The Intel Pentium MMX instructions performSIMD o erations on 8 one b te inte ers.
Vector Processors
A vector is a one dimensional array.
Vector processors are SIMD machines that
have instructions to perform operations onvectors.
A vector process might have vector.
A single instruction could add two vectors.
-
7/27/2019 Parallel Arch
5/11
Parallel Architectures
5
Vector Example
/* traditional vector addition */
double a[64], b[64], c[64];
for (i = 0; i < 64; i++)
c[i] = a[i] + b[i];
/* vector processor in one instruction*/ci = ai + bi
Non-contiguous Vectors
A simple vector is a one dimensional arrayw eac e emen n success ve a resses
In a two dimensional array, the rows can beconsidered simple vectors. The columnsare vectors, but the are not stored inconsecutive addresses.
The distance between elements of a vectoris called the stride.
Vector Stride
Matrix multiplication
12,11 bb
32,31
22,21*
23,22,21
,,
bb
bbaaa
aaa
a11 a12 a13 a21 a22 a23
b11 b12 b21 b22 b31 b32
Cray-1 Vector Processor
Sold to Los AlamosNational Laboratoryin1976 for $8.8 million
80 MHz clock
160 megaflops
8 me ab te 1 M words
Freon cooled
-
7/27/2019 Parallel Arch
6/11
Parallel Architectures
6
Cray-2Built in 1985 with ten times the performance of the Cray-1
Intel calls some of their new
Pentium instructions SIMD because. ey opera e on
multiple bytes
2. There is a separateprocessor for them
.units operate on onepiece of data
4. They are executed inthe GPU
Data Flow Architecture
Data flow graph computing N =(A+B) * (B-4)
MISD Systolic Architectures
Memory
A computed value is passed from onerocessor to the next.
Processor
Element
Processor
Element
Processor
Element
Each processor may perform a differentoperation.
Minimizes memory accesses.
P ll l A hi
-
7/27/2019 Parallel Arch
7/11
Parallel Architectures
7
Systolic Architectures Systolic arrays meant to be special-purpose
processors or coprocessors and were very fine-grained
computation, usually called cells
Communication is very fast, granularity meant tobe very fine (a small number of computationaloperations per communication)
Very fast clock rates due to regular, synchronous
structure Data moved through pipelined computational units
in a regular and rhythmic fashion
Warp and iWarp were examples of systolic arrays
from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cationb2,2
b2,1 b1,2
b2,0 b1,1 b0,2
b1,0 b0,1
Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
b0,0
a0,2 a0,1 a0,0
Alignments in time
Rows of A
Columns of B
a , a , a ,
a2,2 a2,1 a2,0
T = 0
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cation
b2,2
b2,1 b1,2
b2,0 b1,1 b0,2
Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
b1,0 b0,1
a0,2 a0,1
b0,0
a0,0a0,0*b0,0
a , a , a ,
a2,2 a2,1 a2,0
T = 1
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cation
b2,2
b2,1 b1,2
Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
, , ,
a0,2
b1,0
a0,1a0,0*b0,0
+ a0,1*b1,0
a1 0
a0,0
b0,1
b0,0
a0,0*b0,1
a1,0*b0,0
, ,
a2,2 a2,1 a2,0
T = 2
,
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
P ll l A hit t
-
7/27/2019 Parallel Arch
8/11
Parallel Architectures
8
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cation
b2,2
Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
b2,1 b1,2
b2,0
a0,2a0,0*b0,0
+ a0,1*b1,0
+ a0,2*b2,0
a1 1
a0,1
b1,1
b1,0
a0,0*b0,1
+ a0,1*b1,1
a1,0*b0,0
+ a1,1*b1,0 a1,0
b0,1
a0,0
b0,2
a1,0*b0,1
a0,0*b0,2
,
a2,2 a2,1
T = 3
b0,0
a2,0 a2,0*b0,0
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
b2,2
a0,0*b0,0
+ a0,1*b1,0
+ a0,2*b2,0
a1 2
a0,2
b2,1
b2,0
a0,0*b0,1
+ a0,1*b1,1
+ a0,2*b2,1
a1,0*b0,0
+ a1,1*b1,0 a1,1
b1,1
a0,1
b1,2
a1,0*b0,1
a0,0*b0,2
+ a0,1*b1,2
a1,0
b0,2
a1,0*b0,2
a2,2T = 4
+ a1,2*a2,0
b1,0
a2,1
+a1,1*b1,1
a2,0*b0,0
+ a2,1*b1,0
b0,1
a2,0a2,0*b0,1
a2,2
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
a0,0*b0,0+ a0,1*b1,0
+ a0,2*b2,0
a0,0*b0,1
+ a0,1*b1,1
+ a0,2*b2,1
a1,0*b0,0
+ a1,1*b1,0 a1,2
b2,1
a0,2
b2,2
a1,0*b0,1
a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,2
a1 1
b1,2
a1,0*b0,2
T = 5
+ a1,2*a2,0
b2,0
a2,2
+a1,1*b1,1
+ a1,2*b2,1
a2,0*b0,0
+ a2,1*b1,0
+ a2,2*b2,0
b1,1
a2,1a2,0*b0,1
+ a2,1*b1,1
+ a , ,
b0,2
a2,0a2,0*b0,2
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
a0,0*b0,0+ a0,1*b1,0
+ a0,2*b2,0
a0,0*b0,1
+ a0,1*b1,1
+ a0,2*b2,1
a1,0*b0,0
+ a1,1*b1,0 a1,0*b0,1
a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,2
a1 2
b2,2
a1,0*b0,2
T = 6
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1
a2,0*b0,0
+ a2,1*b1,0
+ a2,2*b2,0
b2,1
a2,2a2,0*b0,1
+ a2,1*b1,1
+ a2,2*b2,1
+ a , ,
+ a1,2*b2,2
b1,2
a2,1a2,0*b0,2
+ a2,1*b1,2
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Parallel Architectures
-
7/27/2019 Parallel Arch
9/11
Parallel Architectures
9
Systolic Array Example:
3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid
Each processor accumulates one
element of the product
a0,0*b0,0
+ a0,1*b1,0
+ a0,2*b2,0
a0,0*b0,1
+ a0,1*b1,1
+ a0,2*b2,1
a1,0*b0,0
+ a1,1*b1,0 a1,0*b0,1
a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,2
a1,0*b0,2
T = 7
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1
a2,0*b0,0
+ a2,1*b1,0
+ a2,2*b2,0
a2,0*b0,1
+ a2,1*b1,1
+ a2,2*b2,1
+ a , ,
+ a1,2*b2,2
b2,2
a2,2 a2,0*b0,2+ a2,1*b1,2
+ a2,2*b2,2
Done
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Systolic Architectures
Never caught on for General-purposecompu a on
Regular structures are HARD to achieve
Real efficiency requires local communication,but flexibility of global communication is oftencritical
Used mainly in structured settings High speed multipliers
Structured Array Computations
from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf
Multiple Instruction Multiple Data
Multiple independent processors canexecute different parts of a program to
.
The processors can be different oridentical.
Different architectures allow the
the same memory
separate memory
local memory with access to all memory
Specialized Processors
Some architectures have a general purposeprocessor with additional specializedprocessors
I/O processors I/O controllers with DMA
Graphics processor
Intel 80387 for Intel 80386
Vector processors
Parallel Architectures
-
7/27/2019 Parallel Arch
10/11
Parallel Architectures
10
Separate Memory Systems
Each processor has its own memory thatnone o e o er processors can access.
Each processor has unhindered access to itsmemory.
No cache coherence problems.
Communication is done by passing
messages over special I/O channels
Topologies
The CPUs can be interconnected indifferent ways.
Goals of an interconnection method are:
Minimize the number of connections per CPU
Minimize the average distance betweennodes
Minimize the diameter or maximum distance
between any two nodes Maximize the throughput
Simple routing of messages
Grid or Mesh
Each node has 2, 3
Routing is simple
and many routesare available.
The diameter is
22 N
Torus
Similar to a meshw connec e ennodes.
Many similar routesavailable.
N
Parallel Architectures
-
7/27/2019 Parallel Arch
11/11
Parallel Architectures
11
Hypercube
Each node log2Nconnec ons
Routing can be doneby changing one bitof the address at atime.
The diameter islog2N
NUMA systems
In NonUniform MemoryAccess systemseach node has local memory but can also.
Most memory access are to local memory.
The memory of other nodes can beaccessed, but the access is slower than
.
Cache coherence algorithms are usedwhen shared memory is accesses.
More scalable than SMP.
NUMAsystem