Parallel Arch

7/27/2019 Parallel Arch

1/11

Parallel Architectures

1

Parallel ArchitecturesParallel Architectures

COMP375 Computer Architecture

an rgan za on

Ever Faster

There is a continuing drive to create fastercompu ers.

Some large engineering problems requiremassive computing resources.

Increases in clock speed alone will not

performance.

Parallelism provides an ability to get moredone in the same amount of time.

Intel Microprocessor Performance

From Stallings textbook

Intel Performance

7000.0

8000.0

ArchitecturalImprovements

2000.0

3000.0

4000.0

5000.0

6000.0

Clock

Architecture

0.0

1000.0

1987 1989 1991 1993 1995 1997 1999 2001 2002 2003


2/11


2

Large Computing ChallengesLarge Computing Challenges

(Memory Requirement)

Demands exceed the capabilities

o even t e astest current

uniprocessor systems

from http://meseec.ce.rit.edu/eecc756-spring2002/756-3-8-2005.ppt

Parallelism

The key to making a computer fast is to doas muc n para e as poss e.

Processors in modern home PCs do a lotin parallel

Superscalar execution

Hyperthreading

Pipelining

Processes and Threads All modern operating systems support

multiple processes and threads.

Processes have their own memory space.

Each program is a process. Threads are a flow of control through a

process. Each process has at least onerea .

Single processor systems support multipleprocesses and threads by sequentiallysharing the processor.

Hyperthreading

Hyperthreading allows the apparentlypara e execu on o wo rea s.

Each thread has its own set of registers,but they share a single CPU.

The CPU alternates executing instructions.

Pipelining is improved because adjacentinstructions come from different threadsand therefore do not have data hazards.


3/11


3

Hyperthreading Performance

Also called Multiple Thread 1

processors

Overall throughput isincreased

rea

Thread 1

Thread 2

Thread 1

Thread 2

run as fast as they wouldwithout hyperthreading

reaThread 2

Pipelining

Hyperthreading

1. Runs best in singlerea e app ca ons

2. Replaces pipelining3. Duplicates the user

register set

. -processors.

How Visible is Parallelism?

Superscalar

Programmer never notices

Programmer must create multiple threads in

the program.

Multiple processes

Different programs

Programmer never notices

Parts of the same program

Programmer must divide the work among differentprocesses.


4/11


4

Flynns Parallel Classification

SISD Single Instruction Single Data

standard uniprocessors

SIMD Single Instruction Multiple Data

vector and array processors

MISD - Multiple Instruction Single Data

sys o c processors MIMD Multiple Instruction Multiple Data

MIMD Computers

Shared Memory Attached processors

Multiple processor systems

Multiple instruction issue processors

Multi-Core processors

NUMA

epara e emory Clusters

Message passing multiprocessors

Hypercube

Mesh

Single Instruction Multiple Data

Many high performance computing programsper orm ca cu a ons on arge a a se s.SIMD computers can be used to performmatrix operations.

The Intel Pentium MMX instructions performSIMD o erations on 8 one b te inte ers.

Vector Processors

A vector is a one dimensional array.

Vector processors are SIMD machines that

have instructions to perform operations onvectors.

A vector process might have vector.

A single instruction could add two vectors.


5/11


5

Vector Example

/* traditional vector addition */

double a[64], b[64], c[64];

for (i = 0; i < 64; i++)

c[i] = a[i] + b[i];

/* vector processor in one instruction*/ci = ai + bi

Non-contiguous Vectors

A simple vector is a one dimensional arrayw eac e emen n success ve a resses

In a two dimensional array, the rows can beconsidered simple vectors. The columnsare vectors, but the are not stored inconsecutive addresses.

The distance between elements of a vectoris called the stride.

Vector Stride

Matrix multiplication

12,11 bb

32,31

22,21*

23,22,21

,,

bb

bbaaa

aaa

a11 a12 a13 a21 a22 a23

b11 b12 b21 b22 b31 b32

Cray-1 Vector Processor

Sold to Los AlamosNational Laboratoryin1976 for $8.8 million

80 MHz clock

160 megaflops

8 me ab te 1 M words

Freon cooled


6/11


6

Cray-2Built in 1985 with ten times the performance of the Cray-1

Intel calls some of their new

Pentium instructions SIMD because. ey opera e on

multiple bytes

2. There is a separateprocessor for them

.units operate on onepiece of data

4. They are executed inthe GPU

Data Flow Architecture

Data flow graph computing N =(A+B) * (B-4)

MISD Systolic Architectures

Memory

A computed value is passed from onerocessor to the next.

Processor

Element

Processor

Element

Processor

Element

Each processor may perform a differentoperation.

Minimizes memory accesses.

P ll l A hi


7/11


7

Systolic Architectures Systolic arrays meant to be special-purpose

processors or coprocessors and were very fine-grained

computation, usually called cells

Communication is very fast, granularity meant tobe very fine (a small number of computationaloperations per communication)

Very fast clock rates due to regular, synchronous

structure Data moved through pipelined computational units

in a regular and rhythmic fashion

Warp and iWarp were examples of systolic arrays

from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf

Systolic Array Example:

3x3 Systolic Array Matrix Multipli cationb2,2

b2,1 b1,2

b2,0 b1,1 b0,2

b1,0 b0,1

Processors arranged in a 2-D grid

Each processor accumulates one

element of the product

b0,0

a0,2 a0,1 a0,0

Alignments in time

Rows of A

Columns of B

a , a , a ,

a2,2 a2,1 a2,0

T = 0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/


3x3 Systolic Array Matrix Multipli cation

b2,2

b2,1 b1,2

b2,0 b1,1 b0,2




b1,0 b0,1

a0,2 a0,1

b0,0

a0,0a0,0*b0,0

a , a , a ,

a2,2 a2,1 a2,0

T = 1




b2,2

b2,1 b1,2




, , ,

a0,2

b1,0

a0,1a0,0*b0,0

+ a0,1*b1,0

a1 0

a0,0

b0,1

b0,0

a0,0*b0,1

a1,0*b0,0

, ,

a2,2 a2,1 a2,0

T = 2

,


P ll l A hit t


8/11


8



b2,2




b2,1 b1,2

b2,0

a0,2a0,0*b0,0

+ a0,1*b1,0

+ a0,2*b2,0

a1 1

a0,1

b1,1

b1,0

a0,0*b0,1

+ a0,1*b1,1

a1,0*b0,0

+ a1,1*b1,0 a1,0

b0,1

a0,0

b0,2

a1,0*b0,1

a0,0*b0,2

,

a2,2 a2,1

T = 3

b0,0

a2,0 a2,0*b0,0



3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid



b2,2

a0,0*b0,0

+ a0,1*b1,0

+ a0,2*b2,0

a1 2

a0,2

b2,1

b2,0

a0,0*b0,1

+ a0,1*b1,1

+ a0,2*b2,1

a1,0*b0,0

+ a1,1*b1,0 a1,1

b1,1

a0,1

b1,2

a1,0*b0,1

a0,0*b0,2

+ a0,1*b1,2

a1,0

b0,2

a1,0*b0,2

a2,2T = 4

+ a1,2*a2,0

b1,0

a2,1

+a1,1*b1,1

a2,0*b0,0

+ a2,1*b1,0

b0,1

a2,0a2,0*b0,1

a2,2






a0,0*b0,0+ a0,1*b1,0

+ a0,2*b2,0

a0,0*b0,1

+ a0,1*b1,1

+ a0,2*b2,1

a1,0*b0,0

+ a1,1*b1,0 a1,2

b2,1

a0,2

b2,2

a1,0*b0,1

a0,0*b0,2

+ a0,1*b1,2

+ a0,2*b2,2

a1 1

b1,2

a1,0*b0,2

T = 5

+ a1,2*a2,0

b2,0

a2,2

+a1,1*b1,1

+ a1,2*b2,1

a2,0*b0,0

+ a2,1*b1,0

+ a2,2*b2,0

b1,1

a2,1a2,0*b0,1

+ a2,1*b1,1

+ a , ,

b0,2

a2,0a2,0*b0,2






a0,0*b0,0+ a0,1*b1,0

+ a0,2*b2,0

a0,0*b0,1

+ a0,1*b1,1

+ a0,2*b2,1

a1,0*b0,0

+ a1,1*b1,0 a1,0*b0,1

a0,0*b0,2

+ a0,1*b1,2

+ a0,2*b2,2

a1 2

b2,2

a1,0*b0,2

T = 6

+ a1,2*a2,0 +a1,1*b1,1

+ a1,2*b2,1

a2,0*b0,0

+ a2,1*b1,0

+ a2,2*b2,0

b2,1

a2,2a2,0*b0,1

+ a2,1*b1,1

+ a2,2*b2,1

+ a , ,

+ a1,2*b2,2

b1,2

a2,1a2,0*b0,2

+ a2,1*b1,2




9/11


9





a0,0*b0,0

+ a0,1*b1,0

+ a0,2*b2,0

a0,0*b0,1

+ a0,1*b1,1

+ a0,2*b2,1

a1,0*b0,0

+ a1,1*b1,0 a1,0*b0,1

a0,0*b0,2

+ a0,1*b1,2

+ a0,2*b2,2

a1,0*b0,2

T = 7

+ a1,2*a2,0 +a1,1*b1,1

+ a1,2*b2,1

a2,0*b0,0

+ a2,1*b1,0

+ a2,2*b2,0

a2,0*b0,1

+ a2,1*b1,1

+ a2,2*b2,1

+ a , ,

+ a1,2*b2,2

b2,2

a2,2 a2,0*b0,2+ a2,1*b1,2

+ a2,2*b2,2

Done


Systolic Architectures

Never caught on for General-purposecompu a on

Regular structures are HARD to achieve

Real efficiency requires local communication,but flexibility of global communication is oftencritical

Used mainly in structured settings High speed multipliers

Structured Array Computations

from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf

Multiple Instruction Multiple Data

Multiple independent processors canexecute different parts of a program to

.

The processors can be different oridentical.

Different architectures allow the

the same memory

separate memory

local memory with access to all memory

Specialized Processors

Some architectures have a general purposeprocessor with additional specializedprocessors

I/O processors I/O controllers with DMA

Graphics processor

Intel 80387 for Intel 80386

Vector processors



10/11


10

Separate Memory Systems

Each processor has its own memory thatnone o e o er processors can access.

Each processor has unhindered access to itsmemory.

No cache coherence problems.

Communication is done by passing

messages over special I/O channels

Topologies

The CPUs can be interconnected indifferent ways.

Goals of an interconnection method are:

Minimize the number of connections per CPU

Minimize the average distance betweennodes

Minimize the diameter or maximum distance

between any two nodes Maximize the throughput

Simple routing of messages

Grid or Mesh

Each node has 2, 3

Routing is simple

and many routesare available.

The diameter is

22 N

Torus

Similar to a meshw connec e ennodes.

Many similar routesavailable.

N



11/11


11

Hypercube

Each node log2Nconnec ons

Routing can be doneby changing one bitof the address at atime.

The diameter islog2N

NUMA systems

In NonUniform MemoryAccess systemseach node has local memory but can also.

Most memory access are to local memory.

The memory of other nodes can beaccessed, but the access is slower than

.

Cache coherence algorithms are usedwhen shared memory is accesses.

More scalable than SMP.

NUMAsystem

Parallel Arch

Documents

Transcript of Parallel Arch