SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata...

51
SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbo

Transcript of SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata...

Page 1: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

SE363

Computer Architecture

MIMD Parallel Processors

John Morris

Iolanthe II racing in Waitemata Harbour

Page 2: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

MIMD Systems

• Recipe• Buy a few high performance commercial PEs

• DEC Alpha

• MIPS R10000

• UltraSPARC

• Pentium?

• Put them together with some memory and peripherals on a common bus Instant

parallel processor!

• How to program it?

Page 3: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Programming Model

• Problem not unique to MIMD• Even sequential machines need one

• von Neuman (stored program) model

• Parallel - Splitting the work load• Data

• Distribute data to PEs• Instructions

• Distribute tasks to PEs• Synchronization

• Having divided the data & tasks,how do we synchronize tasks?

Page 4: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Programming Model

• Shared Memory Model• Flavour of the year

• Generally thought

to be simplest to manage

• All PEs see a common (virtual) address space

• PEs communicate by writing into the common address space

Page 5: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Data Distribution

• Trivial• All the data sits

in the common addressspace

• Any PE can access it!

• Uniform Memory Access(UMA) systems• All PEs access all data

with same tacc

• Non-UMA (NUMA) systems• Memory is physically distributed• Some PEs are “closer” to some addresses• More later!

Page 6: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Synchronisation

• Read static shared data• No problem!

• Update problem• PE0 writes x

• PE1 reads x

• How to ensure thatPE1 reads the lastvalue written by PE0?

• Semaphores• Lock resources

(memory areas or ...)while being updatedby one PE

Page 7: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Synchronisation

• Semaphore• Data structure in memory

• Count of waiters• -1 = resource free

• >= 0 resource in use

• Pointer to list of waiters• Two operations

• Wait• Proceed immediately if resource free

(waiter count = -1)

• Notify• Advise semaphore that you have finished with resource

• Decrement waiter count

• First waiter will be given control

Page 8: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Semaphores - Implementation

• Scenario• Semaphore free (-1)

• PE0: wait ..

• Resource free, so PE0 uses it (sets 0)

• PE1: wait ..• Reads count (0)• Starts to increment it ..

• PE0 notify ..• Gets bus and writes -1

• PE1: (finishing wait)

• Adds 1 to 0, writes 1 to count, adds PE1 TCB to list

Stalemate!• Who issues notify to free the resource?

Page 9: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Atomic Operations

• Problem• PE0 wrote a new value (-1) after PE1 had read the counter

• PE1 increments the value it read (0) and writes it back

• Solution• PE1’s read and update must be atomic

• No other PE must gain access to counter

while PE1 is updating

• Usually an architecture will provide • Test and set instruction

• Read a memory location, test it,if it’s 0, write a new value,else do nothing

• Atomic or indivisible .. No other PE can access the value until the operation is complete

Page 10: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Atomic Operations

• Test & Set• Read a memory location, test it,

if it’s 0, write a new value,else do nothing

• Can be used to guard a resource• When the location contains 0 -

access to the resource is allowed• Non-zero value means the resource is locked• Semaphore:

• Simple semaphore (no wait list)• Implement directly• Waiter “backs off” and tries again (rather than being queued)

• Complex semaphore (with wait list)• Guards the wait counter

Page 11: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Atomic Operations

• Processor must provide an atomic operation for• Multi-tasking or multi-threading on a single PE

• Multiple processes• Interrupts occur at arbitrary points in time

• including timer interrupts signaling end of time-slice

• Any process can be interrupted in the middle of a read-modify-write sequence

• Shared memory multi-processors• One PE can lose control of the bus after the

read of a read-modify-write• Cache?

• Later!

Page 12: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Atomic Operations

• Variations• Provide equivalent capability

• Sometimes appear in strange guises!

• Read-modify-write bus transactions• Memory location is

read, modified and written back as a single, indivisible operation

• Test and exchange• Check register’s value, if 0, exchange with memory

• Reservation Register (PowerPC)• lwarx - load word and reserve indexed• stwcx - store word conditional indexed• Reservation register stores address of reserved word

• Reservation and use can be separated by sequence of instructions

Page 13: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Barriers

• In shared memoryenvironment

• PEs must know whenanother PE hasproduced a result

• Simplest case:barrier for all PEs

• Must be inserted byprogrammer

• Potentially expensive• All PEs stall and

waste time in the barrier

Page 14: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Cache?

• What happens to cachedlocations?

Page 15: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Multiple Caches

• CoherencePEA reads location x

from memory Copy in cache A

PEB reads location x from memory Copy in cache B

PEA adds 1

Page 16: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Multiple Caches - Inconsistent states

• CoherencePEA reads location x

from memory Copy in cache A

PEB reads location x from memory Copy in cache B

PEA adds 1

A’s copy now 201PEB reads location x

reads 200 from cache B

Page 17: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Multiple Caches - Inconsistent states

• CoherencePEA reads location x

from memory Copy in cache A

PEB reads location x from memory Copy in cache B

PEA adds 1

A’s copy now 201PEB reads location x

reads 200 from cache BCaches and memory are now inconsistent or

not coherent

Page 18: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Cache - Maintaining Coherence

• Invalidate on writePEA reads location x

from memory Copy in cache A

PEB reads location x from memory Copy in cache B

PEA adds 1

A’s copy now 201Issues invalidate x

Cache B marks x invalid• Invalidate is address transaction only

Page 19: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Cache - Maintaining Coherence

• Reading the new valuePEB reads location x

Main memoryis wrong also

PEA snoops read

Realises it hasvalid copy

PEA issues retry

Page 20: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Cache - Maintaining Coherence

• Reading the new valuePEB reads location x

Main memoryis wrong also

PEA snoops read

Realises it hasvalid copy

PEA issues retry

PEA writes x back

Memory now correct PEB reads location x again

• Reads latest version

Page 21: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Coherent Cache - Snooping

• SIU “snoops” bus for transactions• Addresses compared with local cache• Matches

• Initiate retries• Local copy is modified

• Local copy is written to bus

• Invalidate local copies• Another PE is writing

• Mark local copies shared

• second PE is readingsame value

Page 22: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Coherent Cache - MESI protocol

• Cache line has 4 states• Invalid• Modified

• Only valid copy• Memory copy is invalid

• Exclusive• Only cached copy• Memory copy is valid

• Shared• Multiple cached copies• Memory copy is valid

Page 23: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

MESI State Diagram

• Note the number of bus transactions needed!

WH Write HitWM Write MissRH Read HitRMS Read Miss SharedRME Read Miss ExclusiveSHW Snoop Hit Write

Page 24: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Coherent Cache - The Cost

• Cache coherency transactions• Additional transactions needed • Shared

• Write Hit• Other caches must be notified

• Modified• Other PE read

• Push-out needed

• Other PE write• Push-out needed - writing one word of n-word line

• Invalid - modified in other cache• Read or write

• Wait for push-out

Page 25: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters

• A bus which is too long becomes slow! eg PCI is limited to 10 TTL loads

• Lots of processors?• On the same bus

• Bus speed must be limited Low communication rate Better to use a single PE!

• Clusters• ~8 processors on a bus

Page 26: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters

8 cache coherent

(CC) processors

on a bus

Interconnectnetwork

~100? clusters

Page 27: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters

Network InterfaceUnit

Detects requests for“remote” memory

Page 28: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters

Messagedespatched to

remote cluster’sNIU

Memory RequestMessage

Page 29: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

This memory ismuch closer

than this one!

From PEs inthis cluster

Clusters - Shared Memory

• Non Uniform Memory Access• Access time to memory depends on location!

Page 30: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters - Shared Memory

• Non Uniform Memory Access• Access time to memory depends on location!

Worse!NIU needs to maintain

cache coherenceacross the entire

machine

Page 31: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters - Maintaining Cache Coherence

• NIU (or equivalent) maintains directory • Directory Entries

• All lines from local memory cached elsewhere

• NIU software (firmware) • Checks memory requests against directory• Update directory• Send invalidate messages to other clusters• Fetch modified (dirty) lines from other clusters

• Remote memory access cost• 100s of cycles!

Address Status Clusters 4340 S 1, 3, 8 5260 E 9

Directory(Cluster 2)

Page 32: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters - “Off the shelf”

• Commercial clusters • Provide page migration

• Make copy of a remote page on the local PE• Programmer remains responsible for

coherence• Don’t provide hardware support for cache

coherence (across network)• Fully CC machines may never be available!

• Software Systems• ....

Page 33: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Shared Memory Systems

• Software Systems eg Treadmarks• Provide shared memory on page basis

• Software • detects references to remote pages

• moves copy to local memory

• Reduces shared memory overhead• Provides some of the shared memory model

convenience• Without swamping interconnection network with

messages

• Message overhead is too high for a single word!

• Word basis is too expensive!!

Page 34: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Shared Memory Systems - Granularity

• Granularity• Word basis is too expensive!!• Sharing data at low granularity

• Fine grain sharing• Access / sharing for individual words

• Overheads too high• Number of messages

• Message overhead is high for one word

• Compare• Burst access to memory• Don’t fetch a single word -

• Overhead (bus protocol) is too high

• Amortize cost of access over multiple words

Page 35: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Shared Memory Systems - Granularity

• Coarse Grain Systems• Transferring data from cluster to cluster

• Overhead• Messages

• Updating directory

• Amortise the overhead over a whole pageLower relative overhead

• Applies to thread size also• Split program into small threads of control

Parallel Overhead

• cost of setting up & starting each thread

• cost of synchronising at the end of a set of threads• Can be more efficient to run a single sequential thread!

Page 36: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Coarse Grain Systems

• So far ...• Most experiments suggest that fine grain

systems are impractical• Larger, coarser grain

• Blocks of data• Threads of computation

needed to reduce overall computation time by using multiple processors

• Too Fine grain parallel systems • can run slower than a single processor!

Page 37: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Parallel Overhead• Ideal

• Time = 1/n

• Add Overhead• Time > optimal• No point to use

more than4 PEs!!

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

Number of PEs

Exe

cuti

on

Tim

e

Ideal

"+Parall O'head"

Page 38: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Parallel Overhead• Ideal

• Time = 1/n

• Add Overhead• Time > optimal• No point to use

more than4 PEs!!

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

Number of PEs

Exe

cuti

on

Tim

e

Ideal

"+Parall O'head"

Page 39: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Parallel Overhead• Shared memory systems Best results if you

• Share on large block basis

eg page• Split program into coarse grain

(long running) threads• Give away some parallelism

to achieve any parallel speedup!

• Coarse grain• Data• Computation

There’s parallelism at the instruction level too!The instruction issue unit in a sequential processoris trying to exploit it!

Page 40: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters - Improving multiple PE performance

• Bandwidth to memory • Cache reduces dependency on the memory-

CPU interface• 95% cache hits 5% of memory accesses

crossing the interface

but add • a few PEs and • a few CC transactions

even if the interface was coping before,it won’t in a multiprocessor system!

A major bottleneck!

Page 41: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters - Improving multiple PE performance

• Bus protocols add to access time Request / Grant / Release phases needed

• “Point-to-point” is faster! • Cross-bar switch

interface to memory• No PE contends

with any other for the common bus

Cross-bar?Name taken from old telephone exchanges!

Page 42: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Clusters - Memory Bandwidth

• Modern Clusters• Use “Point-to-point” X-bar interfaces to

memory to get bandwidth!

• Cache coherence?• Now really hard!!• How does each cache

snoop all transactions?

Page 43: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Programming Model

• Distributed Memory• Message passing• Alternative to shared memory• Each PE has

own address space• PEs communicate

with messages• Messages provide

synchronisation• PE can block or

wait for a message

Page 44: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Programming Model - Distributed Memory

• Distributed Memory Systems• Hardware is simple!• Network can be as simple as ethernet• Networks of Workstations model

• Commodity (cheap!) PEs• Commodity Network

• Standard

• Ethernet

• ATM• Proprietary

• Myrinet

• Achilles (UWA!)

Page 45: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Programming Model - Distributed Memory

• Distributed Memory Systems• Software is considered harder• Programmer responsible for

• Distributing data to individual PEs• Explicit Thread control

• Starting, stopping & synchronising

• At least two commonly available systems• Parallel Virtual Machine (PVM)• Message Passing Interface (MPI)

• Built on two operations• Send data, destPE, block | don’t block

• Receive data, srcPE, block | don’t block

• Blocking ensures synchronisation

Page 46: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Programming Model - Distributed Memory

• Distributed Memory Systems• Performance generally better

(versus shared memory)• Shared memory has hidden overheads

• Grain size poorly chosen• eg data doesn’t fit into pages

• Unnecessary coherencetransactions

• Updating a shared region (each page)before end of computation

• MP system waits and updates page when computation is complete

Page 47: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Programming Model - Distributed Memory

• Distributed Memory Systems• Performance generally better

(versus shared memory)

• False sharing

• Severely degrades performance• May not be apparent on superficial analysis

PEa accessesthis data

PEb accessesthis data

This whole pageping-pongs

between PEa and PEb

Memory page

Page 48: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Distributed Memory - Summary

• Simpler (almost trivial) hardware• Software

• More programmer effort• Explicit data distribution• Explicit synchronisation

• Performance generally better • Programmer knows more about the problem• Communicates only when necessary• Communication grain size can be optimum

Lower overheads

Page 49: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Data Flow

• Conventional programming models are control driven• Instruction sequence is precisely specified• Sequence specifies control

• which instruction the CPU will execute next

• Execution rule:• Execute an instruction when its predecessor

has completed s1: r = a*b;s2: s = c*d;s3: y = r + s;

s2 executes when s1 is completes3 executes when s2 is complete

Page 50: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Data Flow• Consider the calculation

• y = a*b + c*d

• Represent it bya graph• Nodes represent

computations• Data flows along

arcs

• Execution rule:• Execute an instruction

when its data is available• Data driven rule

a b

x

+

d c

x

y

Page 51: SE363 Computer Architecture MIMD Parallel Processors John Morris Iolanthe II racing in Waitemata Harbour.

Data Flow• Dataflow firing rule

• An instruction fires (executes)when its data is available

• Exposes all possible parallelism• Either multiplication can

fire as soon as data arrives• Addition must wait

• Data dependence analysis!• Instruction issue units:

• Fire (issue) each instructionwhen its operands (registers) have been written

a b

x

+

d c

x

y