Download - ECE200 – Computer Organization Chapter 9 – Multiprocessors.

ECE200 – Computer Organization

Chapter 9 – Multiprocessors

What we’ll cover today

Multiprocessor motivation

Multiprocessor organizations

Shared memory multiprocessorsCache coherenceSynchronization

Multiprocessor motivation, part 1

Many scientific applications take too long to run on a single processor machineModeling of weather patterns, astrophysics, chemical

reactions, ocean currents, etc.

Many of these are parallel applications which largely consist of loops which operate on independent data

Such applications can make efficient use of a multiprocessor machine with each loop iteration running on a different processor and operating on independent data

Multiprocessor motivation, part 2

Many multi-user environments require more compute power than available from a single processor machineAirline reservation system, department store chain

inventory system, file server for a large department, web server for a major corporation, etc.

These consist of largely parallel transactions which operate on independent data

Such applications can make efficient usage of a multiprocessor machine with each transaction running on a different processor and operating on independent data


Shared memory multiprocessorsAll processors share the same memory address spaceSingle copy of the OS (although some parts may be

parallel)Relatively easy to program and port sequential code toDifficult to scale to large numbers of processorsUniform memory access (UMA) machine block diagram


Distributed memory multiprocessorsProcessors have their own memory address spaceMessage passing used to access another processor’s

memoryMultiple copies of the OSUsually commodity hardware and network (e.g.,

Ethernet)More difficult to programEasier to scale hardware and more inherently fault

resilient

Multiprocessor variants Non-uniform memory access (NUMA)

shared memory multiprocessorsAll memory can be addressed by all processors, but

access to a processor’s own local memory is faster than access to another processor’s remote memory

Looks like a distributed machine, but interconnection network is usually custom-designed switches and/or buses

Multiprocessor variants Distributed shared memory (DSM)

multiprocessorsCommodity hardware of a distributed memory

multiprocessor, but all processors have the illusion of shared memory

Operating system handles accesses to remote memory “transparently” on behalf of the application

Relieves application developer of the burden of memory management across the network

Multiprocessor variants

Shared memory machines connected together over a network (operating as a distributed memory or DSM machine)

network controller

network controller

…

network

Shared memory multiprocessors

Major design issuesCache coherence: ensuring that stores to cached data

are seen by other processorsSynchronization: the coordination among processors

accessing shared data Memory consistency: definition of when a processor

must observe a write from another processor

Cache coherence problem

Two writeback caches becoming incoherent

CPU 0

cache

CPU 1

cache

main memory

A

A

(1) CPU 0 reads block A



CPU 0

cache

CPU 1

cache

main memory

A

A


CPU 0

cache

CPU 1

cache

main memory

A

A


A



CPU 0

cache

CPU 1

cache

main memory

A

A


CPU 0

cache

CPU 1

cache

main memory

A

A


A

CPU 0

cache

CPU 1

cache

main memory

A

A

(3) CPU 0 writes block A

Aold, out of date

copies of block A

Cache coherence protocols

Ensures that cached blocks that are written to are observable by all processors

Assigns a state field to all cached blocks

Defines actions for performing reads and writes to blocks in each state that ensure cache coherence

Actions are much more complicated than described here in a real machine with a split transaction bus

MESI cache coherence protocol

Commonly used (or variant thereof) in shared memory multiprocessors

Idea is to ensure that when a cache wants to write to a cache block that other remote caches invalidate their copies first

Each cache block is in one of four states (2 bits stored with each cache block)Invalid: contents are not validShared: other processor caches may have the same

copy; main memory has the same copyExclusive: no other processor cache has a copy; main

memory has the same copyModified: no other processor cache has a copy; main

memory has an old copy


Actions on a load that results in cache hitLocal cache actions

Read blockRemote cache actions

None

Actions on a load that results in cache missLocal cache actions

Request block from bus If not in a remote cache, set state to Exclusive If also in a remote cache, set state to Shared

Remote cache actionsLook up cache tags to see if the block is present If so, signal the local cache that we have a copy, provide it if it

is in state Modified, and change the state of our copy to Shared


Actions on a store that results in cache hitLocal cache actions

Check state of block If Shared, send an Invalidation bus command to all remote

cachesWrite the block and change the state to Modified

Remote cache actionsUpon receipt of an Invalidation command on the bus, look up

cache tags to see if the block is present If so, change the state of the block to Invalid

Actions on a store that results in cache missLocal cache actions

Simultaneously request block from bus and send an Invalidation command

After block received, write the block and set the state to ModifiedRemote cache actions

Look up cache tags to see if the block is present If so, signal the local cache that we have a copy, provide it if it is

in state Modified, and change the state of our copy to Invalid

Cache coherence problem revisited

CPU 0

cache

CPU 1

cache

main memory

A

A


Exclusive


CPU 0

cache

CPU 1

cache

main memory

A

A


CPU 0

cache

CPU 1

cache

main memory

A

A


AShared SharedExclusive


CPU 0

cache

CPU 1

cache

main memory

A

A


CPU 0

cache

CPU 1

cache

main memory

A

A


A

CPU 0

cache

CPU 1

cache

main memory

A

A

(3) CPU 0 cache invalidates remote block A

A

Shared Shared

Invalidate command

InvalidShared

Exclusive


CPU 0

cache

CPU 1

cache

main memory

A

A


CPU 0

cache

CPU 1

cache

main memory

A

A


A

CPU 0

cache

CPU 1

cache

main memory

A

A

(3) CPU 0 cache invalidates remote block A

A

Shared Shared

Invalidate command

InvalidShared

CPU 0

cache

CPU 1

cache

main memory

A

A

(4) CPU 0 writes block A

A InvalidModified

Exclusive

Synchronization

For parallel programs to share data, we must make sure that accesses to a given memory location are orderedExample: database of available inventory at a department

store simultaneously accessed from different store computers; only one computer must “win the race” to reserve a particular item

SolutionArchitecture defines a special atomic swap instruction in

which a memory location is tested for 0, and if so, is set to 1

Software associates a lock variable with each data that needs to be ordered (e.g., particular class of merchandise) and uses the atomic swap instruction to try to set it

Software acquires the lock before modifying the associated data (e.g., reserving the merchandise)

Software releases the lock by setting it to 0 when done

Synchronization flowchart

“spinning”

Synchronization and coherence example

Questions?