Parallel Computing Erik Robbins. Limits on single-processor performance Over time, computers have...

27
Parallel Parallel Computing Computing Erik Robbins Erik Robbins

description

Parallelism  Improvement of processor performance by distributing the computational load among several processors.  The processing elements can be diverse  Single computer with multiple processors  Several networked computers

Transcript of Parallel Computing Erik Robbins. Limits on single-processor performance Over time, computers have...

Page 1: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Parallel ComputingParallel Computing

Erik RobbinsErik Robbins

Page 2: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Limits on single-processor Limits on single-processor performanceperformance

Over time, computers have become better Over time, computers have become better and faster, but there are constraints to and faster, but there are constraints to further improvementfurther improvement

Physical barriersPhysical barriers Heat and electromagnetic interference limit chip Heat and electromagnetic interference limit chip

transistor densitytransistor density Processor speeds constrained by speed of lightProcessor speeds constrained by speed of light

Economic barriersEconomic barriers Cost will eventually increase beyond price Cost will eventually increase beyond price

anybody will be willing to payanybody will be willing to pay

Page 3: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

ParallelismParallelism

Improvement of processor performance by Improvement of processor performance by distributing the computational load among distributing the computational load among several processors.several processors.

The processing elements can be diverseThe processing elements can be diverse Single computer with multiple processorsSingle computer with multiple processors Several networked computersSeveral networked computers

Page 4: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Drawbacks to ParallelismDrawbacks to Parallelism

Adds costAdds cost Imperfect speed-up.Imperfect speed-up.

Given n processors, perfect speed-up would Given n processors, perfect speed-up would imply a n-fold increase in power.imply a n-fold increase in power.

A small portion of a program which cannot be A small portion of a program which cannot be parallelized will limit overall speed-up.parallelized will limit overall speed-up.

““The bearing of a child takes nine months, no The bearing of a child takes nine months, no matter how many women are assigned.”matter how many women are assigned.”

Page 5: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Amdahl’s LawAmdahl’s Law

This relationship is given by the equation:This relationship is given by the equation: S = 1 / (1 – P)S = 1 / (1 – P) S is the speed-up of the program (as a S is the speed-up of the program (as a

factor of its original sequential runtime)factor of its original sequential runtime) P is the fraction that is parallelizableP is the fraction that is parallelizable

Web Applet –Web Applet – http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/amdahl.htmlhttp://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/amdahl.html

Page 6: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Amdahl’s LawAmdahl’s Law

Page 7: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

History of Parallel Computing – History of Parallel Computing – ExamplesExamples

1954 – IBM 704 1954 – IBM 704 Gene Amdahl was a principle architectGene Amdahl was a principle architect uses fully automatic floating point arithmetic commands.uses fully automatic floating point arithmetic commands.

1962 – Burroughs Corporation D8251962 – Burroughs Corporation D825 Four-processor computerFour-processor computer

1967 – Amdahl and Daniel Slotnick publish debate about 1967 – Amdahl and Daniel Slotnick publish debate about parallel computing feasibilityparallel computing feasibility Amdahl’s Law coinedAmdahl’s Law coined

1969 – Honeywell Multics system1969 – Honeywell Multics system Capable of running up to eight processors in parallelCapable of running up to eight processors in parallel

1970s – Cray supercomputers (SIMD architecture)1970s – Cray supercomputers (SIMD architecture) 1984 – Synapse N+11984 – Synapse N+1

First bus-connected multi-processor with snooping cachesFirst bus-connected multi-processor with snooping caches

Page 8: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

History of Parallel Computing –History of Parallel Computing –Overview of EvolutionOverview of Evolution

1950’s - Interest in parallel computing began.1950’s - Interest in parallel computing began. 1960’s & 70’s - Advancements surfaced in the form of 1960’s & 70’s - Advancements surfaced in the form of

supercomputers.supercomputers. Mid-1980’s – Massively parallel processors (MPPs) Mid-1980’s – Massively parallel processors (MPPs)

came to dominate top end of computing.came to dominate top end of computing. Late-1980’s – Clusters (type of parallel computer built Late-1980’s – Clusters (type of parallel computer built

from large numbers of computers connected by network) from large numbers of computers connected by network) competed with & eventually displaced MPPs.competed with & eventually displaced MPPs.

Today – Parallel computing has become mainstream Today – Parallel computing has become mainstream based on multi-core processors in home computers. based on multi-core processors in home computers. Scaling of Moore’s Law predicts a transition from a few Scaling of Moore’s Law predicts a transition from a few cores to many.cores to many.

Page 9: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Multiprocessor ArchitecturesMultiprocessor Architectures Instruction Level Parallelism (ILP)Instruction Level Parallelism (ILP)

Superscalar and VLIWSuperscalar and VLIW SIMD Architectures SIMD Architectures (single instruction streams, multiple data streams)(single instruction streams, multiple data streams)

Vector ProcessorsVector Processors MIMD Architectures (multiple instruction, multiple data)MIMD Architectures (multiple instruction, multiple data)

Interconnection NetworksInterconnection Networks Shared Memory MultiprocessorsShared Memory Multiprocessors Distributed ComputingDistributed Computing

Alternative Parallel Processing ApproachesAlternative Parallel Processing Approaches Dataflow ComputingDataflow Computing Neural Networks (SIMD)Neural Networks (SIMD) Systolic Arrays (SIMD)Systolic Arrays (SIMD) Quantum ComputingQuantum Computing

Page 10: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

SuperscalarSuperscalar A design methodology that allows multiple A design methodology that allows multiple

instructions to be executed simultaneously in instructions to be executed simultaneously in each clock cycle.each clock cycle.

Analogous to adding another lane to a highway. Analogous to adding another lane to a highway. The “additional lanes” are called execution units.The “additional lanes” are called execution units.

Instruction Fetch UnitInstruction Fetch Unit Critical component.Critical component. Retrieves multiple instructions simultaneously from Retrieves multiple instructions simultaneously from

memory. Passes instructions to…memory. Passes instructions to… Decoding UnitDecoding Unit

Determines whether the instructions have any type of Determines whether the instructions have any type of dependencydependency

Page 11: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

VLIWVLIW Superscalar processors rely on both Superscalar processors rely on both

hardware and the compiler.hardware and the compiler. VLIW processors rely entirely on the VLIW processors rely entirely on the

compiler.compiler. They pack independent instructions into one long They pack independent instructions into one long

instruction which tells the execution units what to do.instruction which tells the execution units what to do. Compiler cannot have an overall picture of the run-Compiler cannot have an overall picture of the run-

time code.time code. Is compelled to be conservative in its scheduling.Is compelled to be conservative in its scheduling.

VLIW compiler also arbitrates all dependencies.VLIW compiler also arbitrates all dependencies.

Page 12: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Vector ProcessorsVector Processors Referred to as supercomputers. (Cray series most Referred to as supercomputers. (Cray series most

famous)famous) Based on vector arithmetic.Based on vector arithmetic.

A vector is a fixed-length, one-dimensional array of values, or an A vector is a fixed-length, one-dimensional array of values, or an ordered series of scalar quantities.ordered series of scalar quantities.

Operations include addition, subtraction, and multiplication.Operations include addition, subtraction, and multiplication. Each instruction specifies a set of operations to be Each instruction specifies a set of operations to be

carried over an entire vector.carried over an entire vector. Vector registers – specialized registers that can hold Vector registers – specialized registers that can hold

several vector elements at one time.several vector elements at one time. Vector instructions are efficient for two reasons.Vector instructions are efficient for two reasons.

Machine fetches fewer instructions.Machine fetches fewer instructions. Processor knows it will have continuous source of data – can Processor knows it will have continuous source of data – can

pre-fetch pairs of values.pre-fetch pairs of values.

Page 13: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

MIMD ArchitecturesMIMD Architectures Communication is essential for synchronized Communication is essential for synchronized

processing and data sharing.processing and data sharing. Manner of passing messages determines overall Manner of passing messages determines overall

design.design. Two aspects:Two aspects:

Shared Memory – one large memory accessed Shared Memory – one large memory accessed identically by all processors.identically by all processors.

Interconnected Network – Each processor has own Interconnected Network – Each processor has own memory, but processors are allowed to access each memory, but processors are allowed to access each other’s memories via the network.other’s memories via the network.

Page 14: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Interconnection NetworksInterconnection Networks

Categorized according to topology, routing Categorized according to topology, routing strategy, and switching technique.strategy, and switching technique.

Networks can be either static or dynamic, Networks can be either static or dynamic, and either blocking or non-blocking.and either blocking or non-blocking. Dynamic – Allow the path between two entities (two Dynamic – Allow the path between two entities (two

processors or a processor & memory) to change processors or a processor & memory) to change between communications. Static is opposite.between communications. Static is opposite.

Blocking – Does not allow new connections in the Blocking – Does not allow new connections in the presence of other simultaneous connections.presence of other simultaneous connections.

Page 15: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Network TopologiesNetwork Topologies The way in which the components are The way in which the components are

interconnected.interconnected. A major determining factor in the overhead of A major determining factor in the overhead of

message passing.message passing. Efficiency is limited by:Efficiency is limited by:

Bandwidth – Bandwidth – information carrying capacity of the networkinformation carrying capacity of the network Message latency – Message latency – time required for first bit of a message to time required for first bit of a message to

reach its destinationreach its destination Transport latency – Transport latency – time a message spends in the networktime a message spends in the network Overhead – Overhead – message processing activities in the sender and message processing activities in the sender and

receiverreceiver

Page 16: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Static TopologiesStatic Topologies Completely Connected – Completely Connected – All components are connected to all All components are connected to all

other components.other components. Expensive to build & difficult to manage.Expensive to build & difficult to manage.

Star – Star – Has a central hub through which all messages must pass.Has a central hub through which all messages must pass. Excellent connectivity, but hub can be a bottleneck.Excellent connectivity, but hub can be a bottleneck.

Linear Array or Ring – Linear Array or Ring – Each entity can communicate directly Each entity can communicate directly with its two neighbors.with its two neighbors. Other communications have to go through multiple entities.Other communications have to go through multiple entities.

Mesh – Mesh – Links each entity to four or six neighbors.Links each entity to four or six neighbors. Tree – Tree – Arrange entities in tree structures.Arrange entities in tree structures.

Potential for bottlenecks in the roots.Potential for bottlenecks in the roots. Hypercube – Hypercube – Multidimensional extensions of mesh networks in Multidimensional extensions of mesh networks in

which each dimension has two processors.which each dimension has two processors.

Page 17: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Static TopologiesStatic Topologies

Page 18: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Dynamic TopologyDynamic Topology Dynamic networks use either a bus or a Dynamic networks use either a bus or a

switch to alter routes through a network.switch to alter routes through a network. Bus-based networks are simplest and Bus-based networks are simplest and

most efficient when number of entities are most efficient when number of entities are moderate.moderate. Bottleneck can result as number of entities Bottleneck can result as number of entities

grow large.grow large. Parallel buses can alleviate bottlenecks, but Parallel buses can alleviate bottlenecks, but

at considerable cost.at considerable cost.

Page 19: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

SwitchesSwitches

Crossbar SwitchesCrossbar Switches Are either open or closed.Are either open or closed. A crossbar network is a non-blocking network.A crossbar network is a non-blocking network. If only one switch at each crosspoint, n If only one switch at each crosspoint, n

entities require n^2 switches. In reality, many entities require n^2 switches. In reality, many switches may be required at each crosspoint.switches may be required at each crosspoint.

Practical only in high-speed multiprocessor Practical only in high-speed multiprocessor vector computers.vector computers.

Page 20: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

SwitchesSwitches 2x2 Switches2x2 Switches

Capable of routing its inputs to different destinations.Capable of routing its inputs to different destinations. Two inputs and two outputs.Two inputs and two outputs. Four statesFour states

Through (inputs feed directly to outputs)Through (inputs feed directly to outputs) Cross (upper in directed to lower out & vice versa)Cross (upper in directed to lower out & vice versa) Upper broadcast (upper input broadcast to both outputs)Upper broadcast (upper input broadcast to both outputs) Lower broadcast (lower input directed to both outputs)Lower broadcast (lower input directed to both outputs)

Through and Cross states are the ones relevant to Through and Cross states are the ones relevant to interconnection networks.interconnection networks.

Page 21: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

2x2 Switches2x2 Switches

Page 22: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Shared Memory MultiprocessorsShared Memory Multiprocessors

Tightly coupled systems that use the same Tightly coupled systems that use the same memory.memory. Global Shared Memory – single memory Global Shared Memory – single memory

shared by multiple processors.shared by multiple processors. Distributed Shared Memory – each processor Distributed Shared Memory – each processor

has local memory, but is shared with other has local memory, but is shared with other processors.processors.

Global Shared Memory with separate cache Global Shared Memory with separate cache at processors.at processors.

Page 23: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

UMA Shared MemoryUMA Shared Memory Uniform Memory AccessUniform Memory Access

All memory accesses take the same amount of time.All memory accesses take the same amount of time. One pool of shared memory and all processors have One pool of shared memory and all processors have

equal access.equal access. Scalability of UMA machines is limited. As the Scalability of UMA machines is limited. As the

number of processors increases…number of processors increases… Switched networks quickly become very expensive.Switched networks quickly become very expensive. Bus-based systems saturate when the bandwidth becomes Bus-based systems saturate when the bandwidth becomes

insufficient.insufficient. Multistage networks run into wiring constraints and Multistage networks run into wiring constraints and

significant latency.significant latency.

Page 24: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

NUMA Shared MemoryNUMA Shared Memory Nonuniform Memory AccessNonuniform Memory Access

Provides each processor its own piece of memory.Provides each processor its own piece of memory. Processors see this memory as a contiguous Processors see this memory as a contiguous

addressable entity.addressable entity. Nearby memory takes less time to read than memory Nearby memory takes less time to read than memory

that is further away. Memory access time is thus that is further away. Memory access time is thus inconsistent.inconsistent.

Prone to cache coherence problems.Prone to cache coherence problems. Each processor maintains a private cache.Each processor maintains a private cache. Modified data needs to be updated in all caches.Modified data needs to be updated in all caches. Special hardware units known as snoopy cache controllers.Special hardware units known as snoopy cache controllers. Write-through with update – updates stale values in other caches.Write-through with update – updates stale values in other caches. Write-through with invalidation – removes stale values from other Write-through with invalidation – removes stale values from other

caches.caches.

Page 25: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Distributed ComputingDistributed Computing

Means different things to different people.Means different things to different people. In a sense, all multiprocessor systems are In a sense, all multiprocessor systems are

distributed systems.distributed systems. Usually used referring to a very loosely Usually used referring to a very loosely

based multicomputer system.based multicomputer system. Depend on a network for communication Depend on a network for communication

among processors.among processors.

Page 26: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Grid ComputingGrid Computing An example of distributed computing.An example of distributed computing. Uses resources of many computers connected Uses resources of many computers connected

by a network (i.e. Internet) to solve by a network (i.e. Internet) to solve computational problems that are too large for computational problems that are too large for any single super-computer.any single super-computer.

Global ComputingGlobal Computing Specialized form of grid computing. Uses computing Specialized form of grid computing. Uses computing

power of volunteers whose computers work on a power of volunteers whose computers work on a problem while the system is idle.problem while the system is idle.

SETI@Home Screen SaverSETI@Home Screen Saver Six year run accumulated two million years of CPU time and Six year run accumulated two million years of CPU time and

50 TB of data.50 TB of data.

Page 27: Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Questions?Questions?