CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

59
CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models

description

FLOPS – a measure of performance FLOPS – Floating Point Operations per Second … a measure of how much computation can be done in a certain amount of time MegaFLOPS – MFLOPS FLOPS GigaFLOPS – GFLOPS – 10 9 FLOPS TeraFLOPS – TFLOPS – FLOPS PetaFLOPS – PFLOPS – FLOPS

Transcript of CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Page 1: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

CS 420 - Design of AlgorithmsParallel Computer Architecture and Software Models

Page 2: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Parallel Computing –It’s about performance

Greater performance is the reason for parallel computingMany types of scientific and engineering programs are too large and too complex for traditional uniprocessorsSuch large problems are common is – Ocean modeling, weather modeling,

astrophysics, solid state physics, power systems, CFD….

Page 3: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

FLOPS – a measure of performance

FLOPS – Floating Point Operations per Second… a measure of how much computation can be done in a certain amount of time MegaFLOPS – MFLOPS - 106 FLOPS GigaFLOPS – GFLOPS – 109 FLOPS TeraFLOPS – TFLOPS – 1012 FLOPS PetaFLOPS – PFLOPS – 1015 FLOPS

Page 4: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

How fast …Cray 1 - ~150 MFLOPSPentium 4 – 3-6 GFLOPSIBM’s BlueGene - +360 TFLOPSPSC’s Big Ben – 10 TFLOPSHumans --- it depends as calculators – 0.001 MFLOPS as information processors – 10PFLOPS

Page 5: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

FLOPS vs. MIPSFLOPS only concerned with floating pointing calculationsother performance issues memory latency cache performance I/O capacity Interconnect

Page 6: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

See…www.Top500.org biannual performance reports and … rankings of the fastest computers in

the world

Page 7: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

PerformanceSpeedup(n processors) = time(1 processor)/time(n processors)

** Culler, Singh and Gupta, Parallel Computing Architecture, A Hardware/Software Approach

Page 9: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

… a model of the Indian Ocean -

73,000,000 square kilometer One data point per 100 meters 7,300,000,000 surface pointsNeed to model the ocean at depth – say every 10 meters up to 200 meters 20 depth data pointsEvery 10 minutes for 4 hours – 24 time steps

Page 10: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

So –73 x 106 (points on the surface) x 102 (points per sq. km) x 20 points per sq km of depth) x 24 (time steps) 3,504,000,000,000 data points in the

model gridSuppose calculations of 100 instructions per grid point 350,400,000,000,000 instructions in

model

Page 11: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Then -Imagine that you have a computer that can run 1 billion (109)instructions per second3.504 x 1014 / 109 = 35040 seconds or 9.7 hours

Page 12: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

But –On a 10 teraflops computer – 3.504 x 1014 / 1013 = 35.0 seconds

Page 13: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Gaining performancePipelining More instructions –faster More instructions in execution at the

same time in a single processor Not usually an attractive strategy

these days – why?

Page 14: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Instruction Level Parallelism (ILP)

based on the fact that many instructions do not depend on instructions that are before them…Processor has extra hardware to execute several instructions at the same time …multiple adders…

Page 15: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Pipelining and ILP not the solution to our problem – why?

near incremental improvements in performancebeen done alreadywe need orders of magnitude improvements in performance

Page 16: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Gaining PerformanceVector Processors

Scientific and Engineering computations are often vector and matrix operations graphic transformations – i.e. shift

object x to the rightRedundant arithmetic hardware and vector registers to operate on an entire vector in one step (SIMD)

Page 17: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Gaining PerformanceVector ProcessorsDeclining popularity for a while – Hardware expensivePopularity returning – Applications – science, engineering,

cryptography, media/graphics Earth Simulator your computer?

Page 18: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Parallel Computer Architecture

Shared Memory ArchitecturesDistributed Memory

Page 19: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Shared Memory SystemsMultiple processors connected to/share the same pool of memorySMPEvery processor has, potentially, access to and control of every memory location

Page 20: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Shared Memory Computers

MemoryProcessor

ProcessorProcessor

Processor

Processor Processor

Page 21: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Shared Memory Computers

Memory Memory Memory

Processor

Processor

Processor

Page 22: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Shared Memory ComputerMemory Memory Memory

Processor

Processor

Processor

Switch

Page 23: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Share Memory ComputersSGI Origin2000 – at NCSABalder256 250mhz R10000 processors128 Gbyte Memory

Page 24: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Shared Memory Computers

Rachel at PSC64 1.15 Ghz EV7 processors256 Gbytes of shared memory

Page 25: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Distributed Memory Systems

Multiple processors each with their own memoryInterconnected to share/exchange data, processingModern architectural approach to supercomputersSupercomputers and Clusters similar**Hybrid distributed/shared memory

Page 26: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Clusters – distributed memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Interconnect

Page 27: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

ClusterDistributed Memory with SMP

Proc1

Memory

Memory

Memory

Memory

Interconnect

Proc2 Proc1

MemoryProc2 Proc1

MemoryProc2

Proc1Proc2 Proc1Proc2 Proc1Proc2

Page 28: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Distributed Memory Supercomputer

BlueGene/L DOE/IBM0.7 Ghz PowerPC 440131072 Processors previous - 32768

Processors367 Teraflops was 70 TFlops

Page 29: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Distributed Memory Supercomputer

Thunder at LLNLNumber 19 was Number 5

20 Teraflops1.4 Ghz Itanium processors4096 processors

Page 30: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Earth SimulatorJapanBuilt by NECNumber 14

was Number 140 TFlops640 Nodeseach node = 8 vector processors640x640 full crossbar

Page 31: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Grid Computing SystemsWhat is a Grid Means different things to different

peopleDistributed Processors Around campus Around the state Around the world

Page 32: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Grid Computing SystemsWidely distributedLoosely connected (i.e. Internet)No central management

Page 33: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Grid Computing SystemsConnected Clusters/other dedicated scientific computers

I2/Abilene

Page 34: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Grid Computer Systems

Internet

Control/Scheduler

Harvested Idle Cycles

Page 36: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Flynn’s TaxonomySingle Instruction/Single Data - SISD

Multiple Instruction/Single Data - MISD

Single Instruction/Multiple Data - SIMD

Multiple Instruction/Multiple Data - MIMD

*Single Program/Multiple Data - SPMD

Page 37: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

SISD – Single Instruction Single Data

Single instruction stream “single instruction execution per clock cycle”Single data stream – one pieced of data per clock cycleDeterministicTradition CPU, most single CPU PCs

Load x to aLoad y to b

Add B to AStore A

Load x to a…

Page 38: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Single Instruction Multiple Data

One Instruction streamMultiple data streams (partitions)Given instruction operates on multiple data elementsLockstepDeterministicProcessor Arrays, Vector ProcessorsCM-2, Cray-C90

Load A(1)

Load B(1)

Store C(1)

Load A(2)

Load B(2)

Store C(2)

Load A(3)

Load B(3)

Store C(3)…

C(1)=A(1)*B(1)

C(2)=A(2)*B(2)

C(3)=A(3)*B(3)

PE-1 PE-2 PE-n

Page 39: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Multiple Instruction Single Data

Multiple instruction streamsOperate on single data streamSeveral instructions operate on the same data element – concurrentlyA bit strange – CMU

Multi-pass filters Encryption – code

cracking

Load A(1)

Load B(1)

Store C(1)

Load A(1)

Load B(2)

Store C(2)

Load A(1)

Load B(3)

Store C(3)

C(1)=A(1)*4 C(2)=A(1)*4

PE-1 PE-2 PE-n

C(3)=A(1)*4

Page 40: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Multiple Instruction Multiple Data

Multiple Instruction StreamsMultiple Data StreamsEach processor has own instructions/own dataMost Supercomputers, Clusters, Grids

Load A(1)

Load B(1)

Store C(1)

Load G

A=SQRT(G)

Store C

Call func2(C,G)

Load B

Call func1(B,C)

Store G

C(1)=A(1)*4 C = A*Pi

PE-1 PE-2 PE-n

Page 41: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Single Program Multiple Data

Single Code Image/ExecutableEach Processor has own dataInstruction execution under program controlDMC, SMP

if PE=1 then…

Load A

Load B

Store C

Load A

Load B

Store C…

Load A

Load B

Store C…

C=A*B C=A*B

PE-1 PE-2 PE-n

C=A*B

if PE=2 then…if PE=n then…

Page 42: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Multiple Program Multiple Data

MPMD like SPMD ……except each processor run separate, independent executableHow to implement interprocess communicationsSocketMPI-2 – more later

ProgA

ProgA

ProgA

ProgA

ProgA

ProgB

ProgC

ProgD

SPMD

MPMD

Page 43: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

UMA and NUMAUMA – Uniform Memory Access all processors have equal access to

memory Usually found in SMPs Identical processors Difficult to implement as n of

processors increases Good processor to memory bandwidth Cache Coherency CC –

important can be implemented in hardware

Page 44: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

UMA and NUMANUMA – Non Uniform Memory Access Access to memory differs by processor local processor = good access, nonlocal

processors = not so good access Usually multiple computers or multiple

SMPs Memory access across interconnect is

slow Cache Coherency CC –

can be done usually not a problem

Page 45: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Let’s revisit speedup…we can achieve speedup (theoretically) by using more processors,…but, of factors may limit speedup… Interprocessor communications Interprocess synchronization Load balance Parallelizability of algorithms

Page 46: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Amdahl’s LawAccording to Amdahl’s Law… Speedup = 1/(S + (1-S)/N) where S is the purely sequential part of the

program N is the number of processors

Page 47: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Amdahl’s LawWhat does it mean – Part of a program can is parallelizable Part of the program must be sequential

(S)Amdahl’s law says – Speedup is constrained by the portion of

the program that must remain sequential relative to the part that is parallelized.

Note: If S is very small – “embarrassingly parallel problem” sometimes anyway!

Page 48: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Software models for parallel computing

Sockets and other P2P modelsThreadsShared MemoryMessage PassingData Parallel

Page 49: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Sockets and othersTCP Sockets establish TCP links among processes send messages through socketsRPC, CORBA, DCOMWebservices, SOAP…

Page 50: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

ThreadsA single executable runs…… at specific points in execution launches new executables – threads…… threads can be launched on other PEs… threads close – control returns to main program…fork and joinPosix, MicrosoftOpenMP is implemented with threads

ThreadThreads

Threads

t1

t2

t3

t0

t1

t2

t3

t0

Page 51: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Shared MemoryProcesses share common memory spaceData sharing via common memory spaceProtocol needed to “play nice” with memoryOpenMP

MemoryProcessor

ProcessorProcessor

Processor

Processor Processor

Page 52: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Distributed Memory - Message Passing

Data messages are passed from PE to PEMessage Passing is explicit … under program controlParallelization is designed by the programmer……implemented by the programmer

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Interconnect

Page 53: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Message PassingMessage Passing usually implement as a library – functions and subroutine callsMost common – MPI – Message Passing InterfaceStandards –

MPI-1 MPI-2

Implementations MPICH OpenMPI MPICH-GM (Myrinet MPICH-G2 – MPICH-G

Page 54: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Message PassingHybrid DM/SMPHow does it look from a message passing perspective?How is MPI implemented?

Proc1

Memory

Memory

Memory

Memory

Interconnect

Proc2 Proc1

MemoryProc2 Proc1

MemoryProc2

Proc1Proc2 Proc1Proc2 Proc1Proc2

Page 55: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Data ParallelProcesses work concurrently on pieces of single data structureSMP – each process works on portion of structure in common memoryDMS – data structure is partitioned, distributed, computed (and collected)

from -http://www.llnl.gov/computing/tutorials/parallel_comp/#Flynn

Page 56: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Data ParallelCan be done with calls to libraries, compiler directives…can be automatic (sort of)High Performance Fortran (HPF)Fortran 95

Page 57: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.

Comments on Automatic Parallelization

Some compilers can automatically parallelize portions of code (HPF)Usually loops are the targetEssentially a serial algorithm with portions pushed out to other processorsProblems Not parallel algorithm, not under

programmer control (at least partly) might be wrong might result in slowdown

Page 59: CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models.