CUDA Lecture 3 Parallel Architectures and Performance Analysis
description
Transcript of CUDA Lecture 3 Parallel Architectures and Performance Analysis
![Page 1: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/1.jpg)
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 3Parallel Architectures and
Performance Analysis
![Page 2: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/2.jpg)
Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory:
Each main memory location located by its address. Addresses start at zero and extend to 2n – 1 when there are n bits (binary digits) in the address.
Parallel Architectures and Performance Analysis – Slide 2
Topic 1: Parallel Architectures
![Page 3: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/3.jpg)
Parallel computer: multiple-processor system supporting parallel programming.
Three principle types of architectureVector computers, in particular processor
arraysShared memory multiprocessors
Specially designed and manufactured systemsDistributed memory multicomputers
Message passing systems readily formed from a cluster of workstations
Parallel Architectures and Performance Analysis – Slide 3
Parallel Computers
![Page 4: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/4.jpg)
Vector computer: instruction set includes operations on vectors as well as scalars
Two ways to implement vector computersPipelined vector processor (e.g. Cray): streams
data through pipelined arithmetic unitsProcessor array: many identical, synchronized
arithmetic processing elements
Parallel Architectures and Performance Analysis – Slide 4
Type 1: Vector Computers
![Page 5: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/5.jpg)
Historically, high cost of a control unitScientific applications have data parallelism
Parallel Architectures and Performance Analysis – Slide 5
Why Processor Arrays?
![Page 6: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/6.jpg)
Front end computer (standard uniprocessor)ProgramData manipulated sequentially
Processor array (individual processor/memory pairs)Data manipulated in parallelPerformance
Speed of processing elementsUtilization of processing elementsSize of data structure
Parallel Architectures and Performance Analysis – Slide 6
Data/Instruction Storage
![Page 7: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/7.jpg)
Each VLSI chip has 16 processing elements
Parallel Architectures and Performance Analysis – Slide 7
2-D Processor Interconnection Network
![Page 8: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/8.jpg)
Not all problems are data parallelSpeed drops for conditionally executed codeDo not adapt to multiple users wellDo not scale down well to “starter” systemsRely on custom VLSI for processorsExpense of control units has dropped
Parallel Architectures and Performance Analysis – Slide 8
Processor Array Shortcomings
![Page 9: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/9.jpg)
Natural way to extend single processor modelHave multiple processors connected to
multiple memory modules such that each processor can access any memory module
So-called shared memory configuration:
Parallel Architectures and Performance Analysis – Slide 9
Type 2: Shared Memory Multiprocessor Systems
![Page 10: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/10.jpg)
Parallel Architectures and Performance Analysis – Slide 10
Ex: Quad Pentium Shared Memory Multiprocessor
![Page 11: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/11.jpg)
Any memory location can be accessible by any of the processors.
A single address space exists, meaning that each memory location is given unique address within a single range of addresses.
Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.).
Parallel Architectures and Performance Analysis – Slide 11
Shared Memory Multiprocessor Systems
![Page 12: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/12.jpg)
Alternately known as a tightly coupled architecture.No local memory associated with processors.
Avoid three problems of processor arraysCan be built from commodity CPUsNaturally support multiple usersMaintain efficiency in conditional code
Parallel Architectures and Performance Analysis – Slide 12
Shared Memory Multiprocessor Systems (cont.)
![Page 13: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/13.jpg)
Several alternatives for programming shared memory multiprocessorsUsing threads (pthreads, Java, …) in which the
programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads.
Using a sequential programming language with user-level libraries to declare and access shared variables.
Parallel Architectures and Performance Analysis – Slide 13
Shared Memory Multiprocessor Systems (cont.)
![Page 14: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/14.jpg)
Several alternatives for programming shared memory multiprocessorsUsing a sequential programming language with
preprocessor compiler directives to declare shared variables and specify parallelism.Ex: OpenMP – the industry standard
An API for shared-memory systemsSupports higher performance parallel programming of
symmetrical multiprocessors
Parallel Architectures and Performance Analysis – Slide 14
Shared Memory Multiprocessor Systems (cont.)
![Page 15: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/15.jpg)
Several alternatives for programming shared memory multiprocessorsUsing a parallel programming language with
syntax for parallelism, in which the compiler creates the appropriate executable code for each processor.
Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code.
Neither of these not now common.
Parallel Architectures and Performance Analysis – Slide 15
Shared Memory Multiprocessor Systems (cont.)
![Page 16: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/16.jpg)
Type 1: Centralized MultiprocessorStraightforward extension of uniprocessorAdd CPUs to busAll processors share same primary memoryMemory access time same for all CPUs
An example of a uniform memory access (UMA) multiprocessor
Symmetrical multiprocessor (SMP)
Parallel Architectures and Performance Analysis – Slide 16
Fundamental Types of Shared Memory Multiprocessor
![Page 17: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/17.jpg)
Parallel Architectures and Performance Analysis – Slide 17
Centralized Multiprocessor
![Page 18: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/18.jpg)
Private data: items used only by a single processorShared data: values used by multiple processorsIn a centralized multiprocessor, processors
communicate via shared data valuesProblems associated with shared data
Cache coherenceReplicating data across multiple caches reduces
contentionHow to ensure different processors have same value for
same address?Synchronization
Mutual exclusionBarriers
Parallel Architectures and Performance Analysis – Slide 18
Private and Shared Data
![Page 19: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/19.jpg)
Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing).
Then can use shared memory programming techniques.
Parallel Architectures and Performance Analysis – Slide 19
Distributed Shared Memory
![Page 20: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/20.jpg)
Type 2: Distributed MultiprocessorDistribute primary memory among processorsIncrease aggregate memory bandwidth and
lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access
(NUMA) multiprocessor
Parallel Architectures and Performance Analysis – Slide 20
Fundamental Types of Shared Memory Multiprocessor
![Page 21: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/21.jpg)
Parallel Architectures and Performance Analysis – Slide 21
Distributed Multiprocessor
![Page 22: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/22.jpg)
Some NUMA multiprocessors do not support it in hardwareOnly instructions, private data in cacheLarge memory access time variance
Implementations more difficultNo shared memory bus to “snoop”Directory-based protocol needed
Parallel Architectures and Performance Analysis – Slide 22
Cache Coherence
![Page 23: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/23.jpg)
Distributed directory contains information about cacheable memory blocks
One directory entry for each cache blockEach entry has
Sharing statusUncached: block not in any processor’s cacheShared: cached by one or more processors; read
onlyExclusive: cached by exactly one processor which
has written block, so copy in memory obsoleteWhich processors have copies
Parallel Architectures and Performance Analysis – Slide 23
Directory-Based Protocol
![Page 24: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/24.jpg)
Complete computers connected through an interconnection network
Parallel Architectures and Performance Analysis – Slide 24
Type 3: Message-Passing Multicomputers
![Page 25: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/25.jpg)
Distributed memory multiple-CPU computerSame address on different processors refers
to different physical memory locationsProcessors interact through message passingCommercial multicomputersCommodity clusters
Parallel Architectures and Performance Analysis – Slide 25
Multicomputers
![Page 26: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/26.jpg)
Alternate name for message-passing multicomputer systems.
Each processor has its own memory accessible only to that processor.
A message passing interconnection network provides point-to-point connections among processors.
Memory access varies between processors.
Parallel Architectures and Performance Analysis – Slide 26
Loosely Coupled Architectures
![Page 27: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/27.jpg)
Parallel Architectures and Performance Analysis – Slide 27
Asymmetrical Multicomputer
![Page 28: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/28.jpg)
Advantages:Back-end processors dedicated to parallel
computationsEasier to understand, model, tune performance
Only a simple back-end operating system neededEasy for a vendor to create
Disadvantages:Front-end computer is a single point of failureSingle front-end computer limits scalability of
systemPrimitive operating system in back-end processors
makes debugging difficultEvery application requires development of both
front-end and back-end programsParallel Architectures and Performance Analysis – Slide 28
Asymmetrical Multicomputer
![Page 29: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/29.jpg)
Parallel Architectures and Performance Analysis – Slide 29
Symmetrical Multicomputer
![Page 30: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/30.jpg)
Advantages:Alleviate performance bottleneck caused by
single front-end computerBetter support for debuggingEvery processor executes same program
Disadvantages:More difficult to maintain illusion of single
“parallel computer”No simple way to balance program
development workload among processorsMore difficult to achieve high performance
when multiple processes on each processorParallel Architectures and Performance Analysis – Slide 30
Symmetrical Multicomputer
![Page 31: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/31.jpg)
Parallel Architectures and Performance Analysis – Slide 31
ParPar Cluster: A Mixed Model
![Page 32: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/32.jpg)
Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.
Also important are number of processors, number of programs which can be executed, and the memory structure.
Parallel Architectures and Performance Analysis – Slide 32
Alternate System: Flynn’s Taxonomy
![Page 33: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/33.jpg)
Single instruction stream, single data stream (SISD) computer
In a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items.
The single CPU executes one instruction at a time and fetches or stores one item of data at a time.
Parallel Architectures and Performance Analysis – Slide 33
Flynn’s Taxonomy: SISD
![Page 34: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/34.jpg)
Parallel Architectures and Performance Analysis – Slide 34
Flynn’s Taxonomy: SISD (cont.)
Control unit ArithmeticProcessor
Memory
Control Signals
Instruction Data Stream
Results
![Page 35: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/35.jpg)
Single instruction stream, multiple data stream (SIMD) computer
A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist.The instructions from the program are
broadcast to more than one processor.Each processor executes the same instruction
in synchronism, but using different data.Developed because there are a number of
important applications that mostly operate upon arrays of data.
Parallel Architectures and Performance Analysis – Slide 35
Flynn’s Taxonomy: SIMD
![Page 36: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/36.jpg)
Parallel Architectures and Performance Analysis – Slide 36
Flynn’s Taxonomy: SIMD (cont.)
Control Unit
Control Signal
PE 1 PE 2 PE n
Data Stream 1 Data Stream 2 Data Stream n
![Page 37: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/37.jpg)
Processing distributed over a large amount of hardware.
Operates concurrently on many different data elements.
Performs the same computation on all data elements.
Processors operate synchronously.Examples: pipelined vector processors (e.g.
Cray-1) and processor arrays (e.g. Connection Machine)
Parallel Architectures and Performance Analysis – Slide 37
SIMD Architectures
![Page 38: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/38.jpg)
Parallel Architectures and Performance Analysis – Slide 38
SISD vs. SIMD ExecutionX 1
X 2
X 3
X 4
PEs satisfy a = 0,others are idle
PEs satisfy a ≠ 0,others are idle
All PEs
All PEs
SIMD machine
X 1
a=0 ?
X 3
X 2
X 4
Yes
No
SISD machine
![Page 39: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/39.jpg)
Multiple instruction stream, single data stream (MISD) computer
MISD machines may execute several different programs on the same data item.
There are two categoriesDistinct processing units perform distinct
instructions on the same data. Currently there is no such machine.
Pipelined architectures, where data flows through a series of processing elements.
Parallel Architectures and Performance Analysis – Slide 39
Flynn’s Taxonomy: MISD
![Page 40: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/40.jpg)
Parallel Architectures and Performance Analysis – Slide 40
Flynn’s Taxonomy: MISD (cont.)
Control Unit 1
Control Unit 2
Control Unit n
ProcessingElement 1
ProcessingElement 2
ProcessingElement n
Instruction Stream 1
Instruction Stream 2
Instruction Stream n
DataStream
![Page 41: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/41.jpg)
A pipeline processor works according to the principle of pipelining.A process can be broken down into several
stages (segments). While one stage is executing, another stage is
being loaded and the input of one stage is the output of the previous stage.
The processor carries out many different computations concurrently.
Example: systolic array
Parallel Architectures and Performance Analysis – Slide 41
MISD Architectures
![Page 42: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/42.jpg)
Parallel Architectures and Performance Analysis – Slide 42
MISD Architectures (cont.)
Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage.
Pipelined execution of the same two processes. T = 5 t
S1 S2 S3 S4 S1 S2 S3 S4
S1 S2 S3 S4S1 S2 S3 S4
![Page 43: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/43.jpg)
Multiple instruction stream, multiple data stream (MIMD) computer
General purpose multiprocessor system.Multiple processors, each with a separate
(different) program operating on its own data.One instruction stream is generated from each
program for each processor.Each instruction operates upon different data.
Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification.
Parallel Architectures and Performance Analysis – Slide 43
Flynn’s Taxonomy: MIMD
![Page 44: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/44.jpg)
Parallel Architectures and Performance Analysis – Slide 44
Flynn’s Taxonomy: MIMD (cont.)
Control Unit 1
Control Unit 2
Control Unit n
ProcessingElement 1
ProcessingElement 2
ProcessingElement n
Instruction Stream 1
Instruction Stream 2
Instruction Stream n
Data Stream 1
Data Stream 2
Data Stream n
![Page 45: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/45.jpg)
Processing distributed over a number of processors operating independently and concurrently.Resources (memory) shared among processors.Each processor runs its own program.
MIMD systems execute operations in a parallel asynchronous fashion.
Parallel Architectures and Performance Analysis – Slide 45
MIMD Architectures
![Page 46: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/46.jpg)
Differ with regard toInterconnection networksMemory addressing techniquesSynchronizationControl structures
A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently.
Parallel Architectures and Performance Analysis – Slide 46
MIMD Architectures (cont.)
![Page 47: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/47.jpg)
Multiple Program Multiple Data (MPMD) Structure
Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.
Parallel Architectures and Performance Analysis – Slide 47
Two MIMD Structures: MPMD
![Page 48: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/48.jpg)
Single Program Multiple Data (SPMD) Structure
Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.
The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.
Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.Parallel Architectures and Performance Analysis – Slide 48
Two MIMD Structures: SPMD
![Page 49: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/49.jpg)
SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit.
SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor.
SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD.
Parallel Architectures and Performance Analysis – Slide 49
SIMD vs. MIMD
![Page 50: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/50.jpg)
MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism.
For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful.
Parallel Architectures and Performance Analysis – Slide 50
SIMD vs. MIMD (cont.)
![Page 51: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/51.jpg)
Time to execute a sequence of instructions in which the execution time is data dependent is less for MIMD than for SIMD.MIMD allows each instruction to execute
independently. In SIMD each processing element must wait until all the others have finished the execution of one instruction.
ThusT(MIMD) = MAX {t1 + t2 + … + tn}T(SIMD) = MAX {t1} + MAX {t2} + … + MAX {tn} T(MIMD) ≤ T(SIMD)
Parallel Architectures and Performance Analysis – Slide 51
SIMD vs. MIMD (cont.)
![Page 52: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/52.jpg)
In MIMD each processing element can independently follow either direction path in executing if-then-else statement. This requires two phases on SIMD.
MIMD can operate in SIMD mode.
Parallel Architectures and Performance Analysis – Slide 52
SIMD vs. MIMD (cont.)
![Page 53: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/53.jpg)
ArchitecturesVector computersShared memory multiprocessors: tightly coupled
Centralized/symmetrical multiprocessor (SMP): UMA
Distributed multiprocessor: NUMADistributed memory/message-passing
multicomputers: loosely coupledAsymmetrical vs. symmetrical
Flynn’s TaxonomySISD, SIMD, MISD, MIMD (MPMD, SPMD)
Parallel Architectures and Performance Analysis – Slide 53
Topic 1 Summary
![Page 54: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/54.jpg)
A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.
The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.
Parallel Architectures and Performance Analysis – Slide 54
Topic 2: Performance Measures and Analysis
![Page 55: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/55.jpg)
The degree of parallelism is a measure of the number of operations that an algorithm can perform in parallel for a problem of size W, and it is independent of the parallel architecture.If P(W) is the degree of parallelism of a parallel
algorithm, then for a problem of size W no more than P(W) processors can be employed effectively.
Want to be able to do two things: predict performance of parallel programs, and understand barriers to higher performance.
Parallel Architectures and Performance Analysis – Slide 55
Performance Measures and Analysis (cont.)
![Page 56: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/56.jpg)
General speedup formulaAmdahl’s Law
Decide if program merits parallelizationGustafson-Barsis’ Law
Evaluate performance of a parallel program
Parallel Architectures and Performance Analysis – Slide 56
Topic 2 Outline
![Page 57: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/57.jpg)
The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.
The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:
In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.
Parallel Architectures and Performance Analysis – Slide 57
Speedup Factor
ps
TTpS ssormultiproce ausing time Exec.
processor oneusing time Exec.)(
![Page 58: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/58.jpg)
Speedup factor can also be cast in terms of computational steps:
Maximum speedup is (usually) p with p processors (linear speedup).
Parallel Architectures and Performance Analysis – Slide 58
Speedup Factor (cont.)
processors using steps comp. No.processor oneusing steps comp. No.)( ppS
![Page 59: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/59.jpg)
It is assumed that the processor used in parallel computation is identical to the one used by sequential algorithm.
S(p) gives the increase in speed by using a multiprocessor.Underlying algorithm for parallel
implementation might be (and usually is) different.
Parallel Architectures and Performance Analysis – Slide 59
Speedup Factor (cont.)
![Page 60: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/60.jpg)
The sequential algorithm has to be the best algorithm known for a particular computation problem.
This means that it is fair to judge the performance of parallel computation with respect to the fastest sequential algorithm for solving the same problem in a single processor architecture.
Several issues such as synchronization and communication are involved in the parallel computation.
Parallel Architectures and Performance Analysis – Slide 60
Speedup Factor (cont.)
![Page 61: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/61.jpg)
Given a problem of size n on p processors letInherently sequential computations (n)Potentially parallel computations (n)Communication operations (n,p)
Then:
Parallel Architectures and Performance Analysis – Slide 61
Execution Time Components
p
s
TT
pnpnn
nnpS
),()()()()()(
![Page 62: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/62.jpg)
Parallel Architectures and Performance Analysis – Slide 62
Speedup PlotComputation Time Communication Time
“elbowing out”
Number of processors
![Page 63: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/63.jpg)
The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:
Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.
Parallel Architectures and Performance Analysis – Slide 63
Efficiency
ppS
TpT
ppE
p
s )(processors using time Exec.processor oneusing time Exec.
![Page 64: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/64.jpg)
In an ideal parallel system the speedup factor is equal to p and the efficiency is equal to one.
In practice ideal behavior is not achieved, since processors cannot devote 100 percent of their time to the computation.
Every parallel program has overhead factors such as creating processes, process synchronization and communication.
In practice efficiency is between zero and one, depending on the degree of effectiveness with which processing elements are utilized.Parallel Architectures and Performance Analysis – Slide 64
Efficiency (cont.)
![Page 65: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/65.jpg)
Since E = S(p)/p, by what we did earlier
Since all terms are positive, E > 0Furthermore, since the denominator is larger
than the numerator, E < 1
Parallel Architectures and Performance Analysis – Slide 65
Analysis of Efficiency
),()()()()(
pnpnnpnnE
![Page 66: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/66.jpg)
Consider the problem of adding n numbers on a p processor system.
Initial brute force approach: all tasks send values to one processor which adds them all up..
Parallel Architectures and Performance Analysis – Slide 66
Example: Reduction
![Page 67: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/67.jpg)
Parallel algorithm: find the global sum by using a binomial tree.
Parallel Architectures and Performance Analysis – Slide 67
Example: Reduction (cont.)
S
![Page 68: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/68.jpg)
Assume it takes one unit of time for two directly connected processors to add two numbers and to communicate to each other.
Adding n/p numbers locally on each processor takes n/p –1 units of time.
The p partial sums may be added in log p steps, each consisting of one addition and one communication.
Parallel Architectures and Performance Analysis – Slide 68
Example: Reduction (cont.)
![Page 69: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/69.jpg)
The total parallel computation time Tp is
n/p – 1 + 2 log p.For large values of p and n this can be
approximated by Tp = n / p + 2 log p.The serial computation time can be
approximated by Ts = n.
Parallel Architectures and Performance Analysis – Slide 69
Example: Reduction (cont.)
![Page 70: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/70.jpg)
The expression for speedup is
The expression for efficiency is
Speedup and efficiency can be calculated for any p and n.
Parallel Architectures and Performance Analysis – Slide 70
Example: Reduction (cont.)
ppnnp
pn
TTpS
pn
p
s
log2log2)(
ppnn
ppSE
log2)(
![Page 71: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/71.jpg)
Computational efficiency as a function of n and p.
Parallel Architectures and Performance Analysis – Slide 71
Example: Reduction (cont.)
processors p n 1
2 4 8 16 32
64 1 .980
.930
.815
.623
.399
192
1 .990
.975
.930
.832
.665
320
1 .995
.985
.956
.892
.768
512
1 .995
.990
.972
.930
.841
![Page 72: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/72.jpg)
Parallel Architectures and Performance Analysis – Slide 72
Example: Reduction (cont.)
0 10 20 30
5
10
15
20
25
30
0
spee
dup
processors
n=64
n=192
n=320n=512
![Page 73: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/73.jpg)
Parallel Architectures and Performance Analysis – Slide 73
Maximum Speedup: Amdahl’s Law
![Page 74: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/74.jpg)
As before
since the communication time must be non-trivial.
Let f represent the inherently sequential portion of the computation; then
Parallel Architectures and Performance Analysis – Slide 74
Amdahl’s Law (cont.)
pnn
nn
pnpnn
nnpS)()(
)()(
),()()(
)()()(
)()()(nn
nf
![Page 75: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/75.jpg)
Then
In short, the maximum speedup factor is given by
where f is the fraction of the computation that cannot be divided into concurrent tasks.
Parallel Architectures and Performance Analysis – Slide 75
Amdahl’s Law (cont.)
fpp
fpfp
pff
pS)()(
)(1111
1
fpp
ffTT
pSpsT
s
s
)()()(
111
![Page 76: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/76.jpg)
LimitationsIgnores communication timeOverestimates speedup achievable
Amdahl EffectTypically (n,p) has lower complexity than
(n)/pSo as p increases, (n)/p dominates (n,p)Thus as p increases, speedup increases
Parallel Architectures and Performance Analysis – Slide 76
Amdahl’s Law (cont.)
fpppS
)()(
11
![Page 77: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/77.jpg)
Even with an infinite number of processors, maximum speedup limited to 1/f .
Ex: With only 5% of a computation being serial, maximum speedup is 20, irrespective of number of processors.
Parallel Architectures and Performance Analysis – Slide 77
Speedup against number of processors
![Page 78: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/78.jpg)
So Amdahl’s LawTreats problem size as a constantShows how execution time decreases as the
number of processors increasesHowever, we often use faster computers to
solve larger problem instancesLet’s treat time as a constant and allow the
problem size to increase with the number of processors
Parallel Architectures and Performance Analysis – Slide 78
Amdahl’s Law: The Last Word
![Page 79: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/79.jpg)
As before
Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then
Parallel Architectures and Performance Analysis – Slide 79
Gustafson-Barsis’ Law
pnn
nn
pnpnn
nnpS)()(
)()(
),()()(
)()()(
pnn
ns)()(
)(
![Page 80: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/80.jpg)
Then
Parallel Architectures and Performance Analysis – Slide 80
Gustafson-Barsis’ Law (cont.)
)1()1()( psppspsspspS
![Page 81: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/81.jpg)
Begin with parallel execution time instead of sequential time
Estimate sequential execution time to solve same problem
Problem size is an increasing function of pPredicts scaled speedup
Parallel Architectures and Performance Analysis – Slide 81
Gustafson-Barsis’ Law (cont.)spppS )()( 1
![Page 82: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/82.jpg)
An application running on 10 processors spends 3% of its time in serial code.
According to Amdahl’s Law the maximum speedup is
However the scaled speedup is
Parallel Architectures and Performance Analysis – Slide 82
Example
87703091
10 ..
)(
pS
739030910 .).)(()( pS
![Page 83: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/83.jpg)
Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time
Both overestimate speedup or scaled speedup achievable
Gene Amdahl John L. Gustafson
Parallel Architectures and Performance Analysis – Slide 83
Limitations
![Page 84: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/84.jpg)
Performance terms: speedup, efficiencyModel of speedup: serial, parallel and
communication componentsWhat prevents linear speedup?
Serial and communication operationsProcess start-upImbalanced workloadsArchitectural limitations
Analyzing parallel performanceAmdahl’s LawGustafson-Barsis’ Law
Parallel Architectures and Performance Analysis – Slide 84
Topic 2 Summary
![Page 85: CUDA Lecture 3 Parallel Architectures and Performance Analysis](https://reader035.fdocuments.us/reader035/viewer/2022081517/5681637c550346895dd459d7/html5/thumbnails/85.jpg)
Based on original material fromThe University of Akron: Tim O’Neil, Kathy
LiszkaHiram College: Irena LomonosovThe University of North Carolina at Charlotte
Barry Wilkinson, Michael AllenOregon State University: Michael Quinn
Revision history: last updated 7/28/2011.
Parallel Architectures and Performance Analysis – Slide 85
End Credits