Parallel Computer Architecture - Rochester Institute of...
Transcript of Parallel Computer Architecture - Rochester Institute of...
EECC756 - ShaabanEECC756 - Shaaban#1 lec # 1 Spring 2002 3-12-2002
Parallel Computer ArchitectureParallel Computer Architecture• A parallel computer is a collection of processing elements
that cooperate to solve large problems fast
• Broad issues involved:– Resource Allocation:
• Number of processing elements (PEs).
• Computing power of each element.
• Amount of physical memory used.
– Data access, Communication and Synchronization• How the elements cooperate and communicate.
• How data is transmitted between processors.
• Abstractions and primitives for cooperation.
– Performance and Scalability• Performance enhancement of parallelism: Speedup.
• Scalabilty of performance to larger systems/problems.
EECC756 - ShaabanEECC756 - Shaaban#2 lec # 1 Spring 2002 3-12-2002
The Need And Feasibility of The Need And Feasibility of Parallel ComputingParallel Computing• Application demands: More computing cycles:
– Scientific computing: CFD, Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases,
Transaction Processing, Gaming…– Mainstream multithreaded programs, are similar to parallel programs
• Technology Trends– Number of transistors on chip growing rapidly– Clock rates expected to go up but only slowly
• Architecture Trends– Instruction-level parallelism is valuable but limited
– Coarser-level parallelism, as in MPs, the most viable approach
• Economics:– Today’s microprocessors have multiprocessor support eliminating the
need for designing expensive custom PEs– Lower parallel system cost.– Multiprocessor systems to offer a cost-effective replacement of
uniprocessor systems in mainstream computing.
EECC756 - ShaabanEECC756 - Shaaban#3 lec # 1 Spring 2002 3-12-2002
Scientific Computing DemandScientific Computing Demand
EECC756 - ShaabanEECC756 - Shaaban#4 lec # 1 Spring 2002 3-12-2002
Scientific Supercomputing TrendsScientific Supercomputing Trends• Proving ground and driver for innovative architecture
and advanced techniques:
– Market is much smaller relative to commercial segment
– Dominated by vector machines starting in 70s
– Meanwhile, microprocessors have made huge gains infloating-point performance
• High clock rates.
• Pipelined floating point units.
• Instruction-level parallelism.
• Effective use of caches.
• Large-scale multiprocessors replace vector supercomputers– Well under way already
EECC756 - ShaabanEECC756 - Shaaban#5 lec # 1 Spring 2002 3-12-2002
Raw Uniprocessor Performance:Raw Uniprocessor Performance:LINPACKLINPACK
LIN
PA
CK
(M
FLO
PS
)
s
s
s
s
s
s
u
u
u
uu
u
uu
u
u u
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 2000
sCRAYn = 100n CRAYn = 1,000
uMicro n = 100l Micro n = 1,000
CRAY 1s
Xmp/14se
Xmp/416Ymp
C90
T94
DEC 8200
IBM Power2/990MIPS R4400
HP9000/735DEC Alpha
DEC Alpha AXPHP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260
n
nn
n
n
n
l
l
l
l l
l ll
ll
l
EECC756 - ShaabanEECC756 - Shaaban#6 lec # 1 Spring 2002 3-12-2002
Raw Parallel Performance:Raw Parallel Performance:LINPACKLINPACK
LIN
PA
CK
(GFL
OP
S)
nCRAY peaklMPP peak
Xmp /416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP (1024)
Paragon XP/S MP (6768)
n
n
n
n
l
l
nl
l
l
ll
l
ll
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
EECC756 - ShaabanEECC756 - Shaaban#7 lec # 1 Spring 2002 3-12-2002
General Technology TrendsGeneral Technology Trends• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years
0
20
40
60
80
100
120
140
160
180
1987 1988 1989 1990 1991 1992
Integer FP
Sun 4
260
MIPS
M/120
IBM
RS6000
540MIPS
M2000
HP 9000
750
DEC
alpha
EECC756 - ShaabanEECC756 - Shaaban#8 lec # 1 Spring 2002 3-12-2002
Clock Frequency Growth RateClock Frequency Growth Rate
uuuuuuu
uuu
uu
uuu
u
uuuu
uu
uu
u
u
uuu
uu
u
u
u
uu
uuu
u
uu
u
uuuu
u
uuu
uu
uu
uu
uuuu
uuu
uuu
uuuuu
u
uuu
uu
0.1
1
10
100
1,000
19701975
19801985
19901995
20002005
Clo
ck r
ate
(MH
z)
i4004i8008
i8080
i8086 i80286i80386
Pentium100
R10000
• Currently increasing 30% per year
EECC756 - ShaabanEECC756 - Shaaban#9 lec # 1 Spring 2002 3-12-2002
Transistor Count Growth RateTransistor Count Growth Rate
Tran
sist
ors
uuuuuuuuu
u
uu
uuuuu
u
uu
u
uu
uuuu
uu
u
u
u u
uu
u
u
uuu
uuuuu
uuuuuuuuu
u
uuuu
uuu
uuuuu
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
19701975
19801985
19901995
20002005
i4004i8008
i8080
i8086
i80286i80386
R2000
Pentium R10000
R3000
• 100 million transistors on chip by early 2000’s A.D.• Transistor count grows much faster than clock rate
- Currently 40% per year
EECC756 - ShaabanEECC756 - Shaaban#10 lec # 1 Spring 2002 3-12-2002
System Attributes to PerformanceSystem Attributes to Performance• Performance benchmarking is program-mix dependent.
• Ideal performance requires a perfect machine/program match.
• Performance measures:
– Cycles per instruction (CPI)
– Total CPU time = T = C x ττ = C / f = Ic x CPI x ττ = Ic x (p + m x k) x τ τ
Ic = Instruction count ττ = CPU cycle time
p = Instruction decode cycles
m = Memory cycles k = Ratio between memory/processor cycles
C = Total program clock cycles f = clock rate
– MIPS Rate = Ic / (T x 106) = f / (CPI x 106) = f x Ic /(C x 106)
– Throughput Rate: Wp = f /(Ic x CPI) = (MIPS) x 106 /Ic
• Performance factors: (Ic, p, m, k, ττ) are influenced by: instruction-setarchitecture, compiler design, CPU implementation and control, cacheand memory hierarchy.
EECC756 - ShaabanEECC756 - Shaaban#11 lec # 1 Spring 2002 3-12-2002
CPU Performance TrendsCPU Performance TrendsP
erfo
rman
ce
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
The microprocessor is currently the most naturalbuilding block for multiprocessor systems interms of cost and performance.
EECC756 - ShaabanEECC756 - Shaaban#12 lec # 1 Spring 2002 3-12-2002
Tran
sist
ors
uuuuuuu
uu
u
uu
u
u
u uu
u
u
u
u
uu
uuu u
uu
u
u
u u
u
u
u
u
u
uu
u u
uuu
uuu u
uu uu u
u
uuu u
uuu
u
u uuu
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Parallelism in Microprocessor VLSI GenerationsParallelism in Microprocessor VLSI Generations
EECC756 - ShaabanEECC756 - Shaaban#13 lec # 1 Spring 2002 3-12-2002
The Goal of Parallel ComputingThe Goal of Parallel Computing• Goal of applications in using parallel machines: Speedup
Speedup (p processors) =
• For a fixed problem size (input data set),performance = 1/time
Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
EECC756 - ShaabanEECC756 - Shaaban#14 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers
HardwareHardwareArchitectureArchitecture
Operating SystemOperating System
Applications SoftwareApplications Software
ComputingComputing Problems Problems
AlgorithmsAlgorithmsand Dataand DataStructuresStructures
High-levelHigh-levelLanguagesLanguages
Performance Performance Evaluation Evaluation
MappingMapping
ProgrammingProgramming
BindingBinding(Compile, (Compile, Load)Load)
EECC756 - ShaabanEECC756 - Shaaban#15 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers1 Computing Problems:
– Numerical Computing: Science and technology numericalproblems demand intensive integer and floating pointcomputations.
– Logical Reasoning: Artificial intelligence (AI) demand logicinferences and symbolic manipulations and large space searches.
2 Algorithms and Data Structures– Special algorithms and data structures are needed to specify the
computations and communication present in computingproblems.
– Most numerical algorithms are deterministic using regular datastructures.
– Symbolic processing may use heuristics or non-deterministicsearches.
– Parallel algorithm development requires interdisciplinaryinteraction.
EECC756 - ShaabanEECC756 - Shaaban#16 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers3 Hardware Resources
– Processors, memory, and peripheral devices form thehardware core of a computer system.
– Processor instruction set, processor connectivity, memoryorganization, influence the system architecture.
4 Operating Systems– Manages the allocation of resources to running processes.
– Mapping to match algorithmic structures with hardwarearchitecture and vice versa: processor scheduling, memorymapping, interprocessor communication.
– Parallelism exploitation at: algorithm design, programwriting, compilation, and run time.
EECC756 - ShaabanEECC756 - Shaaban#17 lec # 1 Spring 2002 3-12-2002
Elements of Modern ComputersElements of Modern Computers5 System Software Support
– Needed for the development of efficient programs in high-level languages (HLLs.)
– Assemblers, loaders.
– Portable parallel programming languages
– User interfaces and tools.
6 Compiler Support– Preprocessor compiler: Sequential compiler and low-level
library of the target parallel computer.
– Precompiler: Some program flow analysis, dependencechecking, limited optimizations for parallelism detection.
– Parallelizing compiler: Can automatically detectparallelism in source code and transform sequential codeinto parallel constructs.
EECC756 - ShaabanEECC756 - Shaaban#18 lec # 1 Spring 2002 3-12-2002
Approaches to Parallel ProgrammingApproaches to Parallel Programming
Source code written in Source code written inconcurrent dialects of C, C++concurrent dialects of C, C++ FORTRAN, LISP FORTRAN, LISP ..
ProgrammerProgrammer
ConcurrencyConcurrencypreserving compilerpreserving compiler
ConcurrentConcurrentobject codeobject code
Execution by Execution byruntime systemruntime system
Source code written in Source code written insequential languages C, C++sequential languages C, C++ FORTRAN, LISP FORTRAN, LISP ..
ProgrammerProgrammer
ParallelizingParallelizing compiler compiler
Parallel Parallelobject codeobject code
Execution by Execution byruntime systemruntime system
(a) Implicit (a) Implicit Parallelism Parallelism
(b) Explicit (b) Explicit Parallelism Parallelism
EECC756 - ShaabanEECC756 - Shaaban
Evolution of ComputerEvolution of ComputerArchitectureArchitecture
Scalar
Sequential Lookahead
I/E Overlap FunctionalParallelism
MultipleFunc. Units Pipeline
Implicit Vector
Explicit Vector
MIMDSIMD
MultiprocessorMulticomputer
Register-to -Register
Memory-to -Memory
Processor Array
Associative Processor
Massively Parallel Processors (MPPs)
I/E: Instruction Fetch and Execute
SIMD: Single Instruction stream over Multiple Data streams
MIMD: Multiple Instruction streams over Multiple Data streams
EECC756 - ShaabanEECC756 - Shaaban#20 lec # 1 Spring 2002 3-12-2002
Parallel Architectures HistoryParallel Architectures History
Application Software
System Software SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
Historically, parallel architectures tied to programming models
• Divergent architectures, with no predictable pattern of growth.
EECC756 - ShaabanEECC756 - Shaaban#21 lec # 1 Spring 2002 3-12-2002
Programming ModelsProgramming Models• Programming methodology used in coding applications
• Specifies communication and synchronization
• Examples:
– Multiprogramming:
No communication or synchronization at program level
– Shared memory address space:
– Message passing:
Explicit point to point communication
– Data parallel: More regimented, global actions on data
• Implemented with shared address space or message passing
EECC756 - ShaabanEECC756 - Shaaban#22 lec # 1 Spring 2002 3-12-2002
Flynn’s 1972 Classification ofFlynn’s 1972 Classification ofComputer ArchitectureComputer Architecture
• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines.
• Single Instruction stream over Multiple Data streams(SIMD): Vector computers, array of synchronized processing elements.
• Multiple Instruction streams and a Single Data stream(MISD): Systolic arrays for pipelined execution.
• Multiple Instruction streams over Multiple Data streams(MIMD): Parallel computers:
• Shared memory multiprocessors.
• Multicomputers: Unshared distributed memory,message-passing used instead.
EECC756 - ShaabanEECC756 - Shaaban#23 lec # 1 Spring 2002 3-12-2002
Flynn’s Classification of Computer Architecture
Fig. 1.3 page 12 in
Advanced Computer Architecture: Parallelism,Scalability, Programmability, Hwang, 1993.
EECC756 - ShaabanEECC756 - Shaaban#24 lec # 1 Spring 2002 3-12-2002
Current Trends In Current Trends In Parallel ArchitecturesParallel Architectures
• The extension of “computer architecture” to supportcommunication and cooperation:
– OLD: Instruction Set Architecture
– NEW: Communication Architecture
• Defines:– Critical abstractions, boundaries, and primitives
(interfaces)
– Organizational structures that implement interfaces(hardware or software)
• Compilers, libraries and OS are important bridges today
EECC756 - ShaabanEECC756 - Shaaban#25 lec # 1 Spring 2002 3-12-2002
Modern Parallel ArchitectureModern Parallel ArchitectureLayered FrameworkLayered Framework
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
EECC756 - ShaabanEECC756 - Shaaban#26 lec # 1 Spring 2002 3-12-2002
Shared Address Space Parallel ArchitecturesShared Address Space Parallel Architectures
• Any processor can directly reference any memory location
– Communication occurs implicitly as result of loads and stores
• Convenient:– Location transparency
– Similar programming model to time-sharing in uniprocessors• Except processes run on different processors
• Good throughput on multiprogrammed workloads
• Naturally provided on a wide range of platforms– Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model– Ambiguous: Memory may be physically distributed among
processors
EECC756 - ShaabanEECC756 - Shaaban#27 lec # 1 Spring 2002 3-12-2002
Shared Address Space (SAS) Model• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
• Writes to shared address visible to other threads (in other processes too)
• Natural extension of the uniprocessor model:• Conventional memory operations used for communication
• Special atomic operations needed for synchronization
• OS uses shared memory to coordinate processes
St or e
P1
P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Shared portionof address space
Private portionof address space
Common physicaladdresses
EECC756 - ShaabanEECC756 - Shaaban#28 lec # 1 Spring 2002 3-12-2002
Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors• The Uniform Memory Access (UMA) Model:
– The physical memory is shared by all processors.
– All processors have equal access to all memory addresses.
• Distributed memory or Nonuniform Memory Access(NUMA) Model:
– Shared memory is physically distributed locally amongprocessors.
• The Cache-Only Memory Architecture (COMA) Model:
– A special case of a NUMA machine where all distributedmain memory is converted to caches.
– No memory hierarchy at each processor.
EECC756 - ShaabanEECC756 - Shaaban#29 lec # 1 Spring 2002 3-12-2002
Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Processor Processor
Interconnect
I/Odevices
M ° ° °M M
Network
P
$
P
$
P
$
° ° °
Network
D
P
C
D
P
C
D
P
C
Distributed memory or Nonuniform Memory Access (NUMA) Model
Uniform Memory Access (UMA) ModelInterconnect: Bus, Crossbar, Multistage networkP: ProcessorM: MemoryC: CacheD: Cache directory
Cache-Only Memory Architecture (COMA)
EECC756 - ShaabanEECC756 - Shaaban#30 lec # 1 Spring 2002 3-12-2002
Uniform Memory Access Example:Uniform Memory Access Example:Intel Pentium Pro QuadIntel Pentium Pro Quad
P-Pr o bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interface
MIU
P-Pr omodule
P-Pr omodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
• All coherence and multiprocessingglue in processor module
• Highly integrated, targeted at highvolume
• Low latency and bandwidth
EECC756 - ShaabanEECC756 - Shaaban#31 lec # 1 Spring 2002 3-12-2002
Uniform Memory Access Example:Uniform Memory Access Example:SUN EnterpriseSUN Enterprise
– 16 cards of either type: processors + memory, or I/O
– All memory accessed over bus, so symmetric
– Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 Fi
berC
hann
el
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O car ds
EECC756 - ShaabanEECC756 - Shaaban#32 lec # 1 Spring 2002 3-12-2002
Distributed Shared-MemoryDistributed Shared-MemoryMultiprocessor System Example:Multiprocessor System Example:
Cray T3ECray T3E
Switch
P
$
XY
Z
Exter nal I/O
Memctrl
and NI
Mem
• Scale up to 1024 processors, 480MB/s links
• Memory controller generates communication requests for nonlocalreferences
• No hardware mechanism for coherence (SGI Origin etc. provide this)
EECC756 - ShaabanEECC756 - Shaaban#33 lec # 1 Spring 2002 3-12-2002
Message-Passing MulticomputersMessage-Passing Multicomputers• Comprised of multiple autonomous computers (nodes).
• Each node consists of a processor, local memory, attachedstorage and I/O peripherals.
• Programming model more removed from basic hardwareoperations
• Local memory is only accessible by local processors.
• A message passing network provides point-to-point staticconnections among the nodes.
• Inter-node communication is carried out by message passingthrough the static connection network
• Process communication achieved using a message-passingprogramming environment.
EECC756 - ShaabanEECC756 - Shaaban#34 lec # 1 Spring 2002 3-12-2002
Message-Passing AbstractionMessage-Passing Abstraction
• Send specifies buffer to be transmitted and receiving process
• Recv specifies sending process and application storage to receive into
• Memory to memory copy, but need to name processes
• Optional tag on send and matching rule on receive
• User process names local data and entities in process/tag space too
• In simplest form, the send/recv match achieves pairwise synch event
• Many overheads: copying, buffer management, protection
Process P Process Q
Address Y
Address X
Send X, Q, t
Receive Y, P, tMatch
Local pr ocessaddr ess space
Local pr ocessaddr ess space
EECC756 - ShaabanEECC756 - Shaaban#35 lec # 1 Spring 2002 3-12-2002
Message-Passing Example: IBM SP-2Message-Passing Example: IBM SP-2
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC• Made out of essentially
complete RS6000workstations
• Network interfaceintegrated in I/O bus(bandwidth limited by I/Obus)
EECC756 - ShaabanEECC756 - Shaaban#36 lec # 1 Spring 2002 3-12-2002
Message-Passing Example:Message-Passing Example:Intel ParagonIntel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Supercomputer
EECC756 - ShaabanEECC756 - Shaaban#37 lec # 1 Spring 2002 3-12-2002
Message-Passing Programming ToolsMessage-Passing Programming Tools• Message-passing programming libraries include:
– Message Passing Interface (MPI):• Provides a standard for writing concurrent message-passing
programs.
• MPI implementations include parallel libraries used byexisting programming languages.
– Parallel Virtual Machine (PVM):• Enables a collection of heterogeneous computers to used as a
coherent and flexible concurrent computational resource.
• PVM support software executes on each machine in a user-configurable pool, and provides a computationalenvironment of concurrent applications.
• User programs written for example in C, Fortran or Java areprovided access to PVM through the use of calls to PVMlibrary routines.
EECC756 - ShaabanEECC756 - Shaaban#38 lec # 1 Spring 2002 3-12-2002
Data Parallel Systems Data Parallel Systems SIMD in Flynn taxonomySIMD in Flynn taxonomy• Programming model
– Operations performed in parallel on eachelement of data structure
– Logically single thread of control, performssequential or parallel steps
– Conceptually, a processor is associated witheach data element
• Architectural model– Array of many simple, cheap processors each
with little memory
• Processors don’t sequence throughinstructions
– Attached to a control processor that issuesinstructions
– Specialized and general communication, cheapglobal synchronization
• Example machines:– Thinking Machines CM-1, CM-2 (and CM-5)
– Maspar MP-1 and MP-2,
PE PE PE° ° °
PE PE PE° ° °
PE PE PE° ° °
° ° ° ° ° ° ° ° °
Controlprocessor
EECC756 - ShaabanEECC756 - Shaaban#39 lec # 1 Spring 2002 3-12-2002
DataflowDataflow Architectures Architectures• Represent computation as a graph of essential dependences
– Logical processor at each node, activated by availability of operands
– Message (tokens) carrying tag of next instruction sent to next processor
– Tag compared with others in matching store; match fires execution
1 b
a
+ − ×
×
×
c e
d
f
Dataflow graph
f = a × d
Network
Token store
Waiting Matching
Instruction fetch
Execute
Token queue
Form token
Network
Network
Program store
a = (b +1) × (b − c) d = c × e
EECC756 - ShaabanEECC756 - Shaaban#40 lec # 1 Spring 2002 3-12-2002
Systolic ArchitecturesSystolic Architectures
M
PE
M
PE PE PE
• Replace single processor with an array of regular processing elements
• Orchestrate data flow for high throughput with less memory access
• Different from pipelining
– Nonlinear array structure, multidirection data flow, each PEmay have (small) local instruction and data memory
• Different from SIMD: each PE may do something different
• Initial motivation: VLSI enables inexpensive special-purpose chips
• Represent algorithms directly by chips connected in regular pattern