Aspects of practical parallel programming Parallel programming models Data parallel
Parallel Processing I - Majmaah...
Transcript of Parallel Processing I - Majmaah...
Parallel Processing I
Advanced Operating Systems
Oleg Goldshmidt
Lecture 5
– p.1/55
Side Note: Simultaneous Multithreading
Modern CPU’s front end: instruction decoding andscheduling
Modern CPU’s back end (“core”): pipelined execution
Inefficiency in pipelines: “pipeline bubbles”execution slots where the CPU could not scheduleany useful work (for whatever reason)
Inefficiency in front enda CPU can schedule several instructions per clockcycle for execution in core
by different functional unitsusually it schedules less
on SMP more instructions are executed in each clockcycle, but more slots are wasted as well
– p.2/55
Side Note: Superthreading
a.k.a. time-slice multithreading
interleave instructions from different threads
each pipeline stage can contain instructions from onethread only
scheduling logic switches between threads
helps alleviate memory latencyif one thread requests data from main memory thatis not in cache it stalls for several cyclesanother thread can proceed with execution, keepingpipelines full
does not help with instruction-level parallelismif on a given cycle not enough instructions can beparallelized, there will still be waste
– p.3/55
Side Note: Hyperthreading
removes the “one thread per time slot” restriction
allows maximal scheduling flexibility to reduce “bubbles”
Intel Pentium 4 Xeon: 2 threads per CPU
not very complicated: adds about 5% to the die area forXeon
from the OS perspective: 2 logical processors,equivalent to 2-way SMP
installing an OS on a Xeon means installing an SMPkernelcan be disabled (why?)
both logical CPUs share the same cache
– p.4/55
Side Note: SMT and Caching
no cache coherence problems that plague SMPs
but higher potential for cache conflicts
each thread can monopolize the caches — nocooperation
potential for cache thrashingcan be bad for memory-intensive workloadsremember: hyperthreading can be disabled
– p.5/55
This (And Next) Lecture
extends the previous discussion of shared-memorymultiprocessors
name of the game:performanceparallelism
A parallel computer is “a collection of processing elementsthat communicate and cooperate to solve large problemsfast” [Almasi and Gottlieb, 1989]
extending the concepts of computer architecture withcommunications architecture
analyze performance, cost
focus on implications for OS, applications
– p.6/55
Parallel Multiprogramming
shared memory programmingbulletin board — posting information at known,shared locations in memoryorchestration by taking note of who is doing what
message passing programmingsending messages between individual processesorchestration by well defined events of sending andreceiving information
data parallel programmingprocesses perform operations on subsets of dataorchestration by exchanging information at varioussteps in executioncommunication by shared memory or messages
– p.7/55
Communications Architecture
communication abstractionuser-level communication primitives of the system,provided by
programming language and environmentcompilers and librariesOShardware
define operations, data types, formatscombination of hardware and software
– p.8/55
Shared Memory vs. Message passing
at highest logical level massage passing is very similarto NUMA
primary difference: communication is integtrated at theI/O level rather than into the memory system
complete computers as building blocksCPU, memory, and I/O
clusters — collections of fairly typical standalonesystems
multicomputers — tighter packaging, faster networks,much tighter integration of CPUs and network
interprocessor communication rather than I/O
– p.9/55
The Role of the OS
shared memorymuch of the distance between the programmingmodel and the hardware is covered by compilers andlibrariesthe basic primitives are provided by hardware (LOADand STORE )
with some help from the OS, e.g., page faultsmuch of the infrastructure is provided by hardware(bus, cache)OS is very important for scheduling andsynchronization
message passingmore OS (and library) support needed — networkI/O!
– p.10/55
Message Passing
send : specifies a local buffer to send and the receivingprocess (usually on a remote CPU)
recv : specifies a sending process and a local buffer toplace the received data in
combined send and recv
cause the transfer to occuraccomlish pairwise synchronizationfacilitate memory-to-memory copy
each side specifies its local address
different semantics: synchronous, blocking,non-blocking (later)
– p.11/55
Message Passing Technologies
point-to-point FIFO linksmeshes or hypercubesconnect to nearest neighbours only (algorithmsspecified topology)synchronous transfers
DMAallows non-blocking sends
store-and-forward routinglatencies increase with the number of hops
pipelining interconnects (IBM SP-2, Intel Paragon)latencies dominated by CPU-to-NICuniform communication costs
– p.12/55
Convergence: Software
send and recv on shared memory machinessend : store the data or a pointer to data in bufferrecv : read the dataraise a flag when a “message” arrives
global memory space on message passing machinese.g., logical read : send a message to anotherprocess, get data with “return mail”message exchange is hidden by compiler-generatedcode for access to a shared variable
shared VM on message passing machineson access to a remote page a page fault occursmessages are exchanged behind the scenesthe missing page is transferred and mapped
– p.13/55
Convergence: Hardware
integrate network controllers with cache and/or memorycontrollers
shared memory machinesdetect cache missesnetwork transactions to access remote memory
message passing machinesprovide networked DMAmemory-to-memory between machinesprovide user-level message passing on top DMAtransfersattemts at built-in distributed shared memory
message-passing between SMP nodes
– p.14/55
Data Parallel Processing
processor arrays, a.k.a. SIMD machines, a.k.a. dataparallel architectures
operations are performed in parallel on elements of alarge regular data set
e.g., elements or chunks of arrays, matrices, etc.
Flynn’s taxonomy:SISD — conventional sequential machinesMIMD — conventional parallel machinesSIMD — instruction sequencing in control processor,data manipulation in data processors
motivation in early architectures: fecthinginstructions cost as much as performing them
eclipsed by vector processors– p.15/55
Vector Processors
functional processing units that can operate on vectorsof data (from one memory) in pipelines
no need to map application data to interconnect
CDC Star-100: operations on contiguous vectorsa lot of time spent making them contiguous
CRAY-I: vector registers, LOAD and STORE
Connection Machine: 32 1-bit PEs on a chip,connections to neighbours, consolidated sequencing
killed by fast processors with integrated FPUs andcachesno cost advantage of consolidated sequencing
SPMD — many CPUs execute copies of the sameprogram
– p.16/55
Architectures and Programming
programming models and environments are largerlyabstracted from parallel machine architectures
languages: Fortran 90, Fortran 95, HPFshared address, data parallel programmingwork on a wide variety of shared memory andmessage passing architecturesdetails are hidden by compilerscompilation radically differs, depending oncommunication and synchronization primitives
libraries: PVM, MPIwork just about everywhere (well...)implementation differs
– p.17/55
Performance Metrics
latency (
�
)how much time is needed to complete an operation
bandwidth (
�
)how many operations are performed in unit time
cost (
�
)the impact the operations have on the overallexecution time of the program
in a sequential world
� � � � ��� � � � �
(where
is the number of operations performed)
parallelism is more difficult
– p.18/55
Performance Metrics Example
a component can perform an operation in
� � � � �� � �
“sequential bandwidth”:
� � � � � � � � �� ��� � � � �
assume a 10-stage pipelinepeak bandwidth
���� � � � �� � � � � � �application starts an operation every
�� � �
bandwidth
� � � � � � � �� � � � � � �
pipelining is still beneficial because of bursts
cost of � � �� �
operations,
� � � :� � � � � � � � � � � � � �� �
� � � � � � � � � � ��� � � � � �
if we average
� � � �� � of work before dependingon the result:
� � � � � � � � � � � � � � �
– p.19/55
Data Transfer Time
central to parallel computing
! � � � !#" $ ��
� is the amount of data (e.g., in bytes)�
is the transfer rate (bandwidth) in bytes per second!#" is the start-up cost
useful for different kinds of data transfersmessage passing:
! " is the time for first bit to reachdestinationmemory access:
!%" is access timebus transactions:
! " is arbitration and controlprotocolpipelines:
! " is time to fill the pipeline
– p.20/55
Data Transfer Time: Discussion
bandwidth depends on the transfer size
�
is the asymptotic bandwidth:
� � &' () *+
�! � �
how quickly do we reach
�
if we increase �?depends on start-up costthe transfer size at which half of the peak bandwidthis reached (half-power point):
�-, ./ � !%" �
bandwidth and latency are not the same!
– p.21/55
Refining The Model
when can the next transfer be initiated?
can useful work be done during transfer?
! � � � 0 $ $ 1where
0
is overhead
is occupancy
1
is network delay
– p.22/55
Data Transfer Overhead
overhead — time it takes the CPU to initiate the transfer
may be constant or linear in �, the transfer sizeconstant if the CPU only has to tell thecommunication assist to start, and gives a pointer todata
for DMA-capable or memory-integrated assistslinear if the CPU has to copy the data to the assist
for programmed I/Omay be more difficult if DMA locks memory bus
overhead is the time CPU is busy
the rest ( $ 1) is network latency
– p.23/55
Network Latency
occupancy — time it takes the data to pass through theslowest link ( � � �
, if
�
is the bandwidth of the link)
often the communication assist is the bottleneck
determines how often transfers can be initiated (
2
)next transfer will wait till the bottleneck is freeif there is buffering between the CPU and thebottleneck,
2 � � �
once the buffer is full2 � � �
delay — time it takes a bit to travel across the networkincluding communication assist delay
CPU does not care about the details, designers do
the split into0
, ,
1
is important for pipelined transfers
– p.24/55
Mutiple Transfers and Contention
individual network components are described by thedelay and occupancy
the network delay for a single transfer is the sum ofindividual delays along the path
the network occupancy is the maximal occupancy alongthe path
if two transfers attempt to use the same resourcesimultaneously, one must wait
contention increases communication time
contention looks to CPUs like increased occupancy
– p.25/55
Performance of a Cache Miss
overhead: cache controller determines there has beena miss and starts the transfer
occupancy: the cach block size divided by the busbandwidth
assume the bus is the bottleneck
delay: the time needed to arbitrate and access the busand the time to deliver the block into the cache
assuming the bus is free
contention: time waiting to get access to the busy bus
– p.26/55
Communication Cost
� � 3 � !� 4 �
where
!
is the communication time, and
3
: frequency of communication — the number ofcommunication operations per unit of work in theprogram
hardware may affect3
:may limit the transfer sizemay automatically migrate or replicate data
4
: overlap — the portion of communication thatoverlaps with other useful work
includes computation of other communicationspossible because communication is offloaded
– p.27/55
Amdahl’s Law
concurrency is limitedby the underlying problemby decomposition into parallel tasks
some portions of the program may not have as muchconcurrency as the number of CPUs
e.g., final aggregation of results, input and output
assume fraction
5
of work is serial,
6 � �� 5
is fullyparallelized: execution time on
processors is! �
6 $ 5
if
5 � �87 � �then no reason to increase
beyond6 � 5 9 �� �– p.28/55
Amdahl’s Law: Example
problem: compute the average of values on an � � �
grid
decomposition: compute the �/ values in parallel, theneach processor will add its result to the global sum,then divide by �/
the summation must be serialized, runtime on
CPUs:
! � � �/ � $ �/ $ �
speedup for very large
&' (: *+! � �
! � � &' (: *+
� �/
�/ � $ �/ $ � �� �/
�/ $ � ; �
– p.29/55
The Role of Decomposition
deserialize a part of the serial summation:make each of the
CPU sum a private sumadd the
private sums together
runtime on
CPUs
! � � �/ � $ �/ � $ $ �
parallelization speedup
! � �! � �
� �/
� �/ � $ $ � �
� �/
� �/ $ / $ <
for
; ; �– p.30/55
Load Balancing
do all the CPUs perform the same amount of work?
refining Amdahl’s law: speedup depends on themaximal work done by any CPU
5>=? ? @BA = � ! � �
( CDFE G ,IH H H : !E �
!E �
should be interpreted liberallyincludes communications and other overheads
CPUs should be working at the same time (scheduling!)extreme case: all do the same work, but one at atime — no speedup
minimize time spent at synchronization pointsincluding serialization, of course
– p.31/55
Reducing Communications
communications are included in
!E �
the higher the concurrency the more communicationsmay be needed
important metrics: communication-to-computation ratioif a GB needs to be transferred overall, the impact ona program that runs 1 second will be greater than theimpact on a program that runs 1 hour
system designers examine the communication needs ofan application to determine the latencies and andwidthrequirements and to justify the costs
– p.32/55
Reducing Parallelization Overhead
parallelization incurs overheadpartitioning of tasks and dataredundant computations on different nodescreating processesdistributing code and data across the systemexecuting synchronizationpacking and unpacking the data for communicationsload balancing
non-existent in sequential execution
pays off in parallel systems if trade-offs are right
– p.33/55
Scalability
what is scalability?we can always add memory, disks, upgrade CPUsand network cardsthe performance increase has hard limitsscalable systems avoid inherent design limits onincreasing resources of the systemin practice, there are still economic limits
for large systems, what do we gain by incrementingresources?
no design scales perfectlywhere is the saturation point?
– p.34/55
Scalability of Parallel Machines
how does throughput increase when we addprocessors?
how does latency of operations increase?
how does the cost increase?
can we scale without redesigning?
scaling up vs. scaling outscaling up: building bigger machinesscaling out: building larger distributed systems
scaling downare small configurations with the same design andlarge ones cost-effective?
– p.35/55
Example: Bus-Based SMPs
bandwidth of the bus does not increase when moreprocessors are added
the clock period of the bus depends on how fast a valuecan be driveen only the bus and sampled by everymodule on the bus
increases with the number of connected modulesincreases with wire lengthlatency increases with number of processors
signal quality on the bus decreases with length andnumber of connected modules — hard design limit
the bus, the packaging, the power supply mustcorrespond to the maximal possible configuration
– p.36/55
Bandwidth Scaling
scalable network provide a large number ofindependent communication paths between nodes
point-to-point latency should not increase with numberof nodes
cost per node should not increasebrute-force solutions exist, e.g., full global crossbar
e.g., Earth Simulator: 640 SMP nodes with fullcrossbar interconnects/
links — not scalable after some point
NB: in NUMA local memory has constant latencyregardless of number of CPUs
important to consider communication patternsbroadcasts are expensive
– p.37/55
Latency Scaling
time to transfer � bytes between 2 nodes
! � � � 0 $ $ 1 �where
0
is CPU overhead, � � � �is channel time
(“occupancy”),
1
is (routing) delay
0 � J� � K or
0 < �, depending on DMA
1 � J� � K or
1 < � depending on switching orstore-and-forward
high performance networks are switching, notstore-and-forward
switches have fixed degree, so latency increases asnumber of hops increases
latency increases with contention– p.38/55
Example of Latency Scaling
for
nodes the network distance is
&� L/ assume
0 � �NM �, � � O P QR � �, 1 � �� � �how do 128-byte transfers scale from 64 to 1024nodes?
! ST� U � �V � � �M � $ � �V �
O P � �M �$ WBXY / O P � �87 �M � � P7 �M �
! ST, " / U � �V � � �M � $ � �V �
O P � �M �$ WBXY / �� � P � �87 �M � � �M �
! S Z� U � �V � � �M � $ WBXY / O P �
� � V �
O P � �M � $ � 7 �M � � � P7 �M �
! S Z, " / U � �V � � �M � $ WBXY / �� � P �
� �V �
O P � �M �$ �87 �M � � �[ M �
– p.39/55
Cost Scaling
fixed infrastructure cost plus incremental cost of addingprocessors and memory
�X\ ] = � ^ � � 3_a` ? @ �X\ ] $ b �cd ? ^? � ]fe W �X\ ] = � ^ �
fixed cost for bus-based machines: cabinet, powersupply, bus
small configurations are at a disadvantagecompared to UPexpansion is possible at a cost less than astandalone processor
– p.40/55
Cost Scaling of Bus-Based Machines
commercial bus-based machines are typically at� � �
themaximal configuration
enough to amortize fixed cost and leaves room forexpansion
vendors of large SMPs offer smaller models
designs should better be modularmore power supplies and cabinets as size increases
– p.41/55
Cost Scaling of Networks
assume number of switches (and links)
5 < &� L/
assume for 64 nodes the cost is balanced betweenprocessors, memory, and network
how will cost be distributed for 1024 nodes (with samememory per CPU)?
the number of switches per CPU will increase by factor10/6
cost per processor
� � 6g � �87 [ [ $ � 2 h � �87 [ [ $ �� � O � � �87 [ [ � �7 � �
network fraction of cost increased from 33% to 45%
there may be other factors such as increased length ofwires
– p.42/55
Cost-Effectiveness
definition of efficiency
i � �! � �
! �parallel machine is effective only if all CPUs are utilized
neglects that much of the cost is in memory andnetwork, not just CPUs
the system is cost-effective if its cost-performance ratio,� � � ! �
, decreases (or does not grow?) with
cost-effectiveness: speedup is higher than costup,
! � �! � j
� �� � �
– p.43/55
Physical Scaling
today’s largest machines occupy huge multistorybuildings
need physical access to components for maintenance
need cooling (and pay huge electricity bills)
dozens of staff
hundreds of kilometers of cables
there are physical limits on wire length (due to powerrequirements and signal degradation)
optical fiber has higher fixed cost
need denser packaging
but need space for all the cards, connections, etc.
also, loose coupling facilitates upgrades– p.44/55
Message Passing
send/receive pair: one-way transfer of data from thesource area specified by the sending user process(“sender”) to a destination area specified by thereceiving user process (“receiver”)
also provides a synchronization event between the 2processes
synchronous send and recv
send returns only after recv is performedrecv returns when all the data has been receivedeasy deadlock: what if all processes send and stall?alternation: even processes send before recv , oddrecv before send
– p.45/55
Asynchronous Message Passing
blocking send and recv
send returns when message is transferred to thesystemrecv is similar to synchronous, but does not sendACKdoes not guarantee seliverysuch a guarantee requires additional handshaking
non-blocking send and recv
send returns immediatelyrecv returns after posting intent to receiveneed to probe the actual state of the transfer usingspecial primitivesthe probes may be blocking on non-blocking
– p.46/55
Message passing Interface (MPI)
distinguishes the notion of when send or recv returnsfrom when a message transfer completes
a flexible combination of buffered and synchronousmodes
synchronous send : recv executed, deliveryguaranteed, the source buffer can be reused
ready-mode send : throws an error if recv has notexecuted by the time the message arrives at destination
need a special message to indicate readiness
buffered send : source buffer can be reusedindependently of whether recv has been issued, thedata may be buffered elsewhere in the system
can be blocking or non-blocking– p.47/55
Synchronous MPI Transfers
sender: sends ready-to-send and waits
receiver: checks if there is a receiver processif yes, sends ready-to-recvelse waits till recv is executed, then sendsready-to-recv
sender: initiates transfer upon arrival of ready-to-recv
can be receiver-initiatedreceiver specifies the sendermatch table is only on the sender sideonly ready-to-recv and transfer are needed
– p.48/55
Buffered MPI Transfers
optimistic single-phase protocol
sender transfers the data in a single transactionwrapped in a envelope containing the information usedin matching
the destination examines the envelope to find a matchif there is a matching recv , deliverselse stores in a temporary buffer
match table lookup is expensive, what if data continuesto flow
always store in temporary buffer? need to copy todeliver...
assumes infinite storage at destination
– p.49/55
Buffered MPI Transfers II
sender sends a ready-to-send with the envelope
receiver sends a ready-to-recv when there is a match orif there is enough buffer space
for short messages the handshake may dominate thecommunications cost
credit system for short transferssome buffer space is pre-allocated (and tracked) forshort transfersshort messages can be sent without the handshakecompletion acknowledgment is used to communicatethe credit state
– p.50/55
Active Messages
essentially a remote procedure call
message contains data, a handler to be invoked on thereceiver, and the handler’s arguments
receiver invokes the handler, issues a response thatcontains a handler to be invoked on the handler
message notification through interrupts or signals
null message (polling)invokes a handler to service a full network
– p.51/55
Common Challenges
each processor has only a limited knowledge of thestate of the system
a large number of network transactions can be inprogress simultaneously
latencies can be large
temptation to use optimistic protocols, large transfers
programming models may require multiple networktransactions per operation
destinations may be flooded
similar to bus-based designs, but no constraints
is large, number of outstanding transactions islarge, no global ordering
– p.52/55
Back Pressure
suppose there is a lot of contention for a particularreceiver
we can control the flow by informing sendersrequires a special protocol
destination may refuse to accept traffic
back pressure will propagate to the sources (in a tree)
sources will feel the back pressure and slow down
reliable networks are deadlock-free as long asmessages are removed at destinations
other messages will get stuck, too — overall latency willgrow
alternatives: NACKs or timeouts
– p.53/55
Fetch Deadlocks
a processor sends request, the destination refuses toaccept
to keep the network deadlock-free, the source mustkeep accepting transactions even if it cannot initiatethem
but what if it gets a request and cannot send aresponse?
provide separate logical networks for requests andresponses
responses can be completed without adding to theload of the request network
– p.54/55
Fetch Deadlocks II
alternative: provide input buffers by limiting the numberof outstanding transactions and set aside buffers forresponses
with a limit of
k
outstanding transactions perprocessor need buffers for up to
k � � �
requestslimits the scalability
destination can also NACK on full input bufferwhile stalled initiating a request, a node acceptsresponses and NACKs requestsit’s OK to assume there is space for NACK at sourcebecause the sources reserve space for responses
– p.55/55