Parallel Processing I - Majmaah...

Parallel Processing I

Advanced Operating Systems

Oleg Goldshmidt

[email protected]

Lecture 5

– p.1/55

Side Note: Simultaneous Multithreading

Modern CPU’s front end: instruction decoding andscheduling

Modern CPU’s back end (“core”): pipelined execution

Inefficiency in pipelines: “pipeline bubbles”execution slots where the CPU could not scheduleany useful work (for whatever reason)

Inefficiency in front enda CPU can schedule several instructions per clockcycle for execution in core

by different functional unitsusually it schedules less

on SMP more instructions are executed in each clockcycle, but more slots are wasted as well

– p.2/55

Side Note: Superthreading

a.k.a. time-slice multithreading

interleave instructions from different threads

each pipeline stage can contain instructions from onethread only

scheduling logic switches between threads

helps alleviate memory latencyif one thread requests data from main memory thatis not in cache it stalls for several cyclesanother thread can proceed with execution, keepingpipelines full

does not help with instruction-level parallelismif on a given cycle not enough instructions can beparallelized, there will still be waste

– p.3/55

Side Note: Hyperthreading

removes the “one thread per time slot” restriction

allows maximal scheduling flexibility to reduce “bubbles”

Intel Pentium 4 Xeon: 2 threads per CPU

not very complicated: adds about 5% to the die area forXeon

from the OS perspective: 2 logical processors,equivalent to 2-way SMP

installing an OS on a Xeon means installing an SMPkernelcan be disabled (why?)

both logical CPUs share the same cache

– p.4/55

Side Note: SMT and Caching

no cache coherence problems that plague SMPs

but higher potential for cache conflicts

each thread can monopolize the caches — nocooperation

potential for cache thrashingcan be bad for memory-intensive workloadsremember: hyperthreading can be disabled

– p.5/55

This (And Next) Lecture

extends the previous discussion of shared-memorymultiprocessors

name of the game:performanceparallelism

A parallel computer is “a collection of processing elementsthat communicate and cooperate to solve large problemsfast” [Almasi and Gottlieb, 1989]

extending the concepts of computer architecture withcommunications architecture

analyze performance, cost

focus on implications for OS, applications

– p.6/55

Parallel Multiprogramming

shared memory programmingbulletin board — posting information at known,shared locations in memoryorchestration by taking note of who is doing what

message passing programmingsending messages between individual processesorchestration by well defined events of sending andreceiving information

data parallel programmingprocesses perform operations on subsets of dataorchestration by exchanging information at varioussteps in executioncommunication by shared memory or messages

– p.7/55

Communications Architecture

communication abstractionuser-level communication primitives of the system,provided by

programming language and environmentcompilers and librariesOShardware

define operations, data types, formatscombination of hardware and software

– p.8/55

Shared Memory vs. Message passing

at highest logical level massage passing is very similarto NUMA

primary difference: communication is integtrated at theI/O level rather than into the memory system

complete computers as building blocksCPU, memory, and I/O

clusters — collections of fairly typical standalonesystems

multicomputers — tighter packaging, faster networks,much tighter integration of CPUs and network

interprocessor communication rather than I/O

– p.9/55

The Role of the OS

shared memorymuch of the distance between the programmingmodel and the hardware is covered by compilers andlibrariesthe basic primitives are provided by hardware (LOADand STORE )

with some help from the OS, e.g., page faultsmuch of the infrastructure is provided by hardware(bus, cache)OS is very important for scheduling andsynchronization

message passingmore OS (and library) support needed — networkI/O!

– p.10/55

Message Passing

send : specifies a local buffer to send and the receivingprocess (usually on a remote CPU)

recv : specifies a sending process and a local buffer toplace the received data in

combined send and recv

cause the transfer to occuraccomlish pairwise synchronizationfacilitate memory-to-memory copy

each side specifies its local address

different semantics: synchronous, blocking,non-blocking (later)

– p.11/55

Message Passing Technologies

point-to-point FIFO linksmeshes or hypercubesconnect to nearest neighbours only (algorithmsspecified topology)synchronous transfers

DMAallows non-blocking sends

store-and-forward routinglatencies increase with the number of hops

pipelining interconnects (IBM SP-2, Intel Paragon)latencies dominated by CPU-to-NICuniform communication costs

– p.12/55

Convergence: Software

send and recv on shared memory machinessend : store the data or a pointer to data in bufferrecv : read the dataraise a flag when a “message” arrives

global memory space on message passing machinese.g., logical read : send a message to anotherprocess, get data with “return mail”message exchange is hidden by compiler-generatedcode for access to a shared variable

shared VM on message passing machineson access to a remote page a page fault occursmessages are exchanged behind the scenesthe missing page is transferred and mapped

– p.13/55

Convergence: Hardware

integrate network controllers with cache and/or memorycontrollers

shared memory machinesdetect cache missesnetwork transactions to access remote memory

message passing machinesprovide networked DMAmemory-to-memory between machinesprovide user-level message passing on top DMAtransfersattemts at built-in distributed shared memory

message-passing between SMP nodes

– p.14/55

Data Parallel Processing

processor arrays, a.k.a. SIMD machines, a.k.a. dataparallel architectures

operations are performed in parallel on elements of alarge regular data set

e.g., elements or chunks of arrays, matrices, etc.

Flynn’s taxonomy:SISD — conventional sequential machinesMIMD — conventional parallel machinesSIMD — instruction sequencing in control processor,data manipulation in data processors

motivation in early architectures: fecthinginstructions cost as much as performing them

eclipsed by vector processors– p.15/55

Vector Processors

functional processing units that can operate on vectorsof data (from one memory) in pipelines

no need to map application data to interconnect

CDC Star-100: operations on contiguous vectorsa lot of time spent making them contiguous

CRAY-I: vector registers, LOAD and STORE

Connection Machine: 32 1-bit PEs on a chip,connections to neighbours, consolidated sequencing

killed by fast processors with integrated FPUs andcachesno cost advantage of consolidated sequencing

SPMD — many CPUs execute copies of the sameprogram

– p.16/55

Architectures and Programming

programming models and environments are largerlyabstracted from parallel machine architectures

languages: Fortran 90, Fortran 95, HPFshared address, data parallel programmingwork on a wide variety of shared memory andmessage passing architecturesdetails are hidden by compilerscompilation radically differs, depending oncommunication and synchronization primitives

libraries: PVM, MPIwork just about everywhere (well...)implementation differs

– p.17/55

Performance Metrics

latency (

�

)how much time is needed to complete an operation

bandwidth (

�

)how many operations are performed in unit time

cost (

�

)the impact the operations have on the overallexecution time of the program

in a sequential world

� � � � ��

(where

is the number of operations performed)

parallelism is more difficult

– p.18/55

Performance Metrics Example

a component can perform an operation in

� � � � ��

“sequential bandwidth”:

� � � � � � � � ��

assume a 10-stage pipelinepeak bandwidth

�� application starts an operation every

��

bandwidth

� � � � � � � ��

pipelining is still beneficial because of bursts

cost of � � ��

operations,

� � � :� � � � � � � � � � � � � ��

� � � � � � � � � � ��

if we average

� � � �� of work before dependingon the result:

� � � � � � � � � � � � � � �

– p.19/55

Data Transfer Time

central to parallel computing

! � � � !#" $ ��

� is the amount of data (e.g., in bytes)�

is the transfer rate (bandwidth) in bytes per second!#" is the start-up cost

useful for different kinds of data transfersmessage passing:

! " is the time for first bit to reachdestinationmemory access:

!%" is access timebus transactions:

! " is arbitration and controlprotocolpipelines:

! " is time to fill the pipeline

– p.20/55

Data Transfer Time: Discussion

bandwidth depends on the transfer size

�

is the asymptotic bandwidth:

� � &' () *+

�! � �

how quickly do we reach

�

if we increase �?depends on start-up costthe transfer size at which half of the peak bandwidthis reached (half-power point):

�-, ./ � !%" �

bandwidth and latency are not the same!

– p.21/55

Refining The Model

when can the next transfer be initiated?

can useful work be done during transfer?

! � � � 0 $ $ 1where

0

is overhead

is occupancy

1

is network delay

– p.22/55

Data Transfer Overhead

overhead — time it takes the CPU to initiate the transfer

may be constant or linear in �, the transfer sizeconstant if the CPU only has to tell thecommunication assist to start, and gives a pointer todata

for DMA-capable or memory-integrated assistslinear if the CPU has to copy the data to the assist

for programmed I/Omay be more difficult if DMA locks memory bus

overhead is the time CPU is busy

the rest ( $ 1) is network latency

– p.23/55

Network Latency

occupancy — time it takes the data to pass through theslowest link ( � � �

, if

�

is the bandwidth of the link)

often the communication assist is the bottleneck

determines how often transfers can be initiated (

2

)next transfer will wait till the bottleneck is freeif there is buffering between the CPU and thebottleneck,

2 � � �

once the buffer is full2 � � �

delay — time it takes a bit to travel across the networkincluding communication assist delay

CPU does not care about the details, designers do

the split into0

, ,

1

is important for pipelined transfers

– p.24/55

Mutiple Transfers and Contention

individual network components are described by thedelay and occupancy

the network delay for a single transfer is the sum ofindividual delays along the path

the network occupancy is the maximal occupancy alongthe path

if two transfers attempt to use the same resourcesimultaneously, one must wait

contention increases communication time

contention looks to CPUs like increased occupancy

– p.25/55

Performance of a Cache Miss

overhead: cache controller determines there has beena miss and starts the transfer

occupancy: the cach block size divided by the busbandwidth

assume the bus is the bottleneck

delay: the time needed to arbitrate and access the busand the time to deliver the block into the cache

assuming the bus is free

contention: time waiting to get access to the busy bus

– p.26/55

Communication Cost

� � 3 � !� 4 �

where

!

is the communication time, and

3

: frequency of communication — the number ofcommunication operations per unit of work in theprogram

hardware may affect3

:may limit the transfer sizemay automatically migrate or replicate data

4

: overlap — the portion of communication thatoverlaps with other useful work

includes computation of other communicationspossible because communication is offloaded

– p.27/55

Amdahl’s Law

concurrency is limitedby the underlying problemby decomposition into parallel tasks

some portions of the program may not have as muchconcurrency as the number of CPUs

e.g., final aggregation of results, input and output

assume fraction

5

of work is serial,

6 � �� 5

is fullyparallelized: execution time on

processors is! �

6 $ 5

if

5 � �87 � �then no reason to increase

beyond6 � 5 9 �� – p.28/55

Amdahl’s Law: Example

problem: compute the average of values on an � � �

grid

decomposition: compute the �/ values in parallel, theneach processor will add its result to the global sum,then divide by �/

the summation must be serialized, runtime on

CPUs:

! � � �/ � $ �/ $ �

speedup for very large

&' (: *+! � �

! � � &' (: *+

� �/

�/ � $ �/ $ � �� /

�/ $ � ; �

– p.29/55

The Role of Decomposition

deserialize a part of the serial summation:make each of the

CPU sum a private sumadd the

private sums together

runtime on

CPUs

! � � �/ � $ �/ � $ $ �

parallelization speedup

! � �! � �

� �/

� �/ � $ $ � �

� �/

� �/ $ / $ <

for

; ; �– p.30/55

Load Balancing

do all the CPUs perform the same amount of work?

refining Amdahl’s law: speedup depends on themaximal work done by any CPU

5>=? ? @BA = � ! � �

( CDFE G ,IH H H : !E �

!E �

should be interpreted liberallyincludes communications and other overheads

CPUs should be working at the same time (scheduling!)extreme case: all do the same work, but one at atime — no speedup

minimize time spent at synchronization pointsincluding serialization, of course

– p.31/55

Reducing Communications

communications are included in

!E �

the higher the concurrency the more communicationsmay be needed

important metrics: communication-to-computation ratioif a GB needs to be transferred overall, the impact ona program that runs 1 second will be greater than theimpact on a program that runs 1 hour

system designers examine the communication needs ofan application to determine the latencies and andwidthrequirements and to justify the costs

– p.32/55

Reducing Parallelization Overhead

parallelization incurs overheadpartitioning of tasks and dataredundant computations on different nodescreating processesdistributing code and data across the systemexecuting synchronizationpacking and unpacking the data for communicationsload balancing

non-existent in sequential execution

pays off in parallel systems if trade-offs are right

– p.33/55

Scalability

what is scalability?we can always add memory, disks, upgrade CPUsand network cardsthe performance increase has hard limitsscalable systems avoid inherent design limits onincreasing resources of the systemin practice, there are still economic limits

for large systems, what do we gain by incrementingresources?

no design scales perfectlywhere is the saturation point?

– p.34/55

Scalability of Parallel Machines

how does throughput increase when we addprocessors?

how does latency of operations increase?

how does the cost increase?

can we scale without redesigning?

scaling up vs. scaling outscaling up: building bigger machinesscaling out: building larger distributed systems

scaling downare small configurations with the same design andlarge ones cost-effective?

– p.35/55

Example: Bus-Based SMPs

bandwidth of the bus does not increase when moreprocessors are added

the clock period of the bus depends on how fast a valuecan be driveen only the bus and sampled by everymodule on the bus

increases with the number of connected modulesincreases with wire lengthlatency increases with number of processors

signal quality on the bus decreases with length andnumber of connected modules — hard design limit

the bus, the packaging, the power supply mustcorrespond to the maximal possible configuration

– p.36/55

Bandwidth Scaling

scalable network provide a large number ofindependent communication paths between nodes

point-to-point latency should not increase with numberof nodes

cost per node should not increasebrute-force solutions exist, e.g., full global crossbar

e.g., Earth Simulator: 640 SMP nodes with fullcrossbar interconnects/

links — not scalable after some point

NB: in NUMA local memory has constant latencyregardless of number of CPUs

important to consider communication patternsbroadcasts are expensive

– p.37/55

Latency Scaling

time to transfer � bytes between 2 nodes

! � � � 0 $ $ 1 �where

0

is CPU overhead, � � � �is channel time

(“occupancy”),

1

is (routing) delay

0 � J� � K or

0 < �, depending on DMA

1 � J� � K or

1 < � depending on switching orstore-and-forward

high performance networks are switching, notstore-and-forward

switches have fixed degree, so latency increases asnumber of hops increases

latency increases with contention– p.38/55

Example of Latency Scaling

for

nodes the network distance is

&� L/ assume

0 � �NM �, � � O P QR � �, 1 � �� how do 128-byte transfers scale from 64 to 1024nodes?

! ST� U � �V � � �M � $ � �V �

O P � �M �$ WBXY / O P � �87 �M � � P7 �M �

! ST, " / U � �V � � �M � $ � �V �

O P � �M �$ WBXY / �� P � �87 �M � � �M �

! S Z� U � �V � � �M � $ WBXY / O P �

� � V �

O P � �M � $ � 7 �M � � � P7 �M �

! S Z, " / U � �V � � �M � $ WBXY / �� P �

� �V �

O P � �M �$ �87 �M � � �[ M �

– p.39/55

Cost Scaling

fixed infrastructure cost plus incremental cost of addingprocessors and memory

�X\ ] = � ^ � � 3_a` ? @ �X\ ] $ b �cd ? ^? � ]fe W �X\ ] = � ^ �

fixed cost for bus-based machines: cabinet, powersupply, bus

small configurations are at a disadvantagecompared to UPexpansion is possible at a cost less than astandalone processor

– p.40/55

Cost Scaling of Bus-Based Machines

commercial bus-based machines are typically at� � �

themaximal configuration

enough to amortize fixed cost and leaves room forexpansion

vendors of large SMPs offer smaller models

designs should better be modularmore power supplies and cabinets as size increases

– p.41/55

Cost Scaling of Networks

assume number of switches (and links)

5 < &� L/

assume for 64 nodes the cost is balanced betweenprocessors, memory, and network

how will cost be distributed for 1024 nodes (with samememory per CPU)?

the number of switches per CPU will increase by factor10/6

cost per processor

� � 6g � �87 [ [ $ � 2 h � �87 [ [ $ �� O � � �87 [ [ � �7 � �

network fraction of cost increased from 33% to 45%

there may be other factors such as increased length ofwires

– p.42/55

Cost-Effectiveness

definition of efficiency

i � �! � �

! �parallel machine is effective only if all CPUs are utilized

neglects that much of the cost is in memory andnetwork, not just CPUs

the system is cost-effective if its cost-performance ratio,� � � ! �

, decreases (or does not grow?) with

cost-effectiveness: speedup is higher than costup,

! � �! � j

� ��

– p.43/55

Physical Scaling

today’s largest machines occupy huge multistorybuildings

need physical access to components for maintenance

need cooling (and pay huge electricity bills)

dozens of staff

hundreds of kilometers of cables

there are physical limits on wire length (due to powerrequirements and signal degradation)

optical fiber has higher fixed cost

need denser packaging

but need space for all the cards, connections, etc.

also, loose coupling facilitates upgrades– p.44/55

Message Passing

send/receive pair: one-way transfer of data from thesource area specified by the sending user process(“sender”) to a destination area specified by thereceiving user process (“receiver”)

also provides a synchronization event between the 2processes

synchronous send and recv

send returns only after recv is performedrecv returns when all the data has been receivedeasy deadlock: what if all processes send and stall?alternation: even processes send before recv , oddrecv before send

– p.45/55

Asynchronous Message Passing

blocking send and recv

send returns when message is transferred to thesystemrecv is similar to synchronous, but does not sendACKdoes not guarantee seliverysuch a guarantee requires additional handshaking

non-blocking send and recv

send returns immediatelyrecv returns after posting intent to receiveneed to probe the actual state of the transfer usingspecial primitivesthe probes may be blocking on non-blocking

– p.46/55

Message passing Interface (MPI)

distinguishes the notion of when send or recv returnsfrom when a message transfer completes

a flexible combination of buffered and synchronousmodes

synchronous send : recv executed, deliveryguaranteed, the source buffer can be reused

ready-mode send : throws an error if recv has notexecuted by the time the message arrives at destination

need a special message to indicate readiness

buffered send : source buffer can be reusedindependently of whether recv has been issued, thedata may be buffered elsewhere in the system

can be blocking or non-blocking– p.47/55

Synchronous MPI Transfers

sender: sends ready-to-send and waits

receiver: checks if there is a receiver processif yes, sends ready-to-recvelse waits till recv is executed, then sendsready-to-recv

sender: initiates transfer upon arrival of ready-to-recv

can be receiver-initiatedreceiver specifies the sendermatch table is only on the sender sideonly ready-to-recv and transfer are needed

– p.48/55

Buffered MPI Transfers

optimistic single-phase protocol

sender transfers the data in a single transactionwrapped in a envelope containing the information usedin matching

the destination examines the envelope to find a matchif there is a matching recv , deliverselse stores in a temporary buffer

match table lookup is expensive, what if data continuesto flow

always store in temporary buffer? need to copy todeliver...

assumes infinite storage at destination

– p.49/55

Buffered MPI Transfers II

sender sends a ready-to-send with the envelope

receiver sends a ready-to-recv when there is a match orif there is enough buffer space

for short messages the handshake may dominate thecommunications cost

credit system for short transferssome buffer space is pre-allocated (and tracked) forshort transfersshort messages can be sent without the handshakecompletion acknowledgment is used to communicatethe credit state

– p.50/55

Active Messages

essentially a remote procedure call

message contains data, a handler to be invoked on thereceiver, and the handler’s arguments

receiver invokes the handler, issues a response thatcontains a handler to be invoked on the handler

message notification through interrupts or signals

null message (polling)invokes a handler to service a full network

– p.51/55

Common Challenges

each processor has only a limited knowledge of thestate of the system

a large number of network transactions can be inprogress simultaneously

latencies can be large

temptation to use optimistic protocols, large transfers

programming models may require multiple networktransactions per operation

destinations may be flooded

similar to bus-based designs, but no constraints

is large, number of outstanding transactions islarge, no global ordering

– p.52/55

Back Pressure

suppose there is a lot of contention for a particularreceiver

we can control the flow by informing sendersrequires a special protocol

destination may refuse to accept traffic

back pressure will propagate to the sources (in a tree)

sources will feel the back pressure and slow down

reliable networks are deadlock-free as long asmessages are removed at destinations

other messages will get stuck, too — overall latency willgrow

alternatives: NACKs or timeouts

– p.53/55

Fetch Deadlocks

a processor sends request, the destination refuses toaccept

to keep the network deadlock-free, the source mustkeep accepting transactions even if it cannot initiatethem

but what if it gets a request and cannot send aresponse?

provide separate logical networks for requests andresponses

responses can be completed without adding to theload of the request network

– p.54/55

Fetch Deadlocks II

alternative: provide input buffers by limiting the numberof outstanding transactions and set aside buffers forresponses

with a limit of

k

outstanding transactions perprocessor need buffers for up to

k � � �

requestslimits the scalability

destination can also NACK on full input bufferwhile stalled initiating a request, a node acceptsresponses and NACKs requestsit’s OK to assume there is space for NACK at sourcebecause the sources reserve space for responses

– p.55/55

Parallel Processing I - Majmaah...

Documents

Transcript of Parallel Processing I - Majmaah...