Multiprocessors and Thread-Level Parallelism

34
ENGS 116 Lecture 15 1 Multiprocessors and Thread- Level Parallelism Vincent Berk November 12 th , 2008 Reading for Friday: Sections 4.1-4.3 Reading for Monday: Sections 4.4 – 4.9

description

Vincent Berk November 12 th , 2008 Reading for Friday: Sections 4.1-4.3 Reading for Monday: Sections 4.4 – 4.9. Multiprocessors and Thread-Level Parallelism. Parallel Computers. - PowerPoint PPT Presentation

Transcript of Multiprocessors and Thread-Level Parallelism

Page 1: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 1

Multiprocessors and Thread-Level Parallelism

Vincent BerkNovember 12th , 2008

Reading for Friday: Sections 4.1-4.3

Reading for Monday: Sections 4.4 – 4.9

Page 2: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 2

Parallel Computers• Definition: “A parallel computer is a collection of processing

elements that cooperate and communicate to solve large problems fast.”

Almasi and Gottlieb, Highly Parallel Computing, 1989• Questions about parallel computers:

– How large a collection?– How powerful are processing elements?– How do they cooperate and communicate?– How is data transmitted? – What type of interconnection?– What are HW and SW primitives for programmer?– Does it translate into performance?

Page 3: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 3

Parallel Processors “Religion”• The dream of computer architects since 1960: replicate processors to

add performance vs. design a faster processor• Led to innovative organization tied to particular programming models

since “uniprocessors can’t keep going”– e.g., uniprocessors must stop getting faster due to limit of speed

of light: 1972, … , 1989, ... , 2008– Borders religious fervor: you must believe!– Fervor damped some when 1990s companies went out of

business: Thinking Machines, Kendall Square, ...• Argument instead is the “pull” of opportunity of scalable

performance, not the “push” of uniprocessor performance plateau• Recent advancement: Sun Niagara “T1 Coolthreads” => 8 in-order

simple execution cores. Switch thread on stalls...

Page 4: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 4

Opportunities: Scientific Computing• Nearly Unlimited Demand (Grand Challenge):

App Perf (GFLOPS) Memory (GB)48 hour weather 0.1 0.172 hour weather 3 1Pharmaceutical design 100 10Global Change, Genome 1000 1000

(Figure 1-2, page 25, of Culler, Singh, Gupta [CSG97])

• Successes in some real industries: – Petroleum: reservoir modeling– Automotive: crash simulation, drag analysis, engine– Aeronautics: airflow analysis, engine, structural mechanics– Pharmaceuticals: molecular modeling– Entertainment: full length movies (“Finding Nemo”)

Page 5: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 5

Flynn’s Taxonomy • SISD (Single Instruction Single Data)

– Uniprocessors• MISD (Multiple Instruction Single Data)

– ???• SIMD (Single Instruction Multiple Data)

– Examples: Illiac-IV, CM-2• Simple programming model• Low overhead• Flexibility• All custom integrated circuits

• MIMD (Multiple Instruction Multiple Data)– Examples: Sun Enterprise 10000, Cray T3D, SGI Origin

• Flexible• Use off-the-shelf microprocessors

Page 6: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 6

Flyn

n’s

clas

sific

atio

n of

com

pute

r sys

tem

s

a) S

ISD

Com

pute

rStre

amM

emor

yM

odul

ePr

oces

sor

Uni

tD

ata

Con

trol

Uni

t

Inst

ruct

ion

Stre

am

Inst

ruct

ion

Stre

am

Mem

ory

Mod

ule

Mem

ory

Mod

ule

Mem

ory

Mod

ule

Mem

ory

Mod

ule

Proc

esso

rU

nit

Proc

esso

rU

nit

Proc

esso

rU

nit

Proc

esso

rU

nit

Stre

am 1

Dat

a

Stre

am 2

Dat

a

Stre

am 3

Dat

a

Stre

am 4

Dat

a

Inst

ruct

ion

Stre

amC

ontro

lU

nit

b) S

IMD

Com

pute

r

Inst

ruct

ion

Stre

am

Inst

ruct

ion

Stre

am

Inst

ruct

ion

Stre

am 3

Inst

ruct

ion

Stre

am 4

Mem

ory

Mod

ule

Mem

ory

Mod

ule

Mem

ory

Mod

ule

Proc

esso

rU

nit

Proc

esso

rU

nit

Proc

esso

rU

nit

Stre

am 2

Dat

a

Stre

am 3

Dat

a

Stre

am 4

Dat

a

Stre

am 2

Inst

ruct

ion

Stre

am 3

Inst

ruct

ion

Stre

am 4

Inst

ruct

ion

Mem

ory

Mod

ule

Proc

esso

rU

nit

Stre

am 1

Dat

aSt

ream

1In

stru

ctio

nC

ontro

lU

nit

Con

trol

Uni

t

Con

trol

Uni

t

Con

trol

Uni

t

c) M

IMD

Com

pute

rIn

stru

ctio

n St

ream

1In

stru

ctio

n St

ream

2

Page 7: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 7

Parallel Framework

• Programming Model:– Multiprogramming: lots of jobs, no communication– Shared address space: communicate via memory– Message passing: send and receive messages– Data parallel: several agents operate on several data sets

simultaneously and then exchange information globally and simultaneously (shared or message passing)

• Communication Abstraction:– Shared address space: e.g., load, store, atomic swap– Message passing: e.g., send, receive library calls– Debate over this topic (ease of programming, scaling)

Page 8: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 8

Diagrams of Shared and Distributed Memory Architectures

P P P P

Network

MM M

Network

P

MM

PP

MDistributed Memory

systemShared Memory System

Page 9: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 9

Shared Address/Memory Multiprocessor Model

• Communicate via Load and Store– Oldest and most popular model

• Based on timesharing: processes on multiple processors vs. sharing single processor

• Process: a virtual address space and ≥ 1 thread of control– Multiple processes can overlap (share), but ALL threads

share a process address space• Writes to shared address space by one thread are visible to reads

of other threads– Usual model: share code, private stack, some shared heap,

some private heap (shared page table!)

Page 10: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 10

Example: Small-Scale MP Designs

• Memory: centralized with uniform memory access (“UMA”) and bus interconnect, I/O

• Examples: Sun Enterprise, SGI Challenge, Intel SystemPro

Processor Processor Processor Processor

One or more levels

of cache

One or more levels

of cache

One or more levels

of cache

One or more levels

of cache

Main memory I/O System

Page 11: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 11

SMP Interconnect

• Processors connected to memory AND to I/O

• Bus based: all memory locations equal access time so SMP = “Symmetric MP”

– Sharing limited BW as we add processors, I/O

• Crossbar: expensive to expand

• Multistage network: less expensive to expand than crossbar with more BW

• “Dance Hall” designs: all processors on the left, all memories on the right

Page 12: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 12

Large-Scale MP Designs• Memory: distributed with nonuniform memory access (“NUMA”) and

scalable interconnect (distributed memory)

Interconnection Network Low Latency High Reliability

Processor+ cache

Memory I/O

Processor+ cache

Memory I/O

Processor+ cache

Memory I/O

Processor+ cache

Memory I/O

1 cycle

40 cycles

Processor+ cache

Memory I/O

Processor+ cache

Memory I/O

Processor+ cache

Memory I/O

Processor+ cache

Memory I/O

100 cycles

Page 13: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 13

Shared Address Model Summary

• Each processor can name every physical location in the machine

• Each process can name all data it shares with other processes

• Data transfer via load and store

• Data size: byte, word, ... or cache blocks

• Uses virtual memory to map virtual to local or remote physical

• Memory hierarchy model applies: now communication moves data to local processor’s cache (as load moves data from memory to cache)

– Latency, BW, scalability of communication?

Page 14: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 14

Message Passing Model

• Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations

– Essentially NUMA but integrated at I/O devices vs. memory system• Send specifies local buffer + receiving process on remote computer• Receive specifies sending process on remote computer + local buffer

to place data– Usually send includes process tag and receive has rule on tag:

match one, match any– Synchronization: when send completes, when buffer free, when request

accepted, receive waits for send• Send + receive => memory-memory copy, where each supplies local

address AND does pairwise synchronization!

Page 15: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 15

Logical View of a Message Passing System

= processor

= process

= receiver’s buffer or queue= message

Port 2

Port 1

Processor 2Processor 1Communications Network

Page 16: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 16

Message Passing Model

• Send + receive => memory-memory copy, synchronization on OS even on 1 processor

• History of message passing:– Network topology important because could only send to immediate

neighbor– Typically synchronous, blocking send & receive– Later DMA with non-blocking sends, DMA for receive into buffer until

processor does receive, and then data is transferred to local memory– Later SW libraries to allow arbitrary communication

• Example: IBM SP-2, RS6000 workstations in racks– Network Interface Card has Intel 960– 8x8 Crossbar switch as communication building block– 40 MB/sec per link

Page 17: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 17

Message Passing Model

• Special type of Message Passing: Pipe• Pipe is a point-to-point connection• Data is pushed in one way by first process, and• Received, in-order, by the second process.• Data can stay in pipe indefinitely• Asynchronous, simplex communication• Advantages:

– Simple implementation– Works with Send-Receive calls– Good way of implementing producer consumers systems.– No latency, since communication is 1-way

Page 18: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 18

Communication Models• Shared Memory

– Processors communicate with shared address space– Easy on small-scale machines– Advantages:

• Model of choice for uniprocessors, small-scale MPs• Ease of programming• Lower latency• Easier to use hardware controlled caching

• Message passing– Processors have private memories, communicate via messages– Advantages:

• Less hardware, easier to design• Focuses attention on costly non-local operations

• Can support either SW model on either HW base

Page 19: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 19

How Do We Develop a Parallel Program?

• Convert a sequential program to a parallel program– Partition the program into tasks and combine tasks into processes– Coordinate data accesses, communication, and synchronization– Map the processes to processors

• Create a parallel program from scratch

Page 20: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 20

Review: Parallel Framework

• Layers:– Programming Model:

• Multiprogramming: lots of jobs, no communication• Shared address space: communicate via memory• Message passing: send and receive messages• Data parallel: several agents operate on several data sets simultaneously and

then exchange information globally and simultaneously (shared or message passing)

– Communication Abstraction:• Shared address space: e.g., load, store, atomic swap• Message passing: e.g., send, receive library calls• Debate over this topic (ease of programming, scaling)

Programming ModelCommunication AbstractionInterconnection SW/OS Interconnection HW

Page 21: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 21

Fundamental Issues

• Fundamental issues in parallel machines

1) Naming

2) Synchronization

3) Latency and Bandwidth

Page 22: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 22

Fundamental Issue #1: Naming

• Naming:– what data is shared– how it is addressed– what operations can access data– how processes refer to each other

• Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for message passing

• Choice of naming affects replication of data; via load in cache memory hierarchy or via SW replication and consistency

Page 23: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 23

Fundamental Issue #1: Naming

• Global physical address space: any processor can generate, address and access it in a single operation

• Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program

• Segmented shared address space: locations are named < process number, address > uniformly for all processes of the parallel program

• Independent local physical addresses

Page 24: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 24

Fundamental Issue #2: Synchronization

• To cooperate, processes must coordinate

• Message passing is implicit coordination with transmission or arrival of data

• Shared address system needs additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor

Page 25: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 25

Fundamental Issue #3: Latency and Bandwidth

• Bandwidth– Need high bandwidth in communication– Cannot scale, but stay close– Match limits in network, memory, and processor– Overhead to communicate is a problem in many machines

• Latency– Affects performance, since processor may have to wait– Affects ease of programming, since requires more thought to

overlap communication and computation• Latency Hiding

– How can a mechanism help hide latency?– Examples: overlap message send with computation, prefetch

data, switch to other tasks

Page 26: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 26

Small-Scale — Shared Memory

• Caches serve to:– Increase bandwidth

versus bus/memory– Reduce latency of

access– Valuable for both

private data and shared data

• What about cache coherence?

Processor Processor Processor Processor

One or more levels

of cache

One or more levels

of cache

One or more levels

of cache

One or more levels

of cache

Main memory I/O System

Page 27: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 27

The Problem Cache Coherency

CPU

I/O

B

100

200

A

Memory

Cache

A’

B’ 200

100

CPU

I/OOutput Agives 100

Cache

A’

B’ 200

550

B

100

200

A

Memory

CPU

I/OInput

400 to B

Cache

200

A’

B’

100

B

100

400

A

Memory

(a) Cache andmemory coherent:A’ = A & B’ = B

(b) Cache andmemory incoherent:A’ ≠ A (A stale)

(c) Cache andmemory incoherent:B’ ≠ B (B’ stale)

Page 28: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 28

What Does Coherency Mean?• Informally:

– “Any read must return the most recent write”– Too strict and too difficult to implement

• Better:– “Any write must eventually be seen by a read”– All writes are seen in proper order (“serialization”)

• Two rules to ensure this:– “If P writes x and P1 reads it, P’s write will be seen by P1 if

the read and write are sufficiently far apart”– Writes to a single location are serialized: seen in one order

• Latest write will be seen• Otherwise could see writes in illogical order

(could see older value after a newer value)

Page 29: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 29

The cache-coherence problem for a single memory location (X), read and written by two processors (A and B).This example assumes a write-through cache.

Cache contents Cache contents Memory contents forTime Event for CPU A for CPU B location X

0 11 CPU A reads X 1 12 CPU B reads X 1 1 13 CPU A stores 0

into X0 1 0

Page 30: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 30

Potential HW Coherency Solutions• Snooping Solution (Snoopy Bus):

– Send all requests for data to all processors– Processors snoop to see if they have a copy and respond

accordingly – Requires broadcast, since caching information is at processors– Works well with bus (natural broadcast medium)– Dominates for small-scale machines (most of the market)

• Directory-Based Schemes– Keep track of what is being shared in one centralized place– Distributed memory => distributed directory for scalability

(avoids bottlenecks)– Send point-to-point requests to processors via network– Scales better than Snooping– Actually existed BEFORE Snooping-based schemes

Page 31: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 31

Basic Snoopy Protocols• Write Invalidate Protocol:

– Multiple readers, single writer– Write to shared data: an invalidate is sent to all caches which

snoop and invalidate any copies– Read Miss:

• Write-through: memory is always up-to-date• Write-back: snoop in caches to find most recent copy

• Write Broadcast Protocol (typically write through):– Write to shared data: broadcast on bus, processors snoop, and

update any copies– Read miss: memory is always up-to-date

Page 32: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 32

An example of an invalidation protocol working on a snooping bus for a single cache block (X) with write-back caches.

Contents of Contents of Contents of memory Processor activity Bus activity CPU A's cache CPU B's cache location X

0CPU A reads X Cache miss for X 0 0CPU B reads X Cache miss for X 0 0 0CPU A writes a 1 to X Invalidation for X 1 0CPU B reads X Cache miss for X 1 1 1

Page 33: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 33

An example of a write update or broadcast protocol working on a snooping bus for a single cache block (X) with write-back caches.

Contents of Contents of Contents of memoryProcessor activity Bus activity CPU A's cache CPU B's cache location X

0CPU A reads X Cache miss for X 0 0CPU B reads X Cache miss for X 0 0 0CPU A writes a 1 to X Write broadcast of X 1 1 1CPU B reads X 1 1 1

Page 34: Multiprocessors and Thread-Level Parallelism

ENGS 116 Lecture 15 34

Basic Snoopy Protocols

• Write Invalidate versus Broadcast:– Invalidate requires one transaction per write-run (writing to one

location multiple times by the same CPU)– Invalidate uses spatial locality: one transaction per block– Broadcast has lower latency between write and read