Cs668 Lec7 Interconnection

101
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy

Transcript of Cs668 Lec7 Interconnection

Page 1: Cs668 Lec7 Interconnection

Lecture 10 Outline

Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy

Page 2: Cs668 Lec7 Interconnection

Interconnection Networks

Uses of interconnection networksConnect processors to shared memoryConnect processors to each other

Interconnection media typesShared mediumSwitched medium

Page 3: Cs668 Lec7 Interconnection

Shared versus Switched Media

Page 4: Cs668 Lec7 Interconnection

Shared Medium

Allows only message at a time Messages are broadcast Each processor “listens” to every message Arbitration is decentralized Collisions require resending of messages Ethernet is an example

Page 5: Cs668 Lec7 Interconnection

Switched Medium

Supports point-to-point messages between pairs of processors

Each processor has its own path to switch Advantages over shared media

Allows multiple messages to be sent simultaneously

Allows scaling of network to accommodate increase in processors

Page 6: Cs668 Lec7 Interconnection

Switch Network Topologies

View switched network as a graphVertices = processors or switchesEdges = communication paths

Two kinds of topologiesDirect Indirect

Page 7: Cs668 Lec7 Interconnection

Direct Topology

Ratio of switch nodes to processor nodes is 1:1

Every switch node is connected to1 processor nodeAt least 1 other switch node

Page 8: Cs668 Lec7 Interconnection

Indirect Topology

Ratio of switch nodes to processor nodes is greater than 1:1

Some switches simply connect other switches

Page 9: Cs668 Lec7 Interconnection

Evaluating Switch Topologies

Diameter Bisection width Number of edges / node (degree) Constant edge length? (yes/no)

Layout area

Page 10: Cs668 Lec7 Interconnection

2-D Mesh Network

Direct topology Switches arranged into a 2-D lattice Communication allowed only between

neighboring switches Variants allow wraparound connections

between switches on edge of mesh

Page 11: Cs668 Lec7 Interconnection

2-D Meshes

Page 12: Cs668 Lec7 Interconnection

Evaluating 2-D Meshes

Diameter: (n1/2)

Bisection width: (n1/2)

Number of edges per switch: 4

Constant edge length? Yes

Page 13: Cs668 Lec7 Interconnection

Binary Tree Network

Indirect topology n = 2d processor nodes, n-1 switches

Page 14: Cs668 Lec7 Interconnection

Evaluating Binary Tree Network

Diameter: 2 log n

Bisection width: 1

Edges / node: 3

Constant edge length? No

Page 15: Cs668 Lec7 Interconnection

Hypertree Network

Indirect topology Shares low diameter of binary tree Greatly improves bisection width From “front” looks like k-ary tree of height d From “side” looks like upside down binary

tree of height d

Page 16: Cs668 Lec7 Interconnection

Hypertree Network

Page 17: Cs668 Lec7 Interconnection

Evaluating 4-ary Hypertree

Diameter: log n

Bisection width: n / 2

Edges / node: 6

Constant edge length? No

Page 18: Cs668 Lec7 Interconnection

Butterfly Network

Indirect topology n = 2d processor

nodes connectedby n(log n + 1)switching nodes

0 1 2 3 4 5 6 7

3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7

2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7

1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7

0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0

R ank 1

R ank 2

R ank 3

Page 19: Cs668 Lec7 Interconnection

Butterfly Network Routing

Page 20: Cs668 Lec7 Interconnection

Evaluating Butterfly Network

Diameter: log n

Bisection width: n / 2

Edges per node: 4

Constant edge length? No

Page 21: Cs668 Lec7 Interconnection

Hypercube

Directory topology 2 x 2 x … x 2 mesh Number of nodes a power of 2 Node addresses 0, 1, …, 2k-1 Node i connected to k nodes whose

addresses differ from i in exactly one bit position

Page 22: Cs668 Lec7 Interconnection

Hypercube Addressing

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

Page 23: Cs668 Lec7 Interconnection

Hypercubes Illustrated

Page 24: Cs668 Lec7 Interconnection

Evaluating Hypercube Network

Diameter: log n

Bisection width: n / 2

Edges per node: log n

Constant edge length? No

Page 25: Cs668 Lec7 Interconnection

Shuffle-exchange

Direct topology Number of nodes a power of 2 Nodes have addresses 0, 1, …, 2k-1 Two outgoing links from node i

Shuffle link to node LeftCycle(i)Exchange link to node [xor (i, 1)]

Page 26: Cs668 Lec7 Interconnection

Shuffle-exchange Illustrated

0 1 2 3 4 5 6 7

Page 27: Cs668 Lec7 Interconnection

Shuffle-exchange Addressing

0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1

11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 1 1 0 1

0 1 1 0 0 1 1 1

Page 28: Cs668 Lec7 Interconnection

Evaluating Shuffle-exchange

Diameter: 2log n - 1

Bisection width: n / log n

Edges per node: 2

Constant edge length? No

Page 29: Cs668 Lec7 Interconnection

Comparing Networks

All have logarithmic diameterexcept 2-D mesh

Hypertree, butterfly, and hypercube have bisection width n / 2

All have constant edges per node except hypercube

Only 2-D mesh keeps edge lengths constant as network size increases

Page 30: Cs668 Lec7 Interconnection

Vector Computers

Vector computer: instruction set includes operations on vectors as well as scalars

Two ways to implement vector computersPipelined vector processor: streams data

through pipelined arithmetic unitsProcessor array: many identical, synchronized

arithmetic processing elements

Page 31: Cs668 Lec7 Interconnection

Why Processor Arrays?

Historically, high cost of a control unit Scientific applications have data

parallelism

Page 32: Cs668 Lec7 Interconnection

Processor Array

Page 33: Cs668 Lec7 Interconnection

Data/instruction Storage

Front end computerProgramData manipulated sequentially

Processor arrayData manipulated in parallel

Page 34: Cs668 Lec7 Interconnection

Processor Array Performance

Performance: work done per time unit Performance of processor array

Speed of processing elementsUtilization of processing elements

Page 35: Cs668 Lec7 Interconnection

Performance Example 1

1024 processors Each adds a pair of integers in 1 sec What is performance when adding two

1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024

Page 36: Cs668 Lec7 Interconnection

Performance Example 2

512 processors Each adds two integers in 1 sec Performance adding two vectors of length

600?

sec/ops103ePerformanc 6sec2

operations600

Page 37: Cs668 Lec7 Interconnection

2-D Processor Interconnection Network

Each VLSI chip has 16 processing elements

Page 38: Cs668 Lec7 Interconnection

if (COND) then A else B

Page 39: Cs668 Lec7 Interconnection

if (COND) then A else B

Page 40: Cs668 Lec7 Interconnection

if (COND) then A else B

Page 41: Cs668 Lec7 Interconnection

Processor Array Shortcomings

Not all problems are data-parallel Speed drops for conditionally executed code Don’t adapt to multiple users well Do not scale down well to “starter” systems Rely on custom VLSI for processors Expense of control units has dropped

Page 42: Cs668 Lec7 Interconnection

Multiprocessors

Multiprocessor: multiple-CPU computer with a shared memory

Same address on two different CPUs refers to the same memory location

Avoid three problems of processor arraysCan be built from commodity CPUsNaturally support multiple usersMaintain efficiency in conditional code

Page 43: Cs668 Lec7 Interconnection

Centralized Multiprocessor

Straightforward extension of uniprocessor Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs

Uniform memory access (UMA) multiprocessorSymmetrical multiprocessor (SMP)

Page 44: Cs668 Lec7 Interconnection

Centralized Multiprocessor

Page 45: Cs668 Lec7 Interconnection

Private and Shared Data

Private data: items used only by a single processor

Shared data: values used by multiple processors

In a multiprocessor, processors communicate via shared data values

Page 46: Cs668 Lec7 Interconnection

Problems Associated with Shared Data

Cache coherenceReplicating data across multiple caches

reduces contentionHow to ensure different processors have

same value for same address? Synchronization

Mutual exclusionBarrier

Page 47: Cs668 Lec7 Interconnection

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X

Page 48: Cs668 Lec7 Interconnection

Cache-coherence Problem

CPU A CPU B

Memory

7X

7

Page 49: Cs668 Lec7 Interconnection

Cache-coherence Problem

CPU A CPU B

Memory

7X

7 7

Page 50: Cs668 Lec7 Interconnection

Cache-coherence Problem

CPU A CPU B

Memory

2X

7 2

Page 51: Cs668 Lec7 Interconnection

Write Invalidate Protocol

CPU A CPU B

7X

7 7 Cache control monitor

Page 52: Cs668 Lec7 Interconnection

Write Invalidate Protocol

CPU A CPU B

7X

7 7

Intent to write X

Page 53: Cs668 Lec7 Interconnection

Write Invalidate Protocol

CPU A CPU B

7X

7

Intent to write X

Page 54: Cs668 Lec7 Interconnection

Write Invalidate Protocol

CPU A CPU B

X 2

2

Page 55: Cs668 Lec7 Interconnection

Distributed Multiprocessor

Distribute primary memory among processors

Increase aggregate memory bandwidth and lower average memory access time

Allow greater number of processors Also called non-uniform memory access

(NUMA) multiprocessor

Page 56: Cs668 Lec7 Interconnection

Distributed Multiprocessor

Page 57: Cs668 Lec7 Interconnection

Cache Coherence

Some NUMA multiprocessors do not support it in hardwareOnly instructions, private data in cacheLarge memory access time variance

Implementation more difficultNo shared memory bus to “snoop”Directory-based protocol needed

Page 58: Cs668 Lec7 Interconnection

Directory-based Protocol

Distributed directory contains information about cacheable memory blocks

One directory entry for each cache block Each entry has

Sharing statusWhich processors have copies

Page 59: Cs668 Lec7 Interconnection

Sharing Status Uncached

Block not in any processor’s cache Shared

Cached by one or more processors Read only

Exclusive Cached by exactly one processor Processor has written block Copy in memory is obsolete

Page 60: Cs668 Lec7 Interconnection

Directory-based ProtocolInterconnection Network

Directory

Local Memory

Cache

CPU 0

Directory

Local Memory

Cache

CPU 1

Directory

Local Memory

Cache

CPU 2

Page 61: Cs668 Lec7 Interconnection

Directory-based ProtocolInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Bit Vector

Page 62: Cs668 Lec7 Interconnection

CPU 0 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Read Miss

Page 63: Cs668 Lec7 Interconnection

CPU 0 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

Page 64: Cs668 Lec7 Interconnection

CPU 0 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

7X

Page 65: Cs668 Lec7 Interconnection

CPU 2 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0

7X

Read Miss

Page 66: Cs668 Lec7 Interconnection

CPU 2 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X

Page 67: Cs668 Lec7 Interconnection

CPU 2 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X 7X

Page 68: Cs668 Lec7 Interconnection

CPU 0 Writes 6 to XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X 7X

Write Miss

Page 69: Cs668 Lec7 Interconnection

CPU 0 Writes 6 to XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 1

7X 7X

Invalidate

Page 70: Cs668 Lec7 Interconnection

CPU 0 Writes 6 to XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

Page 71: Cs668 Lec7 Interconnection

CPU 1 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

Read Miss

Page 72: Cs668 Lec7 Interconnection

CPU 1 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X

Switch to Shared

Page 73: Cs668 Lec7 Interconnection

CPU 1 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 1 0 0

6X

Page 74: Cs668 Lec7 Interconnection

CPU 1 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X S 1 1 0

6X 6X

Page 75: Cs668 Lec7 Interconnection

CPU 2 Writes 5 to XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X S 1 1 0

6X 6X

Write Miss

Page 76: Cs668 Lec7 Interconnection

CPU 2 Writes 5 to XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X S 1 1 0

6X 6X

Invalidate

Page 77: Cs668 Lec7 Interconnection

CPU 2 Writes 5 to XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 0 0 1

5X

Page 78: Cs668 Lec7 Interconnection

CPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 0 0 1

5X

Write Miss

Page 79: Cs668 Lec7 Interconnection

CPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

6X

Caches

Memories

Directories X E 1 0 0

Take Away

5X

Page 80: Cs668 Lec7 Interconnection

CPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 0 1 0

5X

Page 81: Cs668 Lec7 Interconnection

CPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

Page 82: Cs668 Lec7 Interconnection

CPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

5X

Page 83: Cs668 Lec7 Interconnection

CPU 0 Writes 4 to XInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

4X

Page 84: Cs668 Lec7 Interconnection

CPU 0 Writes Back X BlockInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories

Directories X E 1 0 0

4X

4X

Data Write Back

Page 85: Cs668 Lec7 Interconnection

CPU 0 Writes Back X BlockInterconnection Network

CPU 0 CPU 1 CPU 2

4X

Caches

Memories

Directories X U 0 0 0

Page 86: Cs668 Lec7 Interconnection

Multicomputer

Distributed memory multiple-CPU computer Same address on different processors refers to

different physical memory locations Processors interact through message passing Commercial multicomputers Commodity clusters

Page 87: Cs668 Lec7 Interconnection

Asymmetrical Multicomputer

Page 88: Cs668 Lec7 Interconnection

Asymmetrical MC Advantages

Back-end processors dedicated to parallel computations Easier to understand, model, tune performance

Only a simple back-end operating system needed Easy for a vendor to create

Page 89: Cs668 Lec7 Interconnection

Asymmetrical MC Disadvantages Front-end computer is a single point of

failure Single front-end computer limits scalability

of system Primitive operating system in back-end

processors makes debugging difficult Every application requires development of

both front-end and back-end program

Page 90: Cs668 Lec7 Interconnection

Symmetrical Multicomputer

Page 91: Cs668 Lec7 Interconnection

Symmetrical MC Advantages

Alleviate performance bottleneck caused by single front-end computer

Better support for debugging Every processor executes same program

Page 92: Cs668 Lec7 Interconnection

Symmetrical MC Disadvantages

More difficult to maintain illusion of single “parallel computer”

No simple way to balance program development workload among processors

More difficult to achieve high performance when multiple processes on each processor

Page 93: Cs668 Lec7 Interconnection

ParPar Cluster, A Mixed Model

Page 94: Cs668 Lec7 Interconnection

Commodity Cluster

Co-located computers Dedicated to running parallel jobs No keyboards or displays Identical operating system Identical local disk images Administered as an entity

Page 95: Cs668 Lec7 Interconnection

Network of Workstations

Dispersed computers First priority: person at keyboard Parallel jobs run in background Different operating systems Different local images Checkpointing and restarting important

Page 96: Cs668 Lec7 Interconnection

Flynn’s Taxonomy

Instruction stream Data stream Single vs. multiple Four combinations

SISD SIMD MISD MIMD

Page 97: Cs668 Lec7 Interconnection

SISD

Single Instruction, Single Data Single-CPU systems Note: co-processors don’t count

Functional I/O

Example: PCs

Page 98: Cs668 Lec7 Interconnection

SIMD

Single Instruction, Multiple Data Two architectures fit this category

Pipelined vector processor(e.g., Cray-1)

Processor array(e.g., Connection Machine)

Page 99: Cs668 Lec7 Interconnection

MISD

MultipleInstruction,Single Data

Example:systolic array

Page 100: Cs668 Lec7 Interconnection

MIMD

Multiple Instruction, Multiple Data Multiple-CPU computers

MultiprocessorsMulticomputers

Page 101: Cs668 Lec7 Interconnection

Summary

Commercial parallel computers appearedin 1980s

Multiple-CPU computers now dominate Small-scale: Centralized multiprocessors Large-scale: Distributed memory

architectures (multiprocessors or multicomputers)