EECS 570: Fall 2003 -- rev3 1 EECS 570 Notes on Chapter 1– Introduction –What is Parallel...

EECS 570: Fall 2003 -- rev3 1

EECS 570

• Notes on Chapter 1– Introduction– What is Parallel Architecture?– Evolution and convergence of Parallel Architectures– Fundamental Design Issues

• Acknowledgements– Slides are derived from work by Steve Reinhardt

(Michigan), Mark Hill (Wisconsin)Sarita Adve (Illinois), Babak Falsafi (CMU),Alvy Lebeck (Duke), and J. P. Singh (Princeton). Many Thanks!

EECS 570: Fall 2003 -- rev3 2

What is Parallel Architecture?

• parallel – OED• parallel pæ;ra lel, a. and sb.• 2. d. Computers. Involving the concurrent or simultaneous

performance of certain operations; functioning in this way.– 1948 Math. Tables & Other Aids to Computation III. 149 The use of

plugboard facilities and punched cards permits parallel operation (as distinguished from sequence operation), with further gain in efficiency.

– 1963 W. H. Ware Digital Computer Technol. & Design II. xi. 3 Parallel arithmetic tends to be faster than serial arithmetic because it performs operations in all columns at once, rather than in one column at a time.

– 1974 P. H. Enslow Multiprocessors & Parallel Processing i. 1 This book focuses on..the integration of multiple functional units into a multiprocessing or parallel processing system.

EECS 570: Fall 2003 -- rev3 3

Spectrum of Parallelism

• Key differences– granularity of operations– frequency/overhead of communication– degree of parallelism– source of parallelism

• data vs. control• parts of larger task vs. independent tasks• source of decomposition (hardware, compiler, programmer,

OS …)

serial pipelining superscalar VLIW

multithreading multiprocessing distributed systems

370 470 570 591

EECS 570: Fall 2003 -- rev3 4

Course Focus: Multithreading & Multiprocessing

• High end: many applications where even a fast CPU isn’t enough– Scientific computing: aerodynamics, crash simulation, drug

design, weather prediction, materials, …– General-purpose computing: graphics, databases, web servers,

data mining, financial modeling, …

• Low end: – cheap microprocessors make small MPs affordable

• Future: – Chip multiprocessors (almost here) – CMPs– Multithreading supported on the Pentium 4

• see http://www.intel.com/homepage/land/hyperthreading_more.htm

EECS 570: Fall 2003 -- rev3 5

Motivation

• N processors in a computer can provide:– Higher Throughput via many jobs in parallel

• individual jobs no faster

– Cost-Effectiveness may improve: users share a central resource– Lower Latency from shrink-wrapped software

(e.g., Photoshop™)• Parallelizing your application (but this is hard) • From reduced queuing delays

• Need something faster than today’s microprocessor?– Wait for tomorrow’s microprocessor– Use many microprocessors in parallel

EECS 570: Fall 2003 -- rev3 6

Historical Perspective

• End of uniprocessor performance has been frequently predicted due to fundamental limits– spurred work in parallel processing – cf. Thornton’s arguments

for ILP in the CDC 6600 http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/cdc.html

• No common parallel programming model– unlike the von Neumann Model– many models: data parallel, shared memory, message passing,

dataflow, systolic, graph reduction, declarative logic– no pre-existing software to target– no common building blocks – high-performance micros are

changing this– result: lots of one-of-a-kind architectures with no software base– architecture defines the programming model

http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/cdc.html

EECS 570: Fall 2003 -- rev3 7

What’s different today?

• insurmountable handicap to build on anything else– Amdahl’s law

• favorable performance per $– gpp’s enjoy volume production

• small-scale bus-based shared memory is well understood– P6 (Pentium Pro/II/III) supports 4-way “glueless” MP– supported by most common OS’s (e.g. NT, Solaris, Linux)

Key: a microprocessor is now the fastest uniprocessor you can build

EECS 570: Fall 2003 -- rev3 8

Technology Trends

The natural building block for multiprocessors is now about the fastest

EECS 570: Fall 2003 -- rev3 9

What's different today? (cont'd)

• Meanwhile, programming models have converged to a few:– shared memory (better: shared address space)– message passing– data parallel (compiler maps to one of above) – data flow (more as concept than model)

• Result: parallel system is microprocessors + memory + interconnection network

• Still many design issues to consider

EECS 570: Fall 2003 -- rev3 10

Parallel Architecture Today

• Key: abstractions & interfaces for communication and cooperation– Communication Architecture– equivalent to Instruction Set Architecture for

uniprocessors• Must consider

– Usability (programmability) & performance– Feasibility/complexity of implementation (hw or sw)

• Compilers, libraries and OS are important bridges today

EECS 570: Fall 2003 -- rev3 11

Modern Layered Framework

EECS 570: Fall 2003 -- rev3 12

Survey of Programming Models

• Shared Address Space• Message Passing• Data Parallel• Others:

– Dataflow– Systolic Arrays (see text)

Examine programming model, motivation, intended applications, and contributions to convergence

EECS 570: Fall 2003 -- rev3 13

Simple Exampleint i;double a, x[N], y[N], z[N], sum;

/* input a, x[] , y[] */

sum = 0;for (i = 0; i< N; ++i) {z[i] = a * x[i] + y[i];sum += z[i];

}

EECS 570: Fall 2003 -- rev3 14

Dataflow Graph

A X[0]

*Y[0]

+

+

A X[1]

*Y[1]

+

A X[2]

*Y[2]

+

+

A X[3]

*Y[3]

+

+

2 + N-1 cycles to execute on N processorswhat assumptions?

…

EECS 570: Fall 2003 -- rev3 15

Shared Address Space ArchitecturesAny processor can directly reference any memory location

– Communication occurs implicitly as result of loads and stores– Need additional synchronization operations

Convenient:– Location transparency– Similar programming model to time-sharing on uniprocessors

• Except processes run on different processors• Good throughput on multiprogrammed workloads

Within one process (“lightweight” threads): all memory sharedAmong processes: portions of address spaces shared (mmap,

shmat)– In either case, variables may he logically global or per-thread– Popularly known as shared memory machines or model– Ambiguous: memory may be physically distributed among processors

EECS 570: Fall 2003 -- rev3 16

Small-scale Implementation

Natural extension of uniprocessor: already have processor, memory modules and I/O controllers on interconnect of some sort– typically a bus– may be crossbar (mainframes)– occasionally multistage network (vector machines, ??)

Just add processors!

interconnect

processor processor

mem mem mem mem I/O ctrl I/O ctrl

I/O devices

EECS 570: Fall 2003 -- rev3 17

Simple Example: SAS version

/* per-thread */ int i, my_start, my_end, me;/* global */ double a, x[N], y[N], z[N], sum;

/* my_start, my_end based on N, # nodes */for (i = my_start; i< my_end; ++i)z[i] = a* x[i] + y[iJ.;

BARRIER;if (me == 0)sum = 0;for (i = 0; i< N; ++i)

sum += z[i];

EECS 570: Fall 2003 -- rev3 18

Message Passing ArchitecturesComplete computer as building block

– ld/st access only private address space (local memory)– Communication via explicit I/O operations (send/receive)

• Specify local memory buffers– Synchronization implicit in msgs

Programming interface often more removed from basic hardware– Library and/or OS intervention

Biggest machines are of this sort– IBM SP/2– DoE ASCI program– Clusters (Beowulf etc.)

EECS 570: Fall 2003 -- rev3 19

Simple Example: MP versionint i, me;double a, x[N/P], y[N/P], z[N/P] , sum;

sum = 0 ;for (i = 0; i < NIP; ++i) {

z[i] = a* x[i] + y[i];sum += z[i];

}

if (me != 0)send (sum, 0);

elsefor (i = 0; i< P; ++i) {recv(tmp, i);sum += tmp;

}

EECS 570: Fall 2003 -- rev3 20

Convergence: Scaling Up SAS

• Problem is interconnect: cost (crossbar) or bandwidth (bus)• Distributed memory or non-uniform memory access (NUMA)

– “Communication assist” turns non-local accesses into simple message transactions (e.g., read-request, read-response)

– issues: cache coherence, remote memory latency– MP HW specialized for read/write requests

Dance hall

interconnect

mem mem mem

$

proc

$

proc

$

interconnect

proc

mem mem mem$

proc

$

proc

$

Distributed memoryproc

EECS 570: Fall 2003 -- rev3 21

Separation of Architecture from Model

At the lowest level SM sends messages– HW is specialized to expedite read/write messages

What programming model/ abstraction is supported at user level?

Can I have shared-memory abstraction on message passing HW?

Can I have message passing abstraction on shared memory HW?

Recent research machines integrate both– Alewife, Tempest/Typhoon, FLASH

EECS 570: Fall 2003 -- rev3 22

Data Parallel SystemsProgramming model

– Operations performed in parallel on each element of data structure

– Logically single thread of control, performs sequential or parallel steps

– Synchronization implicit in sequencing– Conceptually, a processor associated with each data element

Architectural model– Array of many simple, cheap

processing– elements (PE’s, really just datapaths)

with no instruction memory, little data memory each.

– Attached to a control processor that issues instructions

– Specialized and general communication, cheap global synch.

pe pe

pe pe

pe

pe

pe pe pe

EECS 570: Fall 2003 -- rev3 23

Simple Example: DP version

double a, x[N], y[N], z[N], sum;

z = a * x+ y;sum = reduce(+, z);

Language supports array assignment, global operations

Other examples: Document searching, image processing, ...

Some recent (within last decade+) machines:– Thinking Machines CM-I, CM-2 (and CM-5)– Maspar MP-1 and MP-2

EECS 570: Fall 2003 -- rev3 24

DP Convergence with SAS/MPPopular when cost savings of centralized sequencer high

– 60’s when CPU was a cabinet– Replaced by vectors in mid-7Os

• More flexible w.r.t. memory layout and easier to manage• Revived in mid-80’s when datapath just fit on chip (w/o control)• No longer true with modem microprocessors

Other reasons for demise• DP applications are simple, regular

– relatively easy targets for compilers– easy to partition across relatively small # of microprocessors

• MIMD machines effective for these apps plus many othersContributions to convergence• utility of fast global synchronization, reductions, etc.• high-Level model that can compile to either platform

EECS 570: Fall 2003 -- rev3 25

Dataflow ArchitecturesRepresent computation as a graph of essential dependences• Logical processor at each node activated by availability of operands• Message (tokens) carrying tag of next instruction sent to next

processor• Tag compared with others in matching store; match fires execution

EECS 570: Fall 2003 -- rev3 26

Basic Dataflow Architecture

EECS 570: Fall 2003 -- rev3 27

DF Evolution and Convergence

Key characteristics– high parallelism: no artificial limitations– dynamic scheduling. fully exploit multiple execution units

Problems– Operations have locality, nice to group them to reduce communication– No conventional notion of memory – how do you declare an array?– Complexity of matching store: large associative search!– Too much parallelism! ! ! mgmt overhead > benefit

Converged to use conventional processors and memory– Group related ops to exploit registers, cache – fine-grain threading– Results communicated between threads via messages

Lasting contributions:– Stresses tightly integrated communication & execution (e.g. create

thread to handle message)– Remains useful concept for ILP hardware, compilers

EECS 570: Fall 2003 -- rev3 28

Programming Model Design Issues

• Naming: How is communicated data and/or partner node referenced?

• Operations: What operations are allowed on named data?

• Ordering: How can producers and consumers of data coordinate their activities?

• Performance– Latency. How long does it take to communicate in a

protected fashion?– Bandwidth: How much data can be communicated per

second? How many operations per second?

EECS 570: Fall 2003 -- rev3 29

Issue: NamingSingle Global Linear-Address-Space (shared memory)Single Global Segmented-Name-Space (global

objects/data parallel)– uniform address space– uniform accessibility (load/store)

Multiple Local Address/Name Spaces (message passing)– two-level address space (node + memory address)– non-uniform accessibility (use messages if node != me)

Naming strategy affects– Programmer/Software– Performance– Design Complexity

EECS 570: Fall 2003 -- rev3 30

Issue: Operations

• SAS– Id/st, arithmetic on any item (in source language)– additional ops for synchronization (locks, etc.), usually on

memory locations

• Message passing– Id/st, arithmetics etc. only on local items– send/recv on (local memory range, remote node ID) tuple

• Data parallel– arithmetics etc.– global operations (sum, max, min, etc.)

EECS 570: Fall 2003 -- rev3 31

Ordering

• Uniprocessor– program order of instructions (note: specifies effect not reality)

• SAS– uniprocessor within thread– implicit memory ordering among threads very subtle– need explicit synchronization operations

• Message passing– uniprocessor within node; can't recv before send

• Data parallel– program order of operations (just like uni)– all parallelism is within individual operations– implicit global barrier after every step

EECS 570: Fall 2003 -- rev3 32

Issue: Order/Synchronization

Coordination mainly takes three forms:– mutual exclusion (e.g., spin-locks)– event notification

• point-to-point (e.g.., producer-consumer)• global (e.g., end of pbase indication, all or subset of processes)

– global operations (e.g., sum)Issues:

– synchronization name space (entire address space or portion)– granularity (per byte, per word. ...=> overhead)– low latency, low serialization (hot spots)– variety of approaches

• test&set, compare&swap, ldLocked-stConditional– Full Empty bits and traps– queue-based locks, fetch&op with combining

EECS 570: Fall 2003 -- rev3 33

Communication Performance

Performance characteristics determine usage of operations at a layer– Programmer, compilers must choose strategies leading to

performance

Fundamentally, three characteristics:– Latency: time from send to receive– Bandwidth: max transmission rate (bytes/sec)– Cost: impact on execution time of program

If processor does one thing at a time: bandwidth I/latency– But actually more complex in modern systems

EECS 570: Fall 2003 -- rev3 34

Communication Cost Model

• Communication time for one n-byte message:Comm Time = latency + n/bandwidth

• Latency has two parts:– overhead is time the CPU is busy (protection checks, formatting

header, copying data, etc)– rest of latency can be lumped as network delay

• Bandwidth is determined by communication bottleneck– occupancy of a component is amount of time that component

spends dedicated to one message– in steady state, can't do better than 1/(max occupancy)

EECS 570: Fall 2003 -- rev3 35

Cost Model (cont'd)

Overall execution-time impact depends on:– amount of communication– amount of comm. time hidden by other useful

work (overlap)

comm cost = frequency * (comm time - overlap)

Note that:– overlap is limited by overhead– overlap is another form of parallelism

EECS 570: Fall 2003 -- rev3 36

Replication

• Very important technique for reducing communication frequency

• Depends on naming model• Uniprocessor: caches do it automatically & transparently• SAS: uniform naming allows transparent replication

– Caches again– OS can do it at page level (useful with distributed memory)– Many copies for same name: coherence problem

• Message Passing:– Send/receive replicates, giving data a new name- not

transparent– Software at some level must manage (programmer, library,

compiler)

EECS 570: Fall 2003 -- rev3 37

Summary of Design Issues

Functional and performance issues apply at all layers

Functional: Naming, operations and ordering

Performance: Organization, latency, bandwidth, overhead, occupancy

Replication and communication are deeply related

– Management depends on naming model

Goal of architects: design against frequency and type of operations that

occur at communication abstraction, constrained by tradeoffs from

above or below

– Hardware/software tradeoffs

EECS 570: Fall 2003 -- rev3 1 EECS 570 Notes on Chapter 1– Introduction –What is Parallel...

Documents

Transcript of EECS 570: Fall 2003 -- rev3 1 EECS 570 Notes on Chapter 1– Introduction –What is Parallel...