Parallel Computer Organization and Design EDA282 Slide 1.

40
Parallel Computer Organization and Design EDA282 Slid e 1

Transcript of Parallel Computer Organization and Design EDA282 Slide 1.

Page 1: Parallel Computer Organization and Design EDA282 Slide 1.

Parallel Computer Organization and DesignEDA282

Slide 1

Page 2: Parallel Computer Organization and Design EDA282 Slide 1.

Slide 2

Why Study Parallel Computers?

Almost ALL computers are now parallel

Understanding hardware is important for producing good software (converse also true!)

It’s fun!

Page 3: Parallel Computer Organization and Design EDA282 Slide 1.

Logistics EL43 1:15-3:00 T/Th (often1:15 F, too)

Expected participation¨ Attend lectures, participate in discussion¨ Complete labs (including a satisfactory writeup)

— dates/times TBD¨ Read papers¨ Complete quizzes¨ Write (short) survey article (in teams)¨ Finish (short) take-home exam

Canvas course-management system¨ https://canvas.instructure.com/courses/777378¨ Link: http://www.cse.chalmers.se/~mckee/eda282

Slide 3

Page 4: Parallel Computer Organization and Design EDA282 Slide 1.

Personnel

Prof. Sally McKee¨ Office hours: arrange meetings via email¨ Available for discussions after class¨ [email protected]

Jacob Lidman¨ [email protected]

Slide 4

Page 5: Parallel Computer Organization and Design EDA282 Slide 1.

Course Materials

“Parallel Computer Organization and Design”Dubois, Annevaram, Stenström (at Cremona)

Research and survey papers (linked to web page)

Slide 5

Page 6: Parallel Computer Organization and Design EDA282 Slide 1.

Course Structure/Contents

Intro today

Programming models¨ Data parallelism¨ Shared address spaces¨ Message passing¨ Hybrid

Design principles/tradeoffs(this is the bulk of the material)¨ Small-scale systems¨ Scalable systems¨ Interconnects

Slide 6

Page 7: Parallel Computer Organization and Design EDA282 Slide 1.

For Each Big Topic, We’ll Discuss . . .

History¨ How concepts originated in old machines¨ How they show up in current machines

Basics required in any parallel machine¨ Memory coherence¨ Communication¨ Synchronization

Page 8: Parallel Computer Organization and Design EDA282 Slide 1.

How Did We Get Here?

Transistor count doubling every ~2 years

Transistor feature sizes shrinking

Costs changing

Clock speeds hitting limits

Parallelism per processor increasing

Looking at trends is important when designing new systems!

Slide 8

Page 9: Parallel Computer Organization and Design EDA282 Slide 1.

Costs of Parallel Machines

Things to keep in mind when designing a machine . . .

What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?

(how long is a "lifetime"?)

Slide 9

Page 10: Parallel Computer Organization and Design EDA282 Slide 1.

Slide 10

Interesting Questions (i.e., course content) What do we mean by parallel?

¨ Task parallelism (SPMD, MPMD)¨ Data parallelism (SIMD)¨ Thread parallelism (Hyperthreading, SMT)

How do the processors coordinate their work?¨ Shared memory/message passing¨ Interconnection network (at least one!)¨ Synchronization primitives¨ Many combinations/variations

What’s the best way to put these pieces together?¨ What do you want to run?¨ How fast do you have to run it?¨ How much can you spend?¨ How much energy can you use?

Page 11: Parallel Computer Organization and Design EDA282 Slide 1.

Moore’s Law: Transistor Counts

Slide 11

Page 12: Parallel Computer Organization and Design EDA282 Slide 1.

Feature Sizes

Slide 12

Page 13: Parallel Computer Organization and Design EDA282 Slide 1.

Costs: Apple

Slide 13

Page 14: Parallel Computer Organization and Design EDA282 Slide 1.

Costs of Consumer Electronics Today

Slide 14

Page 15: Parallel Computer Organization and Design EDA282 Slide 1.

History Pascal adding machine, 1642

Leibniz adder/multiplier, ~1670

Babbage analytical engine, 1837 (punch cards, memory, printer!)

Hollerith punchcards, 1890 (used for US census data)

Aiken digital computer, 1940s (Harvard)

Von Neumann stored-program computer, 1945

Eckert/Mauchly ENIAC GP computer, 1946

Slide 15

Page 16: Parallel Computer Organization and Design EDA282 Slide 1.

Evolution of Electronic Computers Vacuum tubes replaced by transistors, late 1950s

¨ Smaller, faster, more versital logic elements¨ Lower power¨ Longer lifetime

Integrated Circuits, late 1960s¨ Many transistors fabricated on silicon substrate¨ Wires plated in place¨ Lower price¨ Smaller size¨ Lower failure rate

LSI/VLSI/microprocessors, 1970s¨ 1000s of interconnected transistors etched into silicon¨ Could check 8 switches at once → 8-bit “byte”

Slide 16

Page 17: Parallel Computer Organization and Design EDA282 Slide 1.

History of Supercomputers IBM 7030 Stretch, 1961

¨ 2K sq. ft.¨ Fastest computer in world at time¨ Slower than expected!¨ Cost initially $13M, dropped to $8.5M¨  Instruction pipelining, prefetching/decoding, memory

interleaving

CDC 6600, 1964¨ Size ~= 4 filing cabinets. ¨ Cost $8M ($60M today)¨ 40MHz, 3M FLOPS at peak¨ Freon cooled¨ CPU == 10 FUs, multiple PCBs¨ 60-bit words/regs

Slide 17

Page 18: Parallel Computer Organization and Design EDA282 Slide 1.

History of Supercomputers (2)

Cray 1, 1976¨ 64-bit words¨ 80 MHz¨ 136 MFLOPS!¨ Speed-critical parts

placed inside¨ 1662 PCBs w/ 144 Ics¨ 80 sold in 10 years¨ $5-8M ($25M now)

Slide 18

Page 19: Parallel Computer Organization and Design EDA282 Slide 1.

History of Supercomputers (3) Cray XMP, 1982

¨ Up to 4 CPUs in 1 chassis¨ Up to 16M 64-bit words (128 MB, all SRAM!)¨ Up to 32 1.2GB disks¨ 105 MHz¨ Up to 800 MFLOPS (200/CPU)¨ Double memory bandwidth wrt Cray 1

Cray 2, 1985¨ Again ICs packed on logic boards¨ Again, horseshoe shape¨ Boards packed tightly — submersed in Fluorinert to cool

(see http://archive.computerhistory.org/resources/text/Cray/Cray.Cray2.1985.102646185.pdf)

¨ Up to 8 CPUs, 1.9 GFLOPS¨ Mainstream software/Unix System V OS

Slide 19

Page 20: Parallel Computer Organization and Design EDA282 Slide 1.

History of Supercomputers (4)

Intel Paragon, 1989¨ I860-based¨ 32- or 64-bit¨ Up to 4K CPUs¨ 2D MIMD topology¨ Poor memory bandwidth utilization

ASCI Red, 1996¨ First to use off-the-shelf CPUs (Pentium Pros, Xeons)¨ 6K CPUs¨ Broke 1 TFLOP barrier¨ Cost $46M ($67M now)¨ Upgrade had 9298 Xeons for 3.1 TFlops¨ Over 1MW power!

Slide 20

Page 21: Parallel Computer Organization and Design EDA282 Slide 1.

History of Supercomputers (5) Hitachi SR2201, 1996

¨ H-shaped chassis¨ 2048 CPUs ¨ 600 GFLOPS peak

Other similar machines (many Japanese)¨ 100s of CPUs¨ 2D or 3D networks (e.g., Cray torus)¨ MIMD

Seymour Cray leaves Cray Research¨ Cray Computer Corp (CCC)

¨ Cray 3 first gallium arsenide chips¨ Cray 4 failed → bankruptcy

¨ SRC Computers (see http://www.srccomp.com/about/aboutus.asp)

Slide 21

Page 22: Parallel Computer Organization and Design EDA282 Slide 1.

Biggest Machine Today

SequoiaIBM BlueGene/Q machine at theU.S. Dept. of Energy Lawrence Livermore National Lab

Slide 22

Page 23: Parallel Computer Organization and Design EDA282 Slide 1.

Types of Parallelism Instruction-Level Parallelism (ILP)

¨ Superscalar issue¨ Out-of-order execution¨ Very Long Instruction Word (VLIW)

Thread-Level Parallelism (TLP)¨ Loop-level¨ Multithreading

¨ Explicit¨ Speculative¨ Simultaneous/Hyperthreading

Task-Level Parallelism Program-Level Parallelism Data-Level Parallelism

Slide 23

Page 24: Parallel Computer Organization and Design EDA282 Slide 1.

Parallelism in Sequential Programs

Programming model: C (sequential)

Architecture: superscalar¨ ILP¨ Communication through registers¨ Synchronization through pipeline interlocks

Slide 24

for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];

Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]

Loop 2 a[0] a[1] … a[N-1]

data dependencies

Page 25: Parallel Computer Organization and Design EDA282 Slide 1.

Parallel Programming Models

Extend semantics to express¨ Units of parallelism

¨ Instructions¨ Threads¨ Programs

¨ Communication and coordination between units via¨ Registers¨ Memory¨ I/O

Slide 25

Page 26: Parallel Computer Organization and Design EDA282 Slide 1.

Model vs. Architecture

¨ Communication abstraction supports model¨ Communication architecture (ISA + comm/sync)

implements part of model¨ Hw/sw boundary defines which parts of comm arch

implemented in which

Slide 26

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Databases Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compileror library

Operating system support

Communication harrdware

Physical communication medium

Hardware/software boundary

Page 27: Parallel Computer Organization and Design EDA282 Slide 1.

Shared Address Space Model

TLP

Communication/coordination among threads via shared global address space

Slide 27

for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j];

Communication abstractionsupported by HW/SW interface

P P P

Memory

Page 28: Parallel Computer Organization and Design EDA282 Slide 1.

Message Passing Model Process-level parallelism (separate addr spaces)

Communication/coordination via messages

Slide 28

for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for

Page 29: Parallel Computer Organization and Design EDA282 Slide 1.

Data Parallelism (SIMD)

Programming model¨ Operations done in parallel on multiple data elements¨ Single thread of control

Architectural model¨ Array of simple, cheap processors w/ little memory¨ Attached to control proc that issues instructions¨ Specialized + general comm, cheap sync

Slide 29

PE PE PE

PE PE PE

PE PE PE

Controlprocessor

parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];

Page 30: Parallel Computer Organization and Design EDA282 Slide 1.

Coarser-Grain Data Parallelism

Single-Program Multiple-Data

More broadly applicable than SIMD

Slide 30

Page 31: Parallel Computer Organization and Design EDA282 Slide 1.

Creating a Parallel Program

ID work that can be done in parallel¨ Computation¨ Data access¨ I/O

Partition work/data among entities¨ Processes¨ Threads

Manage data access, comm, sync

Speedup(P) = Performance(P)/Performance(1)

= Time(1)/Time(P)

Slide 31

Page 32: Parallel Computer Organization and Design EDA282 Slide 1.

Steps

Decomposition

Assignment

Orchestration

Mapping

Can be done by ¨ Programmer¨ Compiler¨ Runtime¨ Hardware (speculatively)

Slide 32

Page 33: Parallel Computer Organization and Design EDA282 Slide 1.

Architecture independent

Parallelization

Architecture dependent

Slide 33

P0 P1

P2 P3

Sequential Compuitation Tasks Processes

Parallel Program Processors

Page 34: Parallel Computer Organization and Design EDA282 Slide 1.

Concepts

Task¨ Arbitrary piece of work from computation¨ Sequentually executed¨ Could be fine- or coarse-grained

Process (or thread)¨ What gets executed by a core¨ Abstract entity that performs tasks assigned to it¨ Processes comm & sync to perform tasks

Processor (core)¨ Physical engine on which processes run¨ Virtualized machine view for programmer

Slide 34

Page 35: Parallel Computer Organization and Design EDA282 Slide 1.

Decomposition

Purpose: Break up computation into tasks to be divided among processes¨ Tasks may become available dynamically¨ Number of available tasks may vary with time¨ i.e., identify concurrency and decide level at which to

exploit it

Goal: keep processes busy, but keep management reasonable¨ Number of tasks creates upper bound on speedup¨ Too many tasks requires too much coordination

Slide 35

Page 36: Parallel Computer Organization and Design EDA282 Slide 1.

Assignment

Specify mechanism to divide work among processes¨ Strive for balance¨ Reduce communication, management

Structured approach recommended¨ Inspect code¨ Apply well known heuristics

Programmer focuses on decomp/assign 1st ¨ Largely independent of architecture/programming model¨ Choice of primitives (cost/complexity) affects decisions

Architects assume program(mer) does decent job

Slide 36

Page 37: Parallel Computer Organization and Design EDA282 Slide 1.

Orchestration

Purpose¨ Name data, structure comm/sync¨ Organize data structures, schedule tasks (temporally)

Goals¨ Reduce costs of comm/sync from processor POV¨ Improve data locality¨ Reduce overhead of managing parallelism

Choices depend heavily on comm abstraction, efficiency of primitives Architects must provide appropriate, efficient

primitives

Slide 37

Page 38: Parallel Computer Organization and Design EDA282 Slide 1.

Mapping

Two aspects ¨ Which processes to run on same processor¨ Which process runs on which processor

One extreme-sharing¨ Partition machine s.t. only 1 app at a time in a subset¨ Pin processes to cores (or let OS balance workloads)

Another extreme¨ Control complete resource management in OS¨ Use performance techniques for dynamic balancing

Real world is between the two¨ User specifies desires in some aspects¨ System may ignore

Slide 38

Page 39: Parallel Computer Organization and Design EDA282 Slide 1.

High-Level Goals

High performance Low resource usage Low development effort Low power consumption Implications for algorithm designers and architects

¨ Algorithm designers: high-performance, low resource needs

¨ Architects: high-performance, low cost, reduced programming effort

Slide 39

Page 40: Parallel Computer Organization and Design EDA282 Slide 1.

Costs of Parallel Machines

Things to keep in mind when designing a machine . . .

What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?

(how long is a "lifetime"?)

Slide 40