Parallel Computer Organization and Design EDA282 Slide 1.

Parallel Computer Organization and DesignEDA282

Why Study Parallel Computers?

Almost ALL computers are now parallel

Understanding hardware is important for producing good software (converse also true!)

It’s fun!

Logistics EL43 1:15-3:00 T/Th (often1:15 F, too)

Expected participation¨ Attend lectures, participate in discussion¨ Complete labs (including a satisfactory writeup)

— dates/times TBD¨ Read papers¨ Complete quizzes¨ Write (short) survey article (in teams)¨ Finish (short) take-home exam

Canvas course-management system¨ https://canvas.instructure.com/courses/777378¨ Link: http://www.cse.chalmers.se/~mckee/eda282

Personnel

Prof. Sally McKee¨ Office hours: arrange meetings via email¨ Available for discussions after class¨ mckee@chalmers.se

Jacob Lidman¨ lidman@chalmers.se

Course Materials

“Parallel Computer Organization and Design”Dubois, Annevaram, Stenström (at Cremona)

Research and survey papers (linked to web page)

Course Structure/Contents

Intro today

Programming models¨ Data parallelism¨ Shared address spaces¨ Message passing¨ Hybrid

Design principles/tradeoffs(this is the bulk of the material)¨ Small-scale systems¨ Scalable systems¨ Interconnects

For Each Big Topic, We’ll Discuss . . .

History¨ How concepts originated in old machines¨ How they show up in current machines

Basics required in any parallel machine¨ Memory coherence¨ Communication¨ Synchronization

How Did We Get Here?

Transistor count doubling every ~2 years

Transistor feature sizes shrinking

Costs changing

Clock speeds hitting limits

Parallelism per processor increasing

Looking at trends is important when designing new systems!

Costs of Parallel Machines

Things to keep in mind when designing a machine . . .

What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?

(how long is a "lifetime"?)

Interesting Questions (i.e., course content) What do we mean by parallel?

¨ Task parallelism (SPMD, MPMD)¨ Data parallelism (SIMD)¨ Thread parallelism (Hyperthreading, SMT)

How do the processors coordinate their work?¨ Shared memory/message passing¨ Interconnection network (at least one!)¨ Synchronization primitives¨ Many combinations/variations

What’s the best way to put these pieces together?¨ What do you want to run?¨ How fast do you have to run it?¨ How much can you spend?¨ How much energy can you use?

Moore’s Law: Transistor Counts

Feature Sizes

Costs: Apple

Costs of Consumer Electronics Today

History Pascal adding machine, 1642

Leibniz adder/multiplier, ~1670

Babbage analytical engine, 1837 (punch cards, memory, printer!)

Hollerith punchcards, 1890 (used for US census data)

Aiken digital computer, 1940s (Harvard)

Von Neumann stored-program computer, 1945

Eckert/Mauchly ENIAC GP computer, 1946

Evolution of Electronic Computers Vacuum tubes replaced by transistors, late 1950s

¨ Smaller, faster, more versital logic elements¨ Lower power¨ Longer lifetime

Integrated Circuits, late 1960s¨ Many transistors fabricated on silicon substrate¨ Wires plated in place¨ Lower price¨ Smaller size¨ Lower failure rate

LSI/VLSI/microprocessors, 1970s¨ 1000s of interconnected transistors etched into silicon¨ Could check 8 switches at once → 8-bit “byte”

History of Supercomputers IBM 7030 Stretch, 1961

¨ 2K sq. ft.¨ Fastest computer in world at time¨ Slower than expected!¨ Cost initially $13M, dropped to $8.5M¨ Instruction pipelining, prefetching/decoding, memory

interleaving

CDC 6600, 1964¨ Size ~= 4 filing cabinets. ¨ Cost $8M ($60M today)¨ 40MHz, 3M FLOPS at peak¨ Freon cooled¨ CPU == 10 FUs, multiple PCBs¨ 60-bit words/regs

History of Supercomputers (2)

Cray 1, 1976¨ 64-bit words¨ 80 MHz¨ 136 MFLOPS!¨ Speed-critical parts

placed inside¨ 1662 PCBs w/ 144 Ics¨ 80 sold in 10 years¨ $5-8M ($25M now)

History of Supercomputers (3) Cray XMP, 1982

¨ Up to 4 CPUs in 1 chassis¨ Up to 16M 64-bit words (128 MB, all SRAM!)¨ Up to 32 1.2GB disks¨ 105 MHz¨ Up to 800 MFLOPS (200/CPU)¨ Double memory bandwidth wrt Cray 1

Cray 2, 1985¨ Again ICs packed on logic boards¨ Again, horseshoe shape¨ Boards packed tightly — submersed in Fluorinert to cool

(see http://archive.computerhistory.org/resources/text/Cray/Cray.Cray2.1985.102646185.pdf)

¨ Up to 8 CPUs, 1.9 GFLOPS¨ Mainstream software/Unix System V OS

History of Supercomputers (4)

Intel Paragon, 1989¨ I860-based¨ 32- or 64-bit¨ Up to 4K CPUs¨ 2D MIMD topology¨ Poor memory bandwidth utilization

ASCI Red, 1996¨ First to use off-the-shelf CPUs (Pentium Pros, Xeons)¨ 6K CPUs¨ Broke 1 TFLOP barrier¨ Cost $46M ($67M now)¨ Upgrade had 9298 Xeons for 3.1 TFlops¨ Over 1MW power!

History of Supercomputers (5) Hitachi SR2201, 1996

¨ H-shaped chassis¨ 2048 CPUs ¨ 600 GFLOPS peak

Other similar machines (many Japanese)¨ 100s of CPUs¨ 2D or 3D networks (e.g., Cray torus)¨ MIMD

Seymour Cray leaves Cray Research¨ Cray Computer Corp (CCC)

¨ Cray 3 first gallium arsenide chips¨ Cray 4 failed → bankruptcy

¨ SRC Computers (see http://www.srccomp.com/about/aboutus.asp)

Biggest Machine Today

SequoiaIBM BlueGene/Q machine at theU.S. Dept. of Energy Lawrence Livermore National Lab

Types of Parallelism Instruction-Level Parallelism (ILP)

¨ Superscalar issue¨ Out-of-order execution¨ Very Long Instruction Word (VLIW)

Thread-Level Parallelism (TLP)¨ Loop-level¨ Multithreading

¨ Explicit¨ Speculative¨ Simultaneous/Hyperthreading

Task-Level Parallelism Program-Level Parallelism Data-Level Parallelism

Parallelism in Sequential Programs

Programming model: C (sequential)

Architecture: superscalar¨ ILP¨ Communication through registers¨ Synchronization through pipeline interlocks

for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];

Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]

Loop 2 a[0] a[1] … a[N-1]

data dependencies

Parallel Programming Models

Extend semantics to express¨ Units of parallelism

¨ Instructions¨ Threads¨ Programs

¨ Communication and coordination between units via¨ Registers¨ Memory¨ I/O

Model vs. Architecture

¨ Communication abstraction supports model¨ Communication architecture (ISA + comm/sync)

implements part of model¨ Hw/sw boundary defines which parts of comm arch

implemented in which

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Databases Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compileror library

Operating system support

Communication harrdware

Physical communication medium

Hardware/software boundary

Shared Address Space Model

Communication/coordination among threads via shared global address space

for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j];

Communication abstractionsupported by HW/SW interface

Memory

Message Passing Model Process-level parallelism (separate addr spaces)

Communication/coordination via messages

for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for

Data Parallelism (SIMD)

Programming model¨ Operations done in parallel on multiple data elements¨ Single thread of control

Architectural model¨ Array of simple, cheap processors w/ little memory¨ Attached to control proc that issues instructions¨ Specialized + general comm, cheap sync

PE PE PE

Controlprocessor

parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];

Coarser-Grain Data Parallelism

Single-Program Multiple-Data

More broadly applicable than SIMD

Creating a Parallel Program

ID work that can be done in parallel¨ Computation¨ Data access¨ I/O

Partition work/data among entities¨ Processes¨ Threads

Manage data access, comm, sync

Speedup(P) = Performance(P)/Performance(1)

= Time(1)/Time(P)

Decomposition

Assignment

Orchestration

Mapping

Can be done by ¨ Programmer¨ Compiler¨ Runtime¨ Hardware (speculatively)

Architecture independent

Parallelization

Architecture dependent

Sequential Compuitation Tasks Processes

Parallel Program Processors

Concepts

Task¨ Arbitrary piece of work from computation¨ Sequentually executed¨ Could be fine- or coarse-grained

Process (or thread)¨ What gets executed by a core¨ Abstract entity that performs tasks assigned to it¨ Processes comm & sync to perform tasks

Processor (core)¨ Physical engine on which processes run¨ Virtualized machine view for programmer

Decomposition

Purpose: Break up computation into tasks to be divided among processes¨ Tasks may become available dynamically¨ Number of available tasks may vary with time¨ i.e., identify concurrency and decide level at which to

exploit it

Goal: keep processes busy, but keep management reasonable¨ Number of tasks creates upper bound on speedup¨ Too many tasks requires too much coordination

Assignment

Specify mechanism to divide work among processes¨ Strive for balance¨ Reduce communication, management

Structured approach recommended¨ Inspect code¨ Apply well known heuristics

Programmer focuses on decomp/assign 1st ¨ Largely independent of architecture/programming model¨ Choice of primitives (cost/complexity) affects decisions

Architects assume program(mer) does decent job

Orchestration

Purpose¨ Name data, structure comm/sync¨ Organize data structures, schedule tasks (temporally)

Goals¨ Reduce costs of comm/sync from processor POV¨ Improve data locality¨ Reduce overhead of managing parallelism

Choices depend heavily on comm abstraction, efficiency of primitives Architects must provide appropriate, efficient

primitives

Mapping

Two aspects ¨ Which processes to run on same processor¨ Which process runs on which processor

One extreme-sharing¨ Partition machine s.t. only 1 app at a time in a subset¨ Pin processes to cores (or let OS balance workloads)

Another extreme¨ Control complete resource management in OS¨ Use performance techniques for dynamic balancing

Real world is between the two¨ User specifies desires in some aspects¨ System may ignore

High-Level Goals

High performance Low resource usage Low development effort Low power consumption Implications for algorithm designers and architects

¨ Algorithm designers: high-performance, low resource needs

¨ Architects: high-performance, low cost, reduced programming effort

Costs of Parallel Machines

Things to keep in mind when designing a machine . . .

What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?

(how long is a "lifetime"?)

Parallel Computer Organization and Design EDA282 Slide 1.

Documents

Transcript of Parallel Computer Organization and Design EDA282 Slide 1.

Model of Parallel-Connected Multiple Induction … Congresso/MECCANICA_MACCHINE...Model of Parallel-Connected Multiple Induction Motors for ... slide protection systems ... both traction

Slide 1 - ICC - The world business organization

Computer Logical Organization Tutorial - Everest Kantoeverestkanto.com/.../files/computer_logical_organization_tutorial.pdf · Computer Logical Organization Tutorial ... 4 Bit Parallel

Precision Parallel Grippers - RAD-RA · Box Slide Parallel Grippers Machine Tool Quality Thousands of Robotic Accessories box slide parallel grippers have demonstrated machine tool

Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

Parallel Matlab: The Next Generationarchive.ll.mit.edu/HPEC/agendas/proc03/pdfs/kepner_pmat.pdf · 2019. 2. 15. · Slide-1 Parallel Matlab MIT Lincoln Laboratory Parallel Matlab:

DSI Pressure Seal Valves - primeva.com.au · Parallel Slide Design 12-13 DSI® Forged Steel Gate Valves Parallel Slide Design 14-15 DSI® Cast Steel Globe Valves Straight Pattern

Foundations of Parallel Computation - IIT Bombayranade/606/all.pdf · The organization of the parallel computer also a ects the cost. ... main concern in parallel computer design

Foundations of Parallel Computation - IIT Bombayranade/408/archive/all.pdf · The organization of the parallel computer also a ects the cost. ... main concern in parallel computer

Application example: Slide clamping with parallel operation with … · 2014. 1. 2. · Application example: Slide clamping with parallel operation with spring clamping cylinder type

Parallel-Slide hardware – the hardware solution for properties with ...

Parallel Computer Organization and Design

Computer Organization and Architecture Parallel Processing Chapter 18.

Table of Contents Function Slide 3 Major organ Slide 4 Systems connected Slide 6-7 Levels of organization Slide 8 The work of organs Slide 5 Diseases.

An Experiential Approach to Organization Development 7 th edition Chapter 1 Slide 1 Chapter 1 Organization Development and Reinventing the Organization.

Agile Organization Design: A Lever for Executives to ... · Agile Organization Design: A Lever for Executives to Transform Their Organization Webinar. ... Slide 4 Agile Organization

PARALLEL SLIDE GATE VALVE - F.R.I. SRL valves · PARALLEL SLIDE GATE VALVES - PRESSURE SEAL CONFIGURATION Both designs Accordingly ASME B16.34 and EN 12516 Dimensions from 2 inch

ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004.

PTL - SIEGENIA · PTL Fold and slide elements Lift-slide elements Parallel-slide elements Parallel slide & tilt elements Parallel slide & tilt elements force-controlled

Slide 1 Parallel Computation Models Lecture 3 Lecture 4.