Instructor ◦ Dan Stevenson Office: P 136 [email protected] Course Web Site: ◦ stevende/cs491/...

CS 491Parallel & Distributed

Computing

Instructor◦ Dan Stevenson

Office: P 136 [email protected]

Course Web Site: ◦ http://www.cs.uwec.edu/~stevende/cs491/

Welcome to CS 491

http://www.cs.uwec.edu/~ernstdj/courses/cs491/

http://www.cs.uwec.edu/~ernstdj/courses/cs491/

Regarding HELP with course materials and assignments◦ Come to office hours – Phillips 136

TIME TBD (Check website) OR by appointment (just e-mail or call my office)

◦ Send me an e-mail: [email protected]

Getting Help:When you have questions

mailto:[email protected]

Required:◦ Michael J. Quinn

Parallel Programming in C with OpenMP and MPI

Suggested:◦ Web tutorials during the semester

Textbooks

Overall Course Grading

Final Grade

: Exams (3): 40%

Project : 20%

Assignments (approx. every two weeks): 40%

CS 491:Overview

Slide content for this course collected from various sources, including:• Dr. Henry Neeman, University of Oklahoma• Dr. Libby Shoop, Macalester College• Dr. Charlie Peck, Earlham College• Tom Murphy, Contra Costa College• Dr. Robert Panoff , Shodor Foundation• Others, who will be credited…

CS 491 – Parallel and Distributed Computing

7

What does it mean to you?◦ Coordinating Threads◦ Supercomputing◦ Multi-core Processors◦ Beowulf Clusters◦ Cloud Computing◦ Grid Computing◦ Client-Server◦ Scientific Computing

All contexts for “splitting up work” in an explicit way

“Parallel & Distributed Computing”


8

In this course, we will take mostly from the context of “Supercomputing”◦ This is the field with the longest record of parallel

computing expertise.◦ It also has a long record of being a source for

“trickle-down” technology.

CS 491

9

Supercomputing is the biggest, fastest computing - right this minute.

Likewise, a supercomputer is one of the biggest, fastest computers right this minute.◦ The definition of supercomputing is, therefore,

constantly changing. A Rule of Thumb: A supercomputer is typically

at least 100 times as powerful as a PC. Jargon: Supercomputing is also known as

High Performance Computing (HPC) or High End Computing (HEC) or Cyberinfrastructure (CI).

What is Supercomputing?

10

Fastest Supercomputer vs. Moore Fastest Supercomputer in the World

1

10

100

1000

10000

100000

1000000

10000000

1992 1997 2002 2007

Year

Sp

eed

in

GF

LO

Ps

Fastest

Moore

GFLOPs:billions of

calculations per second

Over recent years, supercomputers have benefitted directly frommicroprocessor performance gains, and have also gotten better

at coordinating their efforts.


11

Jaguar – Oak Ridge National Laboratory (TN)◦ 224162 processor cores – 1.76 PetaFLOP/second

Recent Champion


12

2008 IBM Roadrunner: 1.1Petaflops 2009 Cray Jaguar: 1.6 2010 Tiahe-1A (China): 2.6 2011 Fujitsu K (Japan): 10.5

◦ 88,128 8-core processors -> 705,024 cores◦ Needs power equivalent to 10,000 homes

Linpack numbers◦ Core i7 – 2.3 Gflops◦ Glalaxy Nexus – 97 Mflops

Current Champ


13

Why should we care? What useful thing actually takes a

long time to run anymore? (especially long enough to warrant investing 7/8/9 figures on a supercomputer)

Important: It’s usually not about getting something done faster, but about getting a harder thing done in the same amount of time◦ This is often referred to as capability

computing

Hold the Phone

14

Simulation of physical phenomena, such as◦ Weather forecasting◦ Galaxy formation◦ Oil reservoir management

Data mining: finding needles of information in a haystack ofdata, such as:◦ Gene sequencing◦ Signal processing◦ Detecting storms that might produce

tornados (want forecasting, not retrocasting…) Visualization: turning a vast sea of data into

pictures that a scientist can understand◦ Oak Ridge National Lab has a 512-core cluster

devoted entirely to visualization runs

What Is HPC Used For?

TornadicStorm


15

16

What is Supercomputing About?

Size Speed

(Laptop)

17

Size: Many problems that are interesting™ can’t fit on a PC – usually because they need more than a few GB of RAM, or more than a few 100 GB of disk.

Speed: Many problems that are interesting™ would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer.

What is Supercomputing About?

18

Parallelism: doing multiple things at the same time◦ finding and coordinating this can be challenging

The tyranny of the storage hierarchy◦ The hardware you’re running on matters◦ Moving data around is often more expensive than

actually computing something

Supercomputing Issues


19

Parallel Computing Hardware

The term parallel processing is usually reserved for the situation in which a single task is executed on multiple processors

◦ Discounts the idea of simply running separate tasks on separate processors – a common thing to do to get high throughput, but not really parallel processing

Key questions in hardware design:

1. How do parallel processors share data and communicate?

◦ shared memory vs distributed memory2. How are the processors connected?

◦ single bus vs network

The number of processors is determined by a combination of #1 and #2

Parallel Processing

Shared Memory Systems◦ All processors share one memory address

space and can access it◦ Information sharing is often implicit

Distributed Memory Systems (AKA “Message Passing Systems”)◦ Each processor has its own memory space◦ All data sharing is done via programming

primitives to pass messages i.e. “Send data value to processor 3”

◦ Information sharing is always explicit

How is Data Shared?

Processors communicate via messages that they send to each other: send and receive

This form is required for multiprocessors that have separate private memories for each processor

◦ Cray T3E◦ “Beowolf Cluster”◦ SETI@HOME

Note: shared memory multiprocessors can also have separate memories – they just aren’t “private” to each processor

Message Passing

Processors all operate independently, but operate out of the same logical memory.

Data structures can be read by any of the processors

To properly maintain ordering in our programs, synchronization primitives are needed! (locks/semaphores)

Shared Memory Systems

Cac he

Proc es s or

Cac he

Proc es s or

Cac he

Proc es s or

Single bus

Memory I/O

Connecting Multiprocessors

Connect several processors via a single shared bus

◦ bus bandwidth limits the number of processors

◦ local cache lowers bus traffic

◦ single memory module attached to the bus

Limited to very smallsystems!

Intel processors supportthis mode by default

Single Bus Multiprocessor

Cac he

Proc es s or

Cac he

Proc es s or

Cac he

Proc es s or

Single bus

Memory I/O

The Cache Coherence Problem

I/O devices

Memory

P1

$ $ $

P2 P3

12

34 5

u = ?u = ?

u:5

u:5

u:5

u = 7

Two most common variations:◦ “snoopy” schemes

rely on broadcast to observe all coherence traffic well suited for buses and small-scale systems example: SGI Challenge or Intel x86

◦ directory schemes uses centralized information to avoid broadcast scales well to large numbers of processors example: SGI Origin/Altix

Cache Coherence Solutions

Basic Idea:◦ all coherence-related activity is broadcast to all

processors e.g., on a global bus

◦ each processor monitors (aka “snoops”) these actions and reacts to any which are relevant to the current contents of its cache

◦ examples: if another processor wishes to write to a line, you may need

to “invalidate” (i.e. discard) the copy in your own cache if another processor wishes to read a line for which you

have a dirty copy, you may need to supply it Most common approach in commercial shared-

memory multiprocessors. Protocol is a distributed algorithm: cooperating

state machines◦ Set of states, state transition diagram, actions

Snoopy Cache Coherence Schemes

In the single bus case, the bus is used for every main memory access

In the network connected model, the network is used only for inter-process communication

There are multiple “memories” BUT that doesn’t mean that there’s separate memory spaces

Network Connected Multiprocessors

Netw ork

Cac he

Proc es s or

Cac he

Proc es s or

Cac he

Proc es s or

Memory Memory Memory

Network-based machines do not want to use a snooping coherence protocol!◦ Means that every memory transaction would

need to be sent everywhere!

Directory-based systems use a global “Directory” to arbitrate who owns data◦ Point-to-point communication with the

directory instead of bus broadcasts◦ The directory keeps a list of what caches

have the data in question When a write to that data occurs, all of the

affected caches can be notified directly

Directory Coherence

Each node (processor) contains its own local memory

Each node is connected to the network via a switch

Messages hop along the ring from node to node until they reach the proper destination

Network Topologies: Ring

2D grid, or mesh, of nodes

Each “inside” node has 4 neighbors◦ “outside” nodes only have 2

If all nodes have four neighbors, then this is a 2D torus

Network Topologies: 2D Mesh

Also called an n-cube For n=2 2D cube (4 nodes square) For n=3 3D cube (8 nodes) For n=4 4D cube (16 nodes) In an n cube, all nodes have n neighbors

Network Topologies: Hypercube

3 cube 4 cube

Every node can communicate directly with every other node in only one pass fully connected network

n nodes n2 switches

Therefore, extremely expensive to implement!

Network Topologies: Full Crossbar

P 0

P 1

P 2

P 3

P 4

P 5

P 6

P 7

Fully connected, but requires passes thru multiple switch boxes

Less hardware required than crossbar, but contention can occur

Network Topologies: Butterfly Network

Omega network switch box

AB

CD

0

1

2

3

4

5

6

7

A simple model for categorizing computers: 4 categories:

1. SISD – Single Instruction Single Data◦ the standard uniprocessor model

2. SIMD – Single Instruction Multiple Data◦ Full systems that are “true” SIMD are no longer in use◦ Many of the concepts exist in vector processing and to come

extend graphics cards

3. MISD – Multiple Instruction Single Data◦ doesn’t really make sense

4. MIMD – Multiple Instruction Multiple Data◦ the most common model in use

Flynn’s Taxonomy of Computer Systems (1966)

A single instruction is applied to multiple data elements in parallel – same operation on all elements at the same time

Most well known examples are:◦ Thinking Machines CM-1 and CM-2◦ MasPar MP-1 and MP-2◦ others

All are out of existence now SIMD requires massive data parallelism

Usually have LOTS of very very simple processors (e.g. 8-bit CPUs)

“True” SIMD

Closely related to SIMD◦ Cray J90, Cray T90, Cray SV1, NEC SX-6◦ Starting to “merge” with MIMD systems

Cray X1E and upcoming systems (“Cascade”)

Use a single instruction to operate on an entire vector of data

◦ Difference from “True” SIMD is that data in a vector processor is not operated on in true parallel, but rather in a pipeline

◦ Uses “vector registers” to feed a pipeline for the vector operation

◦ Generally have memory systems optimized for “streaming” of large amounts of consecutive or strided data (Because of this, didn’t typically have caches

until late 90s)

Vector Processors

Multiple instructions are applied to multiple data

The multiple instructions can come from the same program, or from different programs

◦ Generally “parallel processing” implies the first

Most modern multiprocessors are of this form

◦ IBM Blue Gene, Cray T3D/T3E/XT3/4/5, SGI Origin/Altix

◦ Clusters

MIMD


40

Parallel Computing Hardware

“Supercomputer Edition”

A parallel computer built out of commodity hardware components

◦ PCs or server racks◦ Commodity network (like

ethernet)◦ Often running a free-software OS

like Linux with a low-level software library to facilitate multiprocessing

Use software to send messages between machines

◦ Standard is to use MPI (message passing interface)

The Most Common Supercomputer: Clustering

“… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is ... is freedom.”

– Captain Jack Sparrow“Pirates of the Caribbean”

What is a Cluster?

A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network

It also needs software that allows the nodes to communicate over the interconnect.

But what a cluster is … is all of these components working together as if they’re one big computer

(a supercomputer)

What a Cluster is ….

nodes◦ PCs◦ Server rack nodes

interconnection network◦ Ethernet (“GigE”)◦ Myrinet (“10GigE”)◦ Infiniband (low latency)◦ The Internet (not really – typically called

“Grid”) software

◦ OS Generally Linux

Redhat / CentOS / SuSE Windows HPC Server

◦ Libraries (MPICH, PBLAS, MKL, NAG)◦ Tools (Torque/Maui, Ganglia, GridEngine)

What a Cluster is ….

An Actual (Production) Cluster

InterconnectNodes


46

Other Actual Clusters…


47

At the high end, many supercomputers are made with custom parts◦ Custom backplane/network◦ Custom/Reconfigurable processors◦ Extreme Custom cooling◦ Custom memory system

Examples:◦ IBM Blue Gene◦ Cray XT4/5/6◦ SGI Altix

What a Cluster is NOT…

Moore’s Law

49

In 1965, Gordon Moore was an engineer at Fairchild Semiconductor.

He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months.

It turns out that computer speed was roughly proportional to the number of transistors per unit area.

Moore wrote a paper about this concept, which became known as “Moore’s Law.”

Moore’s Law

50

Fastest Supercomputer vs. Moore

Fastest Supercomputer in the World

1

10

100

1000

10000

100000

1000000

10000000

1992 1997 2002 2007

Year

Sp

eed

in

GF

LO

Ps

Fastest

Moore

GFLOPs:billions of

calculations per second

51

Moore’s Law in Practice

Year

log(

Spe

ed)

CPU

52


Year

log(

Spe

ed)

CPU

Networ

k Ban

dwid

th

53


Year

log(

Spe

ed)

CPU

Networ

k Ban

dwid

th

RAM

54


Year

log(

Spe

ed)

CPU

Networ

k Ban

dwid

th

RAM

1/Network Latency

Patterson: “In the time that bandwidth doubles, latency improves by no more than a

factor of 1.2 or 1.4”

The Tyranny ofthe Storage Hierarchy

56

Henry’s Laptop Pentium 4 Core Duo

T2400 1.83 GHz w/2 MB L2 Cache (“Yonah”)

2 GB (2048 MB) 667 MHz DDR2 SDRAM

100 GB 7200 RPM SATA Hard Drive

DVD+RW/CD-RW Drive (8x)

1 Gbps Ethernet Adapter 56 Kbps Phone Modem

Dell Latitude D620[4]

57

The Storage Hierarchy Registers Cache memory Main memory (RAM) Hard disk Removable media

(CD, DVD etc) Internet

Fast, expensive, few

Slow, cheap, a lot

We want to have lots of memory for our processor:◦ LC2K needs 216 words of memory ( ~ 256 KB)◦ MIPS needs 232 bytes of memory ( ~ 4 GB )◦ x86-64 needs 264 bytes of memory ( ~ 16 exabytes )

What are our choices?◦ SRAM, DRAM, Magnetic Disk, paper?

Memory Hierarchy

On-chip memory◦ Fabricated in the same technology as the processor

About 2-10 ns access (depending on size)◦ Decoders are big◦ Array are big

It will cost LOTS of money◦ SRAM costs $10 per megabyte

$2.50 for LC2K $40,960 for MIPS $175 trillion for x86-64

Option 1: build it out of fast SRAM

About 50 ns access◦ Why build a fast processor that stalls for dozens of cycles on

each memory load?

Still costs lots of money for new machines◦ DRAM costs $0.10 per megabyte

< $0.01 for LC2K $400 for MIPS $2 trillion for x86-64

Option 2: build it out of DRAM

Use a small array of SRAM◦ Big enough to hold whatever you use most often◦ Small means fast!◦ Small means cheap!

Use a larger amount of DRAM◦ And hope that you rarely have to use it

Use a really big amount of Disk storage◦ Disks are getting cheaper at a faster rate than we fill

them up with data (for most people) Don’t try to buy 264 bytes of anything

◦ It would take decades to format it anyway!

Option 6: Use a little of everything (wisely)

Use a small array of SRAM◦ For the CACHE (hopefully for most accesses)

Use a bigger amount of DRAM◦ For the Main memory

Use a really big amount of Disk storage◦ For the Virtual memory (i.e. everything else)

Option 6: The Memory Hierarchy

Famous Picture of Food Memory Hierarchy

Cache

Main Memory

Disk Storage

Cost Latency AccessFreq.

CPU

Hungry! must eat!◦ Option 1: go to refrigerator

Found eat! Latency = 1 minute

◦ Option 2: go to store Found purchase, take home, eat! Latency = 20-30 minutes

◦ Option 3: grow food! Plant, wait … wait … wait … , harvest, eat! Latency = ~250,000 minutes (~ 6 months)

Crazy fact: ratio of growing food:going to the store = 10,000ratio of disk access:DRAM access = 200,000

A Favorite Cache Analogy

The Architectural view of memory is:◦ What the machine language sees◦ Memory is just a big array of storage

Breaking up the memory system into different pieces – cache, main memory (made up of DRAM) and Disk storage – is not architectural.◦ The machine language doesn’t know about it◦ The processor may not know about it◦ A new implementation may not break it up

into the same pieces (or break it up at all).

The Hierarchy Will Not Be Televised…

69

Supercomputing Perspective: RAM is Slow

CPU351 GB/sec[6]

3.4 GB/sec[7]

BottleneckThe speed of data transferbetween Main Memory and theCPU is much slower than thespeed of calculating, so the CPUspends most of its time waitingfor data to come in or go out.

Why Have Cache?CPU Cache is much closer to the speed

of the CPU, so the CPU doesn’thave to wait nearly as long forstuff that’s already in cache:it can do moreoperations per second!3.4 GB/sec[7]

14.2 GB/sec (4x RAM)[7]

Cache & RAM Latency: Intel T2400 (1.83 GHz)

0

10

20

30

40

50

60

1024

2048

4032

7296

1248

0

2105

6

3513

6

5817

6

9600

0

1576

32

2584

96

4235

52

6935

04

1135

488

1858

432

3041

408

4976

960

8143

744

Array Size (bytes)

Lat

ency

(cl

ock

cyc

les)

Memory Latency

3 cycles

14 cycles

47 cycles

Cache & RAM Latencies

Better

72

Many scientific codes use a lot more data than can fit in cache all at once.

Therefore, you need to ensure a high cache hit rate even though you’ve got much more data than cache.

So, how can you improve your cache hit rate?

Improving Your Cache Hit Rate

Instructor ◦ Dan Stevenson Office: P 136 [email protected] Course Web Site: ◦ stevende/cs491/...

Documents

Transcript of Instructor ◦ Dan Stevenson Office: P 136 [email protected] Course Web Site: ◦ stevende/cs491/...