Instructor ◦ Dan Stevenson Office: P 136 [email protected] Course Web Site: ◦ stevende/cs491/...
-
Upload
alana-calloway -
Category
Documents
-
view
222 -
download
3
Transcript of Instructor ◦ Dan Stevenson Office: P 136 [email protected] Course Web Site: ◦ stevende/cs491/...
CS 491Parallel & Distributed
Computing
Instructor◦ Dan Stevenson
Office: P 136 [email protected]
Course Web Site: ◦ http://www.cs.uwec.edu/~stevende/cs491/
Welcome to CS 491
Regarding HELP with course materials and assignments◦ Come to office hours – Phillips 136
TIME TBD (Check website) OR by appointment (just e-mail or call my office)
◦ Send me an e-mail: [email protected]
Getting Help:When you have questions
Required:◦ Michael J. Quinn
Parallel Programming in C with OpenMP and MPI
Suggested:◦ Web tutorials during the semester
Textbooks
Overall Course Grading
Final Grade
: Exams (3): 40%
Project : 20%
Assignments (approx. every two weeks): 40%
CS 491:Overview
Slide content for this course collected from various sources, including:• Dr. Henry Neeman, University of Oklahoma• Dr. Libby Shoop, Macalester College• Dr. Charlie Peck, Earlham College• Tom Murphy, Contra Costa College• Dr. Robert Panoff , Shodor Foundation• Others, who will be credited…
CS 491 – Parallel and Distributed Computing
7
What does it mean to you?◦ Coordinating Threads◦ Supercomputing◦ Multi-core Processors◦ Beowulf Clusters◦ Cloud Computing◦ Grid Computing◦ Client-Server◦ Scientific Computing
All contexts for “splitting up work” in an explicit way
“Parallel & Distributed Computing”
CS 491 – Parallel and Distributed Computing
8
In this course, we will take mostly from the context of “Supercomputing”◦ This is the field with the longest record of parallel
computing expertise.◦ It also has a long record of being a source for
“trickle-down” technology.
CS 491
9
Supercomputing is the biggest, fastest computing - right this minute.
Likewise, a supercomputer is one of the biggest, fastest computers right this minute.◦ The definition of supercomputing is, therefore,
constantly changing. A Rule of Thumb: A supercomputer is typically
at least 100 times as powerful as a PC. Jargon: Supercomputing is also known as
High Performance Computing (HPC) or High End Computing (HEC) or Cyberinfrastructure (CI).
What is Supercomputing?
10
Fastest Supercomputer vs. Moore Fastest Supercomputer in the World
1
10
100
1000
10000
100000
1000000
10000000
1992 1997 2002 2007
Year
Sp
eed
in
GF
LO
Ps
Fastest
Moore
GFLOPs:billions of
calculations per second
Over recent years, supercomputers have benefitted directly frommicroprocessor performance gains, and have also gotten better
at coordinating their efforts.
CS 491 – Parallel and Distributed Computing
11
Jaguar – Oak Ridge National Laboratory (TN)◦ 224162 processor cores – 1.76 PetaFLOP/second
Recent Champion
CS 491 – Parallel and Distributed Computing
12
2008 IBM Roadrunner: 1.1Petaflops 2009 Cray Jaguar: 1.6 2010 Tiahe-1A (China): 2.6 2011 Fujitsu K (Japan): 10.5
◦ 88,128 8-core processors -> 705,024 cores◦ Needs power equivalent to 10,000 homes
Linpack numbers◦ Core i7 – 2.3 Gflops◦ Glalaxy Nexus – 97 Mflops
Current Champ
CS 491 – Parallel and Distributed Computing
13
Why should we care? What useful thing actually takes a
long time to run anymore? (especially long enough to warrant investing 7/8/9 figures on a supercomputer)
Important: It’s usually not about getting something done faster, but about getting a harder thing done in the same amount of time◦ This is often referred to as capability
computing
Hold the Phone
14
Simulation of physical phenomena, such as◦ Weather forecasting◦ Galaxy formation◦ Oil reservoir management
Data mining: finding needles of information in a haystack ofdata, such as:◦ Gene sequencing◦ Signal processing◦ Detecting storms that might produce
tornados (want forecasting, not retrocasting…) Visualization: turning a vast sea of data into
pictures that a scientist can understand◦ Oak Ridge National Lab has a 512-core cluster
devoted entirely to visualization runs
What Is HPC Used For?
TornadicStorm
CS 491 – Parallel and Distributed Computing
15
16
What is Supercomputing About?
Size Speed
(Laptop)
17
Size: Many problems that are interesting™ can’t fit on a PC – usually because they need more than a few GB of RAM, or more than a few 100 GB of disk.
Speed: Many problems that are interesting™ would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer.
What is Supercomputing About?
18
Parallelism: doing multiple things at the same time◦ finding and coordinating this can be challenging
The tyranny of the storage hierarchy◦ The hardware you’re running on matters◦ Moving data around is often more expensive than
actually computing something
Supercomputing Issues
CS 491 – Parallel and Distributed Computing
19
Parallel Computing Hardware
The term parallel processing is usually reserved for the situation in which a single task is executed on multiple processors
◦ Discounts the idea of simply running separate tasks on separate processors – a common thing to do to get high throughput, but not really parallel processing
Key questions in hardware design:
1. How do parallel processors share data and communicate?
◦ shared memory vs distributed memory2. How are the processors connected?
◦ single bus vs network
The number of processors is determined by a combination of #1 and #2
Parallel Processing
Shared Memory Systems◦ All processors share one memory address
space and can access it◦ Information sharing is often implicit
Distributed Memory Systems (AKA “Message Passing Systems”)◦ Each processor has its own memory space◦ All data sharing is done via programming
primitives to pass messages i.e. “Send data value to processor 3”
◦ Information sharing is always explicit
How is Data Shared?
Processors communicate via messages that they send to each other: send and receive
This form is required for multiprocessors that have separate private memories for each processor
◦ Cray T3E◦ “Beowolf Cluster”◦ SETI@HOME
Note: shared memory multiprocessors can also have separate memories – they just aren’t “private” to each processor
Message Passing
Processors all operate independently, but operate out of the same logical memory.
Data structures can be read by any of the processors
To properly maintain ordering in our programs, synchronization primitives are needed! (locks/semaphores)
Shared Memory Systems
Cac he
Proc es s or
Cac he
Proc es s or
Cac he
Proc es s or
Single bus
Memory I/O
Connecting Multiprocessors
Connect several processors via a single shared bus
◦ bus bandwidth limits the number of processors
◦ local cache lowers bus traffic
◦ single memory module attached to the bus
Limited to very smallsystems!
Intel processors supportthis mode by default
Single Bus Multiprocessor
Cac he
Proc es s or
Cac he
Proc es s or
Cac he
Proc es s or
Single bus
Memory I/O
The Cache Coherence Problem
I/O devices
Memory
P1
$ $ $
P2 P3
12
34 5
u = ?u = ?
u:5
u:5
u:5
u = 7
Two most common variations:◦ “snoopy” schemes
rely on broadcast to observe all coherence traffic well suited for buses and small-scale systems example: SGI Challenge or Intel x86
◦ directory schemes uses centralized information to avoid broadcast scales well to large numbers of processors example: SGI Origin/Altix
Cache Coherence Solutions
Basic Idea:◦ all coherence-related activity is broadcast to all
processors e.g., on a global bus
◦ each processor monitors (aka “snoops”) these actions and reacts to any which are relevant to the current contents of its cache
◦ examples: if another processor wishes to write to a line, you may need
to “invalidate” (i.e. discard) the copy in your own cache if another processor wishes to read a line for which you
have a dirty copy, you may need to supply it Most common approach in commercial shared-
memory multiprocessors. Protocol is a distributed algorithm: cooperating
state machines◦ Set of states, state transition diagram, actions
Snoopy Cache Coherence Schemes
In the single bus case, the bus is used for every main memory access
In the network connected model, the network is used only for inter-process communication
There are multiple “memories” BUT that doesn’t mean that there’s separate memory spaces
Network Connected Multiprocessors
Netw ork
Cac he
Proc es s or
Cac he
Proc es s or
Cac he
Proc es s or
Memory Memory Memory
Network-based machines do not want to use a snooping coherence protocol!◦ Means that every memory transaction would
need to be sent everywhere!
Directory-based systems use a global “Directory” to arbitrate who owns data◦ Point-to-point communication with the
directory instead of bus broadcasts◦ The directory keeps a list of what caches
have the data in question When a write to that data occurs, all of the
affected caches can be notified directly
Directory Coherence
Each node (processor) contains its own local memory
Each node is connected to the network via a switch
Messages hop along the ring from node to node until they reach the proper destination
Network Topologies: Ring
2D grid, or mesh, of nodes
Each “inside” node has 4 neighbors◦ “outside” nodes only have 2
If all nodes have four neighbors, then this is a 2D torus
Network Topologies: 2D Mesh
Also called an n-cube For n=2 2D cube (4 nodes square) For n=3 3D cube (8 nodes) For n=4 4D cube (16 nodes) In an n cube, all nodes have n neighbors
Network Topologies: Hypercube
3 cube 4 cube
Every node can communicate directly with every other node in only one pass fully connected network
n nodes n2 switches
Therefore, extremely expensive to implement!
Network Topologies: Full Crossbar
P 0
P 1
P 2
P 3
P 4
P 5
P 6
P 7
Fully connected, but requires passes thru multiple switch boxes
Less hardware required than crossbar, but contention can occur
Network Topologies: Butterfly Network
Omega network switch box
AB
CD
0
1
2
3
4
5
6
7
A simple model for categorizing computers: 4 categories:
1. SISD – Single Instruction Single Data◦ the standard uniprocessor model
2. SIMD – Single Instruction Multiple Data◦ Full systems that are “true” SIMD are no longer in use◦ Many of the concepts exist in vector processing and to come
extend graphics cards
3. MISD – Multiple Instruction Single Data◦ doesn’t really make sense
4. MIMD – Multiple Instruction Multiple Data◦ the most common model in use
Flynn’s Taxonomy of Computer Systems (1966)
A single instruction is applied to multiple data elements in parallel – same operation on all elements at the same time
Most well known examples are:◦ Thinking Machines CM-1 and CM-2◦ MasPar MP-1 and MP-2◦ others
All are out of existence now SIMD requires massive data parallelism
Usually have LOTS of very very simple processors (e.g. 8-bit CPUs)
“True” SIMD
Closely related to SIMD◦ Cray J90, Cray T90, Cray SV1, NEC SX-6◦ Starting to “merge” with MIMD systems
Cray X1E and upcoming systems (“Cascade”)
Use a single instruction to operate on an entire vector of data
◦ Difference from “True” SIMD is that data in a vector processor is not operated on in true parallel, but rather in a pipeline
◦ Uses “vector registers” to feed a pipeline for the vector operation
◦ Generally have memory systems optimized for “streaming” of large amounts of consecutive or strided data (Because of this, didn’t typically have caches
until late 90s)
Vector Processors
Multiple instructions are applied to multiple data
The multiple instructions can come from the same program, or from different programs
◦ Generally “parallel processing” implies the first
Most modern multiprocessors are of this form
◦ IBM Blue Gene, Cray T3D/T3E/XT3/4/5, SGI Origin/Altix
◦ Clusters
MIMD
CS 491 – Parallel and Distributed Computing
40
Parallel Computing Hardware
“Supercomputer Edition”
A parallel computer built out of commodity hardware components
◦ PCs or server racks◦ Commodity network (like
ethernet)◦ Often running a free-software OS
like Linux with a low-level software library to facilitate multiprocessing
Use software to send messages between machines
◦ Standard is to use MPI (message passing interface)
The Most Common Supercomputer: Clustering
“… [W]hat a ship is … It's not just a keel and hull and a deck and sails. That's what a ship needs. But what a ship is ... is freedom.”
– Captain Jack Sparrow“Pirates of the Caribbean”
What is a Cluster?
A cluster needs of a collection of small computers, called nodes, hooked together by an interconnection network
It also needs software that allows the nodes to communicate over the interconnect.
But what a cluster is … is all of these components working together as if they’re one big computer
(a supercomputer)
What a Cluster is ….
nodes◦ PCs◦ Server rack nodes
interconnection network◦ Ethernet (“GigE”)◦ Myrinet (“10GigE”)◦ Infiniband (low latency)◦ The Internet (not really – typically called
“Grid”) software
◦ OS Generally Linux
Redhat / CentOS / SuSE Windows HPC Server
◦ Libraries (MPICH, PBLAS, MKL, NAG)◦ Tools (Torque/Maui, Ganglia, GridEngine)
What a Cluster is ….
An Actual (Production) Cluster
InterconnectNodes
CS 491 – Parallel and Distributed Computing
46
Other Actual Clusters…
CS 491 – Parallel and Distributed Computing
47
At the high end, many supercomputers are made with custom parts◦ Custom backplane/network◦ Custom/Reconfigurable processors◦ Extreme Custom cooling◦ Custom memory system
Examples:◦ IBM Blue Gene◦ Cray XT4/5/6◦ SGI Altix
What a Cluster is NOT…
Moore’s Law
49
In 1965, Gordon Moore was an engineer at Fairchild Semiconductor.
He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months.
It turns out that computer speed was roughly proportional to the number of transistors per unit area.
Moore wrote a paper about this concept, which became known as “Moore’s Law.”
Moore’s Law
50
Fastest Supercomputer vs. Moore
Fastest Supercomputer in the World
1
10
100
1000
10000
100000
1000000
10000000
1992 1997 2002 2007
Year
Sp
eed
in
GF
LO
Ps
Fastest
Moore
GFLOPs:billions of
calculations per second
51
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
52
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Networ
k Ban
dwid
th
53
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Networ
k Ban
dwid
th
RAM
54
Moore’s Law in Practice
Year
log(
Spe
ed)
CPU
Networ
k Ban
dwid
th
RAM
1/Network Latency
Patterson: “In the time that bandwidth doubles, latency improves by no more than a
factor of 1.2 or 1.4”
The Tyranny ofthe Storage Hierarchy
56
Henry’s Laptop Pentium 4 Core Duo
T2400 1.83 GHz w/2 MB L2 Cache (“Yonah”)
2 GB (2048 MB) 667 MHz DDR2 SDRAM
100 GB 7200 RPM SATA Hard Drive
DVD+RW/CD-RW Drive (8x)
1 Gbps Ethernet Adapter 56 Kbps Phone Modem
Dell Latitude D620[4]
57
The Storage Hierarchy Registers Cache memory Main memory (RAM) Hard disk Removable media
(CD, DVD etc) Internet
Fast, expensive, few
Slow, cheap, a lot
We want to have lots of memory for our processor:◦ LC2K needs 216 words of memory ( ~ 256 KB)◦ MIPS needs 232 bytes of memory ( ~ 4 GB )◦ x86-64 needs 264 bytes of memory ( ~ 16 exabytes )
What are our choices?◦ SRAM, DRAM, Magnetic Disk, paper?
Memory Hierarchy
On-chip memory◦ Fabricated in the same technology as the processor
About 2-10 ns access (depending on size)◦ Decoders are big◦ Array are big
It will cost LOTS of money◦ SRAM costs $10 per megabyte
$2.50 for LC2K $40,960 for MIPS $175 trillion for x86-64
Option 1: build it out of fast SRAM
About 50 ns access◦ Why build a fast processor that stalls for dozens of cycles on
each memory load?
Still costs lots of money for new machines◦ DRAM costs $0.10 per megabyte
< $0.01 for LC2K $400 for MIPS $2 trillion for x86-64
Option 2: build it out of DRAM
Use a small array of SRAM◦ Big enough to hold whatever you use most often◦ Small means fast!◦ Small means cheap!
Use a larger amount of DRAM◦ And hope that you rarely have to use it
Use a really big amount of Disk storage◦ Disks are getting cheaper at a faster rate than we fill
them up with data (for most people) Don’t try to buy 264 bytes of anything
◦ It would take decades to format it anyway!
Option 6: Use a little of everything (wisely)
Use a small array of SRAM◦ For the CACHE (hopefully for most accesses)
Use a bigger amount of DRAM◦ For the Main memory
Use a really big amount of Disk storage◦ For the Virtual memory (i.e. everything else)
Option 6: The Memory Hierarchy
Famous Picture of Food Memory Hierarchy
Cache
Main Memory
Disk Storage
Cost Latency AccessFreq.
CPU
Hungry! must eat!◦ Option 1: go to refrigerator
Found eat! Latency = 1 minute
◦ Option 2: go to store Found purchase, take home, eat! Latency = 20-30 minutes
◦ Option 3: grow food! Plant, wait … wait … wait … , harvest, eat! Latency = ~250,000 minutes (~ 6 months)
Crazy fact: ratio of growing food:going to the store = 10,000ratio of disk access:DRAM access = 200,000
A Favorite Cache Analogy
The Architectural view of memory is:◦ What the machine language sees◦ Memory is just a big array of storage
Breaking up the memory system into different pieces – cache, main memory (made up of DRAM) and Disk storage – is not architectural.◦ The machine language doesn’t know about it◦ The processor may not know about it◦ A new implementation may not break it up
into the same pieces (or break it up at all).
The Hierarchy Will Not Be Televised…
69
Supercomputing Perspective: RAM is Slow
CPU351 GB/sec[6]
3.4 GB/sec[7]
BottleneckThe speed of data transferbetween Main Memory and theCPU is much slower than thespeed of calculating, so the CPUspends most of its time waitingfor data to come in or go out.
Why Have Cache?CPU Cache is much closer to the speed
of the CPU, so the CPU doesn’thave to wait nearly as long forstuff that’s already in cache:it can do moreoperations per second!3.4 GB/sec[7]
14.2 GB/sec (4x RAM)[7]
Cache & RAM Latency: Intel T2400 (1.83 GHz)
0
10
20
30
40
50
60
1024
2048
4032
7296
1248
0
2105
6
3513
6
5817
6
9600
0
1576
32
2584
96
4235
52
6935
04
1135
488
1858
432
3041
408
4976
960
8143
744
Array Size (bytes)
Lat
ency
(cl
ock
cyc
les)
Memory Latency
3 cycles
14 cycles
47 cycles
Cache & RAM Latencies
Better
72
Many scientific codes use a lot more data than can fit in cache all at once.
Therefore, you need to ensure a high cache hit rate even though you’ve got much more data than cache.
So, how can you improve your cache hit rate?
Improving Your Cache Hit Rate