CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen Chapter 1: Parallel Computers.
CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen
description
Transcript of CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen
![Page 1: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/1.jpg)
CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen
Chapter 1: Parallel Computers
![Page 2: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/2.jpg)
2
Concurrency vs. True Parallelism
Concurrency is used in systems where more than one user is using a resource at the same CPUDatabase information
In true parallelism, multiple processors are working simultaneously on one application problem
![Page 3: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/3.jpg)
3
Flynn’s Taxonomy – Classification by Control Mechanism
A classification of parallel systems from a “flow of control” perspective
SISD –single instruction, single dataSIMD – single instruction, multiple dataMISD – multiple instructions, single dataMIMD – multiple instructions, multiple
data
![Page 4: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/4.jpg)
4
SISD
Single instruction, single dataSequential programming with one
processor, just like you’ve always done
![Page 5: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/5.jpg)
5
SIMD
Single instruction, multiple data One control unit issuing the same instruction
to multiple CPU’s that operate simultaneously on their own portions of data
Lock-step, synchronized Vector and matrix computation lend
themselves to an SIMD implementation Examples of SIMD computers: Illiac IV, MPP,
DAP, CM-2, and MasPar MP-2
![Page 6: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/6.jpg)
6
MIMD
Multiple instructions, multiple data Each processor “doing its own thing” Processors synchronize either through
passing messages or writing values to shared memory addresses
Subcategories SPMD – single program, multiple data (MPI on a Linux
cluster) MPMD – multiple program, multiple data (PVM)
Examples of MIMD computers – BBN butterfly, IPSC 1 and 2, IBM SP, SP2
![Page 7: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/7.jpg)
7
MISD
Multiple instruction, single dataDoesn’t really exist, unless you consider
pipelining an MISD configuration
![Page 8: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/8.jpg)
8
Comparison of SIMD and MIMD
It takes a specially-designed computer to do SIMD computing, since one control unit controls multiple processors.
SIMD requires only one copy of a program. MIMD systems have a copy of the program
and operating system at each processor. SIMD computers quickly become obsolete. MIMD systems can be pieced together from
the most up-to-date components available.
![Page 9: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/9.jpg)
9
Classification by Communication Mechanism
Shared-address-space “Multiprocessors”, each with its own control unit Virtual memory makes all memory addresses look
like they come from one consistent space, but they don’t necessarily
Processors communicate with reads and writes Message passing systems
“Multicomputers” Separate processors and separate memory
addresses Processors communicate with message passing
![Page 10: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/10.jpg)
10
Shared Memory Address Space
Interprocess communication is done in the memory interface through reads and writes.
Virtual memory address maps to a real address. Different processors may have memory locally
attached to them. Access could needed to a processor’s own memory,
or to the memory attached to a different processor. Different instances of memory access could take
different amounts of time. Collisions are possible. UMA (i.e., shared memory) vs. NUMA (i.e., distributed
shared memory)
![Page 11: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/11.jpg)
11
Message Passing System
Interprocess communication is done at the program level using sends and receives.
Reads and writes refer only to a processor’s local memory.
Data can be packed into long messages before being sent, to compensate for latency.
Global scheduling of messages can help avoid message collisions.
![Page 12: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/12.jpg)
12
Basic Architecture Terms – Clock Speed and Bandwidth
Clock speed of a processor– max # of times per sec. that a device can say something new
Bandwidth of a transmission medium (i.e., telephone line, cable line, etc.) is defined as the maximum rate at which the medium can change a signal. Bandwidth is measured in cycles per second or Hertz. Bandwidth is determined by the physical properties of the transmission medium, including the material of which it is composed.
![Page 13: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/13.jpg)
13
Basic Architecture Terms – Clock Speed and Bandwidth
Data rate is a measure of the amount of data that can be sent across a transmission medium per unit time. Data rate is determined by two things (1) the bandwidth, and (2) the potential number of different things that can be conveyed each time the signal changes (which, in the case of a bus, is based on the number of parallel data lines).
![Page 14: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/14.jpg)
14
Basic Architecture Terms -- Bus
A bus is a communication medium to which all processors are connected.
Only one communication at a time is allowed on the bus.
Only one step from any source to any destination. Bus data rate (sometimes loosely called
“bandwidth”) is defined as clock speed times number of bits transmitted at each clock pulse
Bus is low-cost, but you can’t have very many processors attached to it.
![Page 15: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/15.jpg)
15
Bus on a Motherboard
The bus transports data among the CPU, memory,and other components.
It consists of electrical circuits called traces and adapters or expansion cards.
There’s a main motherboard bus, and then buses for the CPU, memory, SCSI connections, and USB.
![Page 16: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/16.jpg)
16
Types of Buses
Original IBM PC bus – 8-bit parallel, 4.77 MHz clock speed
IBM AT, 1982, introduced the ISA bus (Industry Standard Architecture), 16 bit parallel, with expansion slots, still compatible with 8-bit, 8 MHz clock speed
IBM PS/2, MCA (Microchannel Architecture) bus, 32 bit parallel, but not backwardly compatible; 10 MHz clock speed; didn’t catch on
![Page 17: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/17.jpg)
17
Types of Buses
Compaq and other IBM rivals introduced EISA (Extended Industry Standard Architecture) bus in 1988, 32-bit parallel, 8.2 MHz clock speed; didn’t catch on
VL-Bus (Vesa Local Bus), 32-bit parallel, close to clock speed of CPU, tied directly to CPU
The trend moved to specialized buses with higher clock speeds, closer to the CPU’s clock speed, and separate from the system bus – e.g. PCI (Peripheral Component Bus)
![Page 18: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/18.jpg)
18
PCI bus
PCI bus can exist side-by-side with ISA bus and system bus; in this sense it’s a “local” bus
Originally 33 MHz, 32-bitsPCI-X is 133 MHz, 64 bit for 1 GB/sec
data transfer rateSupports Plug and Play See http://computer.howstuffworks.com/pci.htm
![Page 19: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/19.jpg)
19
Ethernet Bus-Based Network
All nodes branch off a common line.Each device has an ethernet address,
also known as MAC address. All computers receive all data transmissions
(in packets). They look to see if the packet is addressed to them, and read it only if it is.
When a computer wants to transmit data, it waits until the line is free.
CSMA/CD protocol is used (carrier-sense multiple access with collision detection).
![Page 20: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/20.jpg)
20
Basic Architecture Terms -- Ethernet
Ethernet is actually an OSI layer 2 communication protocol. It does not dictate the type of connectivity – could be copper, fiber, wireless.
Today’s ethernet is full-duplex, i.e., it has separate lines for send and receive
IEEE Standard 802.3Ethernet comes in 10, 100, and 1000
Mb/sec (1 Gb/sec) speeds. See http://computer.howstuffworks.com/ethernet.htm
![Page 21: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/21.jpg)
21
Basic Architecture Terms -- Hub
Hubs connect computers in a network.
They operate using a broadcast model. When n computers are connected to a hub, hubs simply pass through all network traffic to each of the n computers.
![Page 22: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/22.jpg)
22
Basic Architecture Terms -- Switch
Unlike hubs, switches can look at data packets as they are received, determine the source and destination device, and forward the packet appropriately.
By delivering messages only to the device that the packet was intended for, switches conserve network bandwidth.
See http://howstuffworks.com/lan-switch.htm
![Page 23: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/23.jpg)
23
Basic Architecture Terms -- Myrinet
Packet communication and switching technology, faster than ethernet.
Myrinet offers full-duplex 2+2 Gb/sec data rate and low latency. It is used in Linux clusters.
Only 16 of the nodes of WFU’s clusters are connected with myrinet. The rest are connected with ethernet, for cost reasons.
![Page 24: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/24.jpg)
24
Classification by Interconnection Network
Static network Bus-based network can be static (if no switches are
involved) Direct links between computers
Examples include completely connected, line/ring, mesh, tree (regular and fat), and hypercube
Dynamic network Uses switches Connections change according to whether a switch
is open or closed Could be arranged in stages (multistage) (e.g.,
Omega network)
![Page 25: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/25.jpg)
25
Hypercube
A d-dimensional hypercube has 2d nodes.
Each node has a d-bit address.Neighboring nodes differ in one bit.Needs a routing algorithm. We’ll try one
in class.
![Page 26: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/26.jpg)
26
Multistage Networks
See notes on Omega network from class.
![Page 27: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/27.jpg)
27
Properties of Network Communication
Diameter of a network – min # links between 2 farthest nodes
Bisection width of a network -- # links that must be cut to divide network into 2 equal parts
![Page 28: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/28.jpg)
28
Properties of Network Communication
Message latency – time taken to prepare the message to be sent (software overhead)
Network latency – time taken for a message to pass through a network
Communication latency – total time taken to send a message, including message and network latency
Deadlock – occurs when packets cannot be forwarded because they are waiting for each other in a circular way
![Page 29: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/29.jpg)
29
Memory Hierarchy
Global memoryLocal memoryCache
Faster, but more expensiveCache coherence must be maintained
![Page 30: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/30.jpg)
30
Communication Methods
Circuit switchingPacket switchingWormhole routing
![Page 31: CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen](https://reader034.fdocuments.us/reader034/viewer/2022051820/568149bb550346895db6f08b/html5/thumbnails/31.jpg)
31
Properties of a Parallel Program
GranularitySpeedupOverheadEfficiencyCostScalabilityGustafson’s law