Why Parallel Computing? Annual performance improvements drops from 50% per year to 20% per year...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
1
Transcript of Why Parallel Computing? Annual performance improvements drops from 50% per year to 20% per year...
Why Parallel Computing?• Annual performance improvements drops
from 50% per year to 20% per year
• Manufacturers focusing on multi-core systems, not single-core
• Why?– Smaller transistors = faster processors.– Faster processors = increased power– Increased power = increased heat.– Increased heat = unreliable processors
Parallel Programming Examples• Embarrassingly Parallel Applications
– Google searches employ > 1,000,000 processors
• Applications with unacceptable sequential run times
• Grand Challenge Problems – 1991 High Performance Computing Act (Law 102-94)
“Fundamental Grand Challenge science and engineering problems with broad economic and/or scientific impact and whose solution can be advanced by applying high performance computing techniques and resources.”
• Promotes terascale level computation over high bandwidth wide area computational grids
Grand Challenge Problems
• Require more processing power than available on single processing systems
• Solutions would significantly benefit society• There are not complete solutions for these,
which are commercially available• There is a significant potential for progress with
today’s technology• Examples: Climate modeling, Gene discovery,
energy research, semantic web-based applications
Grand ChallengeProblem
Examples
• Associations
– Computer Research Association (CRA)
– National Science Foundation (NSF)
Partial Grand Challenge Problem List
1. Predict contaminant seepage2. Predict airborne contaminant affects3. Gene sequence discovery4. Short term weather forecasts5. Predict long term global warming6. Predict earthquakes and volcanoes, 7. Predict hurricanes and tornados8. Automate natural language understanding9. Computer vision10.Nanotechnology11.Computerized reasoning12.Protein mechanisms13.Predict asteroid collisions14.Sub-atomic particle interactions15.Model biomechanical processes16.Manufacture new materials17.Fundamental nature of matter18.Transportation patterns19.Computational fluid dynamics
1.5
Global Weather Forecasting Example
• Suppose whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) - about 5 108 cells.
• Suppose each calculation requires 200 floating point operations. In one time step, 1011 floating point operations necessary.
• To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (109 floating point operations/s) takes 106 seconds or over 10 days.
• To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4 1012 floating point operations/sec).
Modeling Motion of Astronomical Bodies
• Astronomical bodies attracted to each other by gravity; the force on each determines movement
• Required calculations: O(n2) or at best, O(n lg n)
• At each time step, calculate new position
• A galaxy might have1011 stars
• If each calculation requires 1 ms, one iteration requires 109 years using the N2 algorithm and almost a year using an efficient N lg N algorithm
Types of Parallelism• Fine grain
– Vector Processors• Matrix operations in single instructions
– High performance optimizing compilers• Reorder instructions
• Loop partitioning
• Low level synchronizations
• Coarse grain– Threads, critical sections, mutual exclusion– Message passing
Von Neumann Bottleneck
Modifications (Transparent to software)• Device Controllers• Cache: Fast memory to hold blocks of recently
used consecutive memory locations• Pipelines to breakup an instruction into pieces
and execute the pieces in parallel• Multiple issue replicates functional units to
enable executing instructions in parallel• Fine grained multithreading after each
instruction
CPU speed exceeds than memory access time
Parallel Systems• Shared Memory Systems
– All cores see the same memory– Coordination: Critical sections and mutual exclusion
• Distributed Memory Systems– Beowulf Clusters; Clusters of Workstations– Each core has access to local memory– Coordination: message passing
• Hybrid (Distributed memory presented as shared)– Uniform Memory Access (UMA)– Cache Only Memory Access (COMA)– Non-uniform Memory Access (NUMA)
• Computational Grids– Heterogeneous and Geographically Separated
Hardware Configurations• Flynn Categories
– SISD (Single Core)– MIMD (*our focus*)– SIMD (Vector Processors)– MISD
• Within MIMD– SPMD (*our focus*)– MPMD
• Multiprocessor Systems– Threads, Critical Sections
• Multi-computer Systems– Message Passing– Sockets, RMI
P0 P1 P2 P3
Memory
P0 P1 P2 P3
M0 M1 M2 M3
Distributed Memory Multi-computer
Shared Memory Multiprocessor
Hardware Configurations• Shared Memory Systems
– Do not scale to high numbers of processors– Considerations
• Enforcing critical sections through mutual exclusion• Forks and joins of threads
• Distributed Memory Systems– Topology: The graph that defines the network
• More connections means higher cost• Latency: time to establish connections• Bandwidth: width of the pipe
– Considerations• Appropriate message passing framework• Redesign algorithms to those that are less natural
Shared Memory Problems• Memory contention
– Single bus inexpensive, sequential access– Crossbar switches, parallel access but
expensive
• Cache Coherence– Cache doesn’t match memory
• Write through writes all changes immediately• Write back writes dirty line when expelled
– Processor cache requires broadcast of changes and complex coherence algorithms
Distributed Memory• Possible to use commodity systems
• Relatively inexpensive interconnects
• Requires message passing, which programmers tend to find difficult
• Must deal with network, topology, and security issues
• Hybrid systems are distributed, but present a system to programmers as shared, but with performance loss
Network Terminology• Latency – Time to send “null”, zero length message
• Bandwidth – Maximum transmission rate (bits/sec)
• Total edges – Total number of network connections
• Degree – Maximum connections per network node
• Connectivity – Minimum connections to disconnect
• Bisection width – Number of connections to cut the network into equal two parts
• Diameter – Maximum hops connecting two nodes
• Dilation – Number of extra hops needed to map one topology to another
Web-based Networks
• Generally uses TCP/IP protocol• The number of hops between nodes is not constant• Communication incurs high latencies• Nodes scattered over large geographical distances• High Bandwidths possible after connection established• The slowest link along the path limits speed• Resources are highly heterogeneous• Security becomes a major concern; proxies often used• Encryption algorithms can require significant overhead.• Subject to local policies at each node• Example: www.globus.org
Routing Techniques
• Packet Switching– Message packets routed separately; assembled at the sink
• Deadlock free– Guarantees sufficient resources to complete transmission
• Store and Forward– Messages stored at node before transmission continues
• Cut Through– Entire path of transmission established before transmission
• Wormhole Routing– Flits (a couple of bits) held at each node; the “worm” of
flits move when the next node becomes available
Myranet• Proprietary technology (http://www.myri.com) • Point-to-point, full-duplex switch based technology• Custom chip settings for parallel topologies• Lightweight transparent cut-through routing protocol• Thousands of processors without TCP/IP limitations• Can embed TCP/IP messages to maximize flexibility• Gateway for wide area heterogeneous networks
Type Bandwidth Latency
Ethernet 10 MB/sec 10-15 ms
GB Ethernet 10 GB/sec 150 us
Myranet 10 GB/sec 2us
Rectangles: Processors, Circles: Switches
Parallel Techniques
• Peer-to-peer– Independent systems coordinate to run a single
application– Mechanisms: Threads, Message Passing
• Client-server– Server responds to many clients running many
applications– Mechanisms: Remote Method Invocation, Sockets
The focus is this class is peer-to-peer applications
Popular Network Topologies
• Fully Connected and Star• Line and Ring• Tree and Fat Tree• Mesh and Torus• Hypercube• Hybrids: Pyramid• Multi-stage: Myrinet
Fully Connected and Star
Degree? Connectivity? Total edges? Bisection width? Diameter?
Line and Ring
Degree? Connectivity? Total edges? Bisection width? Diameter?
Tree and Fat Tree
Degree? Connectivity? Total edges?
Edges Connecting Node at level k to k-1 are twice thenumber of edges connecting a Node from level k-2 to k-1
Bisection width? Diameter?
Mesh and Torus
Degree? Connectivity? Total edges? Bisection width? Diameter?
Hypercube
Degree? Connectivity? Total edges?
•A Hypercube of degree zero is a single node
•A Hypercube of degree d is two hypercubes of degree d-1With edges connecting the corresponding nodes
Bisection width? Diameter?
Pyramid
Degree? Connectivity? Total edges? Bisection width? Diameter?
A Hybrid Network Combining a mesh and a Tree
1.26
Multistage Interconnection NetworkExample: Omega network
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Inputs Outputs
switch elements withstraight-through or
crossover connections
1.27
Distributed Shared Memory Making main memory of group of interconnected computers look as though a single memory with single address space. Then can use shared memory programming techniques.
Processor
Interconnectionnetwork
Shared
Computers
Messages
memory
Parallel Performance Metrics
• Complexity (Big Oh Notation)– f(n) = O(g(n)) if for constants z, c>0 f(n)≤c g(n) when n>z
• Speed up: s(p) = t1/tp
• Cost: C(p) = tp*p
• Efficiency: E(p) = t1/C(p)
• Scalability– Imprecise term to measure impact of adding processors– We might say an application scales to 256 processors– Can refer to hardware, software, or both
Parallel Run Time
• Sequential execution time: t1
– t1 = Computation time of best sequential algorithm
• Communication overhead: Tcomm = m(tstartup + ntdata)– tstartup = latency (time to send a message with no data)– tdata = time to send one data element– n = number of data elements– m = number of messages
• Computation overhead: tcomp=f (n, p))
• Parallel execution time: tp = tcomp + tcomm– Tp = reflects the worst case execution time over all processors
Estimating Scalability
Parallel Visualization Tools
Process 1
Process 2
Process 3
TimeComputingWaitingMessage-passing system routine
Message
Observe using a space-time diagram (or process-time diagram)
Superlinear speed-up (s(p)>p)
1. Non-optimal sequential algorithma. Solution: Compare to an optimal sequential algorithm
b. Parallel versions are often different from sequential versions
2. Specialized hardware on certain processorsa. Processor has fast graphics but computes slow
b. NASA superlinear distributed grid application
3. Average case doesn’t match single runa. Consider a search application
b. What are the speed-up possibilities?
Reasons for:
Speed-up Potential• Amdahl’s “pessimistic” law
– Fraction of Sequential processing (f) is fixed– S(p) = t1/ (f * t1 + (1-f)t1/p) → 1/f as p →∞
• Gustafson’s “optimistic” law– Greater data implies parallel portion grows– Assumes more capability leads to more data– S(p) = f + (1-f)*p
• For each law– What is the best speed-up if f=.25?– What is the speed-up for 16 processors?
• Which assumption if valid?
Challenges
• Running multiple instances of a sequential program won’t make effective use of parallel resources
• Without programmer optimizations, additional processors will not improve overall system performance
• Connect networks of systems together in a peer-to-peer manner
• Rewrite and parallel existing programs– Algorithms for parallel systems are drastically
different than those that execute sequentially
• Translation programs that automatically parallelize serial programs.– This is very difficult to do.– Success has been limited.
• Operating system thread/process allocation– Some benefit, but not a general solution
Solutions
Compute and merge results• Serial Algorithm:
result = 0;for (int i=0; i<N; i++) { sum += merge(compute(i)); }
• Parallel Algorithm (first try) :– Each processor, P, performs N/P computations– IF P>0 THEN Send partial results to master (P=0)– ELSE receive and merge partial results
Is this the best we can do?How many compute calls? How many merge calls?How is work distributed among the processors?
Multiple cores forming a global sum
Copyright © 2010, Elsevier Inc. All rights Reserved
•How many merges must the master do?•Suppose 1024 processors. Then how many merges would the master do?•Note the difference from the first approach
Parallel Algorithms
• Problem: Three helpers must mow, weed eat, and pull weeds on a large field
• Task Level– Each helper perform one of the tasks over the
entire large field
• Data Level– Each helper do all three tasks over one third of the
field
Case Study• Millions of doubles
• Thousands of bins
• We want to create a histogram of the number of values present in each bin
• How would we program this sequentially?
• What parallel algorithm would we use?– Using a shared memory system– Using a distributed memory system
In This Class We• Investigate converting sequential programs
to make use of parallel facilities
• Devise algorithms that are parallel in nature
• Use C with Industry Standard Extensions– Message-Passing Interface (MPI) via mpich– Posix Threads (Pthreads)– OpenMP
Parallel Program Development• Cautions
– Parallel program programming is harder than sequential programming– Some algorithms don’t lend themselves to running in parallel
• Advised Steps of Development– Step 1: Program and test as much as possible sequentially– Step 2: Code the Parallel version– Step 3: Run in parallel; one processor with few threads– Step 4: Add more threads as confidence grows– Step 5: Run in parallel with a small number of processors– Step 6: Add more processes as confidence grows
• Tools– There are parallel debuggers that can help– Insert assertion error checks within the code– Instrument the code (add print statements)– Timing: time(), gettimeofday(), clock(), MPI_Wtime()