Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid...
-
Upload
diane-lewis -
Category
Documents
-
view
228 -
download
3
Transcript of Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid...
Introduction toHigh Performance Computing:
Parallel Computing, Distributed Computing, Grid Computing and More
Dr. Jay Boisseau
Director, Texas Advanced Computing Center
December 3, 2001
The University of Texas at AustinTexas Advanced Computing Center
Introduction to High Performance Computing
Outline
• Preface
• What is High Performance Computing?
• Parallel Computing
• Distributed Computing, Grid Computing, and More
• Future Trends in HPC
Introduction to High Performance Computing
Purpose
• Purpose of this workshop:– to educate researchers about the value and
impact of high performance computing (HPC) techniques and technologies in conducting computational science and engineering
• Purpose of this presentation:– to educate researchers about the techniques and
tools of parallel computing, and to show them the possibilities presented by distributed computing and Grid computing
Introduction to High Performance Computing
Goals
• Goals of this presentation are to help you:1. understand the ‘big picture’ of high performance
computing
2. develop a comprehensive understanding of parallel computing
3. begin to understand how Grid and distributed computing will further enhance computational science capabilities
Introduction to High Performance Computing
Content and Context
• This material is an introduction and an overview– It is not a comprehensive HPC, so further reading
(much more!) is recommended.
• Presentation is followed by additional speakers with detailed presentations on specific HPC and science topics
• Together, these presentations will help prepare you to use HPC in your scientific discipline.
Introduction to High Performance Computing
Background - me
• Director of the Texas Advanced Computing Center (TACC) at the University of Texas
• Formerly at San Diego Supercomputer Center (SDSC), Artic Region Supercomputing Center
• 10+ years in HPC
• Known Luis for 4 years - plan to develop strong relationship between TACC and CeCalCULA
Introduction to High Performance Computing
Background – TACC
• Mission:– to enhance the academic research capabilities of
the University of Texas and its affiliates through the application of advanced computing resources and expertise
• TACC activities include:– Resources– Support– Development– Applied research
Introduction to High Performance Computing
TACC Activities
• TACC resources and support includes:– HPC systems – Scientific visualization resources– Data storage/archival systems
• TACC research and development areas: – HPC– Scientific Visualization– Grid Computing
Introduction to High Performance Computing
Current HPC Systems
FDDI
HiPPI
CRAY SV116 CPU, 16GB
Memory
ARCHIVE640GB
CRAY T3E256+ procs
128 MB/proc
500GBaurora
golden
IBM SP64+ procs
256 MB/proc
azure
300GB
AscendRouter
Introduction to High Performance Computing
New HPC Systems
• Four IBM p690 HPC servers– 16 Power4 Processors
• 1.3 GHz: 5.2 Gflops per proc,83.2 Gflops per server
– 16 GB Shared Memory• >200 GB/s memory bandwidth!
– 144 GB Disk
• 1 TB disk to partition across servers
• Will configure as single system (1/3 Tflop) with single GPFS system (1 TB) in 2Q02
Introduction to High Performance Computing
New HPC Systems
• IA64 Cluster– 20 2-way nodes
• Itanium (800 MHz) processors
• 2 GB memory/node
• 72 GB disk/node
– Myrinet 2000 switch – 180GB shared disk
• IA32 Cluster– 32 2-way nodes
• Pentium III (1 GHz) processors
• 1 GB Memory
• 18.2 GB disk/node
– Myrinet 2000 Switch
750 GB IBM GPFS parallel file system for both clusters
Introduction to High Performance Computing
World-Class Vislab
• SGI Onyx2– 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines– 24 GB Memory, 750 GB Disk
• Front and Rear Projection Systems– 3x1 cylindrically-symmetric Power Wall– 5x2 large-screen, 16:9 panel Power Wall
• Matrix switch between systems, projectors, rooms
Introduction to High Performance Computing
More Information
• URL: www.tacc.utexas.edu
• E-mail Addresses:– General Information: [email protected]– Technical assistance: [email protected]
• Telephone Numbers:– Main Office: (512) 475-9411– Facsimile transmission: (512) 475-9445– Operations Room: (512) 475-9410
Introduction to High Performance Computing
Outline
• Preface
• What is High Performance Computing?
• Parallel Computing
• Distributed Computing, Grid Computing, and More
• Future Trends in HPC
Introduction to High Performance Computing
‘Supercomputing’
• First HPC systems were vector-based systems (e.g. Cray)– named ‘supercomputers’ because they were an
order of magnitude more powerful than commercial systems
• Now, ‘supercomputer’ has little meaning– large systems are now just scaled up versions of
smaller systems
• However, ‘high performance computing’ has many meanings
Introduction to High Performance Computing
HPC Defined
• High performance computing:– can mean high flop count
• per processor• totaled over many processors working on the same
problem• totaled over many processors working on related
problems
– can mean faster turnaround time• more powerful system• scheduled to first available system(s)• using multiple systems simultaneously
Introduction to High Performance Computing
My Definitions
• HPC: any computational technique that solves a large problem faster than possible using single, commodity systems– Custom-designed, high-performance processors
(e.g. Cray, NEC)– Parallel computing– Distributed computing– Grid computing
Introduction to High Performance Computing
My Definitions
• Parallel computing: single systems with many processors working on the same problem
• Distributed computing: many systems loosely coupled by a scheduler to work on related problems
• Grid Computing: many systems tightly coupled by software and networks to work together on single problems or on related problems
Introduction to High Performance Computing
Importance of HPC
• HPC has had tremendous impact on all areas of computational science and engineering in academia, government, and industry.
• Many problems have been solved with HPC techniques that were impossible to solve with individual workstations or personal computers.
Introduction to High Performance Computing
Outline
• Preface
• What is High Performance Computing?
• Parallel Computing
• Distributed Computing, Grid Computing, and More
• Future Trends in HPC
Introduction to High Performance Computing
What is a Parallel Computer?
• Parallel computing: the use of multiple computers or processors working together on a common task
• Parallel computer: a computer that contains multiple processors:– each processor works on its section of the
problem– processors are allowed to exchange information
with other processors
Introduction to High Performance Computing
Parallel vs. Serial Computers
• Two big advantages of parallel computers:1. total performance
2. total memory
• Parallel computers enable us to solve problems that:– benefit from, or require, fast solution– require large amounts of memory– example that requires both: weather forecasting
Introduction to High Performance Computing
Parallel vs. Serial Computers
• Some benefits of parallel computing include:– more data points
• bigger domains• better spatial resolution• more particles
– more time steps • longer runs• better temporal resolution
– faster execution• faster time to solution• more solutions in same time• lager simulations in real time
Introduction to High Performance Computing
Serial Processor Performance
0
20
40
60
1 6
Time (years)
per
form
ance
Moore'sLaw
Future(?)
Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached
Introduction to High Performance Computing
Types of Parallel Computers
• The simplest and most useful way to classify modern parallel computers is by their memory model:– shared memory– distributed memory
Introduction to High Performance Computing
P P P P P P
BUS
Memory
M
P
M
P
M
P
M
P
M
P
M
P
Network
Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000)
Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)
Shared vs. Distributed Memory
Introduction to High Performance Computing
P P P P P P
BUS
Memory
Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000)
P P P P
BUS
Memory
P P P P
BUS
Memory
Network
Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)
Shared Memory: UMA vs. NUMA
Introduction to High Performance Computing
Distributed Memory: MPPs vs. Clusters
• Processor-memory nodes are connected by some type of interconnect network– Massively Parallel Processor (MPP): tightly
integrated, single system image.– Cluster: individual computers connected by s/w
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
InterconnectNetwork
Introduction to High Performance Computing
Processors, Memory, & Networks
• Both shared and distributed memory systems have:1. processors: now generally commodity RISC
processors
2. memory: now generally commodity DRAM
3. network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)
• We will now begin to describe these pieces in detail, starting with definitions of terms.
Introduction to High Performance Computing
Processor-Related Terms
Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz).
Instruction: an action executed by a processor, such as a mathematical operation or a memory operation.
Register: a small, extremely fast location for storing data or instructions in the processor.
Introduction to High Performance Computing
Processor-Related Terms
Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc.
Pipeline : technique enabling multiple instructions to be overlapped in execution.
Superscalar: multiple instructions are possible per clock period.
Flops: floating point operations per second.
Introduction to High Performance Computing
Processor-Related Terms
Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly.
Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)
Introduction to High Performance Computing
Memory-Related Terms
SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable.
DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper).
Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.
Introduction to High Performance Computing
Interconnect-Related Terms
• Latency: – Networks: How long does it take to start sending a
"message"? Measured in microseconds.– Processors: How long does it take to output
results of some operations, such as floating point add, divide etc., which are pipelined?)
• Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec or Gbytes/sec
Introduction to High Performance Computing
Interconnect-Related Terms
Topology: the manner in which the nodes are connected. – Best choice would be a fully connected network
(every processor to every other). Unfeasible for cost and scaling reasons.
– Instead, processors are arranged in some variation of a grid, torus, or hypercube.
3-d hypercube 2-d mesh 2-d torus
Introduction to High Performance Computing
Processor-Memory Problem
• Processors issue instructions roughly every nanosecond.
• DRAM can be accessed roughly every 100 nanoseconds (!).
• DRAM cannot keep processors busy! And the gap is growing:– processors getting faster by 60% per year– DRAM getting faster by 7% per year (SDRAM and
EDO RAM might help, but not enough)
Introduction to High Performance Computing
Processor-Memory Performance Gap
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
19
80
19
81
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
DRAM
CPU19
82
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
“Moore’s Law”
From D. Patterson, CS252, Spring 1998 ©UCB
Introduction to High Performance Computing
Processor-Memory Performance Gap
• Problem becomes worse when remote (distributed or NUMA) memory is needed– network latency is roughly 1000-10000
nanoseconds (roughly 1-10 microseconds)– networks getting faster, but not fast enough
• Therefore, cache is used in all processors– almost as fast as processors (same circuitry)– sits between processors and local memory– expensive, can only use small amounts– must design system to load cache effectively
Introduction to High Performance Computing
CPU
Main Memory
Cache
Processor-Cache-Memory
• Cache is much smaller than main memory and hence there is mapping of data from main memory to cache.
Introduction to High Performance Computing
CPU
Cache
LocalMemory
RemoteMemory
SPEED SIZE COST/BIT
Memory Hierarchy
Introduction to High Performance Computing
Cache-Related Terms
• ICACHE : Instruction cache
• DCACHE (L1) : Data cache closest to registers
• SCACHE (L2) : Secondary data cache– Data from SCACHE has to go through DCACHE
to registers– SCACHE is larger than DCACHE – Not all processors have SCACHE
Introduction to High Performance Computing
Cache Benefits
• Data cache was designed with two key concepts in mind– Spatial Locality
• When an element is referenced its neighbors will be referenced also
• Cache lines are fetched together• Work on consecutive data elements in the same cache
line
– Temporal Locality• When an element is referenced, it might be referenced
again soon• Arrange code so that data in cache is reused often
Introduction to High Performance Computing
cache
main memory
Direct-Mapped Cache
• Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.
Introduction to High Performance Computing
cache
Main memory
Fully Associative Cache
• Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be
associated with any entry in the cache.
Introduction to High Performance Computing
2-way set-associative cache
Main memory
Set Associative Cache
• Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into N (N > 1) locations in the cache.
Introduction to High Performance Computing
Cache-Related Terms
Least Recently Used (LRU): Cache replacement strategy for set associative caches. The cache block that is least recently used is replaced with a new block.
Random Replace: Cache replacement strategy for set associative caches. A cache block is randomly replaced.
Introduction to High Performance Computing
Example: CRAY T3E Cache
• The CRAY T3E processors can execute– 2 floating point ops (1 add, 1 multiply) and– 2 integer/memory ops (includes 2 loads or 1 store)
• To help keep the processors busy– on-chip 8 KB direct-mapped data cache– on-chip 8 KB direct-mapped instruction cache– on-chip 96 KB 3-way set associative secondary
data cache with random replacement.
Introduction to High Performance Computing
Putting the Pieces Together
• Recall:– Shared memory architectures:
• Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000
• Non-Uniform Memory Access (NUMA): Most common are Distributed Shared Memory (DSM), or cc-NUMA (cache coherent NUMA) systems. Ex: SGI Origin 2000
– Distributed memory architectures:• Massively Parallel Processor (MPP): tightly integrated
system, single system image. Ex: CRAY T3E, IBM SP• Clusters: commodity nodes connected by interconnect.
Example: Beowulf clusters.
Introduction to High Performance Computing
Symmetric Multiprocessors (SMPs)
• SMPs connect processors to global shared memory using one of:– bus– crossbar
• Provides simple programming model, but has problems:– buses can become saturated– crossbar size must increase with # processors
• Problem grows with number of processors, limiting maximum size of SMPs
Introduction to High Performance Computing
Shared Memory Programming
• Programming models are easier since message passing is not necessary. Techniques:– autoparallelization via compiler options– loop-level parallelism via compiler directives– OpenMP– pthreads
• More on programming models later.
Introduction to High Performance Computing
Massively Parallel Processors
• Each processor has it’s own memory:– memory is not shared globally– adds another layer to memory hierarchy (remote
memory)
• Processor/memory nodes are connected by interconnect network– many possible topologies– processors must pass data via messages– communication overhead must be minimized
Introduction to High Performance Computing
Communications Networks
• Custom– Many vendors have custom interconnects that
provide high performance for their MPP system– CRAY T3E interconnect is the fastest for MPPs:
lowest latency, highest bandwidth
• Commodity– Used in some MPPs and all clusters– Myrinet, Gigabit Ethernet, Fast Ethernet, etc.
Introduction to High Performance Computing
Types of Interconnects
• Fully connected– not feasible
• Array and torus– Intel Paragon (2D array), CRAY T3E (3D torus)
• Crossbar– IBM SP (8 nodes)
• Hypercube– SGI Origin 2000 (hypercube), Meiko CS-2 (fat tree)
• Combinations of some of the above– IBM SP (crossbar & fully connected for 80 nodes)– IBM SP (fat tree for > 80 nodes)
Introduction to High Performance Computing
Clusters
• Similar to MPPs– Commodity processors and memory
• Processor performance must be maximized
– Memory hierarchy includes remote memory– No shared memory--message passing
• Communication overhead must be minimized
• Different from MPPs– All commodity, including interconnect and OS– Multiple independent systems: more robust– Separate I/O systems
Introduction to High Performance Computing
Cluster Pros and Cons
• Pros– Inexpensive– Fastest processors first– Potential for true parallel I/O– High availability
• Cons:– Less mature software (programming and system)– More difficult to manage (changing slowly)– Lower performance interconnects: not as scalable
to large number (but have almost caught up!)
Introduction to High Performance Computing
Distributed Memory Programming
• Message passing is most efficient– MPI– MPI-2– Active/one-sided messages
• Vendor: SHMEM (T3E), LAPI (SP• Coming in MPI-2
• Shared memory models can be implemented in software, but are not as efficient.
• More on programming models in the next section.
Introduction to High Performance Computing
“Distributed Shared Memory”
• More generally called cc-NUMA (cache coherent NUMA)
• Consists of m SMPs with n processors in a global address space:– Each processor has some local memory (SMP)– All processors can access all memory: extra
“directory” hardware on each SMP tracks values stored in all SMPs
– Hardware guarantees cache coherency– Access to memory on other SMPs slower (NUMA)
Introduction to High Performance Computing
“Distributed Shared Memory”
• Easier to build because of slower access to remote memory (no expensive bus/crossbar)
• Similar cache problems
• Code writers should be aware of data distribution
• Load balance: Minimize access of “far” memory
Introduction to High Performance Computing
DSM Rationale and Realities
• Rationale: combine ease of SMP programming with scalability of MPP programming at much at cost of MPP
• Reality: NUMA introduces additional layers in SMP memory hierarchy relative to SMPs, so scalability is limited if programmed as SMP
• Reality: Performance and high scalability require programming to the architecture.
Introduction to High Performance Computing
Clustered SMPs
• Simpler than DSMs:– composed of nodes connected by network, like an
MPP or cluster– each node is an SMP– processors on one SMP do not share memory on
other SMPs (no directory hardware in SMP nodes)– communication between SMP nodes is by
message passing– Ex: IBM Power3-based SP systems
Introduction to High Performance Computing
Clustered SMP Diagram
Network
P P P P
BUS
Memory
P P P P
BUS
Memory
Introduction to High Performance Computing
Reasons for Clustered SMPs
• Natural extension of SMPs and clusters– SMPs offer great performance up to their
crossbar/bus limit– Connecting nodes is how memory and
performance are increased beyond SMP levels– Can scale to larger number of processors with less
scalable interconnect– Maximum performance:
• Optimize at SMP level - no communication overhead• Optimize at MPP level - fewer messages necessary for
same number of processors
Introduction to High Performance Computing
Clustered SMP Drawbacks
• Clustering SMPs has drawbacks– No shared memory access over entire system,
unlike DSMs– Has other disadvantages of DSMs
• Extra layer in memory hierarchy• Performance requires more effort from programmer than
SMPs or MPPs
• However, clustered SMPs provide a means for obtaining very high performance and scalability
Introduction to High Performance Computing
Clustered SMP: NPACI “Blue Horizon”
• IBM SP system:– Power3 processors: good peak performance (~1.5
Gflops)– better sustained performance (highly superscalar
and pipelined) than for many other processors– SMP nodes have 8 Power3 processors– System has 144 SMP nodes (1154 processors
total)
Introduction to High Performance Computing
Programming Clustered SMPs
• NSF: Most users use only MPI, even for intra- node messages
• DoE: Most applications are being developed with MPI (between nodes) and OpenMP (intra-node)
• MPI+OpenMP programming is more complex, but might yield maximum performance
• Active messages and pthreads would theoretically give maximum performance
Introduction to High Performance Computing
Data parallelism Task parallelism
Types of Parallelism
• Data parallelism: each processor performs the same task on different sets or sub-regions of data
• Task parallelism: each processor performs a different task
• Most parallel applications fall somewhere on the continuum between these two extremes.
Introduction to High Performance Computing
Data vs. Task Parallelism
• Example of data parallelism:– In a bottling plant, we see several ‘processors’, or
bottle cappers, applying bottle caps concurrently on rows of bottles.
• Example of task parallelism;– In a restaurant kitchen, we see several chefs, or
‘processors’, working simultaneously on different parts of different meals.
– A good restaurant kitchen also demonstrates load balancing and synchronization--more on those topics later.
Introduction to High Performance Computing
Example: Master-Worker Parallelism
• A common form of parallelism used in developing applications years ago (especially in PVM) was Master-Worker parallelism:– a single processor is responsible for distributing
data and collecting results (task parallelism)– all other processors perform same task on their
portion of data (data parallelism)
Introduction to High Performance Computing
Parallel Programming Models
• The primary programming models in current use are– Data parallelism - operations are performed in
parallel on collections of data structures. A generalization of array operations.
– Message passing - processes possess local memory and communicate with other processes by sending and receiving messages.
– Shared memory - each processor has access to a single shared pool of memory
Introduction to High Performance Computing
Parallel Programming Models
• Most parallelization efforts fall under the following categories.– Codes can be parallelized using message-passing
libraries such as MPI.– Codes can be parallelized using compiler
directives such as OpenMP.– Codes can be written in new parallel languages.
Introduction to High Performance Computing
Programming Models Architectures
• Natural mappings– data parallel CM-2 (SIMD machine)
– message passing IBM SP (MPP)
– shared memory SGI Origin, Sun E10000
• Implemented mappings– HPF (a data parallel language) and MPI (a
message passing library) have been implemented on nearly all parallel machines
– OpenMP (a set of directives, etc. for shared memory programming) has been implemented on most shared memory systems.
Introduction to High Performance Computing
SPMD
• All current machines are MIMD systems (Multiple Instruction, Multiple Data) and are capable of either data parallelism or task parallelism.
• The primary paradigm for programming parallel machines is the SPMD paradigm: Single Program, Multiple Data– each processor runs a copy of same source code– enables data parallelism (through data
decomposition) and task parallelism (through intrinsic functions that return the processor ID)
Introduction to High Performance Computing
OpenMP - Shared Memory Standard
• OpenMP is a new standard for shared memory programming: SMPs and cc-NUMAs.– OpenMP provides a standard set of directives,
run-time library routines, and– environment variables for parallelizing code under
a shared memory model.– Very similar to Cray PVP autotasking directives,
but with much more functionality. (Cray now uses supports OpenMP.)
– See http://www.openmp.org for more information
Introduction to High Performance Computing
program add_arraysparameter (n=1000)real x(n),y(n),z(n)read(10) x,y,z
do i=1,n x(i) = y(i) + z(i)enddo...end
Fortran 77
program add_arraysparameter (n=1000)real x(n),y(n),z(n)read(10) x,y,z
!$OMP PARALLEL DOdo i=1,n x(i) = y(i) + z(i)enddo...end
Fortran 77 + OpenMP
Highlighted directive specifies that loop is executed in parallel. Each processor executes a subset of the loop iterations.
OpenMP Example
Introduction to High Performance Computing
MPI - Message Passing Standard
• MPI has emerged as the standard for message passing in both C and Fortran programs. No longer need to know MPL, PVM, TCGMSG, etc.
• MPI is both large and small:– MPI is large, since it contains 125 functions which
give the programmer fine control over communications
– MPI is small, since message passing programs can be written using a core set of just six functions.
Introduction to High Performance Computing
PE 0 calls MPI_SEND to pass the real variable x to PE 1.PE 1 calls MPI_RECV to receive the real variable y from PE 0
if(myid.eq.0) then call MPI_SEND(x,1,MPI_REAL,1,100,MPI_COMM_WORLD,ierr)endif
if(myid.eq.1) then call MPI_RECV(y,1,MPI_REAL,0,100,MPI_COMM_WORLD, status,ierr)endif
MPI Examples - Send and Receive
MPI messages are two-way: they require a send and a matching receive:
Introduction to High Performance Computing
MPI Example - Global Operations
PE 6 collects the single (1) integer value n from all other processors and puts the sum (MPI_SUM) into into sum
call MPI_REDUCE(n,allsum,1,MPI_INTEGER,MPI_SUM,6, MPI_COMM_WORLD,ierr)
MPI also has global operations to broadcast and reduce (collect) information
PE 5 broadcasts the single (1) integer value n to all other processors
call MPI_BCAST(n,1,MPI_INTEGER,5, MPI_COMM_WORLD,ierr)
Introduction to High Performance Computing
MPI Implementations
• MPI is typically implemented on top of the highest performance native message passing library for every distributed memory machine.
• MPI is a natural model for distributed memory machines (MPPs, clusters)
• MPI offers higher performance on DSMs beyond the size of an individual SMP
• MPI is useful between SMPs that are clustered
• MPI can be implemented on shared memory machines
Introduction to High Performance Computing
Extensions to MPI: MPI-2
• A standard for MPI-2 has been developed which extends the functionality of MPI. New features include:– One sided communications - eliminates the need
to post matching sends and receives. Similar in functionality to the shmem PUT and GET on the CRAY T3E (most systems have analogous library)
– Support for parallel I/O– Extended collective operations– No full implementation yet - it is difficult for
vendors
Introduction to High Performance Computing
MPI vs. OpenMP
• There is no single best approach to writing a parallel code. Each has pros and cons:– MPI - powerful, general, and universally available
message passing library which provides very fine control over communications, but forces the programmer to operate at a relatively low level of abstraction.
– OpenMP - conceptually simple approach for creating parallel codes on a shared memory machines, but not applicable to distributed memory platforms.
Introduction to High Performance Computing
MPI vs. OpenMP
• MPI is the most general (problems types) and portable (platforms, although not efficient for SMPs)
• The architecture and the problem type often make the decision for you.
Introduction to High Performance Computing
Parallel Libraries
• Finally, there are parallel mathematics libraries that enable users to write (serial) codes, then call parallel solver routines :– ScaLAPACK is for solving dense linear system of
equations, eigenvalues and least square problems. Also see PLAPACK.
– PETSc is for solving linear and non-linear partial differential equations (includes various iterative solvers for sparse matrices).
– Many others: check NETLIB for complete survey:http://www.netlib.org
Introduction to High Performance Computing
Hurdles in Parallel Computing
There are some hurdles in parallel computing:– Scalar performance: Fast parallel codes require
efficient use of the underlying scalar hardware– Parallel algorithms: Not all scalar algorithms
parallelize well, may need to rethink problem• Communications: Need to minimize the time spent doing
communications• Load balancing: All processors should do roughly the
same amount of work
– Amdahl’s Law: Fundamental limit on parallel computing
Introduction to High Performance Computing
Scalar Performance
• Underlying every good parallel code is a good scalar code.
• If a code scales to 256 processors but only gets 1% of peak performance, it is still a bad parallel code.– Good news: Everything that you know about serial
computing will be useful in parallel computing!– Bad news: It is difficult to get good performance
out of the processors and memory used in parallel machines. Need to use cache effectively.
Introduction to High Performance Computing
0.1
1
10
100
1 10 100
Number of processors
tim
e
parallel
serial
In this case, the parallel code achieves perfect scaling, but does not match the performance of the serial code until 32 processors are used
Serial Performance
Introduction to High Performance Computing
main memory
cache
CPU
A simplified memoryhierarchy
Small& fast
Big& slow
The data cache was designed with two key concepts in mind:
Spatial locality - cache is loaded an entire line (4-32 words) at a time to take advantage of the fact that if a location in memory is required, nearby locations will probably also be required
Temporal locality - once a word is loaded into cache it remains there until the cache line is needed to hold another word of data.
Use Cache Effectively
Introduction to High Performance Computing
Non-Cache Issues
• There are other issues to consider to achieve good serial performance:– Force reductions, e.g., replacement of divisions
with multiplications-by-inverse– Evaluate and replace common sub-expressions– Pushing loops inside subroutines to minimize
subroutine call overhead– Force function inlining (compiler option)– Perform interprocedural analysis to eliminate
redundant operations (compiler option)
Introduction to High Performance Computing
Parallel Algorithms
• The algorithm must be naturally parallel!– Certain serial algorithms do not parallelize well.
Developing a new parallel algorithm to replace a serial algorithm can be one of the most difficult task in parallel computing.
– Keep in mind that your parallel algorithm may involve additional work or a higher floating point operation count.
Introduction to High Performance Computing
Parallel Algorithms
– Keep in mind that the algorithm should• need the minimum amount of communication (Monte
Carlo algorithms are excellent examples)• balance the load among the processors equally
– Fortunately, a lot of research has been done in parallel algorithms, particularly in the area of linear algebra. Don’t reinvent the wheel, take full advantage of the work done by others:
• use parallel libraries supplied by the vendor whenever possible!
• use ScaLAPACK, PETSc, etc. when applicable
Introduction to High Performance Computing
Busy timeIdle time
t
PE 0PE 1
The figures below show the timeline for parallel codes run on two processors. In both cases, the total amount of work done is the same, but in the second case the work is distributed more evenly between the two processors resulting in a shorter time to solution.
PE 0PE 1
Synchronizationpoints
Load Balancing
Introduction to High Performance Computing
Communications
• Two key parameters of the communications network are– Latency: time required to initiate a message. This
is the critical parameter in fine grained codes, which require frequent interprocessor communications. Can be thought of as the time required to send a message of zero length.
– Bandwidth: steady-state rate at which data can be sent over the network.This is the critical parameter in coarse grained codes, which require infrequent communication of large amounts of data.
Introduction to High Performance Computing
Latency and Bandwidth Example
• Bucket brigade: the old style of fighting fires in which the townspeople formed a line from the well to the fire and passed buckets of water down the line– latency - the delay until the first bucket to arrives
at the fire– bandwidth - the rate at which buckets arrive at the
fire
Introduction to High Performance Computing
Sequential: t = t(comp) + t(comm)Overlapped: t = t(comp) + t(comm) - t(comp) t(comm)
More on Communications
• Time spent performing communications is considered overhead. Try minimize the impact of communications:– minimize the effect of latency by combining large
numbers of small messages into small numbers of large messages.
– communications and computation do not have to be done sequentially, can often overlap communication and computations
Introduction to High Performance Computing
• dial• “Hi mom”• hang up• dial• “How are things?”• hang up• dial• “in the U.S.?”• hang up• dial• At this point many mothers would not pick up the next call.
• dial• “Hi mom. How are things in the U.S.?. Yak, yak...”• hang up
By transmitting a single large message, Ionly have to pay the price for the dialinglatency once. I transmit more informationin less time.
The following examples of “phoning home” illustrate the value of combining many small messages into a single larger one.
Combining Small Messages into Larger Ones
Introduction to High Performance Computing
In the following example, a stencil operation is performed on a 10 x 10 array that has been distributed over two processors. Assume periodic boundary conditions.
Boundary elements - requires datafrom neighboring processor
Interior elements
• Initiate communications• Perform computations on interior elements• Wait till communications are finished• Perform computations on boundary elements
Stencil operation:y(i,j)=x(i+1,j)+x(i-1,j)+x(i,j+1)+x(i,j-1)
PE0 PE1
Overlapping Communications and Computations
Introduction to High Performance Computing
Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where:fs = serial fraction of codefp = parallel fraction of code = 1 - fs
N = number of processors
Amdahl’s Law
Introduction to High Performance Computing
0
50
100
150
200
250
0 50 100 150 200 250
Number of processors
spee
dup
fp = 1.000
fp = 0.999
fp = 0.990
fp = 0.900
It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
Illustration of Amdahl’s Law
Introduction to High Performance Computing
Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.
0
10
20
30
40
50
60
70
80
0 50 100 150 200 250Number of processors
spee
dup Amdahl's Law
Reality
fp = 0.99
Amdahl’s Law Vs. Reality
Introduction to High Performance Computing
More on Amdahl’s Law
• Amdahl’s Law can be generalized to any two processes of with different speeds
• Ex.: Apply to fprocessor and fmemory:– The growing processor-memory performance gap
will undermine our efforts at achieving maximum possible speedup!
Introduction to High Performance Computing
Generalized Amdahl’s Law
• Amdahl’s Law can be further generalized to handle an arbitrary number of processes of various speeds. (The total fractions representing each process must still equal 1.)
• This is a weighted Harmonic mean. Application performance is limited by performance of the slowest component as much as it is determined by the fastest.
Ravg = 1
fi
R ii 1
N
Introduction to High Performance Computing
Gustafson’s Law
• Thus, Amdahl’s Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large.
• There is a way around this: increase the problem size– bigger problems mean bigger grids or more
particles: bigger arrays– number of serial operations generally remains
constant; number of parallel operations increases: parallel fraction increases
Introduction to High Performance Computing
The 1st Question to Ask Yourself Before You Parallelize Your Code
• Is it worth my time? – Do the CPU requirements justify parallelization?– Do I need a parallel machine in order to get
enough aggregate memory?– Will the code be used just once or will it be a major
production code?
• Your time is valuable, and it can be very time consuming to write, debug, and test a parallel code. The more time you spend writing a parallel code, the less time you have to spend doing your research.
Introduction to High Performance Computing
The 2nd Question to Ask Yourself Before You Parallelize Your Code
• How should I decompose my problem?– Do the computations consist of a large number of
small, independent problems - trajectories, parameter space studies, etc? May want to consider a scheme in which each processor runs the calculation for a different set of data
– Does each computation have large memory or CPU requirements? Will probably have to break up a single problem across multiple processors
Introduction to High Performance Computing
Distributing the Data
• Decision on how to distribute the data should consider these issues:– Load balancing:
Often implies an equal distribution of data, but more generally means an equal distribution of work
– Communications:Want to minimize the impact of communications, taking into account both size and number of messages
– Physics:Choice of distribution will depend on the processes that are being modeled in each direction.
Introduction to High Performance Computing
A good distribution if the physics of theproblem is the same in both directions.Minimizes the amount of data that mustbe communicated between processors.
If expensive global operations need to becarried out in the x-direction (ex. FFTs), this is probably a better choice.
A Data Distribution Example
Introduction to High Performance Computing
Imagine that we are doing a simulationin which more work is required for thegrid points covering the shaded object.
Neither data distribution from the previous example will result in good load balancing.
May need to consider an irregular gridor a different data structure.
A More Difficult Example
Introduction to High Performance Computing
Choosing a Resource
• The following factors should be taken into account when choosing a resource:– What is the granularity of my code?– Are there any special hardware features that I
need or can take advantage of?– How many processors will the code be run on?– What are my memory requirements?
• By carefully considering these points, you can make the right choice of computational platform.
Introduction to High Performance Computing
Granularity is a measure of the amount of work done by each processor between synchronization events.
PE 0PE 1
Low-granularity application
PE 0PE 1
High-granularity application
Generally, latency is the critical parameter for low-granularity codes, while processor performance is the key factor for high-granularity applications.
Choosing a Resource: Granularity
Introduction to High Performance Computing
Choosing a Resource: Special Hardware Features
• Various HPC platforms have different hardware features that your code may be able to take advantage of. Examples include:– Hardware support for divide and square root
operations (IBM SP)– Parallel I/O file system (IBM SP)– Data streams (CRAY T3E)– Control over cache alignment (CRAY T3E)– E-registers for by-passing cache hierarchy
(CRAY T3E)
Introduction to High Performance Computing
Importance of Parallel Computing
• High performance computing has become almost synonymous with parallel computing.
• Parallel computing is necessary to solve big problems (high resolution, lots of timesteps, etc.) in science and engineering.
• Developing and maintaining efficient, scalable parallel applications is difficult. However, the payoff can be tremendous.
Introduction to High Performance Computing
Importance of Parallel Computing
• Before jumping in, think about– whether or not your code truly needs to be
parallelized– how to decompose your problem.
• Then choose a programming model based on your problem and your available architecture.
• Take advantage of the resources that are available - compilers libraries, debuggers, performance analyzers, etc. - to help you write efficient parallel code.
Introduction to High Performance Computing
Useful References
• Hennessy, J. L. and Patterson, D. A. Computer Architecture: A Quantitative Approach.
• Patterson, D.A. and Hennessy, J.L., Computer Organization and Design: The Hardware/Software Interface.
• D. Dowd, High Performance Computing.
• D. Kuck, High Performance Computing. Oxford U. Press (New York) 1996.
• D. Culler and J. P. Singh, Parallel Computer Architecture.
Introduction to High Performance Computing
Outline
• Preface
• What is High Performance Computing?
• Parallel Computing
• Distributed Computing, Grid Computing, and More
• Future Trends in HPC
Introduction to High Performance Computing
Distributed Computing
• Concept has been used for two decades
• Basic idea: run scheduler across systems to runs processes on least-used systems first– Maximize utilization– Minimize turnaround time
• Have to load executables and input files to selected resource– Shared file system– File transfers upon resource selection
Introduction to High Performance Computing
Examples of Distributed Computing
• Workstation farms, Condor flocks, etc.– Generally share file system
• SETI@home, Entropia, etc.– Only one source code; central server copies
correct binary code and input data to each system
• Napster, Gnutella: file/data sharing
• NetSolve– Runs numerical kernel on any of multiple
independent systems, much like a Grid solution
Introduction to High Performance Computing
SETI@home: Global Distributed Computing
• Running on 500,000 PCs, ~1000 CPU Years per Day– 485,821 CPU Years so far
• Sophisticated Data & Signal Processing Analysis
• Distributes Datasets from Arecibo Radio Telescope
Introduction to High Performance Computing
Distributed vs. Parallel Computing
• Different– Distributed computing executes independent (but
possibly related) applications on different systems; jobs do not communicate with each other
– Parallel computing executes a single application across processors, distributing the work and/or data but allowing communication between processes
• Non-exclusive: can distribute parallel applications to parallel computing systems
Introduction to High Performance Computing
Grid Computing
• Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals—in the absence of central control, omniscience, trust relationships.
• Resources (HPC systems, visualization systems & displays, storage systems, sensors, instruments, people) are integrated via ‘middleware’ to facilitate use of all resources.
Introduction to High Performance Computing
Why Grids?
• Resources have different functions, but multiple classes resources are necessary for most interesting problems.
• Power of any single resource is small compared to aggregations of resources
• Network connectivity is increasing rapidly in bandwidth and availability
• Large problems require teamwork and computation
Introduction to High Performance Computing
Network Bandwidth Growth
• Network vs. computer performance– Computer speed doubles every 18 months– Network speed doubles every 9 months– Difference = order of magnitude per 5 years
• 1986 to 2000– Computers: x 500– Networks: x 340,000
• 2001 to 2010– Computers: x 60– Networks: x 4000
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
Introduction to High Performance Computing
Grid Possibilities
• A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour
• 1,000 physicists worldwide pool resources for petaflop analyses of petabytes of data
• Civil engineers collaborate to design, execute, & analyze shake table experiments
• Climate scientists visualize, annotate, & analyze terabyte simulation datasets
• An emergency response team couples real time data, weather model, population data
Introduction to High Performance Computing
Some Grid Usage Models
• Distributed computing: job scheduling on Grid resources with secure, automated data transfer
• Workflow: synchronized scheduling and automated data transfer from one system to next in pipeline (e.g. HPC system to visualization lab to storage system)
• Coupled codes, with pieces running on different systems simultaneously
• Meta-applications: parallel apps spanning multiple systems
Introduction to High Performance Computing
Grid Usage Models
• Some models are similar to models already being used, but are much simpler due to:– single sign-on– automatic process scheduling– automated data transfers
• But Grids can encompass new resources likes sensors and instruments, so new usage models will arise
Introduction to High Performance Computing
Selected Major Grid Projects
Name URL & Sponsors FocusAccess Grid www.mcs.anl.gov/FL/
accessgrid; DOE, NSFCreate & deploy group collaboration systems using commodity technologies
BlueGrid IBM Grid testbed linking IBM laboratories
DISCOM www.cs.sandia.gov/discomDOE Defense Programs
Create operational Grid providing access to resources at three U.S. DOE weapons laboratories
DOE Science Grid
sciencegrid.org
DOE Office of Science
Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities
Earth System Grid (ESG)
earthsystemgrid.orgDOE Office of Science
Delivery and analysis of large climate model datasets for the climate research community
European Union (EU) DataGrid
eu-datagrid.org
European Union
Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics
g
g
g
g
g
g
Introduction to High Performance Computing
Selected Major Grid Projects
Name URL/Sponsor FocusEuroGrid, Grid Interoperability (GRIP)
eurogrid.org
European Union
Create technologies for remote access to supercomputer resources & simulation codes; in GRIP, integrate with Globus
Fusion Collaboratory fusiongrid.org
DOE Off. Science
Create a national computational collaboratory for fusion research
Globus Project globus.org
DARPA, DOE, NSF, NASA, Msoft
Research on Grid technologies; development and support of Globus Toolkit; application and deployment
GridLab gridlab.org
European Union
Grid technologies and applications
GridPP gridpp.ac.uk
U.K. eScience
Create & apply an operational grid within the U.K. for particle physics research
Grid Research Integration Dev. & Support Center
grids-center.org
NSF
Integration, deployment, support of the NSF Middleware Infrastructure for research & education
g
g
g
g
g
g
Introduction to High Performance Computing
Selected Major Grid Projects
Name URL/Sponsor FocusGrid Application Dev. Software
hipersoft.rice.edu/grads; NSF
Research into program development technologies for Grid applications
Grid Physics Network griphyn.org
NSF
Technology R&D for data analysis in physics expts: ATLAS, CMS, LIGO, SDSS
Information Power Grid
ipg.nasa.gov
NASA
Create and apply a production Grid for aerosciences and other NASA missions
International Virtual Data Grid Laboratory
ivdgl.org
NSF
Create international Data Grid to enable large-scale experimentation on Grid technologies & applications
Network for Earthquake Eng. Simulation Grid
neesgrid.org
NSF
Create and apply a production Grid for earthquake engineering
Particle Physics Data Grid
ppdg.net
DOE Science
Create and apply production Grids for data analysis in high energy and nuclear physics experiments
g
g
g
g
g
g
Introduction to High Performance Computing
Selected Major Grid Projects
Name URL/Sponsor FocusTeraGrid teragrid.org
NSF
U.S. science infrastructure linking four major resource sites at 40 Gb/s
UK Grid Support Center
grid-support.ac.uk
U.K. eScience
Support center for Grid projects within the U.K.
Unicore BMBFT Technologies for remote access to supercomputers
g
g
New
There are also many technology R&D projects: e.g., Globus, Condor, NetSolve, Ninf, NWS, etc.
Introduction to High Performance Computing
Example Application Projects
• Earth Systems Grid: environment (US DOE)
• EU DataGrid: physics, environment, etc. (EU)
• EuroGrid: various (EU)
• Fusion Collaboratory (US DOE)
• GridLab: astrophysics, etc. (EU)
• Grid Physics Network (US NSF)
• MetaNEOS: numerical optimization (US NSF)
• NEESgrid: civil engineering (US NSF)
• Particle Physics Data Grid (US DOE)
Introduction to High Performance Computing
Some Grid Requirements – Systems/Deployment Perspective
• Identity & authentication
• Authorization & policy
• Resource discovery
• Resource characterization
• Resource allocation
• (Co-)reservation, workflow
• Distributed algorithms
• Remote data access
• High-speed data transfer
• Performance guarantees
• Monitoring
• Adaptation
• Intrusion detection
• Resource management
• Accounting & payment
• Fault management
• System evolution
• Etc.
• Etc.
Introduction to High Performance Computing
Some Grid Requirements –User Perspective
• Single allocation (or none needed)
• Single sign-on: authentication to any Grid resources authenticates for all others
• Single compute space: one scheduler for all Grid resources
• Single data space: can address files and data from any Grid resources
• Single development environment: Grid tools and libraries that work on all grid resources
Introduction to High Performance Computing
The Systems Challenges:Resource Sharing Mechanisms That…
• Address security and policy concerns of resource owners and users
• Are flexible enough to deal with many resource types and sharing modalities
• Scale to large number of resources, many participants, many program components
• Operate efficiently when dealing with large amounts of data & computation
Introduction to High Performance Computing
The Security Problem
• Resources being used may be extremely valuable & the problems being solved extremely sensitive
• Resources are often located in distinct administrative domains– Each resource may have own policies & procedures
• The set of resources used by a single computation may be large, dynamic, and/or unpredictable– Not just client/server
• It must be broadly available & applicable– Standard, well-tested, well-understood protocols– Integration with wide variety of tools
Introduction to High Performance Computing
The Resource Management Problem
• Enabling secure, controlled remote access to computational resources and management of remote computation– Authentication and authorization– Resource discovery & characterization– Reservation and allocation– Computation monitoring and control
Introduction to High Performance Computing
Grid Systems Technologies
• Systems and security problems addressed by new protocols & services. E.g., Globus:– Grid Security Infrastructure (GSI) for security– Globus Metadata Directory Service (MDS) for
discovery– Globus Resource Allocations Manager (GRAM)
protocol as a basic building block• Resource brokering & co-allocation services
– GridFTP for data movement
Introduction to High Performance Computing
The Programming Problem
• How does a user develop robust, secure, long-lived applications for dynamic, heterogeneous, Grids?
• Presumably need:– Abstractions and models to add to
speed/robustness/etc. of development– Tools to ease application development and
diagnose common problems– Code/tool sharing to allow reuse of code
components developed by others
Introduction to High Performance Computing
Grid Programming Technologies
• “Grid applications” are incredibly diverse (data, collaboration, computing, sensors, …)– Seems unlikely there is one solution
• Most applications have been written “from scratch,” with or without Grid services
• Application-specific libraries have been shown to provide significant benefits
• No new language, programming model, etc., has yet emerged that transforms things– But certainly still quite possible
Introduction to High Performance Computing
Examples of GridProgramming Technologies
• MPICH-G2: Grid-enabled message passing
• CoG Kits, GridPort: Portal construction, based on N-tier architectures
• GDMP, Data Grid Tools, SRB: replica management, collection management
• Condor-G: simple workflow management
• Legion: object models for Grid computing
• Cactus: Grid-aware numerical solver framework– Note tremendous variety, application focus
Introduction to High Performance Computing
MPICH-G2: A Grid-Enabled MPI
• A complete implementation of the Message Passing Interface (MPI) for heterogeneous, wide area environments– Based on the Argonne MPICH implementation of MPI
(Gropp and Lusk)
• Globus services for authentication, resource allocation, executable staging, output, etc.
• Programs run in wide area without change!
• See also: MetaMPI, PACX, STAMPI, MAGPIE
www.globus.org/mpi
Introduction to High Performance Computing
Grid Events
• Global Grid Forum: working meeting– Meets 3 times/year, alternates U.S.-Europe, with
July meeting as major event
• HPDC: major academic conference– HPDC-11 in Scotland with GGF-8, July 2002
• Other meetings include– IPDPS, CCGrid, EuroGlobus, Globus Retreats
www.gridforum.org, www.hpdc.org
Introduction to High Performance Computing
Useful References
• Book (Morgan Kaufman)– www.mkp.com/grids
• Perspective on Grids– “The Anatomy of the Grid: Enabling Scalable
Virtual Organizations”, IJSA, 2001– www.globus.org/research/papers/anatomy.pdf
• All URLs in this section of the presentation, especially:– www.gridforum.org, www.grids-center.org,
www.globus.org
Introduction to High Performance Computing
Outline
• Preface
• What is High Performance Computing?
• Parallel Computing
• Distributed Computing, Grid Computing, and More
• Future Trends in HPC
Introduction to High Performance Computing
Value of Understanding Future Trends
• Monitoring and understanding future trends in HPC is important:– users: applications should be written to be
efficient on current and future architectures– developers: tools should be written to be efficient
on current and future architectures– computing centers: system purchases are
expensive and should have upgrade paths
Introduction to High Performance Computing
The Next Decade
• 1980s and 1990s:– academic and government requirements strongly
influenced parallel computing architectures– academic influence was greatest in developing
parallel computing software (for science & eng.)– commercial influence grew steadily in late 1990s
• In the next decade:– commercialization will become dominant in
determining the architecture of systems– academic/research innovations will continue to
drive the development of the HPC software
Introduction to High Performance Computing
Commercialization
• Computing technologies (including HPC) are now propelled by profits, not sustained by subsidies– Web servers, databases, transaction processing
and especially multimedia applications drive the need for computational performance.
– Most HPC systems are ‘scaled up’ commercial systems, with less additional hardware and software compared to commercial systems.
– It’s not engineering, it’s economics.
Introduction to High Performance Computing
Processors and Nodes
• Easy predictions:– microprocessors performance increase continues
at ~60% per year (Moore’s Law) for 5+ years.– total migration to 64-bit microprocessors– use of even more cache, more memory hierarchy.– increased emphasis on SMPs
• Tougher predictions:– resurgence of vectors in microprocessors? Maybe– dawn of multithreading in microprocessors? Yes
Introduction to High Performance Computing
Building Fat Nodes: SMPs
• More processors are faster, of course– SMPs are simplest form of parallel systems– efficient if not limited by memory bus contention:
small numbers of processors
• Commercial market for high performance servers at low cost drives need for SMPs
• HPC market for highest performance, ease of programming drives development of SMPs
Introduction to High Performance Computing
Building Fat Nodes: SMPs
• Trends are to:– build bigger SMPs– attempt to share memory across SMPs (cc-
NUMA)
Introduction to High Performance Computing
Resurgence of Vectors
• Vectors keep functional units busy– vector registers are very fast– vectors are more efficient for loops of any stride– vectors are great for many science & eng. apps
• Possible resurgence of vectors– SGI/Cray plans has built SV1ex, building SV2– NEC continues building (CMOS) parallel-vector,
Cray-like systems– Microprocessors (Pentium4, G4) have added
vector-like functionality for multimedia purposes
Introduction to High Performance Computing
Dawn of Multithreading?
• Memory speed will always be a bottleneck
• Must overlap computation with memory accesses: tolerate latency– requires immense amount of parallelism– requires processors with multiple streams and
compilers that can define multiple threads
Introduction to High Performance Computing
Multithreading Diagram
Introduction to High Performance Computing
Multithreading
• Tera MTA was first multithreaded HPC system– scientific success, production failure– MTA-2 will be delivered in a few months.
• Multithreading will be implemented (in more limited fashion) in commercial processors.
Introduction to High Performance Computing
Networks
• Commercial network bandwidth and latency approaching custom performance.
• Dramatic performance increases likely– “the network is the computer” (Sun slogan)– more companies, more competition– no severe physical, economic limits
• Implications of faster networks– more clusters– collaborative, visual supercomputing– Grid computing
Introduction to High Performance Computing
Commodity Clusters
• Clusters provide some real advantages:– computing power: leverage workstations and PCs– high availability: replace one at a time– inexpensive: leverage existing competitive market– simple path to installing parallel computing system
• Major disadvantages were robustness of hardware and software, but both have improved
• NCSA has huge clusters in production based on Pentium III and Itanium.
Introduction to High Performance Computing
Clustering SMPs
• Inevitable (already here!):– leverages SMP nodes effectively for same
reasons clusters leverage individual processors– Commercial markets drive need for SMPs
• Combine advantages of SMPs, clusters– more powerful nodes through multiprocessing– more powerful nodes -> more powerful cluster– Interconnect scalability requirements reduced for
same number of processors
Introduction to High Performance Computing
Continued Linux Growth in HPC
• Linux popularity growing due to price and availability of source code
• Major players now supporting Linux, esp. IBM
• Head start on Intel Itanium
Introduction to High Performance Computing
Programming Tools
• However, programming tools will continue to lag behind hardware and OS capabilities:– Researchers will continue to drive the need for the
most powerful tools to create the most efficient applications on the largest systems
– Such technologies will look more like MPI than the Web… maybe worse due to multi-tiered clusters of SMPs (MPI + OpenMP; Active messages + threads?).
– Academia will continue to play a large role in HPC software development.
Introduction to High Performance Computing
Grid Computing
• Parallelism will continue to grow in the form of– SMPs– clusters– Cluster of SMPs (and maybe DSMs)
• Grids provide the next level– connects multiple computers into virtual systems– Already here:
• IBM, other vendors supporting Globus• SC2001 dominated by Grid technologies• Many major government awards (>$100M in past year)
Introduction to High Performance Computing
Emergence of Grids
• But Grids enable much more than apps running on multiple computers (which can be achieved with MPI alone)– virtual operating system: provides global
workspace/address space via a single login– automatically manages files, data, accounts, and
security issues– connects other resources (archival data facilities,
instruments, devices) and people (collaborative environments)
Introduction to High Performance Computing
Grids Are Inevitable
• Inevitable (at least in HPC):– leverages computational power of all available
systems– manages resources as a single system--easier for
users– provides most flexible resource selection and
management, load sharing– researchers’ desire to solve bigger problems will
always outpace performance increases of single systems; just as multiple processors are needed, ‘multiple multiprocessors’ will be deemed so
Introduction to High Performance Computing
Grid-Enabled Software
• Commercial applications on single parallel systems and Grids will require that:– underlying architectures must be invisible: no
parallel computing expertise required– usage must be simple– development must not be to difficult
• Developments in ease-of-use will benefit scientists as users (not as developers)
• Web-based interfaces: transparent supercomputing (MPIRE, Meta-MEME, etc.).
Introduction to High Performance Computing
Grid-Enabled Collaborative andVisual Supercomputing
• Commercial world demands:– multimedia applications– real-time data processing– online transaction processing– rapid prototyping and simulation in engineering,
chemistry and biology– interactive, remote collaboration– 3D graphics, animation and virtual reality
visualization
Introduction to High Performance Computing
Grid-enabled Collaborative, Visual Supercomputing
• Academic world will leverage resulting Grids linking computing and visualization systems via high-speed networks:– collaborative post-processing of data already here– simulations will be visualized in 3D, virtual worlds
in real-time– such simulations can then be ‘steered’– multiple scientists can participate in these visual
simulations– the ‘time to insight’ (SGI slogan) will be reduced
Introduction to High Performance Computing
Web-based Grid Computing
• Web currently used mostly for content delivery
• Web servers on HPC systems can execute applications
• Web servers on Grids can launch applications, move/store/retrieve data, display visualizations, etc.
• NPACI HotPage already enables single sign-on to NPACI Grid Resources
Introduction to High Performance Computing
Summary of Expectations
• HPC systems will grow in performance but probably change little in design (5-10 years):– HPC systems will be larger versions of smaller
commercial systems, mostly large SMPs and clusters of inexpensive nodes
– Some processors will exploit vectors, as well as more/larger caches.
– Best HPC systems will have been designed ‘top-down’ instead of ‘bottom-up’, but all will have been designed to make the ‘bottom’ profitable.
– Multithreading is the only likely, near-term major architectural change.
Introduction to High Performance Computing
Summary of Expectations
• Using HPC systems will change much more:– Grid computing will become widespread in HPC
and in commercial computing– Visual supercomputing and collaborative
simulation will be commonplace.– WWW interfaces to HPC resources will make
transparent supercomputing commonplace.
• But programming the most powerful resources most effectively will remain difficult.
Introduction to High Performance Computing
Caution
• Change is difficult to predict (and I am an astrophysicist, not an astrologer):– Accuracy of linear extrapolation predictions
degrade over long times (like weather forecasts)– Entirely new ideas can change everything:
• WWW is an excellent example; Grid computing is probably the next
• Eventually, something truly different will replace CMOS technology (nanotechnology? molecular computing? DNA computing?)
Introduction to High Performance Computing
Final Prediction
“The thing about change is that things will be different afterwards.”
Alan McMahon (Cornell University)