Introduction to Grid & Cluster Computing Sriram Krishnan, Ph.D. sriram@sdsc
Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar...
-
Upload
cameron-kristina-bruce -
Category
Documents
-
view
230 -
download
2
Transcript of Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar...
![Page 1: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/1.jpg)
Introduction to Supercomputers, Architectures and High Performance
Computing
Amit MajumdarScientific Computing Applications Group
SDSCMany others: Tim Kaiser, Dmitry Pekurovsky,
Mahidhar Tatineni, Ross Walker
![Page 2: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/2.jpg)
Topics
1. Intro to Parallel Computing2. Parallel Machines3. Programming Parallel Computers4. Supercomputer Centers and Rankings5. SDSC Parallel Machines6. Allocations on NSF Supercomputers7. One Application Example – Turbulence
2
![Page 3: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/3.jpg)
First Topic – Intro to Parallel Computing
• What is parallel computing• Why do parallel computing• Real life scenario• Types of parallelism• Limits of parallel computing• When do you do parallel computing
3
![Page 4: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/4.jpg)
What is Parallel Computing?
• Consider your favorite computational application – One processor can give me results in N hours– Why not use N processors
-- and get the results in just one hour?
The concept is simple:
Parallelism = applying multiple processors to a single problem
![Page 5: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/5.jpg)
Parallel computing is computing by committee
• Parallel computing: the use of multiple computers or processors working together on a common task.– Each processor works on its section of the problem– Processors are allowed to exchange information with
other processors
CPU #1 works on this area of the problem
CPU #3 works on this areaof the problem
CPU #4 works on this areaof the problem
CPU #2 works on this areaof the problem
Grid of Problem to be solved
y
x
exchange
exchangeexc
ha
ng
e
exc
ha
ng
e
![Page 6: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/6.jpg)
Why Do Parallel Computing?• Limits of single CPU computing
– Available memory– Performance/Speed
• Parallel computing allows:– Solve problems that don’t fit on a single CPU’s memory
space– Solve problems that can’t be solved in a reasonable time
• We can run…– Larger problems– Faster– More cases– Run simulations at finer resolution– Model physical phenomena more realistically
6
![Page 7: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/7.jpg)
Parallel Computing – Real Life Scenario• Stacking or reshelving of a set of library books• Assume books are organized into shelves and
shelves are grouped into bays• Single worker can only do it in a certain rate• We can speed it up by employing multiple
workers• What is the best strategy ?
o Simple way is to divide the total books equally among workers. Each worker stacks the books one at a time. Worker must walk all over the library.
o Alternate way is to assign fixed disjoint sets of bay to each worker. Each worker is assigned equal # of books arbitrarily. Workers stack books in their bays or pass to another worker responsible for the bay it belongs to.
7
![Page 8: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/8.jpg)
Parallel Computing – Real Life Scenario• Parallel processing allows to accomplish a task
faster by dividing the work into a set of substacks assigned to multiple workers.
• Assigning a set of books to workers is task partitioning. Passing of books to each other is an example of communication between subtasks.
• Some problems may be completely serial; e.g. digging a post hole. Poorly suited to parallel processing.
• All problems are not equally amenable to parallel processing.
8
![Page 9: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/9.jpg)
Weather ForecastingAtmosphere is modeled by dividing it into three-dimensional regions or cells, 1 mile x 1 mile x 1 mile - about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time.
About 200 floating point operations per cell per time step or 10 11 floating point operations necessary per time step
10 day forecast with 10 minute resolution => ~1.5x1014 flop
On a 100 Mflops (Mflops/sec) sustained performance machine would
take: 1.5x1014 flop/ 100x106 flops = ~17 days
On a 1.7 Tflops sustained performance machine would take:
1.5x1014 flop/ 1.7x1012 flops = ~2 minutes
![Page 10: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/10.jpg)
Other Examples• Vehicle design and dynamics• Analysis of protein structures• Human genome work• Quantum chromodynamics• Astrophysics• Earthquake wave propagation• Molecular dynamics• Climate, ocean modeling• CFD• Imaging and Rendering• Petroleum exploration• Nuclear reactor, weapon design• Database query• Ozone layer monitoring • Natural language understanding• Study of chemical phenomena• And many other scientific and industrial simulations
10
![Page 11: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/11.jpg)
Types of Parallelism : Two Extremes
• Data parallel– Each processor performs the same task on
different data– Example - grid problems
• Task parallel– Each processor performs a different task– Example - signal processing
• Most applications fall somewhere on the continuum between these two extremes
11
![Page 12: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/12.jpg)
Typical Data Parallel Program• Example: integrate 2-D propagation problem:
2
2
2
2
yB
xD
t
211
211
1 22
y
fffB
x
fffD
t
ff ni
ni
ni
ni
ni
ni
ni
ni
Starting partial differential equation:
Finite DifferenceApproximation:
PE #0 PE #1 PE #2
PE #4 PE #5 PE #6
PE #3
PE #7
y
x
12
![Page 13: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/13.jpg)
13
![Page 14: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/14.jpg)
Basics of Data Parallel ProgrammingOne code will run on 2 CPUsProgram has array of data to be operated on by 2 CPU so array is split into
two parts.
program:… if CPU=a then low_limit=1 upper_limit=50elseif CPU=b then low_limit=51 upper_limit=100end ifdo I = low_limit, upper_limit work on A(I)end do...end program
CPU A CPU B
program:…low_limit=1upper_limit=50do I= low_limit, upper_limit work on A(I)end do…end program
program:…low_limit=51upper_limit=100do I= low_limit, upper_limit work on A(I)end do…end program
14
![Page 15: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/15.jpg)
Typical Task Parallel Application• Example: Signal Processing• Use one processor for each task• Can use more processors if one is overloaded
DATA Normalize Task
FFTTask
Multiply Task
Inverse FFT Task
15
![Page 16: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/16.jpg)
Basics of Task Parallel Programming• One code will run on 2 CPUs• Program has 2 tasks (a and b) to be done by 2 CPUs
program.f:… initialize...if CPU=a then do task aelseif CPU=b then do task bend if….end program
CPU A CPU B
program.f:…initialize…do task a…end program
program.f:…initialize…do task b…end program
16
![Page 17: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/17.jpg)
How Your Problem Affects Parallelism
• The nature of your problem constrains how successful parallelization can be
• Consider your problem in terms of– When data is used, and how– How much computation is involved, and when
• Importance of problem architectures– Perfectly parallel– Fully synchronous
![Page 18: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/18.jpg)
Perfect Parallelism• Scenario: seismic imaging problem
– Same application is run on data from many distinct physical sites– Concurrency comes from having multiple data sets processed at once– Could be done on independent machines (if data can be available)
• This is the simplest style of problem • Key characteristic: calculations for each data set are independent
– Could divide/replicate data into files and run as independent serial jobs– (also called “job-level parallelism”)
Si t e C I mage Si t e D I mageSi t e B I mageSi t e A I mage
Si t e A Dat a Si t e B Dat a Si t e C Dat a Si t e D Dat a
Se i s mi c I magi ng Appl i c at i on
![Page 19: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/19.jpg)
Fully Synchronous Parallelism• Scenario: atmospheric dynamics problem
– Data models atmospheric layer; highly interdependent in horizontal layers– Same operation is applied in parallel to multiple data– Concurrency comes from handling large amounts of data at once
• Key characteristic: Each operation is performed on all (or most) data– Operations/decisions depend on results of previous operations
• Potential problems– Serial bottlenecks force other processors to “wait”
I ni t i al At mos phe r i c Par t i t i ons
At mos phe r i c Mode l i ng Appl i c at i on
Re s ul t i ng Par t i t i ons
![Page 20: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/20.jpg)
Limits of Parallel Computing
• Theoretical Upper Limits– Amdahl’s Law
• Practical Limits– Load balancing– Non-computational sections (I/O, system ops etc.)
• Other Considerations– time to re-write code
20
![Page 21: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/21.jpg)
Theoretical Upper Limits to Performance
• All parallel programs contain:– Serial sections– Parallel sections
• Serial sections – when work is duplicated or no useful work done (waiting for others) - limit the parallel effectiveness• Lot of serial computation gives bad speedup• No serial work “allows” perfect speedup
• Speedup is the ratio of the time required to run a code on one processor to the time required to run the same code on multiple (N) processors - Amdahl’s Law states this formally
21
![Page 22: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/22.jpg)
tn fp / N fs t1
S 1fs
fp / N
Amdahl’s Law• Amdahl’s Law places a strict limit on the speedup that can be realized
by using multiple processors.– Effect of multiple processors on run time
– Effect of multiple processors on speed up (S = t1/tn)
– Where
• fs = serial fraction of code
• fp = parallel fraction of code• N = number of processors
• tn = time to run on N processors
22
![Page 23: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/23.jpg)
It takes only a small fraction of serial content in a code to degrade the parallel performance.
0
50
100
150
200
250
0 50 100 150 200 250Number of processors
fp = 1.000fp = 0.999fp = 0.990fp = 0.900
Illustration of Amdahl's Law
Speedup
23
![Page 24: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/24.jpg)
Amdahl’s Law provides a theoretical upper limit on parallelspeedup assuming that there are no costs for speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance.
0
10
20
30
40
50
60
70
80
0 50 100 150 200 250
Number of processors
Amdahl's LawReality
fp = 0.99
Amdahl’s Law vs. Reality
Speedup
24
![Page 25: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/25.jpg)
Practical Limits: Amdahl’s Law vs. Reality
• In reality, Amdahl’s Law is limited by many things:– Communications– I/O– Load balancing (waiting)– Scheduling (shared processors or memory)
25
![Page 26: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/26.jpg)
When do you do parallel computing
• Writing effective parallel application is difficult– Communication can limit parallel efficiency– Serial time can dominate– Load balance is important
• Is it worth your time to rewrite your application– Do the CPU requirements justify parallelization?– Will the code be used just once?
![Page 27: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/27.jpg)
Parallelism Carries a Price Tag• Parallel programming
– Involves a learning curve– Is effort-intensive
• Parallel computing environments can be complex– Don’t respond to many serial debugging and tuning techniques
Will the investment of your time be worth it?Will the investment of your time be worth it?
![Page 28: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/28.jpg)
Test the “Preconditions for Parallelism”
• According to experienced parallel programmers:– no green Don’t even consider it– one or more red Parallelism may cost you more than you gain– all green You need the power of parallelism (but there are no guarantees)
Fre que ncy o f Us e Exe c ut io n Tim e Re s o lut io n Ne e d s
t hous ands o f t ime sbe t we e n chang e s
doz e ns of t ime sbe t we e n chang e s
only a fe w t ime sbe t we e n chang e s
days or weekss
4 -8 ho urs
m in u t e s
m us t s ig nif ic ant lyinc re as e re s o lut io no r c o mple xit y
want t o inc re aset o s ome e xt e nt
c urre nt re s o lut io n/c o mple xit y a lre adymo re than ne e de d
p o s i t i v ep r e - c o n d it io n
p o s s ib lep r e - c o n d it io n
n e g a t iv ep r e - c o n d it io n
![Page 29: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/29.jpg)
Second Topic – Parallel Machines
• Simplistic architecture• Types of parallel machines• Network topology• Parallel computing terminology
29
![Page 30: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/30.jpg)
30
Simplistic Architecture
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM CPU
MEM
CPU
MEM
Processors
Memory
Interconnect Network
![Page 31: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/31.jpg)
31
Processor Related Terms
• RISC: Reduced Instruction Set Computers• PIPELINE : Technique where multiple
instructions are overlapped in execution• SUPERSCALAR: Multiple instructions per clock
period
![Page 32: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/32.jpg)
32
Network Interconnect Related Terms
• LATENCY : How long does it take to start sending a "message"? Units are generally microseconds now a days.
• BANDWIDTH : What data rate can be sustained once the message is started? Units are bytes/sec, Mbytes/sec, Gbytes/sec etc.
• TOPLOGY: What is the actual ‘shape’ of the interconnect? Are the nodes connect by a 2D mesh? A ring? Something more elaborate?
![Page 33: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/33.jpg)
33
Memory/Cache Related Terms
CPU
MAIN MEMORY
Cache
CACHE : Cache is the level of memory hierarchy between the CPU and main memory. Cache is much smaller than main memory and hence there is mapping of data from main memory to cache.
![Page 34: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/34.jpg)
34
Memory/Cache Related Terms • ICACHE : Instruction cache• DCACHE (L1) : Data cache closest to registers• SCACHE (L2) : Secondary data cache
– Data from SCACHE has to go through DCACHE to registers
– SCACHE is larger than DCACHE – L3 cache• TLB : Translation-lookaside buffer keeps
addresses of pages ( block of memory) in main memory that have recently been accessed
![Page 35: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/35.jpg)
35
Memory/Cache Related Terms (cont.)
CPU
MEMORY (e.g., L1 cache)
MEMORY(e.g., L2 cache, L3 cache)
MEMORY(e.g., DRAM)
SPEED SIZE Cost ($/bit)
File System
![Page 36: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/36.jpg)
36
Memory/Cache Related Terms (cont.)
• The data cache was designed with two key concepts in mind– Spatial Locality
• When an element is referenced its neighbors will be referenced too
• Cache lines are fetched together• Work on consecutive data elements in the same cache line
– Temporal Locality• When an element is referenced, it might be referenced again soon• Arrange code so that date in cache is reused as often as possible
![Page 37: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/37.jpg)
Types of Parallel Machines• Flynn's taxonomy has been commonly use to classify
parallel computers into one of four basic types:– Single instruction, single data (SISD): single scalar processor– Single instruction, multiple data (SIMD): Thinking machines CM-2– Multiple instruction, single data (MISD): various special purpose
machines– Multiple instruction, multiple data (MIMD): Nearly all parallel machines
• Since the MIMD model “won”, a much more useful way to classify modern parallel computers is by their memory model– Shared memory– Distributed memory
![Page 38: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/38.jpg)
P P P P P P
B U S
M e m o r y
Shared and Distributed memory
Shared memory - single address space. All processors have access to a pool of shared memory.(examples: CRAY T90, SGI Altix)
Methods of memory access : - Bus - Crossbar
Distributed memory - each processorhas it’s own local memory. Must do message passing to exchange data between processors. (examples: CRAY T3E, XT; IBM Power,Sun and other vendor made machines )
M
P
M
P
M
P
M
P
M
P
M
P
Network
![Page 39: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/39.jpg)
Uniform memory access (UMA)Each processor has uniform accessto memory - Also known assymmetric multiprocessors (SMPs)
Non-uniform memory access (NUMA)Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scalethan SMPs(example: HP-Convex Exemplar, SGI Altix)
P P P P
BUS
Memory
Styles of Shared memory: UMA and NUMA
Memory
P PPP
Bus
Memory
P PPP
Bus
Secondary Bus
![Page 40: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/40.jpg)
UMA - Memory Access Problems• Conventional wisdom is that systems do not scale
well– Bus based systems can become saturated– Fast large crossbars are expensive
• Cache coherence problem– Copies of a variable can be present in multiple caches– A write by one processor my not become visible to others– They'll keep accessing stale value in their caches– Need to take actions to ensure visibility or cache
coherence
![Page 41: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/41.jpg)
Machines
• T90, C90, YMP, XMP, SV1,SV2 • SGI Origin (sort of)• HP-Exemplar (sort of)• Various Suns• Various Wintel boxes• Most desktop Macintosh• Not new
– BBN GP 1000 Butterfly– Vax 780
![Page 42: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/42.jpg)
Programming methodologies• Standard Fortran or C and let the compiler do it for
you• Directive can give hints to compiler (OpenMP)• Libraries• Threads like methods
– Explicitly Start multiple tasks– Each given own section of memory– Use shared variables for communication
• Message passing can also be used but is not common
![Page 43: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/43.jpg)
Distributed shared memory (NUMA)• Consists of N processors and a global address space
– All processors can see all memory– Each processor has some amount of local memory– Access to the memory of other processors is
slower• NonUniform Memory Access
Memory
P PPP
Bus
Memory
P PPP
Bus
Secondary Bus
![Page 44: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/44.jpg)
Memory
• Easier to build because of slower access to remote memory
• Similar cache problems• Code writers should be aware of data
distribution– Load balance– Minimize access of "far" memory
![Page 45: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/45.jpg)
Programming methodologies• Same as shared memory• Standard Fortran or C and let the compiler do it for
you• Directive can give hints to compiler (OpenMP)• Libraries• Threads like methods
– Explicitly Start multiple tasks– Each given own section of memory– Use shared variables for communication
• Message passing can also be used
![Page 46: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/46.jpg)
Machines
• SGI Origin, Altix• HP-Exemplar
![Page 47: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/47.jpg)
Distributed Memory
• Each of N processors has its own memory• Memory is not shared• Communication occurs using messages
Network
P PPP
M MMM
![Page 48: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/48.jpg)
Programming methodology
• Mostly message passing using MPI• Data distribution languages
– Simulate global name space– Examples
• High Performance Fortran• Split C• Co-array Fortran
![Page 49: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/49.jpg)
Hybrid machines• SMP nodes (clumps) with interconnect between
clumps• Machines
– Cray XT3/4– IBM Power4/Power5– Sun, other vendor machines
• Programming– SMP methods on clumps or message passing– Message passing between all processors
Memory
P PPP
Bus
Memory
P PPP
Bus
Interconnect
![Page 50: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/50.jpg)
Currently Multi-socket Multi-core
50
![Page 51: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/51.jpg)
51
![Page 52: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/52.jpg)
Network Topology• Custom
– Many manufacturers offer custom interconnects (Myrinet, Quadrix, Colony, Federation, Cray, SGI)
• Off the shelf– Infiniband– Ethernet– ATM– HIPPI– FIBER Channel– FDDI
![Page 53: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/53.jpg)
Types of interconnects• Fully connected• N dimensional array and ring or torus
– Paragon– Cray XT3/4
• Crossbar– IBM SP (8 nodes)
• Hypercube– Ncube
• Trees, CLOS– Meiko CS-2, TACC Ranger (Sun machine)
• Combination of some of the above• IBM SP (crossbar and fully connect for 80 nodes)• IBM SP (fat tree for > 80 nodes)
![Page 54: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/54.jpg)
![Page 55: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/55.jpg)
Wrapping produces torus
![Page 56: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/56.jpg)
![Page 57: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/57.jpg)
![Page 58: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/58.jpg)
![Page 59: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/59.jpg)
Parallel Computing Terminology• Bandwidth - number of bits that can be transmitted in unit time, given as
bits/sec, bytes/sec, Mbytes/sec.
• Network latency - time to make a message transfer through network.
• Message latency or startup time - time to send a zero-length message. Essentially the software and hardware overhead in sending message and the actual transmission time.
• Communication time - total time to send message, including software overhead and interface delays.
• Bisection width of a network - number of links (or sometimes wires) that must be cut to divide network into two equal parts. Can provide a lower bound for messages in a parallel algorithm.
![Page 60: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/60.jpg)
60
Communication Time Modeling• Tcomm = Nmsg * Tmsg
Nmsg = # of non overlapping messagesTmsg = time for one point to point communication
L = length of message ( for e.g in words)Tmsg = ts + tw * L
latency = ts = startup time (size independent)tw = asymptotic time per word (1/BW)
![Page 61: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/61.jpg)
61
Performance and Scalability Terms• Efficiency: Measure of the fraction of time for which a
processor is usefully employed. Defined as the ratio of speedup to the number of processor. E = S/N
• Amdahl’s law : discussed before • Scalability : An algorithm is scalable if the level of
parallelism increases at least linearly with the problem size. An architecture is scalable if it continues to yield the same performance per processor, albeit used on a larger problem size, as the # of processors increases. Algorithm and architecture scalability are important since they allow a user to solve larger problems in the same amount of time by buying a parallel computer with more processors.
![Page 62: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/62.jpg)
62
Performance and Scalability Terms• Superlinear speedup: In practice a speedup
greater than N (on N processors) is called superlinear speedup.
• This is observed due to Non optimal sequential algorithmSequential problem may not fit in one processor’s main memory and require slow secondary storage, whereas on multiple processors problem fits in main memory of N processors
![Page 63: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/63.jpg)
63
Sources of Parallel Overhead• Interprocessor communication: Time to transfer data
between processors is usually the most significant source of parallel processing overhead.
• Load imbalance: In some parallel applications it is impossible to equally distribute the subtask workload to each processor. So at some point all but one processor might be done and waiting for one processor to complete.
• Extra computation: Sometime the best sequential algorithm is not easily parallelizable and one is forced to use a parallel algorithm based on a poorer but easily parallelizable sequential algorithm. Sometimes repetitive work is done on each of the N processors instead of send/recv, which leads to extra computation.
![Page 64: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/64.jpg)
64
CPU Performance Comparison
# of floating point operations in a program TFLOPS = ----------------------------------------------------------------
execution time in seconds * 1012
• TFLOPS (Trillions of Floating Point Operations per Second) are dependent on the machine and on the program ( same program running on different computers would execute a different # of instructions but the same # of FP operations)
• TFLOPS is also not a consistent and useful measure of performance because – Set of FP operations is not consistent across machines e.g. some have
divide instructions, some don’t– TFLOPS rating for a single program cannot be generalized to establish
a single performance metric for a computer
Just when we thought we understand TFLOPs, Petaflop is almost here
![Page 65: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/65.jpg)
65
CPU Performance Comparison
• Execution time is the principle measure of performance
• Unlike execution time, it is tempting to characterize a machine with a single MIPS, or MFLOPS rating without naming the program, specifying I/O, or describing the version of OS and compilers
![Page 66: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/66.jpg)
66
Capability Computing• Full power of a machine is
used for a given scientific problem utilizing - CPUs, memory, interconnect, I/O performance
• Enables the solution of problems that cannot otherwise be solved in a reasonable period of time - figure of merit time to solution
• E.g moving from a two-dimensional to a three-dimensional simulation, using finer grids, or using more realistic models
Capacity Computing• Modest problems are tackled,
often simultaneously, on a machine, each with less demanding requirements
• Smaller or cheaper systems are used for capacity computing, where smaller problems are solved
• Parametric studies or to explore design alternatives
• The main figure of merit is sustained performance per unit cost
• Today’s capability computing can be tomorrow’s capacity computing
![Page 67: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/67.jpg)
67
Strong Scaling
• For a fixed problem size how does the time to solution vary with the number of processors
• Run a fixed size problem and plot the speedup
• When scaling of parallel codes is discussed it is normally strong scaling that is being referred to
Weak Scaling
• How the time to solution varies with processor count with a fixed problem size per processor
• Interesting for O(N) algorithms where perfect weak scaling is a constant time to solution, independent of processor count
• Deviations from this indicate that either – The algorithm is not truly O(N) or – The overhead due to parallelism is
increasing, or both
![Page 68: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/68.jpg)
68
Third Topic – Programming Parallel Computers
• Programming single-processor systems is (relatively) easy due to:– single thread of execution– single address space
• Programming shared memory systems can benefit from the single address space
• Programming distributed memory systems can be difficult due to multiple address spaces and need to access remote data
![Page 69: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/69.jpg)
69
Single Program, Multiple Data (SPMD)
• SPMD: dominant programming model for shared and distributed memory machines.– One source code is written– Code can have conditional execution based on
which processor is executing the copy– All copies of code are started simultaneously and
communicate and synch with each other periodically
![Page 70: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/70.jpg)
70
SPMD Programming Model
Processor 0 Processor 1 Processor 2 Processor 3
source.c
source.c source.c source.c source.c
![Page 71: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/71.jpg)
71
Shared Memory vs. Distributed Memory
• Tools can be developed to make any system appear to look like a different kind of system– distributed memory systems can be programmed as if
they have shared memory, and vice versa– such tools do not produce the most efficient code, but
might enable portability
• HOWEVER, the most natural way to program any machine is to use tools & languages that express the algorithm explicitly for the architecture.
![Page 72: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/72.jpg)
72
Shared Memory Programming: OpenMP
• Shared memory systems (SMPs, cc-NUMAs) have a single address space:– applications can be developed in which loop
iterations (with no dependencies) are executed by different processors
– shared memory codes are mostly data parallel, ‘SIMD’ kinds of codes
– OpenMP is a good standard for shared memory programming (compiler directives)
– Vendors offer native compiler directives
![Page 73: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/73.jpg)
73
Accessing Shared Variables
• If multiple processors want to write to a shared variable at the same time there may be conflicts :
Process 1 and 21) read X2) compute X+1
3) write X
• Programmer, language, and/or architecture must provide ways of resolving conflicts
Shared variable X in memory
X+1 in proc1
X+1 in proc2
![Page 74: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/74.jpg)
74
OpenMP Example #1: Parallel loop!$OMP PARALLEL DO
do i=1,128
b(i) = a(i) + c(i)
end do
!$OMP END PARALLEL DO • The first directive specifies that the loop immediately following should be
executed in parallel. The second directive specifies the end of the parallel section (optional).
• For codes that spend the majority of their time executing the content of simple loops, the PARALLEL DO directive can result in significant parallel performance.
![Page 75: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/75.jpg)
75
OpenMP Example #2; Private variables
!$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP)
do I=1,N TEMP = A(I)/B(I) C(I) = TEMP + SQRT(TEMP)end do!$OMP END PARALLEL DO• In this loop, each processor needs its own private
copy of the variable TEMP. If TEMP were shared, the result would be unpredictable since multiple processors would be writing to the same memory location.
![Page 76: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/76.jpg)
76
Distributed Memory Programming: MPI
• Distributed memory systems have separate address spaces for each processor– Local memory accessed faster than remote
memory– Data must be manually decomposed– MPI is the standard for distributed memory
programming (library of subprogram calls)– Older message passing libraries include PVM and
P4; all vendors have native libraries such as SHMEM (T3E) and LAPI (IBM)
![Page 77: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/77.jpg)
77
MPI Example #1• Every MPI program needs these:#include <mpi.h> /* the mpi include file *//* Initialize MPI */ierr=MPI_Init(&argc, &argv);
/* How many total PEs are there */
ierr=MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
/* What node am I (what is my rank? */
ierr=MPI_Comm_rank(MPI_COMM_WORLD, &myid);...ierr=MPI_Finalize();
![Page 78: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/78.jpg)
78
MPI Example #2#include#include "mpi.h” int main(argc,argv)int argc;char *argv[];{
int myid, numprocs;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);/* print out my rank and this run's PE size*/printf("Hello from %d\n",myid);printf("Numprocs is %d\n",numprocs);MPI_Finalize();
}
![Page 79: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/79.jpg)
79
Fourth Topic – Supercomputer Centers and Rankings
• DOE National Labs - LANL, LNNL, Sandia etc.• DOE Office of Science Labs – ORNL, NERSC, BNL, etc.• DOD, NASA Supercomputer Centers• NSF supercomputer centers for academic users
– San Diego Supercomputer Center (UCSD)– National Center for Supercomputer Applications (UIUC) – Pittsburgh Supercomputer Center (Pittsburgh)– Texas Advanced Computing Center– Indiana– Purdue– ANL-Chicago– ORNL– LSU– NCAR
![Page 80: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/80.jpg)
80
TeraGrid: Integrating NSF Cyberinfrastructure
SDSCTACC
UC/ANL
NCSA
ORNL
PUIU
PSC
TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, LSU, and the National Center for Atmospheric Research.
NCAR
CaltechUSC-ISI
UtahIowa
Cornell
Buffalo
UNC-RENCI
Wisc
LSU
![Page 81: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/81.jpg)
81
Measure of Supercomputers• Top 500 list (HPL code performance)
– Is one of the measures, but not the measure– Japan’s Earth Simulator (NEC) was on top for 3 years 40 TFLOPs peak; 35
TFLOPS on HPL (87% of peak)
• In Nov 2005 LLNL IBM BlueGene reached the top spot ~65000 nodes, 280 TFLOPs on HPL, 367 TFLOPs peak (currently 596 TFLOPs peak and 478 TFLOPS on HPL – 80% of peak)– First 100 TFLOP and 200 TFLOP sustained on a real application
• June, 2008 RoadRunner at LANL achieves 1 PFLOPs on HPL (1.375 PLOPs peak; 73% of peak)
• New HPCC benchmarks• Many others – NAS, NERSC, NSF, DOD TI06 etc.• Ultimate measure is usefulness of a center for you – enabling better
or new science through simulations on balanced machines
![Page 82: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/82.jpg)
82
![Page 83: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/83.jpg)
83
![Page 84: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/84.jpg)
84
Other Benchmarks
• HPCC – High Performance Computing Challenge benchmarks – no rankings
• NSF benchmarks – HPCC, SPIO, and applications: WRF, OOCORE, GAMESS, MILC, PARATEC, HOMME – (these are changing , new ones are considered)
• DoD HPCMP – TI0X benchmarks
![Page 85: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/85.jpg)
85
Floating point Performance
Processor to memory BW
Inter –processor communication of smallmessages
Inter –processor communication of largemessages
Latency and BW for simultaneous communicationpatterns
Total communicationcapacity of network
![Page 86: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/86.jpg)
86
Kiviat diagrams
![Page 87: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/87.jpg)
Fifth Topic – SDSC Parallel Machines
87
![Page 88: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/88.jpg)
SDSC: Data-intensive Computing for the TeraGrid 40TF compute, 2.5PB disk, 25PB archive
DataStarIBM Power4+
15.6 TFlops
TeraGrid Linux ClusterIBM/Intel IA-64
4.4 TFlops
Archival Systems25PB capacity (~5PB used)
Storage Area Network Disk
2500 TBSun F15K Disk Server
BlueGene DataIBM PowerPC17.1 TFlops
OnDemand ClusterDell/Intel
2.4 TFlops
![Page 89: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/89.jpg)
89
DataStar is a powerful compute resource well-suited to “extreme I/O” applications
• Peak speed 15.6 TFlops • IBM Power4+ processors (2528 total)• Hybrid of 2 node types, all on single switch
– 272 8-way p655 nodes: • 176 1.5 GHz proc, 16 GB/node (2 GB/proc)• 96 1.7 GHz proc, 32 GB/node (4 GB/proc)
– 11 32-way p690 nodes: 1.7 GHz, 64-256 GB/node (2-8 GB/proc)
• Federation switch: ~6 sec latency, ~1.4 GB/sec pp-bandwidth• At 283 nodes, ours is one of the largest IBM Federation switches• All nodes are direct-attached to high-performance SAN disk , 3.8
GB/sec write, 2.0 GB/sec read to GPFS• GPFS now has 115TB capacity
• ~700 TB of gpfs-wan across NCSA, ANL
• Will be retired in October, 2008 for national users
Due to consistent high demand, in FY05 we added 96 1.7GHz/32GB p655 nodes &
increased GPFS storage from 60 ->125TB - Enables 2048-processor capability jobs
- ~50% more throughput capacity- More GPFS capacity and bandwidth
![Page 90: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/90.jpg)
SDSC’s three-rack BlueGene/L system
90
![Page 91: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/91.jpg)
BG/L System Overview:Novel, massively parallel system from IBM
• Full system installed at LLNL from 4Q04 to 3Q05; addition in 2007– 106,496 nodes (212992 cores)– Each node being two low-power PowerPC processors + memory– Compact footprint with very high processor density– Slow processors & modest memory per processor – Very high peak speed of 596 Tflop/s– #1 in top500 until June 2008 - Linpack speed of 478 Tflop/s– Two applications have run at over 100 (2005) and 200+ (2006) Tflop/s
• Many BG/L Systems at US and outside• Now there are BG/P machines – ranked #3, #6, and #9 on top500• Need to select apps carefully
– Must scale (at least weakly) to many processors (because they’re slow)– Must fit in limited memory
91
![Page 92: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/92.jpg)
92
Chip(2 processors)
Com pute Card(2 ch ips, 2x1x1)
Node Board(32 ch ips, 4x4x2)
16 Com pute C ards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 G F/s4 M B
5.6/11.2 G F/s0.5 G B DDR
90/180 G F/s8 G B DDR
2.9/5.7 TF/s256 G B DDR
180/360 TF /s16 TB D DR
SDSC was first academic institution with an IBM Blue Gene system
SDSC rack has maximum ratio of I/O SDSC rack has maximum ratio of I/O to compute nodes at 1:8 (LLNL’s is to compute nodes at 1:8 (LLNL’s is
1:64). Each of 128 I/O nodes in rack 1:64). Each of 128 I/O nodes in rack has 1 Gbps Ethernet connection => 16 has 1 Gbps Ethernet connection => 16
GBps/rack potential. GBps/rack potential.
SDSC procured 1-rack system 12/04. SDSC procured 1-rack system 12/04. Used initially for code evaluation and Used initially for code evaluation and benchmarking; production 10/05. Now benchmarking; production 10/05. Now
SDSC has 3 racksSDSC has 3 racks (LLNL system initially had 64 racks.) (LLNL system initially had 64 racks.)
![Page 93: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/93.jpg)
BG/L System Overview:SDSC’s 3-rack system
• 3076 compute nodes & 384 I/O nodes (each with 2p)– Most I/O-rich configuration possible (8:1 compute:I/O node ratio)– Identical hardware in each node type with different networks wired– Compute nodes connected to: torus, tree, global interrupt, & JTAG– I/O nodes connected to: tree, global interrupt, Gigabit Ethernet, & JTAG– IBM network : 4 us latency, 0.16 GB/sec pp-bandwidth– I/O rates of 3.4 GB/s for writes and 2.7 GB/s for reads achieved on GPFS-WAN
• Two half racks (also confusingly called midplanes)– Connected via link chips
• Front-end nodes (2 B80s, each with 4 pwr3 processors, 1 pwr5 node)• Service node (Power 275 with 2 Power4+ processors)• Two parallel file systems using GPFS
– Shared /gpfs-wan serviced by 58 NSD nodes (each with 2 IA-64s)– Local /bggpfs serviced by 12 NSD nodes (each with 2 IA-64s)93
![Page 94: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/94.jpg)
94
BG System Overview: Processor Chip (1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR512MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
l
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
“Double FPU”
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
128
![Page 95: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/95.jpg)
95
BG System Overview: Processor Chip (2)(= System-on-a-chip)
• Two 700-MHz PowerPC 440 processors– Each with two floating-point units– Each with 32-kB L1 data caches that are not coherent– 4 flops/proc-clock peak (=2.8 Gflop/s-proc)– 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc)
• Shared 2-kB L2 cache (or prefetch buffer)• Shared 4-MB L3 cache• Five network controllers (though not all wired to each node)
– 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways)– Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways) – Global interrupt (for MPI_Barrier: low latency)– Gigabit Ethernet (for I/O)– JTAG (for machine control)
• Memory controller for 512 MB of off-chip, shared memory
![Page 96: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/96.jpg)
Sixth Topic – Allocations on NSF Supercomputer Centers
• http://www.teragrid.org/userinfo/getting_started.php?level=new_to_teragrid
• Development Allocation Committee (DAC) awards up to 30,000 CPU-hours and/or some TBs of disk (these amounts are going up)
• Larger allocations awarded through merit-review of proposals by panel of computational scientists
• UC Academic Associates• Special program for UC campuses• www.sdsc.edu/user_services/aap
96
![Page 97: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/97.jpg)
Medium and large allocations• Requests of 10,001-500,000 SUs
reviewed quarterly.o MRAC
• Requests of more than 500,000 SUs reviewed twice per year.o LRAC
• Requests can span all NSF-supported resource providers
• Multi-year requests and awards are possible
![Page 98: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/98.jpg)
New: Storage Allocations• SDSC now making disk storage and database resources available via the merit-review process
– SDSC Collections Disk Space• >200 TB of network-accessible disk for data collections• TeraGrid GPFS-WAN
• Many100s of TB parallel file system attached to TG computers• Portion available for long-term storage allocations
• SDSC Database• Dedicated disk/hardware for high-performance databases
– Oracle, DB2, MySQL
![Page 99: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/99.jpg)
And all this will cost you…
……absolutely absolutely nothing.nothing.
$0$0plus the time to plus the time to
write your proposalwrite your proposal
![Page 100: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/100.jpg)
Seventh Topic – One Application Example –
Turbulence
100
![Page 101: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/101.jpg)
Turbulence using Direct Numerical Simulation (DNS)
101
![Page 102: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/102.jpg)
“Large”
102
![Page 103: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/103.jpg)
Evolution of Computers and DNS
103
![Page 104: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/104.jpg)
104
![Page 105: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/105.jpg)
105
![Page 106: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/106.jpg)
106
Can use N2
procs for N3
gridCan use N procs for N3
grid
![Page 107: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/107.jpg)
1D Decomposition
![Page 108: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/108.jpg)
2D decomposition
![Page 109: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/109.jpg)
2D Decomposition cont’d
![Page 110: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/110.jpg)
CommunicationGlobal communication: traditionally, a serious
challenge for scaling applications to large node counts.
• 1D decomposition: 1 all-to-all exchange involving P processors• 2D decomposition: 2 all-to-all exchanges within p1 groups of p2
processors each (p1 x p2 = P)
• Which is better? Most of the time 1D wins. But again: it can’t
be scaled beyond P=N.
Crucial parameter is bisection bandwidth
![Page 111: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/111.jpg)
Performance : towards 40963
111
![Page 112: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/112.jpg)
112
![Page 113: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/113.jpg)
113
![Page 114: Introduction to Supercomputers, Architectures and High Performance Computing Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim.](https://reader038.fdocuments.us/reader038/viewer/2022102713/56649eb25503460f94bb9aa5/html5/thumbnails/114.jpg)
114