MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department...
-
Upload
maurice-stokes -
Category
Documents
-
view
214 -
download
0
Transcript of MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department...
![Page 1: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/1.jpg)
MACI - University of Alberta - April 2001
1
High-Performance Computing
José Nelson AmaralDepartment of Computing Science
University of [email protected]
![Page 2: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/2.jpg)
MACI - University of Alberta - April 2001
2
Why High Performance Computing?
Many important problems cannot be solved yet even with the fastest machines available.
faster computers enable the formulation of more interesting questions
when a problem is solved, researchers find bigger problems to tackle!
![Page 3: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/3.jpg)
MACI - University of Alberta - April 2001
3
Grand Challenges
weather forecastingeconomic modelingcomputer-aided designdrug designexploring the origins of the universesearching for extra-terrestrial lifecomputer vision
![Page 4: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/4.jpg)
MACI - University of Alberta - April 2001
4
Grand Challenges
To simulate the folding of a 300 amino acid protein in water:# of atoms: ~ 32,000folding time: 1 milisecond# of FLOPs: 3 1022 Machine Speed: 1 PetaFLOP/sSimulation Time: 1 year (Source: IBM Blue Gene Project)
IBM’s answer: The Blue Gene ProjectUS$ 100 M of funding to build a1 PetaFLOP/s computer
Ken Dil and Kit Lau’s protein folding model.
Charles L Brooks III, Scripps Research Institute
![Page 5: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/5.jpg)
MACI - University of Alberta - April 2001
5
Grand Challenges
In 1996 the GeneCrunch project demonstrates that a cluster of SGI Chalengers (64 processors) delivers near linear speedup for multiple sequence alignment.
![Page 6: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/6.jpg)
MACI - University of Alberta - April 2001
6
Commercial Applications
In October 2000, SGI andESI (France) revealed acrash simulator to be usedin the future BMW Series 5.
Sustained performance: 12 GFLOPSProcessors: 96 400 MHz MIPS Machine: SGI Origin 3000 series.
![Page 7: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/7.jpg)
MACI - University of Alberta - April 2001
7
Powerful Computers
Increased computing power enablesincreasing problem dimensions
adding more particles to a system increasing accuracy of the result improving experiment turnaround time
![Page 8: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/8.jpg)
MACI - University of Alberta - April 2001
8
Speed and Storage
![Page 9: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/9.jpg)
MACI - University of Alberta - April 2001
9
Solution?
Instead of using a single processor …
use multiple processors combine their efforts to solve a
problembenefit from the aggregate of their
processing speed, memory, cache and disk storage
![Page 10: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/10.jpg)
MACI - University of Alberta - April 2001
10
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 11: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/11.jpg)
MACI - University of Alberta - April 2001
11
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 12: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/12.jpg)
1980
1988
1990
1994
1998
2000
![Page 13: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/13.jpg)
MACI - University of Alberta - April 2001
13
Distributed Memory Machine Architecture
Interconnection Network
Processor
Caches
Memory I/O
Processor
Caches
Memory I/O
Processor
Caches
Memory I/O
NonUniform Memory Access (NUMA):Accessing local memory is faster than
accessing remote memory
![Page 14: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/14.jpg)
MACI - University of Alberta - April 2001
14
Centralized Shared Memory Multiprocessor
Interconnection Network
Processor
Caches
Main Memory I/O System
Processor
Caches
Processor
Caches
Processor
Caches
![Page 15: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/15.jpg)
MACI - University of Alberta - April 2001
15
Centralized Shared Memory Multiprocessor
Interconnection Network
Processor
Caches
Mem. Mem. Mem.
Processor
Caches
Processor
Caches
Processor
Caches
I/O crtl I/O crtl
Uniform Memory Address (UMA)“Dance Hall Approach”
![Page 16: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/16.jpg)
MACI - University of Alberta - April 2001
16
Distributed Shared Memory(Clusters of SMPs)
Cluster Interconnection Network
Memory I/O
Proc.
Caches
Node Interc. Network
Proc.
Caches
Proc.
Caches
Memory I/O
Proc.
Caches
Node Interc. Network
Proc.
Caches
Proc.
Caches
Typically: Shared Address Space with Non-Uniform Memory Access (NUMA)
![Page 17: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/17.jpg)
MACI - University of Alberta - April 2001
17
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 18: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/18.jpg)
MACI - University of Alberta - April 2001
18
What’s Next?
What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)
Thesis: 1. Clusters are becoming ubiquous, and even traditional data centers are migrating to clusters;2. Grid communities are beginning to provide significant advantages for addressing parallel problems and sharing vast number of files.
“Dark Side of Clusters: Clusters perform poorly on applications that require large shared memory.”
![Page 19: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/19.jpg)
MACI - University of Alberta - April 2001
19
Beowulf
Project started at NASA in 1993 with the goal of:
“Implementing a 1 GFLOPs workstation costingless than US$50,000 using commercial off-the-shelf(COTS) hardware and software.”
In 1994 a US$ 40,000 cluster, with 16 Intel 486s reached the goal.
In 1997 a Beowulf cluster won theGordon Bell performance/price Prize.
In June 2001, 28 Beowulfs were in theTop500 fastest computers in the world.
![Page 20: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/20.jpg)
MACI - University of Alberta - April 2001
20
“The Dark Side of Clusters”
What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)
“Clusters perform poorly on applications that require large shared memory.”
PAP = Peak Advertised PerformanceRAP = Real Application Performance
Shared memory computers deliver RAP of30-50% of the PAP, while clusters deliver5-15% of the PAP.
![Page 21: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/21.jpg)
MACI - University of Alberta - April 2001
21
Non-Shared Address Space
Clusters require an explicit message passing programming model:
MPI is the most widely used parallel programming model today
PVM used in some engineering departments.
![Page 22: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/22.jpg)
MACI - University of Alberta - April 2001
22
Large and Expensive Clusters
ASCI White8,192 PowerPC processors6 TB of memory160 TB of disk space12.3 Teraops (peak)28 tractor trailers to transport(July 2000)
Suplier: IBMClient: USA Department of EnergyMain Application: Simulated Testing of Nuclear Weapons Stockpile
![Page 23: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/23.jpg)
MACI - University of Alberta - April 2001
23
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 24: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/24.jpg)
MACI - University of Alberta - April 2001
24
Programming Model Requirements
What data can be named by the threads?
What operations can be performed on the named data?
What ordering exists among these operations?
![Page 25: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/25.jpg)
MACI - University of Alberta - April 2001
25
Programming Model Requirements
Naming:Global Physical Address SpaceIndependent Local Physical Address Spaces
Ordering: Mutual ExclusionEventsCommunication X Synchronization
![Page 26: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/26.jpg)
MACI - University of Alberta - April 2001
26
Parallel Framework
Layers: Programming Model:
Multiprogramming: lots of jobs, no communication
Shared address space: communicate via memory
Message passing: send and receive messages
![Page 27: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/27.jpg)
MACI - University of Alberta - April 2001
27
Message Passing Model
Communicate through explicit I/O operationsEssentially NUMA but integrated at I/O devices vs.
memory system Send specifies local buffer + receiving
process on remote computer Receive specifies sending process on remote
computer + local buffer to place dataSynch: when send completes, when buffer free,
when request accepted, receive wait for send Send+receive => memory-memory copy,
where each supplies local address, AND does pair-wise synchronization!
![Page 28: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/28.jpg)
MACI - University of Alberta - April 2001
28
Shared Address Model Summary
Each processor can name every physical location in the machine
Each process can name all data it shares with other processes
Data transfer via load and store Data size: byte, word, ... or cache blocks Uses virtual memory to map virtual to local or
remote physical Memory hierarchy model applies: communication
moves data to local processor cache (as load moves data from memory to cache)
![Page 29: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/29.jpg)
MACI - University of Alberta - April 2001
29
Shared Address/Memory Multiprocessor Model
Communicate via Load and Store Oldest and most popular model
process: a virtual address space and ~ 1 thread of control: Multiple processes can overlap (share), but all threads
share the process address space Writes to shared address space by one thread are
visible to reads by other threads
![Page 30: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/30.jpg)
MACI - University of Alberta - April 2001
30
Advantages of shared-memory communication
modelCompatibility with SMP hardwareEase of programming
• for complex communication patterns; or • for dynamic communication patterns;
Uses familiar SMP model• attention only on performance critical accesses
Lower communication overhead, • better use of BW for small items• memory mapping implements protection in hardware
HW-controlled caching • reduces remote comm. by caching of all data, both shared
and private.
![Page 31: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/31.jpg)
MACI - University of Alberta - April 2001
31
Advantages of message-passing communication
model
The hardware can be simplerCommunication explicit =>
• simpler to understand• focuses attention on costly aspect of parallel
computation
Synchronization is associated with messages• reduces the potential for errors introduced by
incorrect synchronization
Easier to implement sender-initiated communication models, which may have some advantages in performance
![Page 32: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/32.jpg)
MACI - University of Alberta - April 2001
32
DataParallelModel
TaskParallelModel
Programming Models
SIMDSingle Instruction
Multiple Data
SPMDSingle ProgramMultiple Data
MPMDMultiple Programs
Multiple Data
SIMDArchitecture
MIMDArchitecture
![Page 33: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/33.jpg)
MACI - University of Alberta - April 2001
33
OpenMP (1)
OpenMP gives programmers a “simple” and portable interface for developing shared-memory parallel programs
OpenMP supports C/C++ and Fortran on “all” architectures, including Unix platforms and Windows NT platforms
may become the industry standard
![Page 34: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/34.jpg)
MACI - University of Alberta - April 2001
34
OpenMP (2) - C
#pragma omp parallel for shared(A) private(i) for( i = 1; i <= 100; i++) { ... + A[i]; compute A[i] = ... }
![Page 35: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/35.jpg)
MACI - University of Alberta - April 2001
35
OpenMP (3) - Fortran
c$omp paralleldo schedule(static)c$omp&shared(omega,error,uold,
u)c$omp&private(i,j,resid)c$omp&reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = calcerror(uold,I,j) u(i,j) = uold(i,j) - omega *
resid error = error + resid*resid end do enddoc$omp end paralleldo
![Page 36: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/36.jpg)
MACI - University of Alberta - April 2001
36
Vector Processing (1)
Cray, NEC computersmultiple functional units, each with multiple stagesreplication and pipelined parallelism at the
instruction level
![Page 37: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/37.jpg)
MACI - University of Alberta - April 2001
37
Vector Processing (2)
for( i=0; i<=N; i++ )for( i=0; i<=N; i++ )A[i] = B[i] * C[i];A[i] = B[i] * C[i];
Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3
![Page 38: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/38.jpg)
MACI - University of Alberta - April 2001
38
Multi-threading
OS-level multi-threading: P-threads
Programming Language-level multi-threading: Java
Fine Grain Multi-threading: Threaded-C, Cilk, TAM
Hardware Supported Multi-threading: Tera
Instruction Level Multi-threading: Simultaneous Multi-threading (Compaq-Intel)
![Page 39: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/39.jpg)
MACI - University of Alberta - April 2001
39
Other Issues: Debugging
Debugging parallel programs can be frustrating
non-deterministic executionprobe effectdifficult to “stop” a parallel programmultiple core filesdifficult to visualize parallel activity tools are barely adequate
![Page 40: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/40.jpg)
MACI - University of Alberta - April 2001
40
Other Issues:Performance Tuning
Use available performance tuning tools (perfex, Speedshop on SGI) to know where the program spends time.
Re-tune code for performance when hardware changes.
![Page 41: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/41.jpg)
MACI - University of Alberta - April 2001
41
Other Issues: Fault Tolerance
Consider a job running on 40 processors for a week, then there is a power outage, losing all the work.Long-running jobs must be able to save a program’s state and then be able to restart from that state. This is called check-pointing.
![Page 42: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/42.jpg)
MACI - University of Alberta - April 2001
42
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 43: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/43.jpg)
MACI - University of Alberta - April 2001
43
What Does Coherency Mean?
Informally:• “Any read must return the most recent write”• Too strict and too difficult to implement
Better:• “Any write must eventually be seen by a read”• All writes are seen in proper order
(“serialization”)
Two rules to ensure this:• “If P writes x and P1 reads x, P’s write will be
seen by P1 if the read and write are sufficiently far apart”
Writes to a single location are serialized: seen in the same order order
![Page 44: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/44.jpg)
MACI - University of Alberta - April 2001
44
Potential HW Coherency Solutions
Snooping Solution (Snoopy Bus):• Send all requests for data to all processors• Each processor snoops to see if it have a copy• Requires broadcast• Works well with bus (natural broadcast medium)• Prefered scheme for small scale machines
Directory-Based Schemes:• Keep track of what is being shared in 1 centralized
place
Distributed memory => distributed directory• Sends point-to-point requests• Scales better than Snooping
![Page 45: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/45.jpg)
MACI - University of Alberta - April 2001
45
Basic Snoopy Protocols
Write Invalidate Protocol:Multiple readers, single writerWrite to shared data: an invalidate is sent to all
caches which snoop and invalidate any copiesRead Miss:
• Write-through: memory is always up-to-date• Write-back: snoop in caches to find most recent copy
Write Broadcast Protocol:Write to shared data: broadcast on bus, processors
snoop, and update any copiesRead miss: memory is always up-to-date
Write serialization: bus serializes requests!Bus is single point of arbitration
![Page 46: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/46.jpg)
MACI - University of Alberta - April 2001
46
Basic Snoopy Protocols
Write Invalidate versus Broadcast: Invalidate requires one transaction per
write-run Invalidate uses spatial locality: one
transaction per block Broadcast has lower latency between
write and read
![Page 47: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/47.jpg)
MACI - University of Alberta - April 2001
47
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
read x
Main Memory
read miss
Snoopy, Cache Invalidation
Protocol (Example)
![Page 48: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/48.jpg)
MACI - University of Alberta - April 2001
48
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
Snoopy, Cache Invalidation
Protocol (Example)
![Page 49: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/49.jpg)
MACI - University of Alberta - April 2001
49
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
read x
read miss
Snoopy, Cache Invalidation
Protocol (Example)
![Page 50: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/50.jpg)
MACI - University of Alberta - April 2001
50
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
oxshared
Snoopy, Cache Invalidation
Protocol (Example)
![Page 51: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/51.jpg)
MACI - University of Alberta - April 2001
51
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
oxshared
write x
invalidate
Snoopy, Cache Invalidation
Protocol (Example)
![Page 52: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/52.jpg)
MACI - University of Alberta - April 2001
52
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
1xexclusive
Snoopy, Cache Invalidation
Protocol (Example)
![Page 53: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/53.jpg)
MACI - University of Alberta - April 2001
53
Programmer’s Abstraction for a
Sequential Consistency Model
P1 P1 Pn
Memory
The switch is randomly set after each memory reference.
(See CullerSinghGupta, pp. 287)
![Page 54: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/54.jpg)
MACI - University of Alberta - April 2001
54
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 55: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/55.jpg)
MACI - University of Alberta - April 2001
55
Top 500 (November 10, 2001)
Manuf. Computer Rmax Site Year Proc. Rpeak Rmax/Rpeak
I BM ASCI White, Power3, 375 MHz
7226 Lawrence Livermore 2000 8192 12288 0.59
Compaq AlphaServer SC ES45/ 1 GHz
4059 Pittsburgh Superc. Center
2001 6048 6048 0.67
I BM SP Power3, 375 MHz 16-way
3052 NERSC/ LBNL 2001 3328 4992 0.61
I ntel ASCI Red 2379 Sandia Nat. Lab. 1999 9632 3207 0.74 I BM ASCI Blue-Pacifi c
I BM SP 604e 2144 Lawrence Livermore 1999 5808 3868 0.55
Compaq AlphaServer SC ES45/ 1 GHz
2096 Los Alamos Nat. Lab. 2001 1536 3072 0.68
Hitachi SR8000/ MPP 1709.1 Univ. of Tokyo 2001 1152 2074 0.82 …
NEC(12) SX-5/ 128M8 3.2ns 1192 Osaka University 2001 128 1280 0.93
Rmax = Maximal LINPACK Performance Achieved [GFLOPS]Rpeak = Theoretical peak performance [GFLOPS]
Canada appears in ranks 123, 144, 183, 255, 266, 280, 308, 311, 315, 414, 419.
![Page 56: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/56.jpg)
MACI - University of Alberta - April 2001
56
Top500 Statistics
HP SPP
IBM SP
SGI Origin
T3E/T3D
NOW
Sun UltraHPC
Others
Industry
Research
Academic
Classified
Vendor
Government
MPP
Constellations
SMP
Clusters
USA
Germany
Japan
UK
France
Korea
Italy
Canada
Others
Computer Family Type of Organization
Machine Organization Country
![Page 57: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/57.jpg)
MACI - University of Alberta - April 2001
57
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 58: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/58.jpg)
MACI - University of Alberta - April 2001
58
Intel Architecture 64
Itanium, the first one, is out….
… but we are still waiting for Mckinley...
Will we get Yamhill (Intel’s Plan B) instead?
![Page 59: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/59.jpg)
MACI - University of Alberta - April 2001
59
Alpha is gone...
So is Compaq! EV8 is scrapped.
Compaq designers split betweenAMD and Intel.
Intel converts to the SymultaneousMultithreading religion, but renamesit: Hyperthreading!!
![Page 60: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/60.jpg)
MACI - University of Alberta - April 2001
60
IBM’s POWER4 is the 2001/2002 winner
Best Floating Point and integer performanceavailable in the market.
Highest memory bandwidth in the industry.
Well integrated cache coherence mechanism.
![Page 61: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/61.jpg)
MACI - University of Alberta - April 2001
61
Implicit X ExplicitInstruction Level Parallelism
EPIC: Explicitly ParallelInstruction Computer
Superscalar: Instruction LevelParallelism discovered
implicitly by compiler or by hardware.
![Page 62: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/62.jpg)
MACI - University of Alberta - April 2001
62
Instruction Level Parallelism
Most ILP is implicit, i.e., instructions that can beexecuted in parallel are automatically discoveredby the hardware at runtime.
Intel launched Itanium, the first IA-64 processor,that explores explicit parallelism at the instructionlevel, in this processors the compiler codes paralleloperations in the assembly code.
But applications are still written in standard sequentialprogramming languages.
![Page 63: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/63.jpg)
MACI - University of Alberta - April 2001
63
Instruction Level Parallelism
For example, in the IA-64 an instruction group, identified by the compiler, is a set of instructions that
have no read after write (RAW) or write after write (WAW)register dependencies (they can execute in parallel).
Consecutive instruction groups are separated by stops (represented by a double semi-column in the assembly code).
ld8 r1=[r5] // First groupsub r6=r8, r9 // First groupadd r3=r1,r4 ;; // First groupst8 [r6]=r12 // Second group
![Page 64: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/64.jpg)
MACI - University of Alberta - April 2001
64
IA-64 Innovations
if-conversion: execute both sides of a branch
if(r1 == 0)r2 = r3 + r3
elser7 = r6 - r5
cmp.ne p1, p2 = r1, 0 ;; Set predicate reg(p1) add r2 = r3, r4(p2) sub r7 = r6,r5
data speculation: load a value before knowing if the addressis correct.
control speculation: Execute a computation before knowing it it have to.
rotating registers: Support for software pipelining.
![Page 65: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/65.jpg)
MACI - University of Alberta - April 2001
65
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 66: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/66.jpg)
MACI - University of Alberta - April 2001
66
Below Above the line
for(n=0 ; …) for(f=0 ; …) for(t=0 ; …) for(x=0 ; …) for(y=0 ; …) for(z=0 ; …) { …….. }
Application LevelParallelism
AutomaticParallelism
![Page 67: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/67.jpg)
MACI - University of Alberta - April 2001
67
Some Common Loop Optimizations
UnswitchingLoop PeelingLoop AlignmentIndex Set SplittingScalar ExpansionLoop FusionLoop FissionLoop ReversalLoop Interchange
![Page 68: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/68.jpg)
MACI - University of Alberta - April 2001
68
Unswitching
Remove loop independent conditionals from a loop.
for i=1 to N do for j=2 to N do if T[i] > 0 then A[i,j] = A[i, j-1]*T[i] + B[i] else A[i,j] = 0.0 endif endforendfor
Before Unswitching
for i=1 to N do if T[i] > 0 then for j=2 to N do A[i,j] = A[i, j-1]*T[i] + B[i] endfor else for j=2 to N do A[i,j] = 0.0 enfor endifendfor
After Unswitching
![Page 69: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/69.jpg)
MACI - University of Alberta - April 2001
69
Loop Peeling
Remove the first (last) iteration of the loop into separate code.
for i=1 to N do A[i] = (X+Y)*B[i]endfor
Before Peeling
if N >= 1 then A[i] = (X+Y)*B[i] for j=2 to N do A[i] = (X+Y)*B[i] enforendif
After Peeling
![Page 70: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/70.jpg)
MACI - University of Alberta - April 2001
70
Index Set Splitting
Divides the index set into two portions.
for i=1 to 100 do A[i] = B[i] + C[i] if i > 10 then D[i] = A[i] + A[i-10] endifendfor
Before Set Splitting
for i=1 to 10 do A[i] = B[i] + C[i]endforfor i=11 to 100 do A[i] = B[i] + C[i] D[i] = A[i] + A[i-10]endfor
After Set Splitting
![Page 71: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/71.jpg)
MACI - University of Alberta - April 2001
71
Scalar Expansion
Breaks anti-dependence relations by expanding, or promoting a scalar into an array.
for i=1 to N do T = A[i] + B[i] C[i] = T + 1/Tendfor
Before Scalar Expansion
if N >= 1 then allocate Tx(1:N) for i=1 to N do Tx[i] = A[i] + B[i] C[i] = Tx[i] + 1/Tx[i] endfor T = Tx[N]endif
After Scalar Expansion
![Page 72: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/72.jpg)
MACI - University of Alberta - April 2001
72
Loop Fusion
Takes two adjacent loops and generates a singleloop.
(1) for i=1 to N do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to N do(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor
Before Loop Fusion
(1) for i=1 to N do(2) A[i] = B[i] + 1(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor
After Loop Fusion
![Page 73: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/73.jpg)
MACI - University of Alberta - April 2001
73
Loop Fusion (Another Example)
(1) for i=1 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor
(2) A[1] = B[1] + 1(1) for i=2 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor
(1) i = 1(2) A[i] = B[i] + 1 for ib=0 to 97 do(1) i = ib+2(2) A[i] = B[i] + 1(4) i = ib+1(5) C[i] = A[i+1] * 2(6) endfor
![Page 74: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/74.jpg)
MACI - University of Alberta - April 2001
74
Loop Fission
Breaks a loop into two or more smaller loops.
(1) for i=1 to N do(2) A[i] = A[i] + B[i-1](3) B[i] = C[i-1]*X + Z(4) C[i] = 1/B[i](5) D[i] = sqrt(C[i])(6) endfor
Original Loop
(1) for ib=0 to N-1 do(3) B[ib+1] = C[ib]*X + Z(4) C[ib+1] = 1/B[ib+1](6) endfor(1) for ib=0 to N-1 do(2) A[ib+1] = A[ib+1] + B[ib](6) endfor(1) for ib=0 to N-1 do(5) D[ib+1] = sqrt(C[ib+1])(6) endfor(1) i = N+1
After Loop Fission
![Page 75: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/75.jpg)
MACI - University of Alberta - April 2001
75
Loop Reversal
Run a loop backward.All dependence directions are reversed.
It is only legal for loops that have no loop carrieddependences.
Can be used to allow fusion(1) for i=1 to N do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=1 to N do(6) D[i] = 1/C[i+1](7) endfor
(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=N downto 1 do(6) D[i] = 1/C[i+1](7) endfor
(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(6) D[i] = 1/C[i+1](7) endfor
![Page 76: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/76.jpg)
MACI - University of Alberta - April 2001
76
Loop Interchanging
(1) for j=2 to M do(2) for i=1 to N do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor
(1) for i=1 to N do(2) for j=2 to M do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor
![Page 77: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/77.jpg)
MACI - University of Alberta - April 2001
77
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 78: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/78.jpg)
MACI - University of Alberta - April 2001
78
Speedup
goal is to use N processors to make a program run N times faster
speedup is the factor by which the program’s speed improves
processor 1ePerformanc
processor ePerformanc processors Speedup
pp
processor Time
processor 1Time processors Speedup
pp
![Page 79: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/79.jpg)
MACI - University of Alberta - April 2001
79
AbsoluteRelative Speedup
Careful: the execution time depends on what the program does!
A parallel program spends time in: Work Synchronization Communication Extra work
A program implemented for a parallel machine is likely to do extra work (than a sequential program) even when running in a single processor machine!
![Page 80: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/80.jpg)
MACI - University of Alberta - April 2001
80
Absolute Relative Speedup
When talking about execution time, ask what algorithm is implemented!
processor Alg., Par.Time
processor 1 Alg., Par.Time processors Speedup Relative
pp
processor 1 Alg., Par.Time
processor 1 Alg., Seq.Time
processor Alg., Par.Time
processor 1 Alg., Seq.Time processors Speedup Absolute
pp
![Page 81: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/81.jpg)
MACI - University of Alberta - April 2001
81
Speedup
![Page 82: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/82.jpg)
MACI - University of Alberta - April 2001
82
Which is Better?
programs A & B solve the same problem using different algorithmsboth are run on a 100-processor computerprogram A gets a 90-fold speedup program B gets a 10-fold speedupWhich one would you prefer to use?
![Page 83: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/83.jpg)
MACI - University of Alberta - April 2001
83
It Depends!
all that matters is overall execution timewhat if A runs sequentially
1,000 times slower than B?always use the best sequential time (over all algorithms) for
computing speedups!and the best compiler!
![Page 84: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/84.jpg)
MACI - University of Alberta - April 2001
84
Superlinear Speedups
sometimes N processors can achieve a speedup > N
usually the result of improving an inferior sequential algorithm
can legitimately occur because of cache and memory effects
![Page 85: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/85.jpg)
MACI - University of Alberta - April 2001
85
Amdahl’s Law (1)
Npar
seq
parseq
NT
TSpeedup
N
par
seq
1
processors ofnumber :
program a ofportion parallel:
program a ofportion sequential:
seqseq
parseq MaxSpeedup
0.1
:is obtained becan that speedup maximum The
![Page 86: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/86.jpg)
MACI - University of Alberta - April 2001
86
Amdahl’s Law (2)
= % seq timeN
![Page 87: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/87.jpg)
MACI - University of Alberta - April 2001
87
Scalability
desirable property of algorithm is scalability, regardless of speedup
problem of size P using N processors takes time T
problem is scalable if problem of size 2P on 2N processors still takes time T
![Page 88: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/88.jpg)
MACI - University of Alberta - April 2001
88
Choose Right Algorithm
understand strengths and weaknesses of hardware being used
choose an algorithm to exploit strengths and avoid weaknessesExample: there are many parallel sorting algorithms, each valid for different hardware/application properties
![Page 89: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/89.jpg)
MACI - University of Alberta - April 2001
89
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
![Page 90: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/90.jpg)
MACI - University of Alberta - April 2001
90
The Reality Is ...
software – more issues to be addressed beyond sequential case
hardware – need to understand machine architecture
Parallel programming is hard!
The Reward Is ...high performance – the motivation
in the first place!
![Page 91: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/91.jpg)
MACI - University of Alberta - April 2001
91
Do You Need Parallelism?
Consider the trade-offs+Potentially faster execution
turnaround- Longer software development time- Obtaining machine access- Cost = f(development) + g(execution
time)Do the benefits out-weigh the costs?
![Page 92: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/92.jpg)
MACI - University of Alberta - April 2001
92
Resistance to Parallelism
software inertia – cost of converting existing software
hardware inertia – waiting for faster machineslack of educationlimited access to resources
![Page 93: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/93.jpg)
MACI - University of Alberta - April 2001
93
Starting Out...
parallel program design is black magicsoftware tools are primitiveexperience is an asset
All is not lost!many problems are amenable
to simple parallel solutions
![Page 94: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/94.jpg)
MACI - University of Alberta - April 2001
94
Starting Out...
sequential world is simple: one architectural model
in parallel world, need to choose the algorithm to suit the architecture
parallel algorithm may only perform well on one class of machine
![Page 95: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/95.jpg)
MACI - University of Alberta - April 2001
95
Granularity (1)
Granularity is the relation between the amount of computation and the amount of communication.
It is a measure of how much work gets done before processes have to communicate.
![Page 96: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/96.jpg)
MACI - University of Alberta - April 2001
96
Granularity (2)
Problem: Shopping for 100 items of grocery.
Scenario 2: (large granularity)
You are told all 100 items, and then you make a single trip to purchase everything.
Scenario 1: (small granularity)
You are told an item to buy, go to the store, purchase the item, then return home to find out the next item to buy.
![Page 97: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/97.jpg)
MACI - University of Alberta - April 2001
97
Granularity (3)
As in the real world, communication and synchronization takes time
![Page 98: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/98.jpg)
MACI - University of Alberta - April 2001
98
ArchitecturesMatch problem granularity to parallel architecture
fine-grained vector/array processors
medium-grained shared memory multiprocessor
coarse/large-grained network of workstations
![Page 99: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/99.jpg)
MACI - University of Alberta - April 2001
99
Program Design
1) Identify hardware platforms available2) Identify parallelism in the application3) Choose right type of algorithmic
parallelism to match the architecture4) Implement algorithm, being wary of
performance and correctness issues
![Page 100: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/100.jpg)
MACI - University of Alberta - April 2001
100
Vector Processing (3)
this class of machines is very effective at striding through arrays
some parallelism can be automatically detected by compiler
there is “right” way and “wrong” way to code loops to allow parallelism
![Page 101: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/101.jpg)
MACI - University of Alberta - April 2001
101
Distributed Memory (1)Loosely coupled processorsIBM SP-2, networks of computersPCs connected by a fast network
![Page 102: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/102.jpg)
MACI - University of Alberta - April 2001
102
Distributed Memory (2)
communication between processes by sending messagesOverhead of parallelism includes
cost of preparing a messagecost of sending a messagecost of waiting for a message
![Page 103: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/103.jpg)
MACI - University of Alberta - April 2001
103
Communication
distributed memory(loosely coupled)
Explicitly sendmessages between processes
![Page 104: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/104.jpg)
MACI - University of Alberta - April 2001
104
Synchronization
![Page 105: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/105.jpg)
MACI - University of Alberta - April 2001
105
Message Passing (1)Process 1 Process 2compute computeSend( P2, info ); computecompute Receive( P1, info );idle computeidle Send( P1, reply );Receive( P2, reply );
SynchronizeSynchronize
CommunicateCommunicate
![Page 106: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/106.jpg)
MACI - University of Alberta - April 2001
106
Message Passing (2)
Two popular message passing libraries:PVM (Parallel Virtual Machine)MPI (Message Passing Interface)MPI will likely be the industry standardBoth are easy to use, but verbose
![Page 107: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/107.jpg)
MACI - University of Alberta - April 2001
107
Master/Slave (1)
many distributed memory programs are structured as one master and N slaves
master generates work to dowhen idle, slave asks for a piece of work, gets it, does the
task, and reports the result
![Page 108: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/108.jpg)
MACI - University of Alberta - April 2001
108
Master/Slave (2)Master Slaveworklist = Make( work ) data = NEED_WORK;while( worklist != empty ) while( true ){ { Receive( slave, result ); Send( master, data ); to_do = Head( worklist ); /* wait */ Send( slave, to_do ) Receive( master, work ); Process( result ); data = Process( work );} }
M
S
S
S
![Page 109: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/109.jpg)
MACI - University of Alberta - April 2001
109
Pitfall: Deadlock
scenario by which no further progress can be made, since each process is waiting on a cyclic dependency
![Page 110: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/110.jpg)
MACI - University of Alberta - April 2001
110
Pitfall: Deadlock
A real-world problem!
![Page 111: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/111.jpg)
MACI - University of Alberta - April 2001
111
Pitfall: Load Balancing
need to assign a roughly equal portion of work to each processor
load imbalances can result in poor speedups
![Page 112: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/112.jpg)
MACI - University of Alberta - April 2001
112
Shared Memory (1)
Tightly coupled multiprocessorsSGI Origin 2400, SUN E10000Classified by memory access times
Same (SMP - symmetric multiprocessor) Different (NUMA - non-uniform memory
access)
P P P P
Memory
![Page 113: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/113.jpg)
MACI - University of Alberta - April 2001
113
Shared Memory (2)
communicate between processes through reading/writing shared variables (instead of sending and receiving messages)Overhead of parallelism includes:
contention for shared resourceprotecting integrity of shared resources
![Page 114: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/114.jpg)
MACI - University of Alberta - April 2001
114
Communication
shared memory(tightly coupled)
Read from Read from and write to and write to shared datashared data
![Page 115: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/115.jpg)
MACI - University of Alberta - April 2001
115
SynchronizationMay have to prevent simultaneous access!Process 1 Process 2B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Print Statement; Print Statement;What is the value of BankBalance?
![Page 116: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/116.jpg)
MACI - University of Alberta - April 2001
116
Pitfall: Shared Data Access
Need to restrict access to data!Avoid race conditions!Process 1 Process 2Lock( access ); Lock( access );B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Unlock( access ); Unlock( access );
![Page 117: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/117.jpg)
MACI - University of Alberta - April 2001
117
Multi-threading
![Page 118: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/118.jpg)
MACI - University of Alberta - April 2001
118
Simultaneous Multi-threading
http://www.eet.com/story/0EG19991008S0014 by Rick Merrit
![Page 119: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/119.jpg)
MACI - University of Alberta - April 2001
119
Top 500
http://www.top500.org
![Page 120: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/120.jpg)
MACI - University of Alberta - April 2001
120
Top 500
http://www.top500.org
![Page 121: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/121.jpg)
MACI - University of Alberta - April 2001
121
Top 500
http://www.top500.org
![Page 122: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/122.jpg)
MACI - University of Alberta - April 2001
122
Top 500
http://www.top500.org
![Page 123: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/123.jpg)
MACI - University of Alberta - April 2001
123
Top 500
http://www.top500.org
![Page 124: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/124.jpg)
MACI - University of Alberta - April 2001
124
Top 500
http://www.top500.org
![Page 125: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/125.jpg)
MACI - University of Alberta - April 2001
125
Top 500
http://www.top500.org
![Page 126: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/126.jpg)
MACI - University of Alberta - April 2001
126
Conclusions
some problems require extensive computational resources
parallelism allows you to decrease experiment turnaround time
the tools are adequate but still have a long way to go
“simple” parallelism gets some performance but maximum performance requires effort.
Performance commensurate with effort!
![Page 127: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/127.jpg)
MACI - University of Alberta - April 2001
127
Reminders
understand your computational needs
understand the hardware and software resources available to you
match parallelism to the architecture
maximize utilization, don’t waste cycles!
granularity, granularity, granularity
develop, test and debug small data sets before trying large ones
be wary of the many pitfalls
![Page 128: MACI - University of Alberta - April 20011 High-Performance Computing José Nelson Amaral Department of Computing Science University of Alberta amaral@cs.ualberta.ca.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ef15503460f94c01d06/html5/thumbnails/128.jpg)
MACI - University of Alberta - April 2001
128
We Want You!
For help with parallel programming, contact
Get parallel!Become part of MACI