CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists...
Transcript of CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists...
1 of 28
2
ming for
© 2006 David A. Padua
CS420/CSE 402/ECE 49
Introduction to Parallel ProgramScientists and Engineers
Spring 2006
2 of 28
nization
© 2006 David A. Padua
Additional Foils 0.i: Course orga
3 of 28
Instructor: Office Hours:
© 2006 David A. Padua
David Padua. By appointment4227 SC [email protected] 3-4223
T.A.: Office Hours:Predrag Tosic
XXXX Siebel Center [email protected]
4 of 28
© 2006 David A. PaduaTextbook
5 of 28
Lectures
ill be posted at:
0/
must complete ork).
© 2006 David A. Padua
Some lecture foils will be required reading. These w
http://www-courses.cs.uiuc.edu/~cs42
Grading:
6-9 Machine Problems(MPs)/Homeworks 50% Midterm (Friday Mar 3) 25%Final (Comprehensive) 25%
Graduate students registered for 1 unit (4 credits)additional work (associated with each MP/Homew
6 of 28
cs
© 2006 David A. Padua
Additional Foils 0.ii: Topi
7 of 28
, pC++, SplitC, n), HTA
© 2006 David A. Padua
• Machine models.
• Parallel programming models.
• Language extensions to express parallelism:
OpenMP (Fortran) and MPI (Fortran or C).
If time allows: High-Performance Fortran, LindaUPC (Unified Parallel C), CAF (Co-array Fortra(Hierarchically Tiled Aarrays).
• Issues in algorithm design
Parallelism
Load balancing
Communication
Locality
8 of 28
• Algorithms.
ication and
© 2006 David A. Padua
Linear algebra algorithms such as matrix multiplequation solvers.
Symbolic algorithms such as sorting.
N-body
Random number generators.
Asynchronous algorithms.
• Program analysis and transformation
Dependence analysis
Race conditions
Deadlock detection
• Parallel program development and maintenance
Modularity
Performance analysis and tuning
Debugging
9 of 28
duction
© 2006 David A. Padua
Additional Foils Chapter 1: Intro
10 of 28
P
• ming two or more
• ce the very first
• ntional systems such
andle one digit at a This design strategy al computer design of
ructions and floating-ctions can execute
ed simultaneously. t.
© 2006 David A. Padua
arallelism
The idea is simple: improve performance by perforoperations at the same time.
Has been an important computer design strategy sincomputers.
It takes many (complementary forms) within conveas uniprocessor PCs and UNIX workstations:
At the circuit level: Adders and multipliers do not htime but operate on several digits at the same time. was used even by Charles Babbage in his mechanicthe 19th century.
At the processor-design level: The execution of instpoint operations is usually pipelined. Several instrusimultaneously.
At the system level: Computation and I/O can proceThis is why multiprogramming increases throughpu
11 of 28
• However, the design strategy of interest to us is to attain everal complete
eading /products/server/
E|s )
med after a 1964 s that “The its doubles every
f parallel systems
doorstop can be ather one of a res, which can one chooses to the dear departed he others.” In
© 2006 David A. Padua
parallelism by using several processors or even scomputers.
• Future PCs will be built with multicore chips. (Rassignment: http://www.intel.com/business/bssresource_center/multi-core.htm?ppc_cid=ggl|multicore_resrc_ctr|k46E
• Multicore are made possible by Moore’s Law, naobservation by Gordon E. Moore of Intel. It holdnumber of elements in advanced integrated circuyear.”
• Another important reason for the development oof the multicomputer variety is availability.
“Having a computer shrivel up into an expensivea whole lot less traumatic if it’s not unique, but rherd. The herd should be able to accomodate spapotentially be used to keep the work going; or ifconfigure sparelessly, the work that was done bysibling can, potentially, be redistributed among tsearch of clusters. G. Pfister. Prentice Hall.
12 of 28
13 of 28
n used for as the weather, turing processes,
ized as the third some cases it is the only may not be possible due (very far away), dangers g the experiments. In d into computer software include both
mathematical models. By eter values, an
hese simulations and the anding and its usefulness
g and re”.
© 2006 David A. Padua
Applications
• Traditionally, highly parallel computers have beenumerical simulations of complex systems such mechanical devices, electronic circuits, manufacchemical reactions, etc.
• “ In part because of HPCC technologies, simulation has become recognparadigm of science, the first two being experimentation and theory. Inapproach available for further advancing knowledge -- experimentsto size (very big or very small), speed (very fast or very slow), distanceto health and safety (toxic or explosive), or the economics of conductinsimulations, mathematical models of physical phenomena are translatethat specifies how calculations are performed using input data that mayexperimental data and estimated values of unknown parameters in the repeatedly running the software using different data and different paramunderstanding of the phenomenon of interest emerges. The realism of tspeed with which they are produced affect the accuracy of this understin predicting change. “
From an old document entitled “High Performance ComputinCommunications: Foundation for America's Information Futu
14 of 28
in parallel
ASC)
s/BGLbrocure.pdf
ar weapons in e of the US ters (1000s of ing used to
y. Examples ers, data mining, main driving
tions due to their
© 2006 David A. Padua
• Perhaps the most important government programcomputing today is the
Advanced Simulation and Computing Program (
( Reading assignment:
http://www.llnl.gov/asc/overview/overview.html http://www.llnl.gov/asci/platforms/bluegenel/image
).
Its main objective is to accurately simulate nucleorder to verify safety, reliability, and performancnuclear stockpile. Several highly-parallel compuprocessors) from Intel, IBM, and SGI are now bedevelop these simulations
• Commercial applications are also important todainclude: transaction processing systems, web servetc. These applications will probably become theforce behind parallel computing in the future.
• In this course, we will focus on numerical simulaimportance for scientists and engineers.
15 of 28
sidered today as a experimentation
ring tool that lity of new
© 2006 David A. Padua
• As mentioned above, computer simulation is conthird mode of scientific research. It complementsand theoretical analysis.
• Furthermore, simulation is an important engineeprovides fast feedback on the quality and feasibidesigns.
16 of 28
ne models
© 2006 David A. Padua
Additional Foils Chapter 2: Machi
17 of 28
l model
Parallel
s ago.
this model. It is
© 2006 David A. Padua
2.1 The Von Neumann computationa
Discussion taken from Almasi and Gottlieb: HighlyComputing. Benjamin Cummings, 1988.
• Designed by John Von Neumann about fifty year
• All widely used “conventional” machines followrepresented next:
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
18 of 28
s “add the ult in that
d data of a
fter another from d shuttles data essor.
© 2006 David A. Padua
• The machine’s essential features are:
1. A processor that performs instructions such acontents of these two registers and put the resregister”
2. A memory that stores both the instructions anprogram in cells having unique addresses.
3. A control scheme that fetches one instruction athe memory for execution by the processor, anone word at a time between memory and proc
19 of 28
For an instruction to be executed, there are several steps that must be performed. For example:
1. Instruction Fetch and decode (IF). Bring the instrution from memory into the control unit and identify the type of instruction.
2. Read data (RD). Read data from memory.
3. Execution (EX). Execute operation.
4. Write Back (WB). Write the results back.
20 of 28
ed in a high
mpiler into the le, the previous sequence of the
register 3)
s in memory)
ine” with its own
nguages, such as model.
© 2006 David A. Padua
• Notice that machines today usually are programmlevel language containing statements such as
A = B + C
However, these statements are translated by a comachine instructions just mentioned. For exampassignment statement would be translated into a form:
LD 1,B (load B from memory into processor register 1)
LD 2,C (load C from memory into register 2)
ADD 3,1,2 (add registers 1 and 2 and put the result into
ST 3,A (store register 3’s contents into variable A’s addres
• It is said that the compiler creates a “virtual machlanguage and computational model.
• Virtual machines represented by conventional laFortran 77 and C, also follow the Von Neumann
21 of 28
tion of
mmunicate with
© 2006 David A. Padua
2.2 Multicomputers
• The easiest way to get parallelism given a collecconventional computers is to connect them:
• Each machine can proceed independently and cothe others via the interconnection network.
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
Interconnect
22 of 28
lusters and uite similar, but old as such.
interconnected le, unified
essor (such as
ervers
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logicregisters
Instruction counter
CONTROL
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logicregisters
Instruction counter
CONTROL
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logicregisters
Instruction counter
CONTROL
© 2006 David A. Padua
• There are two main classes of multicomputers: cdistributed-memory multiprocessors. They are qthe latter is considered a single computer and is s
Furthermore, a cluster consists of a collection ofwhole computers (including I/O) used as a singcomputing resource.
Not all nodes of a distributed memory multiprocIBMs SP-2) need have complete I/O resources.
• An example of cluster is a web server
The net
dispatcherrouter
request
S
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logicregisters
Instruction counter
CONTROL
23 of 28
rmilab, which workstations.
. Analyzing any zing any of the that analyzes one ossible to analyze
© 2006 David A. Padua
• Another example was a workstation cluster at Feconsisted of about 400 Silicon Graphics and IBMThe system is used to analyze accelerator eventsone of those events has nothing to do with analyothers. Each machine runs a sequential program event at a time. By using several machines it is pmany events simultaneously.
24 of 28
cessor is the we mean that ties. Therefore al access to every O device equally h processor the m symmetric.
hese will be
I/O
LAN Disks
Interconnect
© 2006 David A. Padua
2.3 Shared-memory multiprocessors
• The simplest form of a shared-memory multiprosymmetric multiprocessor (SMP). By symmetriceach of the processors has exactly the same abiliany processor can do anything: they all have equlocation in memory; they all can control every I/well, etc. In effect, from the point of view of eacrest of the machine looks the same, hence the ter
• An important component of SMPs are caches. Tdiscussed later.
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
25 of 28
allelism that are e coarse grain rs.
nit.
is type of
© 2006 David A. Padua
2.4 Other forms of parallelism
• As discussed above, there are other forms of parwidely used today. These usually coexist with thparallelism of multicomputers and multiprocesso
• Pipelining of the control unit and/or arithmetic u
• Multiple functional units
• Most microprocessors today take advantage of thparallelism.
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
registers
Instruction counter
CONTROL
26 of 28
are an important hat each re performed ted by the guage rol this type of
IALU
BRANCH
© 2006 David A. Padua
• VLIW (Very Long Instruction Word) processorsclass of multifunctional processors. The idea is tinstruction may involve several operations that asimultaneously.This parallelism is usually exploicompiler and not accessible to the high-level lanprogrammer. However, the programmer can contparallelism in assembly language.
Register File
Memory
LD/ST FADD FMUL
LD/ST FADD FMUL IALUInstruction
Word
Multifunction Processor (VLIW)
27 of 28
achine. Each was connected to us).
MEMORYholds instructions and data
ARITHMETICUNIT
logic
registers
© 2006 David A. Padua
• Array processors. Multiple arithmetic units
• Illiac IV is the earliest example of this type of marithmetic unit (processing unit) of the Illiac IV four others to form a two-dimensional array (tor
MEMORYholds instructions and data
PROCESSOR
ARITHMETICUNIT
logic
registers
Instruction counter
CONTROL
MEMORYholds instructions and data
MEMORYholds instructions and data
ARITHMETICUNIT
logic
registers
ARITHMETICUNIT
logic
registers
28 of 28
h he picked two ssible nd the others
nal Von
lticomputers and
processors.
ed and perhaps
© 2006 David A. Padua
2.5 Flynn’s taxonomy
• Michael Flynn published a paper in 1972 in whiccharacteristics of computers and tried all four pocombinations. Two stuck in everybody’s mind, adidn’t:
• SISD: Single Instruction, Single Data. ConventioNeumann computers.
• MIMD: Multiple Instruction, Multiple Data. Mumultiprocessors.
• SIMD: Single Instruction, Multiple Data. Array
• MISD: Multiple Instruction, Single Data. Not usnot meaningful.