Parallel Processing
-
Upload
nsatiqahsazhar -
Category
Documents
-
view
222 -
download
2
description
Transcript of Parallel Processing
-
PARALLEL PROCESSING : FUNDAMENTALS Khushdeep Singh Department of Computer Science and Engineering IIT Kanpur TUTOR : Prof. Dr. U. Rude, Florian Schornbaum
1
-
OUTLINE
Para
llel P
roce
ssin
g : F
un
dam
en
tals
Overview
1. What is Parallel Processing ?
2. Why use Parallel Processing ?
Flynns Classical Taxonomy
Parallel Computer Memory Architectures
1. Shared Memory
2. Distributed Memory
3. Hybrid Distributed-Shared Memory
Parallel Programming Models
Designing Parallel Programs
Amdahls Law
Embarrassingly parallel
Summary
2
-
What is Parallel Processing?
Simultaneous use of multiple resources to solve a computational problem :
The problem is broken into discrete parts that can be solved concurrently
Instructions from each part execute simultaneously on different CPUs Pa
ralle
l Pro
cess
ing
: Fu
nd
ame
nta
ls
3
-
Why use Parallel Processing?
Save time
Solve larger problems:
Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer
Use of non-local resources:
Using compute resources on a wide area network, or even the Internet when local compute resources are scarce
E.g. : SETI@home : over 1.3 million users, 3.2 million computers in nearly every country in the world.
Para
llel P
roce
ssin
g : F
un
dam
en
tals
4
-
Why use Parallel Processing?
Limits to serial computing:
Transmission speeds : limits on how fast data can move through hardware
Limits to miniaturization
Heating issues : Power Consumption proportional to frequency
Economic limitations : it is increasingly expensive to make a single processor faster
Current computer architectures are increasingly relying upon hardware level parallelism to improve performance:
Multiple execution units
Pipelined instructions
Multi-core
Para
llel P
roce
ssin
g : F
un
dam
en
tals
5
-
Why use Parallel Processing?
Parallelism and Moore's law:
Moore's law : performance of chips effectively doubles every 2 years due to the addition of more transistors to a circuit board
Parallel computation necessary to take full advantage of the gains allowed by Moore's law
Para
llel P
roce
ssin
g : F
un
dam
en
tals
6
-
Flynns Classical Taxonomy
Classification of Parallel Computers : Flynn's Classical Taxonomy Single Instruction, Single Data (SISD):
A serial (non-parallel) computer Single Instruction: Only one instruction stream is being acted on by
the CPU during any one clock cycle Single Data: Only one data stream is being used as input during any
one clock cycle
Single Instruction, Multiple Data (SIMD): Single Instruction: All processing units execute the same instruction
at any given clock cycle Multiple Data: Each processing unit can operate on a different data
element Best suited for problems characterized by a high degree of regularity,
such as image processing. E.g. : GPU
Para
llel P
roce
ssin
g : F
un
dam
en
tals
7
-
Flynns Classical Taxonomy
Multiple Instruction, Single Data (MISD): Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
Single Data: A single data stream is fed into multiple processing units.
Few actual examples.
Multiple Instruction, Multiple Data (MIMD): Multiple Instruction: Every processor may be executing a different
instruction stream
Multiple Data: Every processor may be working with a different data stream
E.g. : networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.
Para
llel P
roce
ssin
g : F
un
dam
en
tals
8
-
Parallel Architectures
Shared Memory :
Ability for all processors to access all memory as global address space
Changes in a memory location effected by one processor are visible to all other processors
Shared memory machines can be divided into two main classes based upon memory access times:
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)
Para
llel P
roce
ssin
g : F
un
dam
en
tals
9
-
Parallel Architectures
Uniform Memory Access (UMA) :
Commonly represented by Symmetric Multiprocessor (SMP) machines
Identical processors
Equal access times to memory
Para
llel P
roce
ssin
g : F
un
dam
en
tals
10
-
Parallel Architectures
Non-Uniform Memory Access (NUMA)
Made by physically linking two or more SMPs
One SMP can directly access memory of another
Not all processors have equal access time to all memories
Memory access across link is slower
Para
llel P
roce
ssin
g : F
un
dam
en
tals
11
-
Parallel Architectures Distributed Memory :
Processors have their own local memory
Change in a processors local memory have no effect on the memory of other processors
Needs message passing
Explicit programming required
Para
llel P
roce
ssin
g : F
un
dam
en
tals
12
-
Parallel Architectures Shared vs Distributed memory :
Para
llel P
roce
ssin
g : F
un
dam
en
tals
13
Shared Memory Distributed Memory
Advantages Disadvantages Advantages Disadvantages
Data sharing between tasks is fast
Expense with increase in no. of processors
Memory is scalable with no. of processors
Explicit programming required
User-friendly programming perspective to memory
Programmer responsible for synchronization
No overhead in cache coherency
Message passing involves overhead
Lack of scalability Cost effectiveness due to Networking
-
Parallel Architectures
Hybrid Distributed-Shared Memory
Shared memory component : a cache coherent SMP machine
Distributed memory component : networking of multiple SMP machines
Para
llel P
roce
ssin
g : F
un
dam
en
tals
14
-
Parallel Programming Models
An abstraction above hardware and memory architectures
Models NOT specific to a particular type of memory architecture
Shared Memory Model:
Tasks share a common address space
Mechanisms such as locks / semaphores used for synchronization
Advantage : programming development simplified
Threads can be used :
Each thread has local data, but also, shares the entire resources of main program
Threads communicate with each other through global memory
Para
llel P
roce
ssin
g : F
un
dam
en
tals
15
-
Parallel Programming Models Implementation of shared memory model :
OpenMP :
Directive based
Master thread forks a specified number of slave threads and task is divided among them
After execution of parallel task, threads join back
Para
llel P
roce
ssin
g : F
un
dam
en
tals
16
-
Parallel Programming Models OpenMP : Core elements
Para
llel P
roce
ssin
g : F
un
dam
en
tals
17
-
Parallel Programming Models
OpenMP : Example Program int main (int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("Hello World from thread %d\n", th_id); #pragma omp barrier if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } return EXIT_SUCCESS; }
Para
llel P
roce
ssin
g : F
un
dam
en
tals
18
-
Parallel Programming Models
Message Passing Model :
Tasks use their own local memory
Tasks exchange data by sending and receiving messages
User explicitly distributes data
Para
llel P
roce
ssin
g : F
un
dam
en
tals
19
-
Parallel Programming Models
Implementation of message passing model :
Message Passing Interface (MPI) :
PORTABILITY : Architecture and hardware independent code
Provides well-defined and safe data transfer
Support heterogeneous environment (e.g. clusters)
Most MPI implementations consist of a specific set of routines (i.e., an API) directly callable from C, C++, Fortran
Para
llel P
roce
ssin
g : F
un
dam
en
tals
20
-
Parallel Programming Models
Message Passing Interface (MPI) : Concepts
Communicator and Rank : connect groups of processes in the MPI session
Point-to-point basics : communication between two specific processes. E.g. MPI_send, MPI_recieve calls
Collective basics : communication among all processes in a process group E.g. MPI_Bcast, MPI_Reduce calls
Derived data types :
specify the type of data which is sent between processors
predefined MPI data types such as MPI_INT, MPI_CHAR, MPI_DOUBLE
Para
llel P
roce
ssin
g : F
un
dam
en
tals
21
-
Parallel Programming Models
Message Passing Interface (MPI) : Example Program #define BUFSIZE 128 #define TAG 0 int main (int argc, char *argv[]) { char idstr[32]; char buff[BUFSIZE]; int numprocs; int myid; int i; MPI_Status stat; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD,&numprocs); MPI_Comm_rank (MPI_COMM_WORLD,&myid); If(myid == 0) { for(i=1 ; i
-
Parallel Programming Models
Message Passing Interface (MPI) : Example Program for(i=1 ; i
-
Designing Parallel Programs
Automatic and Manual Parallelization :
Manual Parallelization : time consuming, complex and error-prone
Automatic Parallelization : done by a parallelizing compiler or pre-processor. Two different ways:
Fully Automatic :
compiler analyzes the source code and identifies opportunities for parallelism
Programmer Directed :
using "compiler directives" or flags, the programmer explicitly tells the compiler how to parallelize the code
E.g. : OpenMP
Para
llel P
roce
ssin
g : F
un
dam
en
tals
24
-
Designing Parallel Programs
Partitioning :
Breaking the problem into discrete "chunks" of work that can be distributed to multiple tasks
Two basic ways to partition :
Domain decomposition : the data associated with a problem is decomposed
Para
llel P
roce
ssin
g : F
un
dam
en
tals
25
-
Designing Parallel Programs
Partitioning :
Two basic ways to partition :
Functional decomposition : the focus is on the computation that is to be performed rather than on the data manipulated by the computation
Para
llel P
roce
ssin
g : F
un
dam
en
tals
26
-
Designing Parallel Programs
Load Balancing :
Practice of distributing work among tasks so that all tasks are kept busy all of the time
Two types :
Static load balancing : assigning a fixed amount of work to each processing site a priori
Dynamic Load Balancing : Two types :
Task-oriented : when one processing site finishes its task, it is assigned another task
Data-oriented : when a processing site finishes its task before other sites, the site with the most work gives the idle site some of its data to process
Para
llel P
roce
ssin
g : F
un
dam
en
tals
27
-
Designing Parallel Programs
Granularity :
Qualitative measure of the ratio of computation to communication
Fine-grain Parallelism : relatively small amounts of computation between communication events Facilitates load balancing
High communication overhead
Coarse-grain Parallelism : significant work done between communications
Most efficient granularity depends on the algorithm and the hardware environment used
Para
llel P
roce
ssin
g : F
un
dam
en
tals
28
-
Amdahls Law
Expected speedup of parallelized implementations of an algorithm relative to the serial algorithm.
Eq. :
Speedup = 1
1 +/ ,
P : Portion that can be made parallel
N : No. of processors
Para
llel P
roce
ssin
g : F
un
dam
en
tals
29
-
Embarrassingly parallel
Embarrassingly parallel problem : little or no effort is required to separate the problem into a number of parallel tasks
No dependency (or communication) between the parallel tasks
Examples :
Distributed relational database queries using distributed set processing
Rendering of computer graphics
Event simulation and reconstruction in particle physics
Brute-force searches in cryptography
Ensemble calculations of numerical weather prediction
Tree growth step of the random forest machine learning technique
Para
llel P
roce
ssin
g : F
un
dam
en
tals
30
-
Applications of parallel processing
Para
llel P
roce
ssin
g : F
un
dam
en
tals
31
-
Summary
Parallel Processing : Simultaneous use of multiple resources to solve a computational problem
Need for parallel processing : Limits to serial computing and Moores Law
Flynns Classical Taxonomy : SISD, SIMD, MIMD, MISD
Parallel architectures : Shared memory, distributed memory and hybrid
Parallel programing models : OpenMP, MPI
Designing parallel programs : Automatic parallelization, partitioning, load balancing and granularity
Embarrassingly parallel problems : very easy to solve by parallel processing
Para
llel P
roce
ssin
g : F
un
dam
en
tals
32
-
References Introduction to Parallel Computing :
https://computing.llnl.gov/tutorials/parallel_comp/#Hybrid
http://en.wikipedia.org
Introduction to Scientific High Performance Computing : Reinhold Bader (LRZ), Georg Hager (RRZE), Heinz Bast (Intel)
Elementary Parallel Programming With Examples : Reinhold Bader (LRZ), Georg Hager (RRZE)
Programming Shared Memory Systems with OpenMP : Reinhold Bader (LRZ) , Georg Hager (RRZE)
THANK YOU !
Para
llel P
roce
ssin
g : F
un
dam
en
tals
33