Parallel Programming On the IUCAA Clusters Sunu Engineer.
-
Upload
blake-cook -
Category
Documents
-
view
224 -
download
1
Transcript of Parallel Programming On the IUCAA Clusters Sunu Engineer.
![Page 1: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/1.jpg)
Parallel Programming On the IUCAA Clusters
Sunu Engineer
![Page 2: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/2.jpg)
IUCAA Clusters
The Cluster – Cluster of Intel Machines on LinuxHercules – Cluster of HP ES45 quad processor
nodes
References: http://www.iucaa.ernet.in/
![Page 3: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/3.jpg)
The Cluster
Four Single Processor Nodes with 100 Mbps Ethernet interconnect.
1.4 GHz, Intel Pentium 4 512 MB RAM Linux 2.4 Kernel (Redhat 7.2 Distribution) MPI – LAM 6.5.9 PVM – 3.4.3
![Page 4: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/4.jpg)
Hercules
Four quad processor nodes with Memory Channel interconnect
1.25 GHz Alpha 21264D RISC Processor 4 GB RAM Tru64 5.1A with TruCluster software Native MPI LAM 7.0 PVM 3.4.3
![Page 5: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/5.jpg)
Expected Computational Performance
Intel Cluster Processor - 512/590 System GFLOPS ~ 2 Algorithm/Benchmark
Used – Specint/float/HPL
ES45 Cluster Processor ~ 679/960 System GFLOPS ~ 30 Algorithm/Benchmark
Used – Specint/float/HPL
![Page 6: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/6.jpg)
Parallel Programs
Move towards large scale distributed programs Larger class of problems with higher resolution Enhanced levels of details to be explored …
![Page 7: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/7.jpg)
The Starting Point
Model Single Processor Program Multi Processor Program
Model Multiprocessor Program
![Page 8: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/8.jpg)
Decomposition of a Single Processor Program
Temporal Initialization Control Termination
Spatial Functional Modular Object based
![Page 9: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/9.jpg)
Multi Processor Programs
Spatial delocalization – Dissolving the boundary Single spatial coordinate - Invalid Single time coordinate - Invalid
Temporal multiplicity Multiple streams at different rates w.r.t an external
clock.
![Page 10: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/10.jpg)
In comparison
Multiple points of initialization Distributed control Multiple points and times of termination Distribution of the activity in space and time
![Page 11: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/11.jpg)
Breaking up a problem
![Page 12: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/12.jpg)
Yet Another way
![Page 13: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/13.jpg)
And another
![Page 14: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/14.jpg)
Amdahl’s Law
![Page 15: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/15.jpg)
Degrees of refinement
Fine parallelism Instruction level Program statement level Loop level
Coarse parallelism Process level Task level Region level
![Page 16: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/16.jpg)
Patterns and Frameworks
Patterns - Documented solutions to recurring design problems.
Frameworks – Software and hardware structures implementing the infrastructure
![Page 17: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/17.jpg)
Processes and Threads
From heavy multitasking to lightweight multitasking on a single processor
Isolated memory spaces to shared memory space
![Page 18: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/18.jpg)
Posix Threads in Brief
pthread_create(pthread_t id, pthread_attr_t attributes, void *(*thread_function)(void *), void * arguments)
pthread_exit pthread_join pthread_self pthread_mutex_init pthread_mutex_lock/unlock Link with –lpthread
![Page 19: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/19.jpg)
Multiprocessing architectures
Symmetric Multiprocessing Shared memory
Space Unified Different temporal streams
OpenMP standard
![Page 20: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/20.jpg)
OpenMP Programming
Set of directives to the compiler to express shared memory parallelism
Small library of functions Environment variables. Standard language bindings defined for
FORTRAN, C and C++
![Page 21: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/21.jpg)
Open MP example
#include <stdio.h>#include <omp.h> int main(int argc, char ** argv) {#pragma omp parallel { printf(“Hello World from
%d\n”,omp_get_thread_num());
}return(0);}
C An openMP program program openmp
!$OMP PARALLEL print *, “Hello world from”, omp_get_thread_num()
!$OMP END PARALLELstop
end
![Page 22: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/22.jpg)
Open MP directivesParallel and Work sharing
OMP Parallel [clauses] OMP do [ clauses] OMP sections [ clauses] OMP section OMP single
![Page 23: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/23.jpg)
Combined work sharingSynchronization
OMP parallel do OMP parallel sections OMP master OMP critical OMP barrierOMP atomicOMP flushOMP orderedOMP threadprivate
![Page 24: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/24.jpg)
OpenMP Directive clauses
shared(list) private(list)/threadprivate firstprivate/lastprivate(list) default(private|shared|none) default(shared|none) reduction (operator|intrinsic : list) copyin(list) if (expr) schedule(type[,chunk]) ordered/nowait
![Page 25: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/25.jpg)
Open MP Library functions
omp_get/set_num_threads() omp_get_max_threads() omp_get_thread_num() omp_get_num_procs() omp_in_parallel() omp_get/set_(dynamic/nested)() omp_init/destroy/test_lock() omp_set/unset_lock()
![Page 26: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/26.jpg)
OpenMP environment variables
OMP_SCHEDULE OMP_NUM_THREADS OMP_DYNAMIC OMP_NESTED
![Page 27: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/27.jpg)
OpenMP Reduction and Atomic Operators
Reduction : +,-,*,&,|,&&,|| Atomic : ++,--,+,*,-,/,&,>>,<<,|
![Page 28: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/28.jpg)
Simple loops
do I=1,N z(I) = a * x(I) + y end do
!$OMP parallel do do I=1,N z(I) = a * x(I) + y end do
![Page 29: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/29.jpg)
Data Scoping
Loop index private by default Declare as shared, private or reduction
![Page 30: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/30.jpg)
Private variables
!$OMP parallel do private(a,b,c) do I=1,m
do j =1,n b=f(I) c=k(j) call abc(a,b,c) end do end do#pragma omp parallel for private(a,b,c)
![Page 31: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/31.jpg)
Dependencies
Data dependencies (Lexical/dynamic extent) Flow dependencies Classifying and removing the dependencies Non removable dependenciesExamples
Do I=2,na(I) =a(I)+a(I-1)
end doDo I=2,N,2 a(I)= a(I)+a(I-1)End do
![Page 32: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/32.jpg)
Making sure everyone has enough work
Parallel overhead – Creation of threads, synchronization vs. work done in the loop
$!OMP parallel do schedule(dynamic,3) schedule type – static, dynamic, guided,runtime
![Page 33: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/33.jpg)
Parallel regions – from fine to coarse parallelism
$!OMP Parallel threadprivate and copyin Work sharing constructs
do, sections, section, singleSynchronization critical, atomic, barrier, ordered, master
![Page 34: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/34.jpg)
To distributed memory systems
MPI, PVM, BSP …
![Page 35: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/35.jpg)
Existing parallel libraries and toolkits include: PUL, the Parallel Utilities Library from EPCC. The Multicomputer Toolbox from Tony Skjellum and
colleagues at LLNL and MSU. The Portable, Extensible, Toolkit for Scientific
computation from ANL. ScaLAPACK from ORNL and UTK. ESSL, PESSL on AIX PBLAS, PLAPACK, ARPACK
Some Parallel Libraries
![Page 36: Parallel Programming On the IUCAA Clusters Sunu Engineer.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649d1b5503460f949f126d/html5/thumbnails/36.jpg)