Tutorial: High Performance Computing

20
Tutorial: High Performance Computing Igal G. Rasin Department of Chemical Engineering Israel Institute of Technology 27 Nisan 5769 (21.04.2009) Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 1 / 18

Transcript of Tutorial: High Performance Computing

Page 1: Tutorial: High Performance Computing

Tutorial: High Performance Computing

Igal G. Rasin

Department of Chemical EngineeringIsrael Institute of Technology

27 Nisan 5769 (21.04.2009)

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 1 / 18

Page 2: Tutorial: High Performance Computing

Motivation

What is High Performance Computing?

What is “Regular” Computing?

How to choose method for you “Regular” problem?

Is your algorithm is optimal for serial Computing?

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 2 / 18

Page 3: Tutorial: High Performance Computing

Memory Gap

1980 1985 1990 1995 2000 20051

10

100

1000

CPU, 2x every

2 years

CPU, 2x every 6 years

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 3 / 18

Page 4: Tutorial: High Performance Computing

Processor structure

RAM frequency is less thanprocessor one

Hierarchical structure

L1 cache usually works on thesame frequency with theprocessor Processor

cache L1

cache L2

RAM

Processor name PF, GHz Memorytransfer rate

L1 size L1 transferrate

Intel Core 2 DUO 1.6 6.4 Gb/s 2x32KB 96 GB/s

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 4 / 18

Page 5: Tutorial: High Performance Computing

Memory. Performance

Example

0

2e + 08

4e + 08

6e + 08

8e + 08

1e + 09

1.2e + 09

8 10 12 14 16 18 20 22 24 26

Iter

atio

ns/

s

array size 2n

randomserial

L1 size: 2x32 KB; 212 · 8 = 32KBL2 size: 3072 KB; 218 · 8 = 2048Kb

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 5 / 18

Page 6: Tutorial: High Performance Computing

Multi-Core / Multi-Processor architectures.

core1

core2

L1 1 L1 2

L2 1 L2 2

core1

core2

L1 1 L1 2

L2 1 L2 2

Memory

Memory

memory access rate depends on memory placement

both processors works independently with their memories

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 6 / 18

Page 7: Tutorial: High Performance Computing

Tendencies

More paralelization More cores Vectorization

IBM, SonyTilera

Nvidia

ATI

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 7 / 18

Page 8: Tutorial: High Performance Computing

Cluster vs. Shared memory

Separated mashines with networkon all levels

P

M

P

M

P

MNetwork

Sheared memory machine on alllevels

P

P

P

Memory

Small data/Many calculations

Large data/few calculations

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 8 / 18

Page 9: Tutorial: High Performance Computing

Cluster vs. Shared memory

Separated mashines with networkon all levels

P

M

P

M

P

MNetwork

Sheared memory machine on alllevels

P

P

P

Memory

Small data/Many calculations

Large data/few calculations

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 8 / 18

Page 10: Tutorial: High Performance Computing

Molecular dynamicsMolecular dynamics of atoms with Lennard-Jones potential

Vij = 4ε

((σr

)12−(σ

r

)6)

Fij = 4ε~r

(6σ6

r8− 13

σ12

r14

)Fi =

∑j

Fij

Programm

69 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){70 i n i t F ( f , n ) ;71 f o r ( i n t i ( 0 ) ; i<n;++ i )72 f o r ( i n t j ( i +1); j<n;++ j ){73 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;74 f [ i ]+= f f ;75 f [ j ]−= f f ;76 }77 } Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 9 / 18

Page 11: Tutorial: High Performance Computing

Parallelization. OpenMP

OpenMP is a tool for parallelization sheared memory machine

Programm

70 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){71 i n i t F ( f , n ) ;72 i n t i ;73 #pragma omp p a r a l l e l f o r s c h edu l e ( dynamic , 1 0 )74 f o r ( i =0; i<n;++ i )75 f o r ( i n t j ( i +1); j<n;++ j ){76 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;77 f [ i ]+= f f ;78 f [ j ]−= f f ;79 }80 }

Nanco, 8000 particles1 core 1 upgrades per Sec2 Processors 2 cores 4 upgrades per Sec

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 10 / 18

Page 12: Tutorial: High Performance Computing

Cluster. MPI

MPI Message Passing Protocol Library for data exchange betweendifferent nodes

Node 0 Node 1 Node 2 Node 3 Node 4

Network

Point to Point Communications between single nodes

Collective Data exchange between a node and a group

One-Sided Remote direct memory access of a process

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 11 / 18

Page 13: Tutorial: High Performance Computing

Cluster. Paralelization

Step 1. Initialization

101 i n t main ( ){102 MPI I n i t ( ( vo id ∗ )0 , ( vo id ∗ ) 0 ) ;...125 MPI F i n a l i z e ( ) ;126 r e t u r n 0 ;127 }

Step 2. Particle exchange

86 vo id p a r t i c l e s E x c h a n g e ( Vec ∗p , i n t n )87 {88 i n t nn ;89 MPI Comm size (MPI COMM WORLD,&nn ) ;90 MPI A l l ga the r (MPI IN PLACE , 0 ,MPI DOUBLE , p ,3∗ n/nn ,91 MPI DOUBLE ,MPI COMM WORLD) ;92 }

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 12 / 18

Page 14: Tutorial: High Performance Computing

Cluster. Paralelization

Step 3. Force calculation

70 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){71 i n i t F ( f , n ) ;72 i n t i , n1 , n2 , id , nn ;73 MPI Comm rank (MPI COMM WORLD,& i d ) ;74 MPI Comm size (MPI COMM WORLD,&nn ) ;75 n1=n/nn∗ i d ;76 n2=n/nn ∗( i d +1);77 #pragma omp p a r a l l e l f o r s c h edu l e ( dynamic , 1 0 )78 f o r ( i=n1 ; i<n2;++ i )79 f o r ( i n t j ( i +1); j<n;++ j ){80 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;81 f [ i ]+= f f ;82 f [ j ]−= f f ;83 }84 }

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 13 / 18

Page 15: Tutorial: High Performance Computing

Cluster. Paralelization

Step 3. Force exchange

94 vo id f o r c e sExchange ( Vec ∗p , i n t n )95 {96 MPI A l l r educe (MPI IN PLACE , p ,3∗ n ,MPI DOUBLE ,97 MPI SUM,MPI COMM WORLD) ;98 }

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 14 / 18

Page 16: Tutorial: High Performance Computing

Barnes-Hut simulation

Cut-off radius rc = 2.5σ

J. Barnes and P. Hut. A hierarchical O(N log N) force-calculationalgorithm. Nature, 324(4), December 1986

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 15 / 18

Page 17: Tutorial: High Performance Computing

Barnes-Hut simulation

Cut-off radius rc = 2.5σJ. Barnes and P. Hut. A hierarchical O(N log N) force-calculationalgorithm. Nature, 324(4), December 1986

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 15 / 18

Page 18: Tutorial: High Performance Computing

OpenMP Paralelization

Paralelization within 1 subdomain

Paralelization over subdomains

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 16 / 18

Page 19: Tutorial: High Performance Computing

MPI Paralelization

Only border cells require data exchange

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 17 / 18

Page 20: Tutorial: High Performance Computing

Conclusions

Small data/Many calculationsI no adaptations needed for modern processorsI extremely easy and efficient for parallelization

Large data/few calculationsI serial program requires data decomposition in order to fit the cacheI extremely easy and efficient to parallelize serial program with

decomposed data

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 18 / 18