Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems...

ETH ETH zürich

ETH ETH zürichIntegrated Systems Laboratory

Exercise: OpenMP Programming

Multicore programming with OpenMP

19.04.2016A. Marongiu - [email protected]

D. Palossi – [email protected]

Integrated Systems LaboratoryETH ETH zürich

2

Odroid Board

Board Specs

• Exynos5 Octa Cortex™-A15 1.4Ghz quad

core and Cortex™-A7 quad core CPUs

• PowerVR SGX544MP3 GPU (OpenGL ES

2.0, OpenGL ES 1.1 and OpenCL 1.1 EP)

• 2Gbyte LPDDR3 RAM PoP

• USB 3.0 Host x 1, USB 3.0 OTG x 1, USB

2.0 Host x 4

• HDMI 1.4a output Type-D connector

• eMMC 4.5 Flash Storage


3

Login, compile and run

• Login to Odroid board:

ssh –X stud_usr@<BOARD-IP>

• Exercise files are already on the board under*:

/home/stud_usr/asocd16/SRC

• Compile and run with

make clean all run

• Take a look at test.c. Different exercises are #ifdef-ed

• To compile and execute the desired exercise:make clean all run –e MYOPTS="-DEX1 –DEX2 …"

* if you need a fresh copy is available at: /home/soc_master/3_openMP/asocd16.zip


4

Ex 1 – Parallelism creation: Hello World!

#pragma omp parallel num_threads (?) printf "Hello world, I’m thread ??"

Use parallel directive to create multiple threads Each thread executes the code enclosed within the scope of the directive

Use runtime library functions to determine thread ID All SPMD parallelization is based on this approach


5

Ex 2 – Loop partitioning: Static scheduling

00 44 88 1212

11 55 99 1313

22 66 1010 1414

33 77 1111 1515

T0 T1 T2 T3#pragma omp parallel for \ num_threads (NTHREADS) schedule (static)

for (uint i=0; i<NITERS; i++){ work (W);

} /* (implicit) SYNCH POINT */

With schedule(static) M iterations are statically assigned to each thread, where

M = NITERS / NTHREADS

Small overhead: loop indexes are computed according to thread ID Optimal scheduling if workload is balanced


6

Use the following parameters: NITERS = 1024 NTHREADS = {1,2,4,8,16} W = 1000000

Collect execution time in excel sheet for the various configurations Comment on the results

Exercise 2a

#pragma omp parallel for \ num_threads (NTHREADS) schedule (static)



Ex 2 – Loop partitioning: Static scheduling


7

Ex 2 – Loop partitioning: Dynamic scheduling

00 44 88 1212

11 55 99 1313

22 66 1010 1414

33 77 1111 1515

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)



T0 T1 T2 T3

Iterations are dynamically assigned to threads in groups of M = NITERS/NTHREADS

Same parallelization of schedule(static) Coarse granularity Overhead only at beginning and end of loop

OVERHEADOVERHEAD CHUNK CHUNK


8





00 44 88 1212

11 55 99 1313

22 66 1010 1414

33 77 1111 1515

T0 T1 T2 T3

Iterations are dynamically assigned to threads one iteration per chunk

Finest granularity Overhead at every iteration

OVERHEADOVERHEADCHUNK (size = 1 iter)CHUNK (size = 1 iter)


9


Use the following parameters: NITERS = 1024 NTHREADS = {1,2,4,8,16} M = {NITERS/NTHREADS, 1} W = 1000000

Add execution time for the various configurations to the excel sheet Comment on the results




Exercise 2b


10


Use the following parameters: NITERS = 102400 NTHREADS = {1,2,4,8,16} M = {NITERS/NTHREADS, 1} W = 10

Draw the same plot as exercise 2a, 2b What happens for smaller workload and higher iteration count?

Exercise 2c





11

Ex 3 – Unbalanced Loop Partitioning

00 44 88 121211

55 99

131322

66 1010

1414

33

77 1111

1515

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, NITERS/NTHREADS)

for (uint i=0; i<NITERS; i++){ work((i>>2) * W);


T0 T1 T2 T3

Iterations have different duration Using coarse-grained chunks (NITERS/NTHREADS) creates unbalanced work among the threads

Due to the barrier at the end of parallel region, all threads have to wait for the slowest one

SYNCH POINTSYNCH POINT


12


00 44 88 121211

55 99

131322

66 1010

1414

33

77 11111515

T0 T1 T2 T3

SPEEDUPSPEEDUP

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, 1)

for (uint i=0; i<NITERS; i++){ work((i>>2) * W);


Iterations have different duration Using fine-grained chunks (the finest is 1) creates balanced work among the threads

The overhead for dynamic scheduling is amortized by the effect of work-balancing on the overall loop duration


13



for (uint i=0; i<NITERS; i++){ /* BALANCED LOOP CODE */


Use the following parameters: NITERS = 128 NTHREADS = 4 M = {32, 16, 8, 4, 1} W = 1000000

Create a new excel sheet plotting results (execution time) for static and dynamic schedules (with chunk size M)Comment on results

Exercise 3a, 3b


14

Ex 4 – Chunking Overhead

Study the impact of chunk size for varying sizes of the workload W

Use the following parameters: NITERS = 1024*256 NTHREADS = 4 M = {32, 16, 8, 4, 1} W = {1, 10, 100, 1000}

Create a new excel sheet plotting results (execution time) for static and dynamic schedules (with chunk size M) Comment on results

HINT: Plot NORMALIZED execution time tothe fastest scheduling option for a givenvalue of W


15

Ex 5 – Task parallelism with sections

a) Distribute workload among 4 threads using SPMD parallelizationI. Get thread idII. Use if/else or switch/case to differentiate workload

b) Implement the same workload partitioning with sections directive

void sections(){ work(1000000); printf("%hu: Done with first elaboration!\n”, …); work(2000000); printf("%hu: Done with second elaboration!\n”, …); work(3000000); printf("%hu: Done with third elaboration!\n", …); work(4000000); printf("%hu: Done with fourth elaboration!\n", …);}


16

Ex 6 – Task parallelism with task

a) Distribute workload among 4 threads using task directivea) Same program as beforeb) But we had to manually unroll the loop to use sections

Compare performance and ease of coding with sections

void tasks(){ unsigned int i;

for(i=0; i<4; i++) { work((i+1)*1000000); printf("%hu: Done with elaboration\n", …); }}


void tasks(){ unsigned int i;

for(i=0; i<1024; i++) { work((1000000); printf("%hu: Done with elaboration\n", …); }}

17

Ex 6 – Task parallelism with task

b) Parallelize the loop with task directive• Use single directive to force a single processor to create tasks

c) Parallelize the loop with single directive• Use nowait clause to allow for parallel execution

Compare performance and ease of coding of the two solutions

Modify the EX6 exercise code as indicated on this slide


18

Projects: come to the software side

If you are interested in course/semester/master project focused on OpenMP development for:

Many-cores embedded devices Parallel-Ultra-Low-Power architectures High-Performance parallel machines

Do not hesitate to contact us➔ [email protected]

➔ [email protected]

mailto:[email protected]

mailto:[email protected]

Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems...

Documents

Transcript of Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems...