Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems...

18
ETH ETH zürich ETH ETH zürich Integrated Systems Laboratory Exercise: OpenMP Programming Mulcore programming with OpenMP 19.04.2016 A. Marongiu - [email protected] D. Palossi – [email protected]

Transcript of Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems...

Page 1: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

ETH ETH zürich

ETH ETH zürichIntegrated Systems Laboratory

Exercise: OpenMP Programming

Multicore programming with OpenMP

19.04.2016A. Marongiu - [email protected]

D. Palossi – [email protected]

Page 2: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

2

Odroid Board

Board Specs

• Exynos5 Octa Cortex™-A15 1.4Ghz quad

core and Cortex™-A7 quad core CPUs

• PowerVR SGX544MP3 GPU (OpenGL ES

2.0, OpenGL ES 1.1 and OpenCL 1.1 EP)

• 2Gbyte LPDDR3 RAM PoP

• USB 3.0 Host x 1, USB 3.0 OTG x 1, USB

2.0 Host x 4

• HDMI 1.4a output Type-D connector

• eMMC 4.5 Flash Storage

Page 3: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

3

Login, compile and run

• Login to Odroid board:

ssh –X stud_usr@<BOARD-IP>

• Exercise files are already on the board under*:

/home/stud_usr/asocd16/SRC

• Compile and run with

make clean all run

• Take a look at test.c. Different exercises are #ifdef-ed

• To compile and execute the desired exercise:make clean all run –e MYOPTS="-DEX1 –DEX2 …"

* if you need a fresh copy is available at: /home/soc_master/3_openMP/asocd16.zip

Page 4: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

4

Ex 1 – Parallelism creation: Hello World!

#pragma omp parallel num_threads (?) printf "Hello world, I’m thread ??"

Use parallel directive to create multiple threads Each thread executes the code enclosed within the scope of the directive

Use runtime library functions to determine thread ID All SPMD parallelization is based on this approach

Page 5: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

5

Ex 2 – Loop partitioning: Static scheduling

00 44 88 1212

11 55 99 1313

22 66 1010 1414

33 77 1111 1515

T0 T1 T2 T3#pragma omp parallel for \ num_threads (NTHREADS) schedule (static)

for (uint i=0; i<NITERS; i++){ work (W);

} /* (implicit) SYNCH POINT */

With schedule(static) M iterations are statically assigned to each thread, where

M = NITERS / NTHREADS

Small overhead: loop indexes are computed according to thread ID Optimal scheduling if workload is balanced

Page 6: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

6

Use the following parameters: NITERS = 1024 NTHREADS = {1,2,4,8,16} W = 1000000

Collect execution time in excel sheet for the various configurations Comment on the results

Exercise 2a

#pragma omp parallel for \ num_threads (NTHREADS) schedule (static)

for (uint i=0; i<NITERS; i++){ work (W);

} /* (implicit) SYNCH POINT */

Ex 2 – Loop partitioning: Static scheduling

Page 7: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

7

Ex 2 – Loop partitioning: Dynamic scheduling

00 44 88 1212

11 55 99 1313

22 66 1010 1414

33 77 1111 1515

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)

for (uint i=0; i<NITERS; i++){ work (W);

} /* (implicit) SYNCH POINT */

T0 T1 T2 T3

Iterations are dynamically assigned to threads in groups of M = NITERS/NTHREADS

Same parallelization of schedule(static) Coarse granularity Overhead only at beginning and end of loop

OVERHEADOVERHEAD CHUNK CHUNK

Page 8: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

8

Ex 2 – Loop partitioning: Dynamic scheduling

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)

for (uint i=0; i<NITERS; i++){ work (W);

} /* (implicit) SYNCH POINT */

00 44 88 1212

11 55 99 1313

22 66 1010 1414

33 77 1111 1515

T0 T1 T2 T3

Iterations are dynamically assigned to threads one iteration per chunk

Finest granularity Overhead at every iteration

OVERHEADOVERHEADCHUNK (size = 1 iter)CHUNK (size = 1 iter)

Page 9: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

9

Ex 2 – Loop partitioning: Dynamic scheduling

Use the following parameters: NITERS = 1024 NTHREADS = {1,2,4,8,16} M = {NITERS/NTHREADS, 1} W = 1000000

Add execution time for the various configurations to the excel sheet Comment on the results

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)

for (uint i=0; i<NITERS; i++){ work (W);

} /* (implicit) SYNCH POINT */

Exercise 2b

Page 10: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

10

Ex 2 – Loop partitioning: Dynamic scheduling

Use the following parameters: NITERS = 102400 NTHREADS = {1,2,4,8,16} M = {NITERS/NTHREADS, 1} W = 10

Draw the same plot as exercise 2a, 2b What happens for smaller workload and higher iteration count?

Exercise 2c

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)

for (uint i=0; i<NITERS; i++){ work (W);

} /* (implicit) SYNCH POINT */

Page 11: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

11

Ex 3 – Unbalanced Loop Partitioning

00 44 88 121211

55 99

131322

66 1010

1414

33

77 1111

1515

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, NITERS/NTHREADS)

for (uint i=0; i<NITERS; i++){ work((i>>2) * W);

} /* (implicit) SYNCH POINT */

T0 T1 T2 T3

Iterations have different duration Using coarse-grained chunks (NITERS/NTHREADS) creates unbalanced work among the threads

Due to the barrier at the end of parallel region, all threads have to wait for the slowest one

SYNCH POINTSYNCH POINT

Page 12: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

12

Ex 3 – Unbalanced Loop Partitioning

00 44 88 121211

55 99

131322

66 1010

1414

33

77 11111515

T0 T1 T2 T3

SPEEDUPSPEEDUP

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, 1)

for (uint i=0; i<NITERS; i++){ work((i>>2) * W);

} /* (implicit) SYNCH POINT */

Iterations have different duration Using fine-grained chunks (the finest is 1) creates balanced work among the threads

The overhead for dynamic scheduling is amortized by the effect of work-balancing on the overall loop duration

Page 13: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

13

Ex 3 – Unbalanced Loop Partitioning

#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)

for (uint i=0; i<NITERS; i++){ /* BALANCED LOOP CODE */

} /* (implicit) SYNCH POINT */

Use the following parameters: NITERS = 128 NTHREADS = 4 M = {32, 16, 8, 4, 1} W = 1000000

Create a new excel sheet plotting results (execution time) for static and dynamic schedules (with chunk size M)Comment on results

Exercise 3a, 3b

Page 14: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

14

Ex 4 – Chunking Overhead

Study the impact of chunk size for varying sizes of the workload W

Use the following parameters: NITERS = 1024*256 NTHREADS = 4 M = {32, 16, 8, 4, 1} W = {1, 10, 100, 1000}

Create a new excel sheet plotting results (execution time) for static and dynamic schedules (with chunk size M) Comment on results

HINT: Plot NORMALIZED execution time tothe fastest scheduling option for a givenvalue of W

Page 15: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

15

Ex 5 – Task parallelism with sections

a) Distribute workload among 4 threads using SPMD parallelizationI. Get thread idII. Use if/else or switch/case to differentiate workload

b) Implement the same workload partitioning with sections directive

void sections(){ work(1000000); printf("%hu: Done with first elaboration!\n”, …); work(2000000); printf("%hu: Done with second elaboration!\n”, …); work(3000000); printf("%hu: Done with third elaboration!\n", …); work(4000000); printf("%hu: Done with fourth elaboration!\n", …);}

Page 16: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

16

Ex 6 – Task parallelism with task

a) Distribute workload among 4 threads using task directivea) Same program as beforeb) But we had to manually unroll the loop to use sections

Compare performance and ease of coding with sections

void tasks(){ unsigned int i;

for(i=0; i<4; i++) { work((i+1)*1000000); printf("%hu: Done with elaboration\n", …); }}

Page 17: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

void tasks(){ unsigned int i;

for(i=0; i<1024; i++) { work((1000000); printf("%hu: Done with elaboration\n", …); }}

17

Ex 6 – Task parallelism with task

b) Parallelize the loop with task directive• Use single directive to force a single processor to create tasks

c) Parallelize the loop with single directive• Use nowait clause to allow for parallel execution

Compare performance and ease of coding of the two solutions

Modify the EX6 exercise code as indicated on this slide

Page 18: Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems Laboratory 5 Ex 2 – Loop partitioning: Static scheduling 0 4 8 12 1 5 9 13 2 6 10

Integrated Systems LaboratoryETH ETH zürich

18

Projects: come to the software side

If you are interested in course/semester/master project focused on OpenMP development for:

Many-cores embedded devices Parallel-Ultra-Low-Power architectures High-Performance parallel machines

Do not hesitate to contact us➔ [email protected]

[email protected]