Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems...
Transcript of Exercise: OpenMP Programminggmichi/asocd/exercises/ex_03.pdf · ETH zürich Integrated Systems...
ETH ETH zürich
ETH ETH zürichIntegrated Systems Laboratory
Exercise: OpenMP Programming
Multicore programming with OpenMP
19.04.2016A. Marongiu - [email protected]
D. Palossi – [email protected]
Integrated Systems LaboratoryETH ETH zürich
2
Odroid Board
Board Specs
• Exynos5 Octa Cortex™-A15 1.4Ghz quad
core and Cortex™-A7 quad core CPUs
• PowerVR SGX544MP3 GPU (OpenGL ES
2.0, OpenGL ES 1.1 and OpenCL 1.1 EP)
• 2Gbyte LPDDR3 RAM PoP
• USB 3.0 Host x 1, USB 3.0 OTG x 1, USB
2.0 Host x 4
• HDMI 1.4a output Type-D connector
• eMMC 4.5 Flash Storage
Integrated Systems LaboratoryETH ETH zürich
3
Login, compile and run
• Login to Odroid board:
ssh –X stud_usr@<BOARD-IP>
• Exercise files are already on the board under*:
/home/stud_usr/asocd16/SRC
• Compile and run with
make clean all run
• Take a look at test.c. Different exercises are #ifdef-ed
• To compile and execute the desired exercise:make clean all run –e MYOPTS="-DEX1 –DEX2 …"
* if you need a fresh copy is available at: /home/soc_master/3_openMP/asocd16.zip
Integrated Systems LaboratoryETH ETH zürich
4
Ex 1 – Parallelism creation: Hello World!
#pragma omp parallel num_threads (?) printf "Hello world, I’m thread ??"
Use parallel directive to create multiple threads Each thread executes the code enclosed within the scope of the directive
Use runtime library functions to determine thread ID All SPMD parallelization is based on this approach
Integrated Systems LaboratoryETH ETH zürich
5
Ex 2 – Loop partitioning: Static scheduling
00 44 88 1212
11 55 99 1313
22 66 1010 1414
33 77 1111 1515
T0 T1 T2 T3#pragma omp parallel for \ num_threads (NTHREADS) schedule (static)
for (uint i=0; i<NITERS; i++){ work (W);
} /* (implicit) SYNCH POINT */
With schedule(static) M iterations are statically assigned to each thread, where
M = NITERS / NTHREADS
Small overhead: loop indexes are computed according to thread ID Optimal scheduling if workload is balanced
Integrated Systems LaboratoryETH ETH zürich
6
Use the following parameters: NITERS = 1024 NTHREADS = {1,2,4,8,16} W = 1000000
Collect execution time in excel sheet for the various configurations Comment on the results
Exercise 2a
#pragma omp parallel for \ num_threads (NTHREADS) schedule (static)
for (uint i=0; i<NITERS; i++){ work (W);
} /* (implicit) SYNCH POINT */
Ex 2 – Loop partitioning: Static scheduling
Integrated Systems LaboratoryETH ETH zürich
7
Ex 2 – Loop partitioning: Dynamic scheduling
00 44 88 1212
11 55 99 1313
22 66 1010 1414
33 77 1111 1515
#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)
for (uint i=0; i<NITERS; i++){ work (W);
} /* (implicit) SYNCH POINT */
T0 T1 T2 T3
Iterations are dynamically assigned to threads in groups of M = NITERS/NTHREADS
Same parallelization of schedule(static) Coarse granularity Overhead only at beginning and end of loop
OVERHEADOVERHEAD CHUNK CHUNK
Integrated Systems LaboratoryETH ETH zürich
8
Ex 2 – Loop partitioning: Dynamic scheduling
#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)
for (uint i=0; i<NITERS; i++){ work (W);
} /* (implicit) SYNCH POINT */
00 44 88 1212
11 55 99 1313
22 66 1010 1414
33 77 1111 1515
T0 T1 T2 T3
Iterations are dynamically assigned to threads one iteration per chunk
Finest granularity Overhead at every iteration
OVERHEADOVERHEADCHUNK (size = 1 iter)CHUNK (size = 1 iter)
Integrated Systems LaboratoryETH ETH zürich
9
Ex 2 – Loop partitioning: Dynamic scheduling
Use the following parameters: NITERS = 1024 NTHREADS = {1,2,4,8,16} M = {NITERS/NTHREADS, 1} W = 1000000
Add execution time for the various configurations to the excel sheet Comment on the results
#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)
for (uint i=0; i<NITERS; i++){ work (W);
} /* (implicit) SYNCH POINT */
Exercise 2b
Integrated Systems LaboratoryETH ETH zürich
10
Ex 2 – Loop partitioning: Dynamic scheduling
Use the following parameters: NITERS = 102400 NTHREADS = {1,2,4,8,16} M = {NITERS/NTHREADS, 1} W = 10
Draw the same plot as exercise 2a, 2b What happens for smaller workload and higher iteration count?
Exercise 2c
#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)
for (uint i=0; i<NITERS; i++){ work (W);
} /* (implicit) SYNCH POINT */
Integrated Systems LaboratoryETH ETH zürich
11
Ex 3 – Unbalanced Loop Partitioning
00 44 88 121211
55 99
131322
66 1010
1414
33
77 1111
1515
#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, NITERS/NTHREADS)
for (uint i=0; i<NITERS; i++){ work((i>>2) * W);
} /* (implicit) SYNCH POINT */
T0 T1 T2 T3
Iterations have different duration Using coarse-grained chunks (NITERS/NTHREADS) creates unbalanced work among the threads
Due to the barrier at the end of parallel region, all threads have to wait for the slowest one
SYNCH POINTSYNCH POINT
Integrated Systems LaboratoryETH ETH zürich
12
Ex 3 – Unbalanced Loop Partitioning
00 44 88 121211
55 99
131322
66 1010
1414
33
77 11111515
T0 T1 T2 T3
SPEEDUPSPEEDUP
#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, 1)
for (uint i=0; i<NITERS; i++){ work((i>>2) * W);
} /* (implicit) SYNCH POINT */
Iterations have different duration Using fine-grained chunks (the finest is 1) creates balanced work among the threads
The overhead for dynamic scheduling is amortized by the effect of work-balancing on the overall loop duration
Integrated Systems LaboratoryETH ETH zürich
13
Ex 3 – Unbalanced Loop Partitioning
#pragma omp parallel for \ num_threads (NTHREADS) \ schedule (dynamic, M)
for (uint i=0; i<NITERS; i++){ /* BALANCED LOOP CODE */
} /* (implicit) SYNCH POINT */
Use the following parameters: NITERS = 128 NTHREADS = 4 M = {32, 16, 8, 4, 1} W = 1000000
Create a new excel sheet plotting results (execution time) for static and dynamic schedules (with chunk size M)Comment on results
Exercise 3a, 3b
Integrated Systems LaboratoryETH ETH zürich
14
Ex 4 – Chunking Overhead
Study the impact of chunk size for varying sizes of the workload W
Use the following parameters: NITERS = 1024*256 NTHREADS = 4 M = {32, 16, 8, 4, 1} W = {1, 10, 100, 1000}
Create a new excel sheet plotting results (execution time) for static and dynamic schedules (with chunk size M) Comment on results
HINT: Plot NORMALIZED execution time tothe fastest scheduling option for a givenvalue of W
Integrated Systems LaboratoryETH ETH zürich
15
Ex 5 – Task parallelism with sections
a) Distribute workload among 4 threads using SPMD parallelizationI. Get thread idII. Use if/else or switch/case to differentiate workload
b) Implement the same workload partitioning with sections directive
void sections(){ work(1000000); printf("%hu: Done with first elaboration!\n”, …); work(2000000); printf("%hu: Done with second elaboration!\n”, …); work(3000000); printf("%hu: Done with third elaboration!\n", …); work(4000000); printf("%hu: Done with fourth elaboration!\n", …);}
Integrated Systems LaboratoryETH ETH zürich
16
Ex 6 – Task parallelism with task
a) Distribute workload among 4 threads using task directivea) Same program as beforeb) But we had to manually unroll the loop to use sections
Compare performance and ease of coding with sections
void tasks(){ unsigned int i;
for(i=0; i<4; i++) { work((i+1)*1000000); printf("%hu: Done with elaboration\n", …); }}
Integrated Systems LaboratoryETH ETH zürich
void tasks(){ unsigned int i;
for(i=0; i<1024; i++) { work((1000000); printf("%hu: Done with elaboration\n", …); }}
17
Ex 6 – Task parallelism with task
b) Parallelize the loop with task directive• Use single directive to force a single processor to create tasks
c) Parallelize the loop with single directive• Use nowait clause to allow for parallel execution
Compare performance and ease of coding of the two solutions
Modify the EX6 exercise code as indicated on this slide
Integrated Systems LaboratoryETH ETH zürich
18
Projects: come to the software side
If you are interested in course/semester/master project focused on OpenMP development for:
Many-cores embedded devices Parallel-Ultra-Low-Power architectures High-Performance parallel machines
Do not hesitate to contact us➔ [email protected]