Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization...

For additional information, contact:

Azamat MametjanovArgonne National Laboratory

(630) 252-1074azamat@anl.gov climatemodeling.science.energy.gov/acme

Lightweight threading and vectorization with OpenMP in ACMEAzamat Mametjanov, Robert Jacob, Mark Taylor

Next-generation DOE machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips.Applications are required to increase their fine-grained threading and vectorization potential tofully occupy ~60 cores each with 8-double-wide vector units. While the new on-packagehigh-bandwidth memory is expected to provide 5x speedup to existing executables,significant source code refactoring and fine-grained parallelization is needed to achieve betterefficiency. Pure MPI parallelization can lead to network congestion degradation and out-of-memory errors at high number of tasks per node. Our goal is to leverage shared-memoryparallelism of OpenMP to efficiently utilize available hardware threads and vector units.

Primary methods toward this goal:1. Parallelize sequential loops2. Collapse nested loops3. Vectorize innermost loops

Complicating factors are threading and SIMD overheads, bit-for-bit correctness and portabilityof speedups and correctness across supported compilers – Intel, IBM, PGI, GNU.

Objective

Profile to pick compute-bound regions• CrayPAT, HPCToolkit, Intel VTune/Advisor/Inspector, …• No “hot” regions: many ”warm”OpenMP updates (in diff format)call t_startf(’omp_par_do_region') ! add a timer!$omp parallel do num_threads(hthreads), default(shared), private(ie) ! parallelizedo ie = nets, nete

call omp_set_num_threads(vthreads) ! nested threading!$omp parallel do private(q,k) collapse(2) ! collapse nested loopsdo q = 1 , qsize ! reorder leading dim

do k=1 , nlev ! into inner loop!$omp simd ! vectorizegradQ(:, :, 1) = Vstar(:, :, 1, k) * elem(ie)%state%Qdp(:, :, k, q, ie)

call t_stopf(' ’omp_par_do_region')Benchmark: based on env_mach_pes.xml, improve timing_stats• Also, inspect compiler-generated listings and optimization reportsTest: (PET, PEM, PMT, ERP tests) on (F-, I-, C-, G-, B-cases) without cprnc diffsPort: re-bench and re-test on other compilers

Approach

Enabled nested threading in ACME• Nested threading shows speedups: e.g. 1.71x speedup in atm/lnd coupled F-cases.• Beyond scaling limits, nested threads occupy “unused” cores: e.g. ne120 on 8K Mira partition

Improving OpenMP run-time support• BOLT (BOLT is OpenMP over Lightweight Threads) provides 1.6x speedup to nested threading• Intel is using transport_semini-app to improve libraries and compiler support

Early access to KNL nodes• 64 cores and 256 hardware threads with 16 GB high-bandwidth memory• Nested threads are able to use all 256 hardware threads while fitting into 16 GB HBM

Impact

0.2270.230

NodesxRanksxHorizthreads(xVertthreads)

CompsetFC5AV1C-01atne120_ne120resolutiononMira

Nodesubscriptionandnestedthreading

1.71xspeedup

1.22xspeedup

0.4150.432 0.427

0.5920.616 0.602

0.7320.767

1350x1x64

1350x2x32

1350x4x16

1350x8x8

2700x1x64

2700x2x32

2700x4x16

2700x8x8

5400x1x64

5400x2x32

5400x4x16

5400x8x8

64elemens/node 32elements/node 16elements/node

NodesxRanksxThreads

CompsetFC5AV1C-01atne120_ne120resolutiononMira

HybridMPI+OpenMPoptimizationandscaling

Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization...

Documents

Transcript of Lightweight threading and vectorization with OpenMP in ACME · efficiency. Pure MPI parallelization...

Dynamic Parallelization and Vectorization of Binary ...The Crusoe processor by Transmeta [9] is an example for both continuous runtime optimization and binary translation. Crusoe is

How to tackle the many- core challenge - DESY · Moore’s Law reinterpreted • Number of cores per chip will double every two years • Instruction parallelization (vectorization)

Parallelization and Vectorization in GCC...3 July 2012 gcc-par-vect: Outline 1/81 Outline • Transformation for parallel and vector execution • Data dependence • Auto-parallelization

Parallelization of Coupled Cluster Code with OpenMP

Development of OpenMP Parallelization for MFIX-DEM · 2013. 8. 6. · • After instrumentation with OpenMP directives, performance analysis shows an efficiency of 87% on 8 threads

Arm in High Performance Computing: Fortran on AArch64fortrandev/2018/AGM18FP_Sircombe.pdf · •Partial Fortran 2008 support •OpenMP 3.1 •Directives to support explicit vectorization

Michael Voss, Principal Engineer Software and Services ...pingali/CS377P/2019sp/lectures/vectorization-voss.pdfThree layer cake for shared-memory programming. ... shared memory OpenMP,

OPTIMIZATION AND OPENMP PARALLELIZATION OF …€¦ · OPTIMIZATION AND OPENMP PARALLELIZATION OF A DISCRETE ELEMENT ... with the optimization and parallelization of a discrete element

LHCb ideas and path toward vectorization and parallelization

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Introduction OpenMP Parallelization - Physics & …homepage.physics.uiowa.edu/~ghowes/teach/ihpc11/lec/ihpc11Lec... · Introduction OpenMP Parallelization Christian D. Ott TAPIR,

Multi-platform Automatic Parallelization and Power ... · (SMP servers) Server Code Generation OpenMP Compiler Shred memory servers Heterogeneous Multicores from Vendor B Hitachi,

Parallel Hybrid Computing · GPU GPU GPU GPU OpenMP HMPP MPI CUDA. Programming Multicores/ ... CILK, TBB, automatic parallelization, vectorization… • Distributed memory architectures

OpenMP Offload Evaluating Support for Features · OpenMP 1.0 OpenMP 2.0 OpenMP 2.0 OpenMP 2.5 OpenMP 3.0 OpenMP 3.1 OpenMP 4.0 OpenMP 4.5 OpenMP 5.0 OpenACC 1.0 OpenACC 2.5 OpenACC

Parallelization and Vectorization in GCC

OpenMP Application Program Interface Examples€¦ · 5 Introduction This collection of programming examples supplements the OpenMP API for Shared Memory Parallelization specifications,

An Introduction to OpenMP · 2 OpenMP General Concepts Parallelization Constructs Data Environment Synchronization 3 Final Considerations Complete Example Summary ... Thread libraries

SIMD Programming and What You Must Know about …Contents 1 Introduction 2 SIMD Instructions 3 SIMD programming alternatives Auto loop vectorization OpenMP SIMD Directives GCC’s

Advanced OpenMP Tutorial · Advanced OpenMP Tutorial – Vectorization Michael Klemm 1 IWOMP 2017 Advanced OpenMP Tutorial Christian Terboven Michael Klemm. OpenMPCon & IWOMP 2017

Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.