Basics Compilers

download Basics Compilers

of 41

Transcript of Basics Compilers

  • 7/29/2019 Basics Compilers

    1/41

    Compilers

  • 7/29/2019 Basics Compilers

    2/41

    7/17/09 2

    PGI Pathscale

    Recommended first compile/run fastsse tp barcelona64

    Get diagnosDcs Minfo Mneginfo

    Inlining

    Mipa=fast,inline Recognize OpenMP direcDves

    mp=nonuma Automatic parallelization

    -Mconcur

    Recommended first compile/run Ftn O3 OPT:Ofast

    march=barcelona

    Get iagnosDcs LNO:simd_verbose=ON

    Inlining ipa

    Recognize OpenMP direcDves mp

    Automatic parallelization -apo

  • 7/29/2019 Basics Compilers

    3/41

    PGI Basic Compiler Usage A compiler driver interprets op8ons and invokes preprocessors, compilers,assembler, linker, etc. Op8ons precedence: if op8ons conflict, last op8on on command line takes

    precedence

    Use Minfo to see a lis8ng of op8miza8ons and transforma8ons performedby the compiler

    Use help to list all op8ons or see details on how to use a given op8on, e.g.pgf90 Mvect help

    Use man pages for more details on op8ons, e.g. man pgf90 Use v to see under the hood

  • 7/29/2019 Basics Compilers

    4/41

    Flags to support language dialects Fortran

    pgf77, pgf90, pgf95, pghpf tools Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95, .F95, .hpf, .HPF Mextend, Mfixed, Mfreeform Type size i2, i4, i, r4, r, etc. Mcray, Mbyteswapio, Mupcase, Mnomain, Mrecursive, etc.

    C/C++ pgcc, pgCC, aka pgcpp Suffixes .c, .C, .cc, .cpp, .i B, c9, c9x, Xa, Xc, Xs, Xt Msignextend, Mfcon, Msingle, Muchar, Mgccbugs

  • 7/29/2019 Basics Compilers

    5/41

    Specifying the target architecture Use the tp switch. Dont need for Dual Core

    tp k64 or tp p764 or tp core264 for 64bit code. tp amd64e for AMD opteron rev E or later tp x64 for unified binary tp k32, k7, p7, piv, piii, p6, p5, px for 32 bit code tp barcelona64

  • 7/29/2019 Basics Compilers

    6/41

    Flags for debugging aids g generates symbolic debug informa8on used by a debugger gopt generates debug informa8on in the presence of op8miza8on Mbounds adds array bounds checking v gives verbose output, useful for debugging system or build problems Mlist will generate a lis8ng Minfo provides feedback on op8miza8ons made by the compiler S or Mkeepasm to see the exact assembly generated

  • 7/29/2019 Basics Compilers

    7/41

    Basic op8miza8on switches Tradi8onal op8miza8on controlled through O[], n is 0 to 4. fast switch combines common set into one simple switch, is equal to O2

    Munroll=c:1 Mnoframe Mlre

    For Munroll, c specifies completely unroll loops with this loop count orless

    Munroll=n: says unroll other loops m 8mes Mlre is loopcarried redundancy elimina8on

  • 7/29/2019 Basics Compilers

    8/41

    Basic op8miza8on switches, cont. fastsse switch is commonly used, extends fast to SSE hardware, andvectoriza8on fastsse is equal to O2 Munroll=c:1 Mnoframe Mlre (fast) plus

    Mvect=sse, Mscalarsse Mcache_align, Mflushz

    Mcache_align aligns top level arrays and objects on cacheline boundaries

    Mflushz flushes SSE denormal numbers to zero

  • 7/29/2019 Basics Compilers

    9/41

    7/17/09 9

    Node level tuning

    Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples

    Function Inlining especially important for C and C++ Parallelization for Cray multi-core processors Miscellaneous Optimizations hit or miss, but worth a try

  • 7/29/2019 Basics Compilers

    10/41

    7/17/09 10

    What can Interprocedural Analysis and

    Op8miza8on with Mipa do for You? Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection F90 shape propagation Function inlining IPA optimization of libraries, including inlining

  • 7/29/2019 Basics Compilers

    11/41

    7/17/09 11

    Effect of IPA on

    the WUPWISE Benchmark

    PGF95 Compiler Options

    Execution Time in

    Seconds

    fastsse 156.49fastsse Mipa=fast 121.65fastsse Mipa=fast,inline 91.72

    Mipa=fast => constant propagation => compiler sees complexmatrices are all 4x3 => completely unrolls loops

    Mipa=fast,inline => small matrix multiplies are all inlined

  • 7/29/2019 Basics Compilers

    12/41

    7/17/09 12

    Using Interprocedural Analysis

    Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe: - safe to optimize functions which

    call or are called from unknown function/library name

    Mipa=libopt perform IPA optimizations on libraries Mipa=libinline perform IPA inlining from libraries

  • 7/29/2019 Basics Compilers

    13/41

    7/17/09 13

    Explicit Func8on InliningMinline[=[lib:] | [name:] | except: |

    size: | levels:]

    [lib:] Inline extracted functions from inlib

    [name:] Inline function func

    except: Do not inline function funcsize: Inline only functions smaller than n

    statements (approximate)

    levels: Inline n levels of functions

    For C++ Codes, PGI Recommends IPA-basedinlining or Minline=levels:10!

  • 7/29/2019 Basics Compilers

    14/41

    7/17/09 14

    Other C++ recommenda8ons

    Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay Inheritance, polymorphism, virtual functions runtimelookup or check, no inlining, potential performance penalties

  • 7/29/2019 Basics Compilers

    15/41

    7/17/09 15

    SMP Paralleliza8on Mconcur for auto-parallelization on multi-core

    Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls

    mp to enable OpenMP 2.5 parallel programming model See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp=nonuma

    Mconcur and mp can be used together!

  • 7/29/2019 Basics Compilers

    16/41

    7/17/09 16

  • 7/29/2019 Basics Compilers

    17/41

    7/17/09 17

  • 7/29/2019 Basics Compilers

    18/41

    7/17/09 18

  • 7/29/2019 Basics Compilers

    19/41

    7/17/09 19

  • 7/29/2019 Basics Compilers

    20/41

    7/17/09 20

  • 7/29/2019 Basics Compilers

    21/41

    7/17/09 21

  • 7/29/2019 Basics Compilers

    22/41

    7/17/09 22

  • 7/29/2019 Basics Compilers

    23/41

    7/17/09 23

  • 7/29/2019 Basics Compilers

    24/41

  • 7/29/2019 Basics Compilers

    25/41

    IntroducDon to the Cray compiler Example

    GTC Overflow PARQUET

    Cray Inc. Confidential Slide 25

  • 7/29/2019 Basics Compilers

    26/41

    Cray has a long tradiDon of high performance compilers VectorizaDon ParallelizaDon Code transformaDon More

    Began internal invesDgaDon leveraging an open sourcecompiler called LLVM

    IniDal results and progress beer than expected ecided to move forward with Cray X86 compiler 7.0 released in ecember 2008 7.1 will be released Q2 2009

    Cray Inc. Confidential Slide 26

  • 7/29/2019 Basics Compilers

    27/41

    Cray Inc. Proprietary Slide 27

    X86 Code

    Generator

    Cray X2 Code

    Generator

    Fortran Front End

    Interprocedural Analysis

    Optimization and

    Parallelization

    C and C++ Source

    Object File

    Compiler

    C & C++ Front End

    Fortran Source C and C++ Front End

    supplied by Edison DesignGroup, with Cray-developed

    code for extensions and

    interface support

    X86 Code Generation from

    Open Source LLVM, with

    additional Cray-developed

    optimizations and interface

    support

    Cray Inc. Compiler

    Technology

  • 7/29/2019 Basics Compilers

    28/41

    Make sure it is available module avail PrgEnvcray

    To access the Cray compiler module load PrgEnvcray

    To target the Barcelona chip module load xtpequadcore

    Once you have loaded the module cc and n are the Craycompilers

    Recommend just using default opDons Use rm (fortran) and hlist=m (C) to find out what happened

    man craynCray Inc. Confidential

    Slide 28

  • 7/29/2019 Basics Compilers

    29/41

    Excellent VectorizaDon Vectorize more loops than other compilers

    OpenMP 2.0 standard

    NesDng PGAS: FuncDonal UPC and CAF available today. Excellent Cache opDmizaDons

    AutomaDc Blocking AutomaDc Management of what stays in cache

    Prefetching, Interchange, Fusion, and much moreCray Inc. Confidential Slide 29

  • 7/29/2019 Basics Compilers

    30/41

    C++ SupportAutomaDc ParallelizaDonModernized version of Cray X1 streaming capabilityInteracts with OMP direcDves

    OpenMP 3.0OpDmized PGASWill require Gemini network to really go fast

    Improved VectorizaDonImprove Cache opDmizaDons

    Cray Inc. Confidential

    Slide 30

  • 7/29/2019 Basics Compilers

    31/41

    Plasma Fusion SimulaDon 3 ParDcleincell code (PIC) in toroidal geometry eveloped by Prof. Zhihong Lin (now at UC Irvine) Code has several different characterisDcs

    Stride1 copies

    Strided memory operaDons ComputaDonally intensive Gather/Scaer SorDng and Packing

    Main rouDne is known as the pusher

    Cray Inc. Confidential

    Slide 31

  • 7/29/2019 Basics Compilers

    32/41

    Main Pusher kernel consists of 2 main loop nests First loop nest contains groups of 4 statements which include

    significant indirect addressing

    e1=e1+wp0*wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij))

    e2=e2+wp0*wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij))

    e3=e3+wp0*wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij))

    e4=e4+wp0*wt00*(wz0*phit(0,ij)+wz1*phit(1,ij))

    Turn 4 statements into 1 vector shortloopev(1:4)=ev(1:4)+wp0*wt00*(wz0*tempphi(1:4,0,ij)+wz1*tempphi(1:4,1,ij))

    Second loop is large, computaDonally intensive, but containsstrided loads and computed gather CCE automaDcally vectorizes loop

    Cray Inc. Confidential

    Slide 32

  • 7/29/2019 Basics Compilers

    33/41

    Cray Inc. Confidential

    Slide 33

    5.0

    10.0

    15.0

    20.0

    25.0

    30.0

    35.0

    40.0

    Billio

    nPar8clesPushed/Sec

    GTC Pusher performance3200 MPI ranks and 4 OMP threads

    CCE

    Previous Best

  • 7/29/2019 Basics Compilers

    34/41

    Cray Inc. Confidential

    Slide 34

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    14.0

    16.0

    Bill

    ionPar8clesPushed/Sec

    GTC performance3200 MPI ranks and 4 OMP threads

    CCE

    Previous Best

  • 7/29/2019 Basics Compilers

    35/41

    Overflow is a NASA developed NavierStokes flow solver forunstructured grids

    SubrouDnes consist of two or three simplynested loops Inner loops tend to be highly vectorized and have 2050

    Fortran statements

    MPI is used for parallel processing Solver automaDcally splits grid blocks for load balancing Scaling is limited due to load balancing at > 1024

    Code is threaded at a highlevel via OpenMP

    Cray Inc. Confidential

    Slide 35

  • 7/29/2019 Basics Compilers

    36/41

    256

    512

    1024

    2048

    4096

    256 512 1024 2048 4096 8192

    TimeinSeco

    nds

    Number of Cores

    Overflow Scaling

    PreviousMPI

    CCEMPI

    CCEOMP 2 thr

    CCEOMP 4 thr

  • 7/29/2019 Basics Compilers

    37/41

    Materials Science code Scales to 1000s of MPI ranks before it runs out of parallelism Want to use shared memory parallelism across enDre node Main kernel consists of 4 independent zgemms Want to use mulDlevel OMP to scale across the node

    Cray Inc. Confidential

    Slide 37

  • 7/29/2019 Basics Compilers

    38/41

    !$omp parallel do

    do i=1,4

    call complex_matmul()

    enddo

    SubrouDne complex_matmul()

    !$omp parallel do private(j,jend,jsize)! num_threads(p2)do j=1,n,nb

    jend = min(n, j+nb1)

    jsize = jend j + 1

    call zgemm( transA,transB, m,jsize,k, &

    alpha,A,ldA,B(j,1),ldb, beta,C(1,j),ldC)

    enddo

    Cray Inc. Confidential

    Slide 38

  • 7/29/2019 Basics Compilers

    39/41

    Cray Inc. Confidential

    Slide 39

    0

    10

    20

    30

    40

    50

    60

    70

    80

    Serial ZGEMM High Level OMP

    ZGEMM 4x1

    Nested OMP

    ZGEMM 3x3

    Nested OMP

    ZGEMM 4x2

    Nested OMP

    ZGEMM 2x4

    Low level OMP

    ZGEMM 1x8

    GFlop

    s

    Parallel method and Nthreads at each level

    ZGEMM 1000x1000

  • 7/29/2019 Basics Compilers

    40/41

    Cray Inc. Confidential

    Slide 40

    0

    5

    10

    15

    20

    25

    30

    35

    Serial ZGEMM High Level OMP

    ZGEMM 4x1

    Nested OMP

    ZGEMM 3x3

    Nested OMP

    ZGEMM 4x2

    Low Level ZGEMM

    1x8

    GFlo

    ps

    Parallel method and Nthreads at each level

    ZGEMM 100x100

  • 7/29/2019 Basics Compilers

    41/41

    The Cray Compiling Environment is a new, different, andinteresDng compiler with several unique capabiliDes

    Several codes are already taking advantage of CCE evelopment is ongoing Consider trying CCE if you think you could take

    advantage of its capabiliDes