Basics Compilers
-
Upload
robotu-de-serviciu -
Category
Documents
-
view
240 -
download
0
Transcript of Basics Compilers
-
7/29/2019 Basics Compilers
1/41
Compilers
-
7/29/2019 Basics Compilers
2/41
7/17/09 2
PGI Pathscale
Recommended first compile/run fastsse tp barcelona64
Get diagnosDcs Minfo Mneginfo
Inlining
Mipa=fast,inline Recognize OpenMP direcDves
mp=nonuma Automatic parallelization
-Mconcur
Recommended first compile/run Ftn O3 OPT:Ofast
march=barcelona
Get iagnosDcs LNO:simd_verbose=ON
Inlining ipa
Recognize OpenMP direcDves mp
Automatic parallelization -apo
-
7/29/2019 Basics Compilers
3/41
PGI Basic Compiler Usage A compiler driver interprets op8ons and invokes preprocessors, compilers,assembler, linker, etc. Op8ons precedence: if op8ons conflict, last op8on on command line takes
precedence
Use Minfo to see a lis8ng of op8miza8ons and transforma8ons performedby the compiler
Use help to list all op8ons or see details on how to use a given op8on, e.g.pgf90 Mvect help
Use man pages for more details on op8ons, e.g. man pgf90 Use v to see under the hood
-
7/29/2019 Basics Compilers
4/41
Flags to support language dialects Fortran
pgf77, pgf90, pgf95, pghpf tools Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95, .F95, .hpf, .HPF Mextend, Mfixed, Mfreeform Type size i2, i4, i, r4, r, etc. Mcray, Mbyteswapio, Mupcase, Mnomain, Mrecursive, etc.
C/C++ pgcc, pgCC, aka pgcpp Suffixes .c, .C, .cc, .cpp, .i B, c9, c9x, Xa, Xc, Xs, Xt Msignextend, Mfcon, Msingle, Muchar, Mgccbugs
-
7/29/2019 Basics Compilers
5/41
Specifying the target architecture Use the tp switch. Dont need for Dual Core
tp k64 or tp p764 or tp core264 for 64bit code. tp amd64e for AMD opteron rev E or later tp x64 for unified binary tp k32, k7, p7, piv, piii, p6, p5, px for 32 bit code tp barcelona64
-
7/29/2019 Basics Compilers
6/41
Flags for debugging aids g generates symbolic debug informa8on used by a debugger gopt generates debug informa8on in the presence of op8miza8on Mbounds adds array bounds checking v gives verbose output, useful for debugging system or build problems Mlist will generate a lis8ng Minfo provides feedback on op8miza8ons made by the compiler S or Mkeepasm to see the exact assembly generated
-
7/29/2019 Basics Compilers
7/41
Basic op8miza8on switches Tradi8onal op8miza8on controlled through O[], n is 0 to 4. fast switch combines common set into one simple switch, is equal to O2
Munroll=c:1 Mnoframe Mlre
For Munroll, c specifies completely unroll loops with this loop count orless
Munroll=n: says unroll other loops m 8mes Mlre is loopcarried redundancy elimina8on
-
7/29/2019 Basics Compilers
8/41
Basic op8miza8on switches, cont. fastsse switch is commonly used, extends fast to SSE hardware, andvectoriza8on fastsse is equal to O2 Munroll=c:1 Mnoframe Mlre (fast) plus
Mvect=sse, Mscalarsse Mcache_align, Mflushz
Mcache_align aligns top level arrays and objects on cacheline boundaries
Mflushz flushes SSE denormal numbers to zero
-
7/29/2019 Basics Compilers
9/41
7/17/09 9
Node level tuning
Vectorization packed SSE instructions maximize performance Interprocedural Analysis (IPA) use it! motivating examples
Function Inlining especially important for C and C++ Parallelization for Cray multi-core processors Miscellaneous Optimizations hit or miss, but worth a try
-
7/29/2019 Basics Compilers
10/41
7/17/09 10
What can Interprocedural Analysis and
Op8miza8on with Mipa do for You? Interprocedural constant propagation Pointer disambiguation Alignment detection, Alignment propagation Global variable mod/ref detection F90 shape propagation Function inlining IPA optimization of libraries, including inlining
-
7/29/2019 Basics Compilers
11/41
7/17/09 11
Effect of IPA on
the WUPWISE Benchmark
PGF95 Compiler Options
Execution Time in
Seconds
fastsse 156.49fastsse Mipa=fast 121.65fastsse Mipa=fast,inline 91.72
Mipa=fast => constant propagation => compiler sees complexmatrices are all 4x3 => completely unrolls loops
Mipa=fast,inline => small matrix multiplies are all inlined
-
7/29/2019 Basics Compilers
12/41
7/17/09 12
Using Interprocedural Analysis
Must be used at both compile time and link time Non-disruptive to development process edit/build/run Speed-ups of 5% - 10% are common Mipa=safe: - safe to optimize functions which
call or are called from unknown function/library name
Mipa=libopt perform IPA optimizations on libraries Mipa=libinline perform IPA inlining from libraries
-
7/29/2019 Basics Compilers
13/41
7/17/09 13
Explicit Func8on InliningMinline[=[lib:] | [name:] | except: |
size: | levels:]
[lib:] Inline extracted functions from inlib
[name:] Inline function func
except: Do not inline function funcsize: Inline only functions smaller than n
statements (approximate)
levels: Inline n levels of functions
For C++ Codes, PGI Recommends IPA-basedinlining or Minline=levels:10!
-
7/29/2019 Basics Compilers
14/41
7/17/09 14
Other C++ recommenda8ons
Encapsulation, Data Hiding - small functions, inline! Exception Handling use no_exceptions until 7.0 Overloaded operators, overloaded functions - okay Pointer Chasing - -Msafeptr, restrict qualifer, 32 bits? Templates, Generic Programming now okay Inheritance, polymorphism, virtual functions runtimelookup or check, no inlining, potential performance penalties
-
7/29/2019 Basics Compilers
15/41
7/17/09 15
SMP Paralleliza8on Mconcur for auto-parallelization on multi-core
Compiler strives for parallel outer loops, vector SSE inner loops Mconcur=innermost forces a vector/parallel innermost loop Mconcur=cncall enables parallelization of loops with calls
mp to enable OpenMP 2.5 parallel programming model See PGI Users Guide or OpenMP 2.5 standard OpenMP programs compiled w/out mp=nonuma
Mconcur and mp can be used together!
-
7/29/2019 Basics Compilers
16/41
7/17/09 16
-
7/29/2019 Basics Compilers
17/41
7/17/09 17
-
7/29/2019 Basics Compilers
18/41
7/17/09 18
-
7/29/2019 Basics Compilers
19/41
7/17/09 19
-
7/29/2019 Basics Compilers
20/41
7/17/09 20
-
7/29/2019 Basics Compilers
21/41
7/17/09 21
-
7/29/2019 Basics Compilers
22/41
7/17/09 22
-
7/29/2019 Basics Compilers
23/41
7/17/09 23
-
7/29/2019 Basics Compilers
24/41
-
7/29/2019 Basics Compilers
25/41
IntroducDon to the Cray compiler Example
GTC Overflow PARQUET
Cray Inc. Confidential Slide 25
-
7/29/2019 Basics Compilers
26/41
Cray has a long tradiDon of high performance compilers VectorizaDon ParallelizaDon Code transformaDon More
Began internal invesDgaDon leveraging an open sourcecompiler called LLVM
IniDal results and progress beer than expected ecided to move forward with Cray X86 compiler 7.0 released in ecember 2008 7.1 will be released Q2 2009
Cray Inc. Confidential Slide 26
-
7/29/2019 Basics Compilers
27/41
Cray Inc. Proprietary Slide 27
X86 Code
Generator
Cray X2 Code
Generator
Fortran Front End
Interprocedural Analysis
Optimization and
Parallelization
C and C++ Source
Object File
Compiler
C & C++ Front End
Fortran Source C and C++ Front End
supplied by Edison DesignGroup, with Cray-developed
code for extensions and
interface support
X86 Code Generation from
Open Source LLVM, with
additional Cray-developed
optimizations and interface
support
Cray Inc. Compiler
Technology
-
7/29/2019 Basics Compilers
28/41
Make sure it is available module avail PrgEnvcray
To access the Cray compiler module load PrgEnvcray
To target the Barcelona chip module load xtpequadcore
Once you have loaded the module cc and n are the Craycompilers
Recommend just using default opDons Use rm (fortran) and hlist=m (C) to find out what happened
man craynCray Inc. Confidential
Slide 28
-
7/29/2019 Basics Compilers
29/41
Excellent VectorizaDon Vectorize more loops than other compilers
OpenMP 2.0 standard
NesDng PGAS: FuncDonal UPC and CAF available today. Excellent Cache opDmizaDons
AutomaDc Blocking AutomaDc Management of what stays in cache
Prefetching, Interchange, Fusion, and much moreCray Inc. Confidential Slide 29
-
7/29/2019 Basics Compilers
30/41
C++ SupportAutomaDc ParallelizaDonModernized version of Cray X1 streaming capabilityInteracts with OMP direcDves
OpenMP 3.0OpDmized PGASWill require Gemini network to really go fast
Improved VectorizaDonImprove Cache opDmizaDons
Cray Inc. Confidential
Slide 30
-
7/29/2019 Basics Compilers
31/41
Plasma Fusion SimulaDon 3 ParDcleincell code (PIC) in toroidal geometry eveloped by Prof. Zhihong Lin (now at UC Irvine) Code has several different characterisDcs
Stride1 copies
Strided memory operaDons ComputaDonally intensive Gather/Scaer SorDng and Packing
Main rouDne is known as the pusher
Cray Inc. Confidential
Slide 31
-
7/29/2019 Basics Compilers
32/41
Main Pusher kernel consists of 2 main loop nests First loop nest contains groups of 4 statements which include
significant indirect addressing
e1=e1+wp0*wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij))
e2=e2+wp0*wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij))
e3=e3+wp0*wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij))
e4=e4+wp0*wt00*(wz0*phit(0,ij)+wz1*phit(1,ij))
Turn 4 statements into 1 vector shortloopev(1:4)=ev(1:4)+wp0*wt00*(wz0*tempphi(1:4,0,ij)+wz1*tempphi(1:4,1,ij))
Second loop is large, computaDonally intensive, but containsstrided loads and computed gather CCE automaDcally vectorizes loop
Cray Inc. Confidential
Slide 32
-
7/29/2019 Basics Compilers
33/41
Cray Inc. Confidential
Slide 33
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
Billio
nPar8clesPushed/Sec
GTC Pusher performance3200 MPI ranks and 4 OMP threads
CCE
Previous Best
-
7/29/2019 Basics Compilers
34/41
Cray Inc. Confidential
Slide 34
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
Bill
ionPar8clesPushed/Sec
GTC performance3200 MPI ranks and 4 OMP threads
CCE
Previous Best
-
7/29/2019 Basics Compilers
35/41
Overflow is a NASA developed NavierStokes flow solver forunstructured grids
SubrouDnes consist of two or three simplynested loops Inner loops tend to be highly vectorized and have 2050
Fortran statements
MPI is used for parallel processing Solver automaDcally splits grid blocks for load balancing Scaling is limited due to load balancing at > 1024
Code is threaded at a highlevel via OpenMP
Cray Inc. Confidential
Slide 35
-
7/29/2019 Basics Compilers
36/41
256
512
1024
2048
4096
256 512 1024 2048 4096 8192
TimeinSeco
nds
Number of Cores
Overflow Scaling
PreviousMPI
CCEMPI
CCEOMP 2 thr
CCEOMP 4 thr
-
7/29/2019 Basics Compilers
37/41
Materials Science code Scales to 1000s of MPI ranks before it runs out of parallelism Want to use shared memory parallelism across enDre node Main kernel consists of 4 independent zgemms Want to use mulDlevel OMP to scale across the node
Cray Inc. Confidential
Slide 37
-
7/29/2019 Basics Compilers
38/41
!$omp parallel do
do i=1,4
call complex_matmul()
enddo
SubrouDne complex_matmul()
!$omp parallel do private(j,jend,jsize)! num_threads(p2)do j=1,n,nb
jend = min(n, j+nb1)
jsize = jend j + 1
call zgemm( transA,transB, m,jsize,k, &
alpha,A,ldA,B(j,1),ldb, beta,C(1,j),ldC)
enddo
Cray Inc. Confidential
Slide 38
-
7/29/2019 Basics Compilers
39/41
Cray Inc. Confidential
Slide 39
0
10
20
30
40
50
60
70
80
Serial ZGEMM High Level OMP
ZGEMM 4x1
Nested OMP
ZGEMM 3x3
Nested OMP
ZGEMM 4x2
Nested OMP
ZGEMM 2x4
Low level OMP
ZGEMM 1x8
GFlop
s
Parallel method and Nthreads at each level
ZGEMM 1000x1000
-
7/29/2019 Basics Compilers
40/41
Cray Inc. Confidential
Slide 40
0
5
10
15
20
25
30
35
Serial ZGEMM High Level OMP
ZGEMM 4x1
Nested OMP
ZGEMM 3x3
Nested OMP
ZGEMM 4x2
Low Level ZGEMM
1x8
GFlo
ps
Parallel method and Nthreads at each level
ZGEMM 100x100
-
7/29/2019 Basics Compilers
41/41
The Cray Compiling Environment is a new, different, andinteresDng compiler with several unique capabiliDes
Several codes are already taking advantage of CCE evelopment is ongoing Consider trying CCE if you think you could take
advantage of its capabiliDes