Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training...

Intel® C++ Compiler 14.0 within Intel System Studio

1

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

What you will learn from this slide deck

• Intel® C++ Compiler 14.0 technical training for

System & Application code running Linux*, Android* & Tizen™

• In-depth explanation of compiler specifics for each development environment mentioned above

• Please see subsequent slide decks for in-depth technical training on other components

2


Compatibility to Standards

The Intel C++ Compiler provides the following

language conformances: ANSI/ISO standard for C language compilation

(ISO/IEC9899:1990)

ANSI/ISO standard (ISO/IEC 14882:1998) for the C++ language

3


Common Optimization Switches

4

Linux*

Disable optimization -O0

Optimize for speed (no code size increase) -O1

Optimize for speed (default) -O2

High-level loop optimization -O3

Create symbols for debugging -g

Multi-file inter-procedural optimization -ipo

Profile guided optimization (multi-step build) -prof-gen

-prof-use

Optimize for speed across the entire program

**warning: -fast def’n changes over time

-fast (same as: -ipo –O3 -no-prec-div -static -xHost)


Optimizations for latest generation Intel® Atom™ Processor

5

• Processor specific light-weight out-of-order instruction scheduler optimization

• Processor specific cache management and memory preload optimizations

• Loop optimizations and vectorizer taking advantage of SSE4.2 vector instructions.

Processor-specific Compiler Optimizations

Linux

-xSSE2 -xCORE-AVX2

-xSSE3 -xCORE-AVX-I

-xSSSE3 -xATOM_SSSE3

-xSSE4.1 -xATOM_SSE4.2

-xSSE4.2

-xAVX

-xHost

Imply an Intel cpu id check

Runtime message if try to run on unsupported processor


Additional new compiler features

•New Vectorization report level

• Adds information about the quality of the vector code generated and does not output the text of messages

• Report data is processed with a script to produce a summary report and text file which intersperses vectorization messages and user code

• Analysis script available at http://software.intel. com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report

• Requires Python 2.6.5 or newer

• Specified with -vec-report7

•-vec-report6 now gives alignment information

6

http://software.intel.com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report





















The Seven Steps of Optimization

7 Intel Confidential



Build with optimization disabled

Use general optimizations

Use processor specific options

Add Inter Procedural Optimizations

Use Profile Guided Optimization

Tune Automatic Vectorization

1.

2.

3.

4.

5.

6.


Multithread your application 7.









1.

2.

3.

4.

5.

6.




General optimization options

• -O1

• Optimize code size, auto vectorization is turned off

• -O2

• Inlining

• vectorization

• -O3

• Loop optimization

• data pre-fetching

• -ansi-alias / -restrict / -no-prec-div










1.

2.

3.

4.

5.

6.




ICC Atom Specific optimization

• Optimization switch –xSSSE3_ATOM for Saltwell

– Intel Supplemental Streaming SIMD Extensions 3 (SSSE3)

– In Order

– Use of LEA for stack operations

– Instruction reordering

– Support for movbe instruction (-minstruction=movbe)

– Only use it for system development

• Optimization switch-xatom_sse4.2 for Silvermont

– Intel® Streaming SIMD Extensions 4.2 (SSE4.2)

– Out of order

GCC Note: use –mtune=atom or –mtune=slm, not –march=atom /slm



SIMD Instruction Enhancements

70 instr

Single-Precision Vectors

Streaming operations

144 instr

Double-precision Vectors

8/16/32

64/128-bit vector integer

13 instr

Complex Data

32 instr

Decode

47 instr

Video

Graphics building blocks

Advanced vector instr

SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2

8 instr

String/XML processing

POP-Count

CRC

AES-NI

7 instr

Encryption and Decryption

Key Generation

AVX

~100 new instr.

~300 legacy sse instr updated

256-bit vector

3 and 4-operand instructions Intel® Atom

Saltwell


Intel® Atom

Silvermont


SIMD Instruction Enhancements (2)

addss Scalar Single-FP Add

single precision FP data

scalar execution mode

addps Packed Single-FP Add

single precision FP data

packed execution mode

x4 x3 x2 x1

y4 y3 y2 y1

x4 x3 x2 x1+y1

x4 x3 x2 x1

y4 y3 y2 y1

x4+y4 x3+y3 x2+y2 x1+y1



Approaches to introduce vectorization

Assembler code (addps)

Vector intrinsic (mm_add_ps())

Compiler: Auto vectorization hints (#pragma ivdep, …)

Programmer control

Ease of use Compiler: Fully automatic vectorization

Cilk Plus Array Notation

User Mandated Vectorization

( SIMD Directive)










1.

2.

3.

4.

5.

6.




Interprocedural Optimizations Extends optimizations across file boundaries

Compile & Optimize

Compile & Optimize

Compile & Optimize

Compile & Optimize

file1.c

file2.c

file3.c

file4.c

Without IPO

Compile & Optimize

file1.c

file4.c file2.c

file3.c

With IPO

-ip Only between modules of one source file

-ipo Modules of multiple files/whole application



Interprocedural Optimizations (IPO) Usage: Two-Step Process

Linking

icc -ipo main.o func1.o func2.o

Pass 1

Pass 2

mock object

executable

Compiling

icc -c -ipo main.c

icc –c –ipo func1.c

icc –c –ipo func2.c



What you should know about IPO

• O2 and O3 activate “almost” file-local IPO (-ip)

• IPO extends compilation time and memory usage

• IPO works for libraries too

• In-lining of functions is most important feature of IPO but there is much more










1.

2.

3.

4.

5.

6.




Profile-Guided Optimizations (PGO)

Static analysis is limited:

• How often is x > y

• What is the size of count

• Which code is touched how often

Enhancements with PGO:

• More accurate branch prediction

• Better decision of functions to inline (help IPO)

• Basic block movement to improve instruction cache behavior

if (x > y) do_this(); else do that();

for(i=0; i<count; ++i)

do_work();



PGO Usage: Three Step Process

Compile + link to add instrumentation icc -prof_gen prog.c

Execute instrumented program prog.exe (on a typical dataset)

Compile + link using feedback icc -prof_use prog.c

Dynamic profile: 12345678.dyn

Instrumented executable: prog.exe

Merged .dyn files: pgopti.dpi

Step 1

Step 2

Step 3

Optimized executable: prog.exe










1.

2.

3.

4.

5.

6.




How do I know if a loop was vectorised

• vec-report[n]

> icc -vec-report MultArray.c MultArray.c(92): (col. 5) remark: LOOP WAS

VECTORIZED.

• always vectorize if safe

#pragma vector always [assert]

• always vectorize

#pragma simd



GAP – Guided Automatic Parallelization

• Use compiler infrastructure to help developer

– Vectorization, parallelization and data transformations

– Extend diagnostic message for failed vectorization and parallelization by specific hints to fix problem

• Exploit multi-year experience brought into the compiler development

– Performance tuning knowledge based on dealing with numerous applications, benchmarks and compute kernels

• Does not influence code generation



GAP – How it Works Compiler Switches for GAP [1]

Activate GAP and optionally define guidance level

• -guide[=level]

• -guide-vec[=level]

• -guide-par[=level]

• -guide-data-trans[=level]

• Optional argument level=1,2,3,4 controls extend of analysis: ‘4’ is most advanced / most detailed and is default



Vectorization Example [1]

void f(int n, float *x, float *y, float *z, float *d1, float *d2) {

for (int i = 0; i < n; i++)

z[i] = x[i] + y[i] – (d1[i]*d2[i]);

}

GAP Message:

g.c(6): remark #30536: (LOOP) Add -no-alias-args option for better type-based disambiguation analysis by the compiler, if appropriate (the option will apply for the entire compilation). This will improve optimizations such as vectorization for the loop at line 6. [VERIFY] Make sure that the semantics of this option is obeyed for the entire compilation. [ALTERNATIVE] Another way to get the same effect is to add the "restrict" keyword to each pointer-typed formal parameter of the routine "f". This allows optimizations such as vectorization to be applied to the loop at line 6. [VERIFY] Make sure that semantics of the "restrict" pointer qualifier is satisfied: in the routine, all data accessed through the pointer must not be accessed through any other



Vectorization Example [2]

void mul(NetEnv* ne, Vector* rslt

Vector* den, Vector* flux1,

Vector* flux2, Vector* num

{

float *r, *d, *n, *s1, *s2;

int i;

r=rslt->data; d=den->data;

n=num->data; s1=flux1->data;

s2=flux2->data;

for (i = 0; i < ne->len; ++i)

r[i] = s1[i]*s2[i] +

n[i]*d[i];

}

GAP Messages (simplified):

1. “Use a local variable to host the upper-bound of loop at line 29 (variable:ne->len) if the upper-bound does not change during execution of the loop”

2. “Use “#pragma ivdep" to help vectorize the loop at line 29, if these arrays in the loop do not have cross-iteration dependencies: r, s1, s2, n, d”



Data Transformation Example

struct S3 {

int a;

int b; // hot

double c[100];

struct S2 *s2_ptr;

int d; int e;

struct S1 *s1_ptr;

char *c_p;

int f; // hot

};

peel.c(22): remark #30756: (DTRANS) Splitting the structure 'S3' into two parts will improve data locality and is highly recommended. Frequently accessed fields are 'b, f'; performance may improve by putting these fields into one structure and the remaining fields into another structure. Alternatively, performance may also improve by reordering the fields of the structure. Suggested field order:'b, f, s2_ptr, s1_ptr, a, c, d, e, c_p'. [VERIFY] The suggestion is based on the field references in current compilation …

for (ii = 0; ii < N; ii++){

sp->b = ii;

sp->f = ii + 1;

sp++;

}









1.

2.

3.

4.

5.

6.





Intel® Cilk™ Plus - Overview

Simple Keywords Set of keywords, for expression

of task parallelism:

cilk_spawn

cilk_sync

cilk_for

Reducers

(Hyper-objects) Reliable access to nonlocal variables

without races

cilk::reducer_opadd<int> sum(3);

Array Notation Provide data parallelism for sections of

arrays or whole arrays

mask[:] = a[:] < b[:] ? -1 : 1;

Elemental Functions Define actions that can be applied to whole or parts of arrays or scalars

Execution parameters Runtime system APIs, Environment variables, pragmas

Task parallelism

Data parallelism



Intel® Cilk™ Plus - Overview

• Intel® Cilk™ Plus (Language Extension to C/C++)

Easier Task & Data Parallelism 3 simple Keywords: cilk_for, cilk_spawn, cilk_sync

Intel® Cilk™ Plus Array Notation Save time with powerful vectorization

32

Minimize Software Re-Work for New Hardware

32


Compiler Reports – Optimization Report

Compiler switch: -opt-report-phase[=phase] (Linux*)

phase can be: ipo_inl - Interprocedural Optimization Inlining Report ilo – Intermediate Language Scalar Optimization hpo – High Performance Optimization hlo – High-level Optimization all – All optimizations (not recommended, output too verbose)

Control the level of detail in the report: -opt-report[0|1|2|3] (Linux*)

• If you do not specify the level (i.e. /Qopt-report, -opt-report) level 2 is being used.

Save report output to file: -opt-report-file=[file] (Linux*)

Vectorization subset report: /Qvec-report2, –vec-report2

33


Optimization Report Example icc –O3 –opt-report-phase=hlo -opt-report-phase=hpo

…

LOOP INTERCHANGE in loops at line: 7 8 9

Loopnest permutation ( 1 2 3 ) --> ( 2 3 1 )

…

Loop at line 8 blocked by 128



…

Loop at line 10 unrolled and jammed by 4

Loop at line 8 unrolled and jammed by 4

…

…(10)… loop was not vectorized: not inner loop.

…(8)… loop was not vectorized: not inner loop.

…(9)… PERMUTED LOOP WAS VECTORIZED

…

34 34


High-Level Optimizer (HLO)

Compiler switches: -O2, -O3 (Linux*)

Loop level optimizations

• loop unrolling, cache blocking, prefetching

More aggressive dependency analysis

• Determines whether or not it‘s safe to reorder or parallelize statements

Scalar replacement

• Goal is to reduce memory by replacing with register references

35


36

Interprocedural Optimizations (IPO) Multi-pass Optimization

• Interprocedural optimizations performs a static, topological analysis of your application!

• ip: Enables inter-procedural optimizations for current source file compilation

• ipo: Enables inter-procedural optimizations across files Can inline functions in separate files

Especially many small utility functions benefit from IPO Enabled optimizations: • Procedure inlining (reduced function call overhead) • Interprocedural dead code elimination, constant propagation and procedure

reordering • Enhances optimization when used in combination with other compiler features

Linux*

-ip

-ipo


37

Interprocedural Optimizations (IPO) Usage: Two-Step Process

Linking

Linux* icc -ipo main.o func1.o

func2.o

Windows* icl /Qipo main.o func1.o

func2.obj

Pass 1

Pass 2

mock object

executable

Compiling

Linux* icc -c -ipo main.c func1.c

func2.c

Windows* icl -c /Qipo main.c func1.c

func2.c


38

Interprocedural Optimizations Extends optimizations across file boundaries

Compile & Optimize

Compile & Optimize

Compile & Optimize

Compile & Optimize

file1.c

file2.c

file3.c

file4.c

Without IPO

Compile & Optimize

file1.c

file4.c file2.c

file3.c

With IPO

/Qip, -ip Only between modules of one source file

/Qipo, -ipo Modules of multiple files/whole application


Auto-Vectorization SIMD – Single Instruction Multiple Data

• Scalar mode – one instruction produces

one result

• SIMD processing – with SSE or AVX instructions

– one instruction can produce multiple

results

+ a[i]

b[i]

a[i]+b[i]

+

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

for (i=0;i<=MAX;i++)

c[i]=a[i]+b[i];

39

a

b

a+b

+


Vectorization is Achieved through SIMD Instructions & Hardware

40

X4

Y4

X4opY4

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

0 128 Intel® SSE Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float

Intel® AVX Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float First introduced in 2011

X4

Y4

X4opY4

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

0 127

X8

Y8

X8opY8

X7

Y7

X7opY7

X6

Y6

X6opY6

X5

Y5

X5opY5

128 255


Comparison of Ways Applications can Take Advantage of Vectorization

41

Effort Required

Code Maintain-ability

Performance Potential

Scale Forward

Assembly/Intrinsics Most Least Best No

Existing libraries such as Intel® IPP, Intel® MKL

Least Most Best Yes

Intel Compiler Auto-Vectorization

Least Most Good Yes

High-level Constructs Moderate Most Best Yes


Compiling for Intel® AVX and SSSE3 using Intel® C++ Compiler

Compile with –xavx (/Qxavx on Windows*)

• Main speedups are for floating point

– Integer 256 bit arithmetic instructions coming for AVX2

– Best if 32 byte aligned

-axavx (/Qaxavx) gives both SSE and AVX code paths

• use –x (/Qx) switches to modify the default SSE code path

– e.g. –axavx –xssse3_atom target Intel Core i7 and Intel Atom™ Processor simultaneously (/Qaxavx /Qxssse3_atom on Windows)

software.intel.com/en-us/articles/

how-to-compile-for-intel-avx/

software.intel.com/en-us/articles/atom-optimized-compiler/

42


Compiler Based Vectorization Extension Specification

Feature SIMD Extension

Intel® Streaming SIMD Extensions 2 (Intel® SSE2) as available in initial Pentium® 4 or compatible non-Intel processors

sse2

Intel® Streaming SIMD Extensions 3 (Intel® SSE3) as available in Pentium® 4 or compatible non-Intel processors

sse3

Supplemental Streaming SIMD Extensions 3 (SSSE3) as available in Intel® Core™2 Duo processors

ssse3

Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture

sse4.1

Intel® SSE4.2 Accelerated String and Text Processing instructions supported first by Intel® Core™ i7 processors

sse4.2

Like ssse3 but also generates the MOVBE instruction that is available for the Intel® Atom™ processor and Intel® Centrino® Atom™ Processor Technology

ssse3_atom

Intel® Advanced Vector Extensions (Intel® AVX) as available in 2nd generation Intel® Core™ processor family

avx

Intel® Advanced Vector Extension (Intel® AVX) including instructions offered by the 3rd generation Intel® Core processor

core-avx-i

Intel® Advanced Vector Extension 2 (Intel® AVX2) as provided by a future Intel processor

core-avx2

43

43


Compiler Reports – Vectorization Report

Compiler switch: -vec-report<n> (Linux)

Set diagnostic level dumped to stdout n=0: No diagnostic information n=1: (Default) Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info

44


Automatic Vectorization by Compiler

Intel Compiler will auto vectorize the source code for you if it can

Pros:

• Minimal effort required

• Maintainable – source code is not changed

• Portable across Intel SIMD architectures

• Optimal performance is possible in best cases

• Scales forward!

Cons:

• Compiler is conservative; will not generate unsafe code

=> Advanced optimization techniques help to improve Data Level Parallelization using Vectorization

45


Pointer Checker (C/C++)

46

• Out-of-bounds memory checking at runtime – Checks before any memory access through a pointer that the

pointer address is inside the object pointed to.

– Checks for accesses through pointers that have been freed.

• Enable pointer checker via compiler switches. -check-pointers=[none|write|rw]

• Enable checking for dangling pointer references: -check-pointers-dangling=[none|heap|stack|all]

• Enable checking of bounds for arrays without dimensions: -[no]check-pointers-undimensioned

• Intrinsics allow user to get lower/upper bounds associated with pointer and create / destroy bounds for a pointer. – void * __chkp_lower_bound(void **)

– void * __chkp_upper_bound(void **)

– void * __chkp_kill_bounds(void *p)

– void * __chkp_make_bounds(void *p, size_t size)


Inlining Functions When the compiler inlines a function call, the function's code gets inserted into the caller's instruction stream Benefits: Reducing overhead of calling a function

• writing the registers and parameters to/from stack

• restore the registers when the function returns.

Improving performance because the optimizer can procedurally integrate the called function and can do better optimizations – sub-expression elimination – copy propagation

Drawbacks:

Overuse of inlining can actually make programs slower. Depending on a function's size, inlining it can cause the code size to increase, resulting in more cache misses and more pressure on the instruction cache

The speed benefits of inline functions tend to diminish as the function grows in size. At some point the overhead of the function call becomes small compared to the execution of the function body, and the benefit is lost.

47


Compiler Floating Point Model

The Floating Point options allow to control the optimizations of floating-point instructions. These options can be used to tune the performance, level of accuracy or result consistency.

Accuracy Produce results that are “close” to the correct value

–Measured in relative error, possibly ulps (units in the last place)

Reproducibility Produce consistent results

–From one run to the next –From one set of build options to another –From one compiler to another –From one platform to another

Performance Produce the most efficient code possible

–Default, primary goal of Intel® Compilers

These objectives usually conflict! Wise use of compiler options lets you control the tradeoffs.

48


Compiler Floating-Point Model

The Floating-Point Compiler Switch

–fp-model keyword (Linux*)

Lets you choose the FP semantics at a coarse granularity and specify the compiler rules for

– Value safety

– FP expression evaluation

– FPU environment access

– Precise FP exceptions

– FP contractions

– Abrupt underflow (flush to zero)

– Denormals are set to zero

– May improve performance, esp. if HW doesn‘t support denormals

49


Floating-Point Keywords

Controls consistency of floating point results by restricting certain optimizations. Values for keywords are

– fast[=1|2]; default is fast=1

– Allows „value-unsafe“ optimizations (=default)

– Allows aggressive optimizations at a slight cost in accuracy or consistency.

– Some additional approximations allowed with fast=2

– precise

– Enables only value-safe optimizations on floating point code.

– source

– Implies precise and enables intermediates to be computed in source precision.

– Source is the recommended form for the majority of situations on processors supporting Intel® 64 and IA-32 platforms when SSE are enabled with /QxSSE2 or higher.

50


Floating-Point Keywords (2)

– double

– Implies precise and enables intermediates to be computed in double or extended precision.

– Not avaliable in Intel® Fortran Compilers

– extended

– Rounds intermediate results to 64-bit (extended) precision

– Enables value safe optimization

– except

– Enables floating point exception semantics

– strict

– Strictest mode of operation, enables both the precise and except options and disables contractions (i.e., precise + strict + disable fma)

51


The –fp-model<key> Switch

52

Key Value

Safety

Expression

Evaluation

FPU

Environ.

Access

Precise FP

Exceptions

FP

contract

precise

source

double

extended

Safe

Varies

Source

Double

Extended

No No Yes

strict Safe Varies Yes Yes No

fast=1

(default) Unsafe Unknown No No Yes

fast=2 Very

Unsafe Unknown No No Yes

except

except-

*/**

*

*

*

*

*

Yes

No

*

*

* These modes are unaffected. –fp-model except[-] only affects the precise FP exceptions

mode.

** It is illegal to specify –fp-model except in an unsafe value safety mode.


New Parallelism Method: Intel® Cilk™ Plus

An extension to C and C++ for expressing fine-grained task parallelism

• Shared-memory multiprocessing (like OpenMP)

Very simple syntax of 3 keywords only: _Cilk_spawn and _Cilk_sync, _Cilk_for

• #include <cilk/cilk.h> in order to get cilk_spawn, cilk_sync, and cilk_for

Every Cilk program preserves the serial semantic

Cilk provides performance guarantees since it is based on theoretically efficient work-stealing scheduler

Preventing races using reducer hyperobjects

Array Notations to provide data parallelism for sections of arrays or whole arrays

Elemental Functions to enable data parallelism of whole functions or operations

#pragma SIMD to express vector parallelism using SIMD hardware registers

53


Key Files Supplied with Compiler

Linux*

Intel compiler

• icc: C/C++ compiler

• compilervars.(c)sh: Source scripts to setup the

complete compiler/debugger/libraries environment

Linker driver

• xild: Invokes ld

Intel include files, libraries

54


Additional new compiler features

• –mtune=<ARCH> option on Linux*/OS X* to specify

cpu targeting without generating instructions exclusive to that cpu

•“no_false_share” attribute to avoid false sharing in data structures. • DWARF4 support on Linux*/OS X*

55


Wind River* Application Cross-Build from Windows* Host

56

1. Set environment variables:

• WRL_TOOLCHAIN • WRL_SYSROOT

Example: Wind River* Linux* 4.3 64-bit target set WRL_TOOLCHAIN=<some_path>\wrl43\wrlinux-4\layers\wrll-toolchain-4.4a-341\i586\toolchain\x86-win32

set WRL_SYSROOT=<some_path>\wrl43\intel64\export\sysroot\common_pc_64-glibc_std\sysroot

Wind River* Linux 5.0.x 64-bit target set WRL_TOOLCHAIN=<some_path>\wrl50\wrlinux-5\layers\wr-toolchain\4.6-60-win32

set WRL_SYSROOT=<some_path>\wrl50\intel64\export\sysroot\intel-xeon-core_glibc_std\

bitbake_build\tmp\sysroots\intel-xeon-core

2. Build application: • C Source:

icc.exe -platform=wrl50 my_source_file.c

• C++ Source” icpc.exe -platform=wrl50 my_source_file.cpp


Summary

Intel® C++ Compiler 14.0 for applications running on Embedded OS Linux*

• High level optimizations

• Auto-vectorization/-parallelization to parallelize serial code

• Sophisticated programming methods for multithreading

• Runs on GNU environments or integrates into Eclipse (Linux*)

More information on Intel’s software offerings and services at http://software.intel.com

57

http://software.intel.com/


INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


58

2/19/2014

Intel® Compiler 14.0 for Android*


Content

• Introduction

• The Seven Steps of Optimization

• Android* Integration

• ARM* Neon vs. Intel SSE



Introducing the Intel® Compiler for Android*

• Based on Intel® C/C++ Compiler XE 14.0 for Linux*

• High performance C/C++ compiler

• Atom optimization

• Vectorization for loops - SIMD

• Interprocedural Optimization (IPO)

• Profile-Guided Optimizations (PGO)



What’s new

• Support for Silvermont architecture

• Optimization switch-xatom_sse4.2

• Support for Android NDK r9.

• Intel® Cilk™ Plus runtime support is enabled for Android as a technology preview.

• Features from C++11 (-std=c++0x)

• 64-bit long double type support (for compatibility with new NDKs)

Intel Confidential 62


When to use

• ICC can only be used for native source code

• You will get better speedup if

• The app is CPU bound (check with Intel GPA)

• The hot functions are not written in assembler

• Code can be vectorized

– Usually true for multimedia apps & games

• Code consists a lot of small helper functions (IPO)

• You want to Multithread your application (use Intel® Cilk™ Plus)

• You want to explicitly optimize for the latest CPU generation


Android* Integration



Differences to Intel® C/C++ Compiler XE 14.0 for Linux*

- Cross-compiler Linux Android*

- Some features are removed

- OpenMP

- Android* NDK or AOSP environment is required



Integration into the build environment

Three different options to compile Android* apps

1. Standalone tool chain

– Useful for own / 3rd party build systems

– Manual compile / link / package of application

2. Using ndk-build script

– Controlled by Android.mk and Application.mk files

– Automatically compile/link applications and store it the right folders for using it from the Android* SDK

3. As part of the AOSP

– Automatically integrates into platform build



Option 1: Standalone tool chain

• Establish the compiler environment

source <icc-install-dir>/bin/compilervars.sh

Intel C++ compiler can be used directly in this environment

Recommended



Option 2: Using ndk-build script

• Execute the ndk-build script in the project folder

ndk-build V=1 -B NDK_TOOLCHAIN=x86-icc APP_ABI=x86



Option 3: As part of the AOSP

• Two modes available:

• ICC as default compiler

• Compile only specified modules with ICC

• Ability to force compilation for particular modules with ICC or GCC independent of the default compiler

• Recommended is to start with compiling specified modules with ICC



Option 3: As part of the AOSP (a)

• Preparations

1. Request a patch set for your particular version of the source tree from your Intel representative

2. Apply the patch to your source tree

3. Check if ICC is already included in your source tree if the directory is empty or missing, make a symlink from /opt/intel/CCAndroid13.0.0.006/


ls prebuilts/PRIVATE/icc/linux-x86/x86/x86-android-linux-13.0


Option 3: As part of the AOSP (b)

• Compile specified modules with ICC

1. Edit the ICC configuration file

2. Specify modules you want to compiler with ICC

3. Build Android as usual


nano build/core/icc_config.mk

ICC_MODULES := libv8 libskia libskiagpu

source build/envsetup.sh

lunch

make


Option 3: As part of the AOSP (c)

• ICC as default compiler

1. Edit the ICC configuration file

2. Specify modules you want to compiler with GCC

3. Change the default compiler to ICC

4. Build Android as usual


nano build/core/icc_config.mk

GCC_MODULES := …

source build/envsetup.sh

lunch

make

DEFAULT_COMPILER:=icc


Option 3: As part of the AOSP (d)

• Checking the result:

• Check if your module was built with ICC. You should see several lines of output for this command:

• Check if the Intel libraries are copied on the device


readelf -s out/target/product/redhookbay/system/

lib/libskia.so |grep intel

adb shell

root@android:/ # ls /system/lib/libsvml.so


Intel Libraries

• ICC comes with four optimized libraries

• The final binary requires access to these libraries

• Options

1. Include them into OS image

2. Link statically

3. Copy them to the application directory

Library Description

libintlc.so Intel support libraries

libimf.so Intel math library

libsvml.so Short vector math library

libirng.so Random number generator



Option 1: Include into OS Image

• Libraries in the /system/lib folder are loaded automatically

• Remount the filesystem read/write

• Push the libraries on the target

cd /opt/intel/CCAndroid13.0.0.005/lib

adb push libintlc.so /system/lib

adb push libimf.so /system/lib

adb push libsvml.so /system/lib

adb push libirng.so /system/lib

Applications will automatically load the needed libraries

Recommended

adb shell mount -o remount,rw /system



Option 2: Link statically

• Best choice for single binary

• Default option

• If libraries shouldn’t linked in statically use option -shared-intel

Recommended



Option 3: Copy the libraries to the application directory

• Part of the Android* SDK/NDK functionality

• Add to the Android.mk file in the jni folder

• Local libraries are not loaded automatically, need to load them manually from JAVA

include $(CLEAR_VARS)

LOCAL_MODULE := libintlc

LOCAL_SRC_FILES := libintlc.so

include $(PREBUILT_SHARED_LIBRARY)

libimf

libsvml

libirng

System.loadLibrary("intlc");

System.loadLibrary("imf");

System.loadLibrary("svml");

System.loadLibrary(“irng");

System.loadLibrary("hello-jni");



Compatibility GCC / ICC

file1.c

file2.c

executable

GCC

ICC

file1.o

file2.o

GCC/ICC



Using GAP on Android*

• Same option set as on Linux

• Recommended option set:

• Using GAP with standalone tool chain is recommended

• Using GAP with ndk-build

• No code generation for GAP linking phase will fail

or use outdated object files


-guide –diag-disable 30761


Using PGO on Android*

• Generated data files need a storage location

• Default stored in the application directory usually

write protected on Android*

• Specify different storage location in Android.mk file:

• Application needs write permissions on sdcard. Add to AndroidManifest.xml:

LOCAL_CFLAGS := -prof-gen -prof-dir /sdcard

<uses-permission

android:name="android.permission.WRITE_EXTERNAL_STORAGE" />



• Data files are only generated if application exits

• Application usually not exit on Android*

• Option 1: Call exit from Java:

• Option 2: Explicitly dump PGO data from native code

• Option 3: Using environment to make regular dumps

Using PGO on Android* (2)

System.exit(0);

#include <pgouser.h>

_PGOPTI_Prof_Dump_All();

export INTEL_PROF_DUMP_INTERVAL 5000

export INTEL_PROF_DUMP_CUMULATIVE 1



Using Intel® Cilk™ Plus

• Change your STL to GNU shared and add exception support to you Application.mk file:

• Include the Cilk runtime library in app by adding to the Android.mk file:

• Load the libraries from your Java code


APP_STL := gnustl_shared

APP_GNUSTL_FORCE_CPP_FEATURES := exceptions rtti

include $(CLEAR_VARS)

LOCAL_MODULE := cilkrts.so

LOCAL_SRC_FILES := ../path/to/CCAndroid/lib/cilkrts.so

include $(PREBUILT_SHARED_LIBRARY)

System.loadLibrary("gnustl_shared");

System.loadLibrary("cilkrts");


Using Intel® Cilk™ Plus (2)

• Add a Cilk to your linker options(Android.mk):

• In your C/C++ file include the Cilk header

• And start using Cilk in your C/C++ code


#include <cilk/cilk.h>

LOCAL_LDLIBS += -lcilkrts

int fib(int n) {

if (n < 2)

return n;

int x = cilk_spawn fib(n-1);

int y = fib(n-2);

cilk_sync;

return x + y;

}

ARM* Neon vs. Intel SSE



Comparison

ARM v5 ARM v7a x86

32-bit 32-bit 32-bit

little-endian little-endian little-endian

Soft FP Hardware FP Hardware FP

64-bit vars aligned

64-bit vars aligned

64-bit vars packed

None NEON SSE This will require porting…

Normally not a problem…



Memory alignment

Force memory alignment

struct TestStruct

{

int mVar1;

long long mVar2;

int mVar3;

};

ARM

x86

-malign-double



Porting SIMD instructions

Porting NEON instructions (ARM*) to SSE

instructions (Intel)

– Fixed point arithmetic only on ARM*

– NEON native C libs can’t be reused in Intel® Atom™ based

platforms


http://intel.ly/10JjuY4 - NEONvsSSE.h wrap NEON functions and intrinsics to SSE3


Conclusion

- Based on the high performance Intel® C/C++ Compiler XE 13.0 for Linux*, widely used by HPC customers for archiving better performance on IA

- Comes with a well established support infrastucture

- Variaty of optimization options available

- Integration into various parts of the Android* environment

- Can be integrated in a standalone tool chain, the NDK and the AOSP




Optimization Notice





90

90

Intel Confidential

Backup



Intel® Cilk™ Plus Pragma/Directive

C/C++: #pragma simd [clause [,clause]…]

Without any clause, the directive enforces vectorization of the loop, ignoring all dependencies (even if they are proved!)

Without SIMD directive, vectorization likely fails since there are too many pointer references to do a run-time check for overlapping (compiler heuristic). The compiler won’t create multiple versions here.

void addfl(float *a, float *b, float *c, float *d, float *e, int n)

{

#pragma simd

for(int i = 0; i < n; i++)

a[i] = a[i] + b[i] + c[i] + d[i] + e[i];

}



The compiler cannot vectorize the loop, even though the arrays a and b won’t overlap (keyword restrict).

Also multi-versioning won’t help because of complexity of the offsets (off[]).

Using #pragma ivdep doesn’t work either because compiler regards accesses to off[] as inefficient here

Solution: If, for example, offsets are at least 4 elements, vectorization is still possible as vector length can be controlled via #pragma simd:

#pragma simd Example for C/C++

void foo(float *restrict a, float *restrict b, int offmax, int n, int off[n])

{

for(int k = 0; k < n - offmax; k++) a[k + off[k]] = a[k] * b[k];

}

void foo(float *restrict a, float *restrict b, int offmax, int n, int off[n])

{

#pragma simd vectorlength(4)

for(int k = 0; k < n - offmax; k++) a[k + off[k]] = a[k] * b[k];

}



Sample for movbe instruction

int a;

void foo (int x)

{

a = ((x & 0xff) << 24) |

((x & 0xff00) << 8) |

((x & 0xff0000) >> 8) |

((x & 0xff000000) >> 24);

return;

}

int main(int argc, char **argv)

{

foo(atoi(argv[1]));

printf("0x%8.8x\n", atoi(argv[1]));

printf("0x%8.8x\n", a);

return 0;

}

> icc –xSSSE3_ATOM –minstruction=movbe



Changes to the AOSP build system and sources

• Include Intel libraries to build environment

• Copy required libraries to target

• Necessary changes to source code ICC is more strict

• about 40 changes for the whole AOSP source tree

• Most of the changes solve existing problems

• Optional changes for better performance

• Loop restructuring

• Pragma SIMD




Optimization Notice





97

2/19/2014


Differences ICC Linux vs. ICC Android

• Android GCC compiler does not support native Linux thread local storage. Thread local storage is emulated as described in http://gcc.gnu.org/onlinedocs/gccint/Emulated-TLS.html#Emulated-TLS.

• Newer GCC uses DT_INIT_ARRAY/DT_FINI_ARRAY elements in .dynamic section for global object initialization as described in http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#init_fini. Previously addresses of constructors and destructors of global objects were placed in .ctors/.dtors sections correspondingly.

• Stack alignment is different: for Linux is 16 bytes while it’s 4 bytes on Android. • Long double is 64-bit on Android and 80-bit on Linux. • Driver has been changed to account differences in system lib names • Android NDK or GNU tools from the Android OS workspace is required to run the compiler and

2 environment variables must be set before invoking the compiler: ANDROID_SYSROOT and ANDROID_GNU_X86_TOOLCHAIN.

• Only IA-32 target is supported. • OpenMP runtime support is missing. • There is only experimental Cilk+ support in Android Compiler. C++ runtime support in

Android is provided by a different set of libraries comparing to Linux. The differences are in RTTI and exceptions. As a result exceptions thrown from Cilk threads are lost.

• Windows host is supported only in the experimental compiler. • Pointer Checker support is limited.

98

http://gcc.gnu.org/onlinedocs/gccint/Emulated-TLS.html#Emulated-TLS





http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#init_fini

Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training...

Documents

Transcript of Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training...