Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training...
Transcript of Intel C++ Compiler 14.0 within Intel System Studio• Intel® C++ Compiler 14.0 technical training...
Intel® C++ Compiler 14.0 within Intel System Studio
1
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What you will learn from this slide deck
• Intel® C++ Compiler 14.0 technical training for
System & Application code running Linux*, Android* & Tizen™
• In-depth explanation of compiler specifics for each development environment mentioned above
• Please see subsequent slide decks for in-depth technical training on other components
2
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compatibility to Standards
The Intel C++ Compiler provides the following
language conformances: ANSI/ISO standard for C language compilation
(ISO/IEC9899:1990)
ANSI/ISO standard (ISO/IEC 14882:1998) for the C++ language
3
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Common Optimization Switches
4
Linux*
Disable optimization -O0
Optimize for speed (no code size increase) -O1
Optimize for speed (default) -O2
High-level loop optimization -O3
Create symbols for debugging -g
Multi-file inter-procedural optimization -ipo
Profile guided optimization (multi-step build) -prof-gen
-prof-use
Optimize for speed across the entire program
**warning: -fast def’n changes over time
-fast (same as: -ipo –O3 -no-prec-div -static -xHost)
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimizations for latest generation Intel® Atom™ Processor
5
• Processor specific light-weight out-of-order instruction scheduler optimization
• Processor specific cache management and memory preload optimizations
• Loop optimizations and vectorizer taking advantage of SSE4.2 vector instructions.
Processor-specific Compiler Optimizations
Linux
-xSSE2 -xCORE-AVX2
-xSSE3 -xCORE-AVX-I
-xSSSE3 -xATOM_SSSE3
-xSSE4.1 -xATOM_SSE4.2
-xSSE4.2
-xAVX
-xHost
Imply an Intel cpu id check
Runtime message if try to run on unsupported processor
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Additional new compiler features
•New Vectorization report level
• Adds information about the quality of the vector code generated and does not output the text of messages
• Report data is processed with a script to produce a summary report and text file which intersperses vectorization messages and user code
• Analysis script available at http://software.intel. com/en-us/articles/vecanalysis-python-script-for-annotating-intelr-compiler-vectorization-report
• Requires Python 2.6.5 or newer
• Specified with -vec-report7
•-vec-report6 now gives alignment information
6
The Seven Steps of Optimization
7 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Seven Steps of Optimization
Build with optimization disabled
Use general optimizations
Use processor specific options
Add Inter Procedural Optimizations
Use Profile Guided Optimization
Tune Automatic Vectorization
1.
2.
3.
4.
5.
6.
8 Intel Confidential
Multithread your application 7.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Seven Steps of Optimization
Build with optimization disabled
Use general optimizations
Use processor specific options
Add Inter Procedural Optimizations
Use Profile Guided Optimization
Tune Automatic Vectorization
1.
2.
3.
4.
5.
6.
9 Intel Confidential
Multithread your application 7.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
General optimization options
• -O1
• Optimize code size, auto vectorization is turned off
• -O2
• Inlining
• vectorization
• -O3
• Loop optimization
• data pre-fetching
• -ansi-alias / -restrict / -no-prec-div
10 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Seven Steps of Optimization
Build with optimization disabled
Use general optimizations
Use processor specific options
Add Inter Procedural Optimizations
Use Profile Guided Optimization
Tune Automatic Vectorization
1.
2.
3.
4.
5.
6.
11 Intel Confidential
Multithread your application 7.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
ICC Atom Specific optimization
• Optimization switch –xSSSE3_ATOM for Saltwell
– Intel Supplemental Streaming SIMD Extensions 3 (SSSE3)
– In Order
– Use of LEA for stack operations
– Instruction reordering
– Support for movbe instruction (-minstruction=movbe)
– Only use it for system development
• Optimization switch-xatom_sse4.2 for Silvermont
– Intel® Streaming SIMD Extensions 4.2 (SSE4.2)
– Out of order
GCC Note: use –mtune=atom or –mtune=slm, not –march=atom /slm
12 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD Instruction Enhancements
70 instr
Single-Precision Vectors
Streaming operations
144 instr
Double-precision Vectors
8/16/32
64/128-bit vector integer
13 instr
Complex Data
32 instr
Decode
47 instr
Video
Graphics building blocks
Advanced vector instr
SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2
8 instr
String/XML processing
POP-Count
CRC
AES-NI
7 instr
Encryption and Decryption
Key Generation
AVX
~100 new instr.
~300 legacy sse instr updated
256-bit vector
3 and 4-operand instructions Intel® Atom
Saltwell
13 Intel Confidential
Intel® Atom
Silvermont
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD Instruction Enhancements (2)
addss Scalar Single-FP Add
single precision FP data
scalar execution mode
addps Packed Single-FP Add
single precision FP data
packed execution mode
x4 x3 x2 x1
y4 y3 y2 y1
x4 x3 x2 x1+y1
x4 x3 x2 x1
y4 y3 y2 y1
x4+y4 x3+y3 x2+y2 x1+y1
14 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Approaches to introduce vectorization
Assembler code (addps)
Vector intrinsic (mm_add_ps())
Compiler: Auto vectorization hints (#pragma ivdep, …)
Programmer control
Ease of use Compiler: Fully automatic vectorization
Cilk Plus Array Notation
User Mandated Vectorization
( SIMD Directive)
15 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Seven Steps of Optimization
Build with optimization disabled
Use general optimizations
Use processor specific options
Add Inter Procedural Optimizations
Use Profile Guided Optimization
Tune Automatic Vectorization
1.
2.
3.
4.
5.
6.
16 Intel Confidential
Multithread your application 7.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Interprocedural Optimizations Extends optimizations across file boundaries
Compile & Optimize
Compile & Optimize
Compile & Optimize
Compile & Optimize
file1.c
file2.c
file3.c
file4.c
Without IPO
Compile & Optimize
file1.c
file4.c file2.c
file3.c
With IPO
-ip Only between modules of one source file
-ipo Modules of multiple files/whole application
17 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Interprocedural Optimizations (IPO) Usage: Two-Step Process
Linking
icc -ipo main.o func1.o func2.o
Pass 1
Pass 2
mock object
executable
Compiling
icc -c -ipo main.c
icc –c –ipo func1.c
icc –c –ipo func2.c
18 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What you should know about IPO
• O2 and O3 activate “almost” file-local IPO (-ip)
• IPO extends compilation time and memory usage
• IPO works for libraries too
• In-lining of functions is most important feature of IPO but there is much more
19 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Seven Steps of Optimization
Build with optimization disabled
Use general optimizations
Use processor specific options
Add Inter Procedural Optimizations
Use Profile Guided Optimization
Tune Automatic Vectorization
1.
2.
3.
4.
5.
6.
20 Intel Confidential
Multithread your application 7.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Profile-Guided Optimizations (PGO)
Static analysis is limited:
• How often is x > y
• What is the size of count
• Which code is touched how often
Enhancements with PGO:
• More accurate branch prediction
• Better decision of functions to inline (help IPO)
• Basic block movement to improve instruction cache behavior
if (x > y) do_this(); else do that();
for(i=0; i<count; ++i)
do_work();
21 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
PGO Usage: Three Step Process
Compile + link to add instrumentation icc -prof_gen prog.c
Execute instrumented program prog.exe (on a typical dataset)
Compile + link using feedback icc -prof_use prog.c
Dynamic profile: 12345678.dyn
Instrumented executable: prog.exe
Merged .dyn files: pgopti.dpi
Step 1
Step 2
Step 3
Optimized executable: prog.exe
22 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Seven Steps of Optimization
Build with optimization disabled
Use general optimizations
Use processor specific options
Add Inter Procedural Optimizations
Use Profile Guided Optimization
Tune Automatic Vectorization
1.
2.
3.
4.
5.
6.
23 Intel Confidential
Multithread your application 7.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How do I know if a loop was vectorised
• vec-report[n]
> icc -vec-report MultArray.c MultArray.c(92): (col. 5) remark: LOOP WAS
VECTORIZED.
• always vectorize if safe
#pragma vector always [assert]
• always vectorize
#pragma simd
24 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
GAP – Guided Automatic Parallelization
• Use compiler infrastructure to help developer
– Vectorization, parallelization and data transformations
– Extend diagnostic message for failed vectorization and parallelization by specific hints to fix problem
• Exploit multi-year experience brought into the compiler development
– Performance tuning knowledge based on dealing with numerous applications, benchmarks and compute kernels
• Does not influence code generation
25 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
GAP – How it Works Compiler Switches for GAP [1]
Activate GAP and optionally define guidance level
• -guide[=level]
• -guide-vec[=level]
• -guide-par[=level]
• -guide-data-trans[=level]
• Optional argument level=1,2,3,4 controls extend of analysis: ‘4’ is most advanced / most detailed and is default
26 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization Example [1]
void f(int n, float *x, float *y, float *z, float *d1, float *d2) {
for (int i = 0; i < n; i++)
z[i] = x[i] + y[i] – (d1[i]*d2[i]);
}
GAP Message:
g.c(6): remark #30536: (LOOP) Add -no-alias-args option for better type-based disambiguation analysis by the compiler, if appropriate (the option will apply for the entire compilation). This will improve optimizations such as vectorization for the loop at line 6. [VERIFY] Make sure that the semantics of this option is obeyed for the entire compilation. [ALTERNATIVE] Another way to get the same effect is to add the "restrict" keyword to each pointer-typed formal parameter of the routine "f". This allows optimizations such as vectorization to be applied to the loop at line 6. [VERIFY] Make sure that semantics of the "restrict" pointer qualifier is satisfied: in the routine, all data accessed through the pointer must not be accessed through any other
27 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization Example [2]
void mul(NetEnv* ne, Vector* rslt
Vector* den, Vector* flux1,
Vector* flux2, Vector* num
{
float *r, *d, *n, *s1, *s2;
int i;
r=rslt->data; d=den->data;
n=num->data; s1=flux1->data;
s2=flux2->data;
for (i = 0; i < ne->len; ++i)
r[i] = s1[i]*s2[i] +
n[i]*d[i];
}
GAP Messages (simplified):
1. “Use a local variable to host the upper-bound of loop at line 29 (variable:ne->len) if the upper-bound does not change during execution of the loop”
2. “Use “#pragma ivdep" to help vectorize the loop at line 29, if these arrays in the loop do not have cross-iteration dependencies: r, s1, s2, n, d”
28 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Data Transformation Example
struct S3 {
int a;
int b; // hot
double c[100];
struct S2 *s2_ptr;
int d; int e;
struct S1 *s1_ptr;
char *c_p;
int f; // hot
};
peel.c(22): remark #30756: (DTRANS) Splitting the structure 'S3' into two parts will improve data locality and is highly recommended. Frequently accessed fields are 'b, f'; performance may improve by putting these fields into one structure and the remaining fields into another structure. Alternatively, performance may also improve by reordering the fields of the structure. Suggested field order:'b, f, s2_ptr, s1_ptr, a, c, d, e, c_p'. [VERIFY] The suggestion is based on the field references in current compilation …
for (ii = 0; ii < N; ii++){
sp->b = ii;
sp->f = ii + 1;
sp++;
}
29 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Seven Steps of Optimization
Build with optimization disabled
Use general optimizations
Use processor specific options
Add Inter Procedural Optimizations
Use Profile Guided Optimization
1.
2.
3.
4.
5.
6.
30 Intel Confidential
Multithread your application 7.
Tune Automatic Vectorization
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus - Overview
Simple Keywords Set of keywords, for expression
of task parallelism:
cilk_spawn
cilk_sync
cilk_for
Reducers
(Hyper-objects) Reliable access to nonlocal variables
without races
cilk::reducer_opadd<int> sum(3);
Array Notation Provide data parallelism for sections of
arrays or whole arrays
mask[:] = a[:] < b[:] ? -1 : 1;
Elemental Functions Define actions that can be applied to whole or parts of arrays or scalars
Execution parameters Runtime system APIs, Environment variables, pragmas
Task parallelism
Data parallelism
31 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus - Overview
• Intel® Cilk™ Plus (Language Extension to C/C++)
Easier Task & Data Parallelism 3 simple Keywords: cilk_for, cilk_spawn, cilk_sync
Intel® Cilk™ Plus Array Notation Save time with powerful vectorization
32
Minimize Software Re-Work for New Hardware
32
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Reports – Optimization Report
Compiler switch: -opt-report-phase[=phase] (Linux*)
phase can be: ipo_inl - Interprocedural Optimization Inlining Report ilo – Intermediate Language Scalar Optimization hpo – High Performance Optimization hlo – High-level Optimization all – All optimizations (not recommended, output too verbose)
Control the level of detail in the report: -opt-report[0|1|2|3] (Linux*)
• If you do not specify the level (i.e. /Qopt-report, -opt-report) level 2 is being used.
Save report output to file: -opt-report-file=[file] (Linux*)
Vectorization subset report: /Qvec-report2, –vec-report2
33
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Report Example icc –O3 –opt-report-phase=hlo -opt-report-phase=hpo
…
LOOP INTERCHANGE in loops at line: 7 8 9
Loopnest permutation ( 1 2 3 ) --> ( 2 3 1 )
…
Loop at line 8 blocked by 128
Loop at line 9 blocked by 128
Loop at line 10 blocked by 128
…
Loop at line 10 unrolled and jammed by 4
Loop at line 8 unrolled and jammed by 4
…
…(10)… loop was not vectorized: not inner loop.
…(8)… loop was not vectorized: not inner loop.
…(9)… PERMUTED LOOP WAS VECTORIZED
…
34 34
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
High-Level Optimizer (HLO)
Compiler switches: -O2, -O3 (Linux*)
Loop level optimizations
• loop unrolling, cache blocking, prefetching
More aggressive dependency analysis
• Determines whether or not it‘s safe to reorder or parallelize statements
Scalar replacement
• Goal is to reduce memory by replacing with register references
35
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
36
Interprocedural Optimizations (IPO) Multi-pass Optimization
• Interprocedural optimizations performs a static, topological analysis of your application!
• ip: Enables inter-procedural optimizations for current source file compilation
• ipo: Enables inter-procedural optimizations across files Can inline functions in separate files
Especially many small utility functions benefit from IPO Enabled optimizations: • Procedure inlining (reduced function call overhead) • Interprocedural dead code elimination, constant propagation and procedure
reordering • Enhances optimization when used in combination with other compiler features
Linux*
-ip
-ipo
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
37
Interprocedural Optimizations (IPO) Usage: Two-Step Process
Linking
Linux* icc -ipo main.o func1.o
func2.o
Windows* icl /Qipo main.o func1.o
func2.obj
Pass 1
Pass 2
mock object
executable
Compiling
Linux* icc -c -ipo main.c func1.c
func2.c
Windows* icl -c /Qipo main.c func1.c
func2.c
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
38
Interprocedural Optimizations Extends optimizations across file boundaries
Compile & Optimize
Compile & Optimize
Compile & Optimize
Compile & Optimize
file1.c
file2.c
file3.c
file4.c
Without IPO
Compile & Optimize
file1.c
file4.c file2.c
file3.c
With IPO
/Qip, -ip Only between modules of one source file
/Qipo, -ipo Modules of multiple files/whole application
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Auto-Vectorization SIMD – Single Instruction Multiple Data
• Scalar mode – one instruction produces
one result
• SIMD processing – with SSE or AVX instructions
– one instruction can produce multiple
results
+ a[i]
b[i]
a[i]+b[i]
+
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
39
a
b
a+b
+
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization is Achieved through SIMD Instructions & Hardware
40
X4
Y4
X4opY4
X3
Y3
X3opY3
X2
Y2
X2opY2
X1
Y1
X1opY1
0 128 Intel® SSE Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float
Intel® AVX Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float First introduced in 2011
X4
Y4
X4opY4
X3
Y3
X3opY3
X2
Y2
X2opY2
X1
Y1
X1opY1
0 127
X8
Y8
X8opY8
X7
Y7
X7opY7
X6
Y6
X6opY6
X5
Y5
X5opY5
128 255
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Comparison of Ways Applications can Take Advantage of Vectorization
41
Effort Required
Code Maintain-ability
Performance Potential
Scale Forward
Assembly/Intrinsics Most Least Best No
Existing libraries such as Intel® IPP, Intel® MKL
Least Most Best Yes
Intel Compiler Auto-Vectorization
Least Most Good Yes
High-level Constructs Moderate Most Best Yes
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiling for Intel® AVX and SSSE3 using Intel® C++ Compiler
Compile with –xavx (/Qxavx on Windows*)
• Main speedups are for floating point
– Integer 256 bit arithmetic instructions coming for AVX2
– Best if 32 byte aligned
-axavx (/Qaxavx) gives both SSE and AVX code paths
• use –x (/Qx) switches to modify the default SSE code path
– e.g. –axavx –xssse3_atom target Intel Core i7 and Intel Atom™ Processor simultaneously (/Qaxavx /Qxssse3_atom on Windows)
software.intel.com/en-us/articles/
how-to-compile-for-intel-avx/
software.intel.com/en-us/articles/atom-optimized-compiler/
42
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Based Vectorization Extension Specification
Feature SIMD Extension
Intel® Streaming SIMD Extensions 2 (Intel® SSE2) as available in initial Pentium® 4 or compatible non-Intel processors
sse2
Intel® Streaming SIMD Extensions 3 (Intel® SSE3) as available in Pentium® 4 or compatible non-Intel processors
sse3
Supplemental Streaming SIMD Extensions 3 (SSSE3) as available in Intel® Core™2 Duo processors
ssse3
Intel® SSE4.1 as first introduced in Intel® 45nm Hi-K next generation Intel Core™ micro-architecture
sse4.1
Intel® SSE4.2 Accelerated String and Text Processing instructions supported first by Intel® Core™ i7 processors
sse4.2
Like ssse3 but also generates the MOVBE instruction that is available for the Intel® Atom™ processor and Intel® Centrino® Atom™ Processor Technology
ssse3_atom
Intel® Advanced Vector Extensions (Intel® AVX) as available in 2nd generation Intel® Core™ processor family
avx
Intel® Advanced Vector Extension (Intel® AVX) including instructions offered by the 3rd generation Intel® Core processor
core-avx-i
Intel® Advanced Vector Extension 2 (Intel® AVX2) as provided by a future Intel processor
core-avx2
43
43
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Reports – Vectorization Report
Compiler switch: -vec-report<n> (Linux)
Set diagnostic level dumped to stdout n=0: No diagnostic information n=1: (Default) Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info
44
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Automatic Vectorization by Compiler
Intel Compiler will auto vectorize the source code for you if it can
Pros:
• Minimal effort required
• Maintainable – source code is not changed
• Portable across Intel SIMD architectures
• Optimal performance is possible in best cases
• Scales forward!
Cons:
• Compiler is conservative; will not generate unsafe code
=> Advanced optimization techniques help to improve Data Level Parallelization using Vectorization
45
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Pointer Checker (C/C++)
46
• Out-of-bounds memory checking at runtime – Checks before any memory access through a pointer that the
pointer address is inside the object pointed to.
– Checks for accesses through pointers that have been freed.
• Enable pointer checker via compiler switches. -check-pointers=[none|write|rw]
• Enable checking for dangling pointer references: -check-pointers-dangling=[none|heap|stack|all]
• Enable checking of bounds for arrays without dimensions: -[no]check-pointers-undimensioned
• Intrinsics allow user to get lower/upper bounds associated with pointer and create / destroy bounds for a pointer. – void * __chkp_lower_bound(void **)
– void * __chkp_upper_bound(void **)
– void * __chkp_kill_bounds(void *p)
– void * __chkp_make_bounds(void *p, size_t size)
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Inlining Functions When the compiler inlines a function call, the function's code gets inserted into the caller's instruction stream Benefits: Reducing overhead of calling a function
• writing the registers and parameters to/from stack
• restore the registers when the function returns.
Improving performance because the optimizer can procedurally integrate the called function and can do better optimizations – sub-expression elimination – copy propagation
Drawbacks:
Overuse of inlining can actually make programs slower. Depending on a function's size, inlining it can cause the code size to increase, resulting in more cache misses and more pressure on the instruction cache
The speed benefits of inline functions tend to diminish as the function grows in size. At some point the overhead of the function call becomes small compared to the execution of the function body, and the benefit is lost.
47
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Floating Point Model
The Floating Point options allow to control the optimizations of floating-point instructions. These options can be used to tune the performance, level of accuracy or result consistency.
Accuracy Produce results that are “close” to the correct value
–Measured in relative error, possibly ulps (units in the last place)
Reproducibility Produce consistent results
–From one run to the next –From one set of build options to another –From one compiler to another –From one platform to another
Performance Produce the most efficient code possible
–Default, primary goal of Intel® Compilers
These objectives usually conflict! Wise use of compiler options lets you control the tradeoffs.
48
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Floating-Point Model
The Floating-Point Compiler Switch
–fp-model keyword (Linux*)
Lets you choose the FP semantics at a coarse granularity and specify the compiler rules for
– Value safety
– FP expression evaluation
– FPU environment access
– Precise FP exceptions
– FP contractions
– Abrupt underflow (flush to zero)
– Denormals are set to zero
– May improve performance, esp. if HW doesn‘t support denormals
49
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Floating-Point Keywords
Controls consistency of floating point results by restricting certain optimizations. Values for keywords are
– fast[=1|2]; default is fast=1
– Allows „value-unsafe“ optimizations (=default)
– Allows aggressive optimizations at a slight cost in accuracy or consistency.
– Some additional approximations allowed with fast=2
– precise
– Enables only value-safe optimizations on floating point code.
– source
– Implies precise and enables intermediates to be computed in source precision.
– Source is the recommended form for the majority of situations on processors supporting Intel® 64 and IA-32 platforms when SSE are enabled with /QxSSE2 or higher.
50
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Floating-Point Keywords (2)
– double
– Implies precise and enables intermediates to be computed in double or extended precision.
– Not avaliable in Intel® Fortran Compilers
– extended
– Rounds intermediate results to 64-bit (extended) precision
– Enables value safe optimization
– except
– Enables floating point exception semantics
– strict
– Strictest mode of operation, enables both the precise and except options and disables contractions (i.e., precise + strict + disable fma)
51
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The –fp-model<key> Switch
52
Key Value
Safety
Expression
Evaluation
FPU
Environ.
Access
Precise FP
Exceptions
FP
contract
precise
source
double
extended
Safe
Varies
Source
Double
Extended
No No Yes
strict Safe Varies Yes Yes No
fast=1
(default) Unsafe Unknown No No Yes
fast=2 Very
Unsafe Unknown No No Yes
except
except-
*/**
*
*
*
*
*
Yes
No
*
*
* These modes are unaffected. –fp-model except[-] only affects the precise FP exceptions
mode.
** It is illegal to specify –fp-model except in an unsafe value safety mode.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New Parallelism Method: Intel® Cilk™ Plus
An extension to C and C++ for expressing fine-grained task parallelism
• Shared-memory multiprocessing (like OpenMP)
Very simple syntax of 3 keywords only: _Cilk_spawn and _Cilk_sync, _Cilk_for
• #include <cilk/cilk.h> in order to get cilk_spawn, cilk_sync, and cilk_for
Every Cilk program preserves the serial semantic
Cilk provides performance guarantees since it is based on theoretically efficient work-stealing scheduler
Preventing races using reducer hyperobjects
Array Notations to provide data parallelism for sections of arrays or whole arrays
Elemental Functions to enable data parallelism of whole functions or operations
#pragma SIMD to express vector parallelism using SIMD hardware registers
53
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Key Files Supplied with Compiler
Linux*
Intel compiler
• icc: C/C++ compiler
• compilervars.(c)sh: Source scripts to setup the
complete compiler/debugger/libraries environment
Linker driver
• xild: Invokes ld
Intel include files, libraries
54
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Additional new compiler features
• –mtune=<ARCH> option on Linux*/OS X* to specify
cpu targeting without generating instructions exclusive to that cpu
•“no_false_share” attribute to avoid false sharing in data structures. • DWARF4 support on Linux*/OS X*
55
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Wind River* Application Cross-Build from Windows* Host
56
1. Set environment variables:
• WRL_TOOLCHAIN • WRL_SYSROOT
Example: Wind River* Linux* 4.3 64-bit target set WRL_TOOLCHAIN=<some_path>\wrl43\wrlinux-4\layers\wrll-toolchain-4.4a-341\i586\toolchain\x86-win32
set WRL_SYSROOT=<some_path>\wrl43\intel64\export\sysroot\common_pc_64-glibc_std\sysroot
Wind River* Linux 5.0.x 64-bit target set WRL_TOOLCHAIN=<some_path>\wrl50\wrlinux-5\layers\wr-toolchain\4.6-60-win32
set WRL_SYSROOT=<some_path>\wrl50\intel64\export\sysroot\intel-xeon-core_glibc_std\
bitbake_build\tmp\sysroots\intel-xeon-core
2. Build application: • C Source:
icc.exe -platform=wrl50 my_source_file.c
• C++ Source” icpc.exe -platform=wrl50 my_source_file.cpp
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Summary
Intel® C++ Compiler 14.0 for applications running on Embedded OS Linux*
• High level optimizations
• Auto-vectorization/-parallelization to parallelize serial code
• Sophisticated programming methods for multithreading
• Runs on GNU environments or integrates into Eclipse (Linux*)
More information on Intel’s software offerings and services at http://software.intel.com
57
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
58
2/19/2014
Intel® Compiler 14.0 for Android*
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Content
• Introduction
• The Seven Steps of Optimization
• Android* Integration
• ARM* Neon vs. Intel SSE
60 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Introducing the Intel® Compiler for Android*
• Based on Intel® C/C++ Compiler XE 14.0 for Linux*
• High performance C/C++ compiler
• Atom optimization
• Vectorization for loops - SIMD
• Interprocedural Optimization (IPO)
• Profile-Guided Optimizations (PGO)
61 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What’s new
• Support for Silvermont architecture
• Optimization switch-xatom_sse4.2
• Support for Android NDK r9.
• Intel® Cilk™ Plus runtime support is enabled for Android as a technology preview.
• Features from C++11 (-std=c++0x)
• 64-bit long double type support (for compatibility with new NDKs)
Intel Confidential 62
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
When to use
• ICC can only be used for native source code
• You will get better speedup if
• The app is CPU bound (check with Intel GPA)
• The hot functions are not written in assembler
• Code can be vectorized
– Usually true for multimedia apps & games
• Code consists a lot of small helper functions (IPO)
• You want to Multithread your application (use Intel® Cilk™ Plus)
• You want to explicitly optimize for the latest CPU generation
Intel Confidential 63
Android* Integration
64 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Differences to Intel® C/C++ Compiler XE 14.0 for Linux*
- Cross-compiler Linux Android*
- Some features are removed
- OpenMP
- Android* NDK or AOSP environment is required
65 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Integration into the build environment
Three different options to compile Android* apps
1. Standalone tool chain
– Useful for own / 3rd party build systems
– Manual compile / link / package of application
2. Using ndk-build script
– Controlled by Android.mk and Application.mk files
– Automatically compile/link applications and store it the right folders for using it from the Android* SDK
3. As part of the AOSP
– Automatically integrates into platform build
66 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 1: Standalone tool chain
• Establish the compiler environment
source <icc-install-dir>/bin/compilervars.sh
Intel C++ compiler can be used directly in this environment
Recommended
67 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 2: Using ndk-build script
• Execute the ndk-build script in the project folder
ndk-build V=1 -B NDK_TOOLCHAIN=x86-icc APP_ABI=x86
68 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 3: As part of the AOSP
• Two modes available:
• ICC as default compiler
• Compile only specified modules with ICC
• Ability to force compilation for particular modules with ICC or GCC independent of the default compiler
• Recommended is to start with compiling specified modules with ICC
Intel Confidential 69
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 3: As part of the AOSP (a)
• Preparations
1. Request a patch set for your particular version of the source tree from your Intel representative
2. Apply the patch to your source tree
3. Check if ICC is already included in your source tree if the directory is empty or missing, make a symlink from /opt/intel/CCAndroid13.0.0.006/
Intel Confidential 70
ls prebuilts/PRIVATE/icc/linux-x86/x86/x86-android-linux-13.0
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 3: As part of the AOSP (b)
• Compile specified modules with ICC
1. Edit the ICC configuration file
2. Specify modules you want to compiler with ICC
3. Build Android as usual
Intel Confidential 71
nano build/core/icc_config.mk
ICC_MODULES := libv8 libskia libskiagpu
source build/envsetup.sh
lunch
make
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 3: As part of the AOSP (c)
• ICC as default compiler
1. Edit the ICC configuration file
2. Specify modules you want to compiler with GCC
3. Change the default compiler to ICC
4. Build Android as usual
Intel Confidential 72
nano build/core/icc_config.mk
GCC_MODULES := …
source build/envsetup.sh
lunch
make
DEFAULT_COMPILER:=icc
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 3: As part of the AOSP (d)
• Checking the result:
• Check if your module was built with ICC. You should see several lines of output for this command:
• Check if the Intel libraries are copied on the device
Intel Confidential 73
readelf -s out/target/product/redhookbay/system/
lib/libskia.so |grep intel
adb shell
root@android:/ # ls /system/lib/libsvml.so
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel Libraries
• ICC comes with four optimized libraries
• The final binary requires access to these libraries
• Options
1. Include them into OS image
2. Link statically
3. Copy them to the application directory
Library Description
libintlc.so Intel support libraries
libimf.so Intel math library
libsvml.so Short vector math library
libirng.so Random number generator
74 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 1: Include into OS Image
• Libraries in the /system/lib folder are loaded automatically
• Remount the filesystem read/write
• Push the libraries on the target
cd /opt/intel/CCAndroid13.0.0.005/lib
adb push libintlc.so /system/lib
adb push libimf.so /system/lib
adb push libsvml.so /system/lib
adb push libirng.so /system/lib
Applications will automatically load the needed libraries
Recommended
adb shell mount -o remount,rw /system
75 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 2: Link statically
• Best choice for single binary
• Default option
• If libraries shouldn’t linked in statically use option -shared-intel
Recommended
76 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Option 3: Copy the libraries to the application directory
• Part of the Android* SDK/NDK functionality
• Add to the Android.mk file in the jni folder
• Local libraries are not loaded automatically, need to load them manually from JAVA
include $(CLEAR_VARS)
LOCAL_MODULE := libintlc
LOCAL_SRC_FILES := libintlc.so
include $(PREBUILT_SHARED_LIBRARY)
libimf
libsvml
libirng
System.loadLibrary("intlc");
System.loadLibrary("imf");
System.loadLibrary("svml");
System.loadLibrary(“irng");
System.loadLibrary("hello-jni");
77 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compatibility GCC / ICC
file1.c
file2.c
executable
GCC
ICC
file1.o
file2.o
GCC/ICC
78 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using GAP on Android*
• Same option set as on Linux
• Recommended option set:
• Using GAP with standalone tool chain is recommended
• Using GAP with ndk-build
• No code generation for GAP linking phase will fail
or use outdated object files
79 Intel Confidential
-guide –diag-disable 30761
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using PGO on Android*
• Generated data files need a storage location
• Default stored in the application directory usually
write protected on Android*
• Specify different storage location in Android.mk file:
• Application needs write permissions on sdcard. Add to AndroidManifest.xml:
LOCAL_CFLAGS := -prof-gen -prof-dir /sdcard
<uses-permission
android:name="android.permission.WRITE_EXTERNAL_STORAGE" />
80 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
• Data files are only generated if application exits
• Application usually not exit on Android*
• Option 1: Call exit from Java:
• Option 2: Explicitly dump PGO data from native code
• Option 3: Using environment to make regular dumps
Using PGO on Android* (2)
System.exit(0);
#include <pgouser.h>
_PGOPTI_Prof_Dump_All();
export INTEL_PROF_DUMP_INTERVAL 5000
export INTEL_PROF_DUMP_CUMULATIVE 1
81 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using Intel® Cilk™ Plus
• Change your STL to GNU shared and add exception support to you Application.mk file:
• Include the Cilk runtime library in app by adding to the Android.mk file:
• Load the libraries from your Java code
Intel Confidential 82
APP_STL := gnustl_shared
APP_GNUSTL_FORCE_CPP_FEATURES := exceptions rtti
include $(CLEAR_VARS)
LOCAL_MODULE := cilkrts.so
LOCAL_SRC_FILES := ../path/to/CCAndroid/lib/cilkrts.so
include $(PREBUILT_SHARED_LIBRARY)
System.loadLibrary("gnustl_shared");
System.loadLibrary("cilkrts");
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using Intel® Cilk™ Plus (2)
• Add a Cilk to your linker options(Android.mk):
• In your C/C++ file include the Cilk header
• And start using Cilk in your C/C++ code
Intel Confidential 83
#include <cilk/cilk.h>
LOCAL_LDLIBS += -lcilkrts
int fib(int n) {
if (n < 2)
return n;
int x = cilk_spawn fib(n-1);
int y = fib(n-2);
cilk_sync;
return x + y;
}
ARM* Neon vs. Intel SSE
84 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Comparison
ARM v5 ARM v7a x86
32-bit 32-bit 32-bit
little-endian little-endian little-endian
Soft FP Hardware FP Hardware FP
64-bit vars aligned
64-bit vars aligned
64-bit vars packed
None NEON SSE This will require porting…
Normally not a problem…
85 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Memory alignment
Force memory alignment
struct TestStruct
{
int mVar1;
long long mVar2;
int mVar3;
};
ARM
x86
-malign-double
86 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Porting SIMD instructions
Porting NEON instructions (ARM*) to SSE
instructions (Intel)
– Fixed point arithmetic only on ARM*
– NEON native C libs can’t be reused in Intel® Atom™ based
platforms
87 Intel Confidential
http://intel.ly/10JjuY4 - NEONvsSSE.h wrap NEON functions and intrinsics to SSE3
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Conclusion
- Based on the high performance Intel® C/C++ Compiler XE 13.0 for Linux*, widely used by HPC customers for archiving better performance on IA
- Comes with a well established support infrastucture
- Variaty of optimization options available
- Integration into various parts of the Android* environment
- Can be integrated in a standalone tool chain, the NDK and the AOSP
88 Intel Confidential
89 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
90
90
Intel Confidential
Backup
92 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus Pragma/Directive
C/C++: #pragma simd [clause [,clause]…]
Without any clause, the directive enforces vectorization of the loop, ignoring all dependencies (even if they are proved!)
Without SIMD directive, vectorization likely fails since there are too many pointer references to do a run-time check for overlapping (compiler heuristic). The compiler won’t create multiple versions here.
void addfl(float *a, float *b, float *c, float *d, float *e, int n)
{
#pragma simd
for(int i = 0; i < n; i++)
a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
}
93 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The compiler cannot vectorize the loop, even though the arrays a and b won’t overlap (keyword restrict).
Also multi-versioning won’t help because of complexity of the offsets (off[]).
Using #pragma ivdep doesn’t work either because compiler regards accesses to off[] as inefficient here
Solution: If, for example, offsets are at least 4 elements, vectorization is still possible as vector length can be controlled via #pragma simd:
#pragma simd Example for C/C++
void foo(float *restrict a, float *restrict b, int offmax, int n, int off[n])
{
for(int k = 0; k < n - offmax; k++) a[k + off[k]] = a[k] * b[k];
}
void foo(float *restrict a, float *restrict b, int offmax, int n, int off[n])
{
#pragma simd vectorlength(4)
for(int k = 0; k < n - offmax; k++) a[k + off[k]] = a[k] * b[k];
}
94 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Sample for movbe instruction
int a;
void foo (int x)
{
a = ((x & 0xff) << 24) |
((x & 0xff00) << 8) |
((x & 0xff0000) >> 8) |
((x & 0xff000000) >> 24);
return;
}
int main(int argc, char **argv)
{
foo(atoi(argv[1]));
printf("0x%8.8x\n", atoi(argv[1]));
printf("0x%8.8x\n", a);
return 0;
}
> icc –xSSSE3_ATOM –minstruction=movbe
95 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Changes to the AOSP build system and sources
• Include Intel libraries to build environment
• Copy required libraries to target
• Necessary changes to source code ICC is more strict
• about 40 changes for the whole AOSP source tree
• Most of the changes solve existing problems
• Optional changes for better performance
• Loop restructuring
• Pragma SIMD
96 Intel Confidential
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
97
2/19/2014
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Differences ICC Linux vs. ICC Android
• Android GCC compiler does not support native Linux thread local storage. Thread local storage is emulated as described in http://gcc.gnu.org/onlinedocs/gccint/Emulated-TLS.html#Emulated-TLS.
• Newer GCC uses DT_INIT_ARRAY/DT_FINI_ARRAY elements in .dynamic section for global object initialization as described in http://www.sco.com/developers/gabi/latest/ch5.dynamic.html#init_fini. Previously addresses of constructors and destructors of global objects were placed in .ctors/.dtors sections correspondingly.
• Stack alignment is different: for Linux is 16 bytes while it’s 4 bytes on Android. • Long double is 64-bit on Android and 80-bit on Linux. • Driver has been changed to account differences in system lib names • Android NDK or GNU tools from the Android OS workspace is required to run the compiler and
2 environment variables must be set before invoking the compiler: ANDROID_SYSROOT and ANDROID_GNU_X86_TOOLCHAIN.
• Only IA-32 target is supported. • OpenMP runtime support is missing. • There is only experimental Cilk+ support in Android Compiler. C++ runtime support in
Android is provided by a different set of libraries comparing to Linux. The differences are in RTTI and exceptions. As a result exceptions thrown from Cilk threads are lost.
• Windows host is supported only in the experimental compiler. • Pointer Checker support is limited.
98