PPPerfPerferferforororormancemancemancemanceTTTuningTuninguninguningpeople.csail.mit.edu/.../rpi/portfolio/hpec/optimisation.pdf ·...

Optimisation

Performance TuningPerformance TuningPerformance TuningPerformance Tuning

Optimisation – p.1/22

Constant Elimination

do i=1,na(i) = 2*b*c(i)enddo

What is wrong with this loop?

Compilers can move simple instances of constant

computations outside the loop. Others need to be

done manually.

Merging Expensive Operations

Eg.divisionmodulosqrttranscendentalfunctions

compute V (r) = 1

r12 −1

r3inv = 1/(r*r*r)V = r3inv * r3inv - r3inv

notV = 1/r ** 12 - 1/r ** 6

Special case functions

Replace special case functions with fasteralgorithmseg.

x * x is faster than x**2 ≡ exp(2*log(x))

sqrt(x) is faster than x**.5 ≡ exp(.5*log(x))

iand(x,63) is faster than mod(x,64)

x & 63 is faster than x % 64

In C++, pow(double,int) may be more efficientthan the standard pow(double,double).Fortran compilers should be able to recognisex**i as a special case.

Optimised Libraries

Don’t reinvent the wheel!

Well optimised libraries include BLAS, LAPACK

and FFTW.

The Cache

86 88 90 92 94 96 98

CPU speedMemory speed

DRAM is much cheaper than SRAM, but it is alsomuch slower. Therefore place a small SRAMcache near processor.

Cache. . .

CPU Cache Memory

Vector CPUs usually use SRAM for all memory,and bank it to b. . . ery.

Memory-Cache mapping

The cache is partitioned up intochunks of size c called cache-lines.In the simplest caching scheme, ev-ery memory location x is mapped to aspecific cache line l , along the linesof:

l = (x mod s)/c

where s is the size of the cache.A status register records if the cacheline has been written to (dirty) andso needs to be flushed back tomain memory before that line can bereused for another part of memory.

· · ·

memory

· · ·

l = (x mod s)/c

· · ·

memory

· · ·

l = (x mod s)/c

· · ·

memory

· · ·

Memory Heirarchy

Registers

Memory

Capacity

Arrange data locally (eg use stride 1 ifpossible)

Avoid strides that are multiples of cache linesize/page size (typically a power of two)

Memory Heirarchy

Registers

Memory

CapacityArrange data locally (eg use stride 1 ifpossible)

Avoid strides that are multiples of cache linesize/page size (typically a power of two)

Array Padding

integer::parameter n=1024*1024real*8 a(n),b(n),c(n)do i=1,n

a(i)=b(i)+c(i)

These arrays are 8MB in size. Napier has a2MB cache.

Each array overlaps in cache a(i) has thesame cache location as b(i) and c(i). Thecache line will be flushed 3 times eachiteration!

Array Padding...

integer::parameter n=1024*1024real*8 a(n),space1(16),b(n)real*8 space2(16),c(n)common /foo/a,space1,b,space2,cdo i=1,n

a(i)=b(i)+c(i)

This ensures a(i) is on a different cache lineto b(i) and c(i)

Similarly it may be sensible to add an extrarow to a higher dimensional array:real*8 a(129,128) rather thanreal*8 a(128,128) Optimisation – p.11/22

Prefetching

The latency involved in a cache miss can behidden by issuing a load instruction severalinstructions ahead of the data actually beingneeded.

The optimiser will usually take care of this foryou

This can be simulated in your source code,but its difficult to arrange this withoutinterference from the optimiser.

Hyperthreading

Technology invented by Tera corporation, andbought by Intel.

Implemented in latest Pentium IV CPUs

When a thread stalls due to a cache miss,CPU switches to another thread.

Compile program with -openmp or -parallel,and run on 2 threads per CPU

Hyperthreading example

Single precision Matmul compiled withifc -O3 -tpp7 -unroll -openmp -vec -axW -xW

0 2000 4000 6000 8000 10000 12000

Matrix size

UnvectorisedVectorised

HyperthreadingHyperthreading and Vectorisation

Its a bit more complicated...

Modern CPUs have multiple cache levels (L1,L2, etc.).

Addresses used at machine language levelare virtual. Virtual addresses are mapped tophysical address by the virtual memorymanager. Mapped addresses are cached inthe Translation Lookaside Buffer (TLB).

The effect of the TLB is like a large cache

Inlining

Subroutine & Function calls degradeperformance

Call and return instructions add overheads

Pushing arguments onto stack and settingstackframe add overheads

Breaks software pipelines

Inhibits parallelisation (ameliorated withPURE)

Small functions/subroutines should be inlined

Inlining...

Compilers usually do inlining at highestoptimisation level

C++ has inline keyword

Fortran has internal functions (possibly inlined)

C preprocessor macros can be used in simplecases

Worst case scenario — you can always manuallyinline code

Inlining trades speed for code size — unless cod-ing for embedded applications, code size is rarely aproblem.

Loop unrolling

Loop overheads: 3 clock cycles per iteration

Increment index i=i+1

Test i<n

Branch if false then exit

Consider axpyz(i)=a*x(i)+y(i)

3 load/stores, 1 fused add-multiply: Loop over-

heads dominate!

Unrolling (depth 4)

do i=1,n,4z(i)=a*x(i)+y(i)z(i+1)=a*x(i+1)+y(i+1)z(i+2)=a*x(i+2)+y(i+2)z(i+3)=a*x(i+3)+y(i+3)enddo12 load/stores, 4 fused add-multiplies, 3cycles of loop overhead. Loop overhead nolonger dominates!

But need 13 registers instead of 4. Unrollingtoo much leads to register spill.

Unrolling typically performed at -O3.Optimisation – p.19/22

Temporary Copies

Consider a 5 point stencil

∆xij = κ(xi−1,j + xi+1,j + xi,j−1 + xi,j+1 − 4Xij)

delx=kappa*(eoshift(x,shift=-1,dim=1)+...-4*x)

This creates 4 temporary arrays to hold shifted

data. Lots of copying!

Temporary copies...

Instead:forall (i=2:n-1,j=2:n-1)

delx(i,j)=kappa(i,j)*(x(i-1,j)+...-4*x(i,j))

A clever compiler may be able to optimise theeoshift code, but don’t bet on it!

In C++, expression templates can help

Conclusions

Advanced Programming doesn’t always helpperformance

but, usually helps code readability

10% of code consume 90% of CPU time

Premature optimisation is the root of all evilDonald Knuth

Conclusions

PPPerfPerferferforororormancemancemancemanceTTTuningTuninguninguningpeople.csail.mit.edu/.../rpi/portfolio/hpec/optimisation.pdf ·...

Documents

Transcript of PPPerfPerferferforororormancemancemancemanceTTTuningTuninguninguningpeople.csail.mit.edu/.../rpi/portfolio/hpec/optimisation.pdf ·...

RPI AMBULANCE

HPEC DYNASTY

Slide-1 HPEC-SI MITRE AFRL MIT Lincoln Laboratory High Performance Embedded Computing Software Initiative (HPEC-SI) This work is sponsored.

Haney - 1 HPEC 10/20/2005 MIT Lincoln Laboratory The HPEC Challenge Benchmark Suite Ryan Haney, Theresa Meuse, Jeremy Kepner and James Lebak Massachusetts.

S Villanova 105 30 - National Collegiate Athletic Association Library/2017...75 90 88 Score 57 65 62 48 48 RPI 1-50 RPI 51-100 RPI 101-200 RPI 201+ Average RPI Win: 105 Average RPI

RPI Field Service Smart Kit (RPI Part #SCK028) 1 –STATIM 2000 & 5000 KEYPAD SYMBOLS CHART Rubber ... 2.0 - RPI FIELD SERVICE SMART KIT® (RPI PART #SCK028) Contained in the next

UMTS optimisation.pdf

High Performance Embedded Computing (HPEC) Workshop · HPEC 2002 Sixth Annual High Performance Embedded Computing Workshop 24–26 September 2002 Sponsored by Dr. Ravindra Athale

S A&M-Corpus Christi DIV. I ONLY DIV. I NON-CONF. Library...RPI 1-25 RPI 26-50 RPI 51-100 RPI 101-150 RPI 150+ 23 23 24 24 Average RPI Win: 249 Average RPI Loss: 131 A & M-C o r p

The “State” and “Future” of Middleware for HPEC Anthony Skjellum RunTime Computing Solutions, LLC & University of Alabama at Birmingham HPEC 2009.

© 2001 Mercury Computer Systems, Inc. Software Communication Architecture and HPEC Dr. Jeffrey E. Smith Murat Bicer Mercury Computer Systems, Inc. HPEC.

MIT Lincoln Laboratory Avionics for HPEC 1 16 September 2010 Air Force Evolution to Open Avionics - HPEC 2010 Workshop - Robert Bond 16 September 2010.

DoD Perspective on the Future of Embedded Software Developmentportals.omg.org/hpec/files/hpec-si/keynote.pdf · DoD Perspective on the Future of Embedded Software Development. 2 ...

RPI: research & consulting - Исследования и …rpi-consult.ru/about/ResearchBrochure_RU.pdf"RPI наиболее авторитетный источник информации

RPi YU5R Rotator 25.03.2014. RPi YU5R Rotator - Digistore · RPi YU5R Rotator 25.03.2014. by Goran Stankovic B.Sc.E.E. YT2FSG 1 RPi YU5R Rotator by Goran ... Now we need to edit /etc/inittab

Semantic Wiki @ RPI

The HPEC Challenge Benchmark Suitearchive.ll.mit.edu/HPEC/agendas/proc05/Day_1/Presentations/1550_Haney.pdfMIT Lincoln Laboratory Haney - 4 HPEC 10/20/2005 HPEC Challenge Benchmark

Processing Challenges in Shrinking HPEC Systems into Small Platforms

L’exemple RPI

LAST 10 Division I Library...RPI 1-25 RPI 26-50 RPI 51-100 RPI 101-150 RPI 150+ 23 23 24 24 Average RPI Win: 242 Average RPI Loss: 135 A & M-C o r p u s C h r i s t i A & M-C o r p