Performance of mathematical software

21
Performance of mathematical software Agner Fog Technical University of Denmark www.agner.org

description

Performance of mathematical software. Agner Fog Technical University of Denmark www.agner.org. Agenda. Intel and AMD microarchitecture Performance bottlenecks Parallelization C++ vector classes, example Instruction set dispatching Performance measuring Hands-on examples. - PowerPoint PPT Presentation

Transcript of Performance of mathematical software

Page 1: Performance of mathematical software

Performance of mathematical software

Agner FogTechnical University of Denmark

www.agner.org

Page 2: Performance of mathematical software

• Intel and AMD microarchitecture• Performance bottlenecks• Parallelization• C++ vector classes, example• Instruction set dispatching• Performance measuring• Hands-on examples

Agenda

Page 3: Performance of mathematical software

AMD Microarchitecture

• 2 threads per CPU unit

• 4 instructions per clock

• Out-of-order execution

Page 4: Performance of mathematical software

Intel Micro-architecture

Page 5: Performance of mathematical software

Typical bottlenecks• Installation of program• Program start, load framework and

libraries• System database• Network• File input / output• Graphics• RAM access, cache utilization• Algorithm• Dependency chains• CPU pipeline• CPU execution units

Sp

eed

Page 6: Performance of mathematical software

Efficient cache use

• Store data in contiguous blocks• Avoid advanced data containers

such as resizable data structures, linked lists, etc.

• Store local data inside the function they are used in

Page 7: Performance of mathematical software

Branch prediction

Loop

Branch

BA

Page 8: Performance of mathematical software

Out-Of-Order Execution

x = a / b;y = c * d;z = x + y;

Page 9: Performance of mathematical software

Dependency chains

x = a + b + c + d;

x = (a + b) + (c + d);

Page 10: Performance of mathematical software

Loop-carrieddependency chain

for (i = 0; i < n; i++) { sum += x[i]; }

for (i = 0; i < n; i += 2) { sum1 += x[i]; sum2 += x[i+1]; }sum = sum1 + sum2;

Page 11: Performance of mathematical software

Levels of parallelization

• Multiple threads running in different CPU units

• Out-of-order execution• SIMD instructions

Page 12: Performance of mathematical software

SIMD programming methods

• Assembly language• Intrinsic functions• Vector classes• Automatic vectorization by

compiler

Page 13: Performance of mathematical software
Page 14: Performance of mathematical software

Obstacles to automatic vectorization

• Pointer aliasing• Array alignment• Array size• Algebraic reduction• Branches• Table lookup

Page 15: Performance of mathematical software

Vector math libraries

Short vectors

• Intel SVML• AMD libm• Agner’s VCL

Long vectors

• Intel VML• Intel IPP• Yeppp

Page 16: Performance of mathematical software

ExampleExponential function

in vector classes

Page 17: Performance of mathematical software

Instruction set dispatching

SSE

SSE2

SSE3

Generic x86

SSE code

SSE2 code

AVX2 code

N

J

N

N

J

J

GenuineIntel

N

J

Future extensions

Page 18: Performance of mathematical software

Performance measurement

• Profiler

• Insert measuring instruments into code

• Performance monitor counters

Page 19: Performance of mathematical software

When timing results are unstable

• Stop other running programs• Warm up CPU with heavy work• Disable power saving features in

BIOS setup• Disable hyperthreading• Use “core clock cycles” counter

(Intel only)

Page 20: Performance of mathematical software

ExampleMeasuring latency and throughput of

machine instruction

Page 21: Performance of mathematical software

Hands-on example

Compare performance of different mathematical function libraries

http://cern.agner.org