Talk About Performance

Post on 22-Nov-2014

201 views 0 download

Tags:

description

The talk I did during IT Weekend Rivne event 2 years ago.

Transcript of Talk About Performance

Talk About Performance@YaroslavBunyak

Senior Software Engineer, SoftServe Inc.

What is Performance?

What is a Program?

xformdata data

What is a Program?

xformdata data

⬆ THIS

!!

What is a Program?

xformdata data

What is a Program?

xformdata data

What is a Program?

xformdata data

How to Create a Program?

Simple

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box <- Right?

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box <- Right?

Wrong!

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box

Bad Programs

Bad Programs

Sloppy

Using the program is like trying to swim in jelly

Bad Programs

Sloppy

Using the program is like trying to swim in jelly

Use memory inefficiently

Bad Programs

Sloppy

Using the program is like trying to swim in jelly

Use memory inefficiently

Battery is dead already

Good Programs

Run fast

Good Programs

Run fastUse little memory

Good Programs

Run fastUse little memorySave battery

Good Programs

Run fastUse little memorySave battery

Good Programs

I write them!

Run fastUse little memorySave battery

Good Programs

I write them!

It was a joke :)

Run fastUse little memorySave battery

Good Programs

How to Create a Good Program?

What is a Program?

xformdata data

What is a Program?

What is a Program?

What is a Program?

code

hardware

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Q: How fast this code is?

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Q: How fast this code is?

A: Depends...

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on how fast CPU adds two

integers?

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on how fast CPU adds two

integers?

Code Sample

int a = ... int b = ... // more code... !int c = a + b; NO

... on how fast CPU adds two

integers?

Code Sample

int a = ... int b = ... // more code... !int c = a + b; NO

Any modern CPU can add integers

very fast !

~1 cycle

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on whether `a’ and `b’ are ready for processing

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on whether `a’ and `b’ are ready for processingi.e. loaded into

CPU registers

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on whether `a’ and `b’ are ready for processingi.e. loaded into

CPU registersLoad data

from memory into a register

!~600 cycles

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Q: What CPU is doing in the meantime?

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Q: What CPU is doing in the meantime?

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

A: Nothing! It’s waiting for data

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

You Ask

You Ask

Can we do better?

You Ask

Can we do better?Yes. And your hardware will help you

CPU

CPU Operation

CPU Operation

Load & decode instruction(s)

CPU Operation

Load & decode instruction(s)Load data

memory -> registers

CPU Operation

Load & decode instruction(s)Load data

memory -> registers

Execute instruction(s)

CPU Operation

Load & decode instruction(s)Load data

memory -> registers

Execute instruction(s)Store results

registers -> memory

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 1

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 1

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 1

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 1

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 16 instr. 2

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 16 instr. 27 instr. 2

Pipeline

cyclepipeline stage

IL ID DL EX DS

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 2

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 27 instr. 4 instr. 3

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 27 instr. 4 instr. 3

Branch Prediction

if (day == Monday) dose = kDouble; else dose = kStandard; !make_coffee(dose);

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

<- What instruction to load & decode

next?

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

<- What instruction to load & decode

next?

<- two or

<- three ?

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

CPU will try to predict and start

load & decode

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

CPU will try to predict and start

load & decode

If it was wrong: discard results,

flush pipeline

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Pipeline

cyclepipeline stage

IL ID DL EX DS

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 1 <- instr. 1

executed, prediction

was correct

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 16 instr. 4 instr. 2

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 16 instr. 4 instr. 27 instr. 4

Pipeline

cyclepipeline stage

IL ID DL EX DS

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 1 <- instr. 1

executed, wrong prediction detected

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 1

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 16 instr. 4 instr. 3

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 16 instr. 4 instr. 37 instr. 4 instr. 3

Takeaways

Takeaways

Branches are bad for the pipeline

Takeaways

Branches are bad for the pipelineAvoid if possible

Takeaways

Branches are bad for the pipelineAvoid if possibleHelp branch predictor to help you

Memory

Workflow

Workflow

Program data is stored in memory

Workflow

Program data is stored in memoryCPU requests data for processing

Workflow

Program data is stored in memoryCPU requests data for processingTypical cycle: load, process, store

Architecture

Memory Controller

Memory BanksCPU

Architecture

Memory Controller

Memory BanksCPU

Architecture

Memory Controller

Memory BanksCPU

Architecture

Memory Controller

Memory BanksCPU

Architecture

Memory Controller

Memory BanksCPU

Parameters

Parameters

There are two main parameters of memory subsystem:

Parameters

There are two main parameters of memory subsystem:

latency

Parameters

There are two main parameters of memory subsystem:

latencybandwidth

Latency

Latency

Shows how much time passes between data request and its delivery

Latency

Shows how much time passes between data request and its deliveryVery important concept (see further)

Bandwidth

Bandwidth

Shows how much data can be accessed per second

Bandwidth

Shows how much data can be accessed per secondAlso important

History Lesson

VAX-11 (1980) Modern Desktop Improvement

Clock Speed, Mhz 6 3000 +500x

Memory Size, MB 2 2000 +1000x

Memory Bandwidth, MB/s 13 7000 +540x

Memory Latency, ns 225 70 +3x

Memory Latency, cycles 1.4 210 -150x

Data from “Machine Architecture” talk by Herb Sutter

History Lesson

History Lesson

For the past 30+ years we saw huge improvements in CPU processing power and data sizes

History Lesson

For the past 30+ years we saw huge improvements in CPU processing power and data sizes ... but

History Lesson

For the past 30+ years we saw huge improvements in CPU processing power and data sizesMemory speeds couldn’t keep up with the progress

Takeaways

Latency is the king!

Takeaways

Latency is the king!You can trade CPU time for memory, i.e. calculate more - load/store less

Takeaways

Memory types

Memory types

There are two main memory types:

Memory types

There are two main memory types:Static RAM - fast, but very expensive

Memory types

There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper

Memory types

There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper

Which one to use?

Memory types

There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper

Solution

Solution

Build memory hierarchy which utilizes large amounts of cheap DRAM storage and small amounts of fast SRAM cache

Memory Hierarchy

Memory

L2 Cache

L1i/L1d

Memory Hierarchy

Memory

L2 Cache

L1i/L1diPhone 4s:

!32KB L1i 32KB L1d 1 MB L2

512 MB DRAM

Memory Hierarchy

Memory

L2 Cache

L1i/L1diPhone 4s:

!32KB L1i 32KB L1d 1 MB L2

512 MB DRAM

Access: !

registers - 1 cycle L1 - 5 cycles

L2 - 40 cycles DRAM - 610

Memory Hierarchy

Memory

L2 Cache

L1i/L1d

Cache Miss

Cache Miss

If data requested by CPU is not in the cache it has to be loaded from the main (slow) memory

Cache Line

Cache Line

Minimum amount of data that can be read from and written to memory

Cache Line

Minimum amount of data that can be read from and written to memoryUsually 64-128 bytes

Cache Line

Cache Line

What does it mean?

Cache Line

What does it mean?Consider you have an array of 16 floats and you want the first float for calculations

Cache Line

What does it mean?Consider you have an array of 16 floats and you want the first float for calculationsIf it’s not in cache already, you will pay the “full price” to load entire cache line

Cache Line

What does it mean?Consider you have an array of 16 floats and you want the first float for calculationsIf it’s not in cache already, you will pay the “full price” to load entire cache lineAccess remaining 15 floats “for free”

Prefetch

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculatively

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need it

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

BTW, C++ operator-> sometimes

referred to as “cache miss”

operator

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

BTW, C++ operator-> sometimes

referred to as “cache miss”

operator

Can you guess why?

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

How to Create a Good Program?

Simple

Simple

Know your target hardware

Simple

Know your target hardwareKnow your data

Simple

Know your target hardwareKnow your dataUse your brain

One More Thing...

One More Thing...

Data-Oriented Design

Thank You!

Questions?

References

Ulrich Drepper, “What Every Programmer Should Know About Memory” Крис Касперски, “Техника оптимизации программ. Еффективное использование памяти” @mike_acton