How NOT to Measure Latency

59
© Copyright Azul Systems 2015 © Copyright Azul Systems 2015 @azulsystems How NOT to Measure Latency Matt Schuetze Product Management Director, Azul Systems 6/12/2015 1 QCon NY Brooklyn, New York

Transcript of How NOT to Measure Latency

Page 1: How NOT to Measure Latency

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

How NOT to Measure Latency

Matt Schuetze

Product Management Director, Azul Systems

6/12/2015 1

QCon NY

Brooklyn, New York

Page 2: How NOT to Measure Latency

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

Understanding Latency and Application Responsiveness

Matt Schuetze

Product Management Director, Azul Systems

6/12/2015 2

QCon NY

Brooklyn, New York

Page 3: How NOT to Measure Latency

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

The Oh $@%T! talk.

Matt Schuetze

Product Management Director, Azul Systems

6/12/2015 3

QCon NY

Brooklyn, New York

Page 4: How NOT to Measure Latency

© Copyright Azul Systems 2015

About me: Matt Schuetze

Product Management Director

at Azul Systems

Translate Voice of Customer

into Zing and Zulu

requirements and work items

Sing the praises of Azul efforts

through product launches

Azul alternate on JCP exec

committee, co-lead Detroit

Java User Group

Stand on the shoulders of

giants and admit it

6/12/2015 4

Page 5: How NOT to Measure Latency

© Copyright Azul Systems 2015

Philosophy and motivation

What do we actually care about. And why?

6/12/2015 5

Page 6: How NOT to Measure Latency

© Copyright Azul Systems 2015

Latency Behavior

Latency: The time it took one operation to

happen

Each operation occurrence has its own

latency

What we care about is how latency

behaves

Behavior is a lot more than “the common

case was X”

6/12/2015 6

Page 7: How NOT to Measure Latency

© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.

95%’ile

The “We only want to show good things” chart

We like to look at charts

6/12/2015 7

Page 8: How NOT to Measure Latency

© Copyright Azul Systems 2015

What do you care about?

Do you :

Care about latency in your system?

Care about the worst case?

Care about the 99.99%’ile?

Only care about the fastest thing in the day?

Only care about the best 50%

Only need 90% of operations to meet

requirements?

6/12/2015 8

Page 9: How NOT to Measure Latency

© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.

We like to rant about latency

6/12/2015 15

Page 10: How NOT to Measure Latency

© Copyright Azul Systems 2015

99%‘ile is ~60 usec.

(but mean is ~210usec)

“outliers”, “averages” and other nonsense

We nicknamed these

spikes “hiccups” 6/12/2015 16

Page 11: How NOT to Measure Latency

© Copyright Azul Systems 2015

Dispelling standard deviation

6/12/2015 17

Page 12: How NOT to Measure Latency

© Copyright Azul Systems 2015

Mean = 0.06 msec

Std. Deviation (𝞂) = 0.21msec

99.999% = 38.66msec

In a normal distribution,

These are NOT normal distributions

~184 σ (!!!) away from the mean

the 99.999%’ile falls within 4.5 σ

Dispelling standard deviation

6/12/2015 18

Page 13: How NOT to Measure Latency

© Copyright Azul Systems 2015

Is the 99%’ile “rare”?

6/12/2015 19

Page 14: How NOT to Measure Latency

© Copyright Azul Systems 2015

What are the chances of a single web page

view experiencing the 99%’ile latency of:

- A single search engine node?

- A single Key/Value store node?

- A single Database node?

- A single CDN request?

Cumulative probability…

6/12/2015 20

Page 15: How NOT to Measure Latency

© Copyright Azul Systems 2015 6/12/2015 21

Page 16: How NOT to Measure Latency

© Copyright Azul Systems 2015

Which HTTP response time metric

is more “representative” of user

experience?

The 95%’ile or the 99.9%’ile

6/12/2015 22

Page 17: How NOT to Measure Latency

© Copyright Azul Systems 2015

Example: A typical user session involves 5 page loads,

averaging 40 resources per page.

- How many of our users will NOT experience something

worse than the 95%’ile?

Answer: ~0.003%

- How many of our users will experience at least one

response that is longer than the 99.9%’ile?

Answer: ~18%

Gauging user experience

6/12/2015 23

Page 18: How NOT to Measure Latency

© Copyright Azul Systems 2015

Classic look at response time behavior

Response time as a function of load

source: IBM CICS server documentation, “understanding response times”

Average?

Max?

Median?

90%?

99.9%

6/12/2015 24

Page 19: How NOT to Measure Latency

© Copyright Azul Systems 2015

Hiccups are strongly multi-modal

They don’t look anything like a normal distribution

A complete shift from one mode/behavior to another

Mode A: “good”.

Mode B: “Somewhat bad”

Mode C: “terrible”, ...

The real world is not a gentle, smooth curve

Mode transitions are “phase changes”

6/12/2015 25

Page 20: How NOT to Measure Latency

© Copyright Azul Systems 2015

Proven ways to deal with hiccups

Actually characterizing latency

Requirements

Response Time

Percentile plot

line

6/12/2015 26

Page 21: How NOT to Measure Latency

© Copyright Azul Systems 2015

Different throughputs, configurations, or other parameters on one graph

Comparing Behavior

6/12/2015 29

Page 22: How NOT to Measure Latency

© Copyright Azul Systems 2015

Shameless Bragging

6/12/2015 30

Page 23: How NOT to Measure Latency

© Copyright Azul Systems 2015

Comparing Behaviors - Actual Latency sensitive messaging distribution application: HotSpot vs. Zing

6/12/2015 31

Page 24: How NOT to Measure Latency

© Copyright Azul Systems 2015

Zing

A standards-compliant JVM for Linux/x86 servers

Eliminates Garbage Collection as a concern for

enterprise applications in Java, Scala, or any JVM

language

Very wide operating range: Used in both low latency and

large scale enterprise application spaces

Decouples scale metrics from response time concerns

Transaction rate, data set size, concurrent users, heap

size, allocation rate, mutation rate, etc.

Leverages elastic memory for resilient operation

6/12/2015 32

Page 25: How NOT to Measure Latency

© Copyright Azul Systems 2015

What is Zing good for?

If you have a server-based Java application

And you are running on Linux

And you use using more than ~300MB of

memory, up to as high as 1TB memory,

Then Zing will likely deliver superior behavior

metrics

6/12/2015 33

Page 26: How NOT to Measure Latency

© Copyright Azul Systems 2015

Where Zing shines

Low latency

Eliminate behavior blips down to the sub-millisecond-units level

Machine-to-machine “stuff”

Support higher *sustainable* throughput (one that meets SLAs)

Messaging, queues, market data feeds, fraud detection, analytics

Human response times

Eliminate user-annoying response time blips. Multi-second and

even fraction-of-a-second blips will be completely gone.

Support larger memory JVMs *if needed* (e.g. larger virtual user

counts, or larger cache, in-memory state, or consolidating multiple

instances)

“Large” data and in-memory analytics

Make batch stuff “business real time”. Gain super-efficiencies.

Cassandra, Spark, Solr, DataGrid, any large dataset in fast motion 6/12/2015 34

Page 27: How NOT to Measure Latency

© Copyright Azul Systems 2015

An accidental conspiracy...

The coordinated omission problem

6/12/2015 35

Page 28: How NOT to Measure Latency

© Copyright Azul Systems 2015

The coordinated omission problem

Common load testing example:

– each “client” issues requests at a certain rate

– measure/log response time for each request

So what’s wrong with that?

– works only if ALL responses fit within interval

– implicit “automatic back off” coordination

Begin audience participation exercise now…

6/12/2015 36

Page 29: How NOT to Measure Latency

© Copyright Azul Systems 2015

It is MUCH more common than you

may think...

Is coordinated omission rare?

6/12/2015 37

Page 30: How NOT to Measure Latency

© Copyright Azul Systems 2015

Before

Correction

After

Correcting

for

Omission

JMeter makes this mistake... And so do other tools!

6/12/2015 38

Page 31: How NOT to Measure Latency

© Copyright Azul Systems 2015

Before Correction

After Correction

Wrong

by 7x

Real World Coordinated Omission effects

6/12/2015 39

Page 32: How NOT to Measure Latency

© Copyright Azul Systems 2015

Uncorrected Data

Real World Coordinated Omission effects

6/12/2015 40

Page 33: How NOT to Measure Latency

© Copyright Azul Systems 2015

Uncorrected Data

Corrected for

Coordinated

Omission

Real World Coordinated Omission effects

6/12/2015 41

Page 34: How NOT to Measure Latency

© Copyright Azul Systems 2015

A ~2500x

difference in

reported

percentile levels

for the problem

that Zing

eliminates Zing

“other” JVM

Real World Coordinated Omission effects Why I care

6/12/2015 42

Page 35: How NOT to Measure Latency

© Copyright Azul Systems 2015

How “real” people react

6/12/2015 43

Page 36: How NOT to Measure Latency

© Copyright Azul Systems 2015

Suggestions

Whatever your measurement technique is, test it.

Run your measurement method against artificial systems

that create hypothetical pauses scenarios. See if your

reported results agree with how you would describe that

system behavior

Don’t waste time analyzing until you establish sanity

Don’t ever use or derive from standard deviation

Always measure Max time. Consider what it means...

Be suspicious.

Measure %‘iles. Lots of them.

6/12/2015 44

Page 37: How NOT to Measure Latency

© Copyright Azul Systems 2015

HdrHistogram

6/12/2015 45

Page 38: How NOT to Measure Latency

© Copyright Azul Systems 2015

Then you need both good dynamic range and good resolution

HdrHistogram

If you want to be able to produce charts like this...

6/12/2015 46

Page 39: How NOT to Measure Latency

© Copyright Azul Systems 2015 6/12/2015 48

Page 40: How NOT to Measure Latency

© Copyright Azul Systems 2015 6/12/2015 49

Page 41: How NOT to Measure Latency

© Copyright Azul Systems 2015

Shape of Constant latency

10K fixed line latency 6/12/2015 50

Page 42: How NOT to Measure Latency

© Copyright Azul Systems 2015

Shape of Gaussian latency

10K fixed line latency with added

Gaussian noise (std dev. = 5K) 6/12/2015 51

Page 43: How NOT to Measure Latency

© Copyright Azul Systems 2015

Shape of Random latency

10K fixed line latency with added

Gaussian (std dev. = 5K) vs. random (+5K) 6/12/2015 52

Page 44: How NOT to Measure Latency

© Copyright Azul Systems 2015

Shape of Stalling latency

10K fixed base, stall magnitude of 50K

stall likelihood = 0.00005 (interval = 100) 6/12/2015 53

Page 45: How NOT to Measure Latency

© Copyright Azul Systems 2015

Shape of Queuing latency

10K fixed base, occasional bursts of 500 msgs

handling time = 100, burst likelihood = 0.00005 6/12/2015 54

Page 46: How NOT to Measure Latency

© Copyright Azul Systems 2015

Shape of Multi Modal latency

10K mode0 70K mode1 (likelihood 0.01)

180K mode2 (likelihood 0.00001) 6/12/2015 55

Page 47: How NOT to Measure Latency

© Copyright Azul Systems 2015

And this what the real(?) world

sometimes looks like…

6/12/2015 56

Page 48: How NOT to Measure Latency

© Copyright Azul Systems 2015

Real world “deductive reasoning”

6/12/2015 57

Page 49: How NOT to Measure Latency

© Copyright Azul Systems 2015

http://www.jhiccup.org

6/12/2015 58

Page 50: How NOT to Measure Latency

© Copyright Azul Systems 2015

jHiccup

6/12/2015 59

Page 51: How NOT to Measure Latency

© Copyright Azul Systems 2015

Discontinuity in Java execution

6/12/2015 60

Page 52: How NOT to Measure Latency

© Copyright Azul Systems 2015

Examples

6/12/2015 63

Page 53: How NOT to Measure Latency

© Copyright Azul Systems 2015

Oracle HotSpot ParallelGC Oracle HotSpot G1

1GB live set in 8GB heap, same app, same HotSpot, different GC

6/12/2015 64

Page 54: How NOT to Measure Latency

© Copyright Azul Systems 2015

Oracle HotSpot CMS Zing Pauseless GC

1GB live set in 8GB heap, same app, different JVM/GC

6/12/2015 65

Page 55: How NOT to Measure Latency

© Copyright Azul Systems 2015

Oracle HotSpot CMS Zing Pauseless GC

1GB live set in 8GB heap, same app, different JVM/GC- drawn to scale

6/12/2015 66

Page 56: How NOT to Measure Latency

© Copyright Azul Systems 2015

Zing

Low latency trading application

Oracle HotSpot (pure NewGen) Zing Pauseless GC

6/12/2015 67

Page 57: How NOT to Measure Latency

© Copyright Azul Systems 2015

Oracle HotSpot (pure newgen) Zing Oracle HotSpot (pure newgen)

Low latency trading application

Oracle HotSpot (pure NewGen) Zing Pauseless GC

6/12/2015 68

Page 58: How NOT to Measure Latency

© Copyright Azul Systems 2015

Oracle HotSpot (pure newgen) Zing

Low latency trading application – drawn to scale

Oracle HotSpot (pure NewGen) Zing Pauseless GC

6/12/2015 69

Page 59: How NOT to Measure Latency

© Copyright Azul Systems 2015

Q & A www.azul.com

www.jhiccup.com

www.hdrhistogram.com

@schuetzematt

6/12/2015 70