Lies, Damn Lies and Performance Metrics - SNIA · At poll time, they exchange views. The...

49
PRESENTATION TITLE GOES HERE Lies, Damn Lies and Performance Metrics Barry Cooks Virtual Instruments

Transcript of Lies, Damn Lies and Performance Metrics - SNIA · At poll time, they exchange views. The...

PRESENTATION TITLE GOES HERE

Lies, Damn Lies and Performance Metrics

Barry Cooks

Virtual Instruments

2 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Goal for This Talk

Take away a sense of how to make the move

from:

Improving your mean time to innocence

to

Improving your infrastructure performance

3 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

What We’ll Cover

A case of performance metrics gone bad

Some history

What performance monitoring needs

The lies

The damn lies

The performance metrics

How can you use them

4 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Application is

down …

… again.

5 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Data Center Management - Actual

You see this???

6 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Array tools say

it’s okay…

7 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Data Center Management - Actual

How can I “help”?

8 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Data Center Management - Actual

Meanwhile, at the storage vendor …

Have you tried updating

your drivers and

firmware?

9 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

And the switch vendor …

Can you clear the

counters and run another

log collection?

10 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Some history

11 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

IBM – A Point of Reference

Mainframes collected and correlated lots of data

about the workload and infrastructure

12 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Closed vs. Open Systems

The move to open systems was introduced

Numerous competing vendors

Interconnected specialized devices

Inconsistency in monitoring methods and metrics

Correlating data from multiple vendors is a serious

challenge

Vendors’ focus has been on core innovation

Monitoring became a secondary priority

13 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

What does performance monitoring need?

14 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

What’s Required for Success

Understanding what data is relevant

A method to gather that data, ideally, without

impacting systems under monitoring

End-to-end view of data

Historical data retention

Comparable data across vendor ecosystem

Actionable insights from that data

15 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

The lies

16 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Performance Monitoring Today

“Performance” metrics are often

Not really performance metrics

Utilization

Error counters

Samples taken on a polling interval

Every minute, hour, 6 hours?

Rollup averages over a window of time

At 16G a single 2KB frame takes 1.25μs to transmit.

That’s 48 million 2K reads per minute.

Fifteen- minute average? That’s the population of

Europe.

17 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

$67K average

Traditional Performance Management

0

$350,000

$700,000

$1,000,000

$295K

average

The Outlier

18 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

The Hidden Issue

5,000ms

1ms

Response Time

10,000 I/Os

I/Os per second

10,000 @ 1ms x 20s

32 @ 5,000ms

10,000 @ 1ms x 35

Total Commands: 550,032

Total I/O Time:

1ms * 10,000 I/Os * 55s +

32 I/Os *5s = 710,000

Average Response Time

= 1.29ms

0 20sec 25sec 60sec

0 20sec 25sec 60sec

19 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

A Question of Balance

Is the traffic between these ports on the same server balanced?

Port A mean traffic: 4.41Mb/s

Port B mean traffic: 4.40Mb/s

20 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Workload Profiling

20

21 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Vendor “Response Time” Metrics

Utilization = 100% * busy time in period / (idle + busy) time in period

Throughput = total number of visitors in periods / period in length in

seconds

Average Busy Queue Length = sum of queue upon arrival of visitor x

/ total number of visitors

Queue length = ABQL * utilization/100%

Response time = queue length / throughput (Little’s Law)

Response Time = (Sum of Queue Upon Arrival of Visitor / Total

Number of Visitors)

* (100% * Busy Time in Period / (Idle + Busy)

time in period) / 100%) / (# of Visitors in Period / Length of Period)

22 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Vendor “Response Time” Metrics

The Fine Text (Necessary Caveats):

For low LUN throughput (<32 IOPS), response time might be inaccurate.

Lazy writes skew the LUN busy counter.

Dual SP ownership of a disk can also impact response time.

Each SP only knows about its own ABQL, throughput and utilization for the disk.

At poll time, they exchange views. The utilization is max(SPA,SPB).

ABQL is computed from the sum of the sum.

And SP throughput is the sum of SPA and SPB throughput.

Be wary of confusing SP response time in Analyzer with the average response time of all

LUNs on that SP.

A LUN is busy (not resting) as long as something is queued to it.

An SP is busy (not resting) as long as it is not in the OS idle loop.

While a disk is busy getting a LUN request, the LUN is still busy.

While a disk is busy getting a LUN request, the SP might be idle.

The SP response time is generally smaller than the average response time of all the

LUNs on that SP.

Host response time is approximated by LUN response time.

23 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Data Time Skew

R2 at one minute delay is 0.91,

while at zero delay it is 0.41

24 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Gathering the Data

A challenge for external software-based

monitoring – perturbing the system under

investigation

Adding load

Changing behavior

25 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Data Collection

26 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Data Collection

AIX

VMware

HPux

Solaris

HyperV

Brocade

Cisco EMC

HDS

IBM

Cisco

Brocade

27 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Data Collection

AIX

VMware

HPux

Solaris

HyperV

EMC

HDS

IBM Brocade

Cisco

Brocade

Cisco

28 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

The damn lies

29 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Decisions Based on Thresholds

Refer to

Documentation

All clear?

Not yet

I guess so.

ASK SOMEBODY

Not yet

I guess so.

INPUT A VALUE

On the first try? Yeah,

right.

Pick a lower

threshold

Yes.

Go buy a lottery ticket,

immediately.

Just the right

number of alarms?

Have

something

better to

do?

Create an email

filter Done!

All clear?

Done,

yet?

No.

Uhhh. Yeah. Finally.

30 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Where should alarm thresholds be placed?

31 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Threshold

Traditional Performance Management

Data Granularity Challenge

One-minute

32 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Threshold

Traditional Performance Management

Data Granularity Challenge

One-second

33 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Traditional Performance Management

Data Granularity Challenge

One-millisecond

Threshold

34 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Performance metrics

35 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

$67K average

Traditional Performance Management

0

$350,000

$700,000

$1,000,000

$295K

average

The Outlier - Revisited

36 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

What Does Average Response Time Mean?

Q: When you hear your average response time is 20 ms, what is the first thing that pops

into your mind?

A. My response distribution must look like this: B. My response distribution must look like this: C. My response distribution must look like this:

D. My response distribution must look like this:

E. I don’t know what my response distribution looks

like because taking an average of all the response

times is not a helpful thing to do.

F. When’s lunch?

37 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

What Are “Histograms”?

A histogram is a graphical

representation of the distribution of

data.

Scalar quantization, typically

denoted as y=Q(x), is the process

of using a quantization function Q()

to map a scalar (one-dimensional)

input value x to a scalar output

value y.

38 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Histogram Bins

Timing Bins:

Reads Writes > 0 <= 0.05ms > 0 <= 0.05ms

> 0.05 <= 0.2ms > 0.05 <= 0.1ms

> 0.2 <= 0.5ms > 0.1 <= 0.2ms

> 0.5 <= 1ms > 0.2 <= 0.3ms

> 1 <= 2ms > 0.3 <= 0.5ms

> 2 <= 4ms > 0.5 <= 0.7ms

> 4 <= 6ms > 0.7 <= 1ms

> 6 <= 8ms > 1 <= 1.5ms

> 8 <= 10ms > 1.5 <= 2ms

> 10 <= 15ms > 2 <= 3ms

> 15 <= 20ms > 3 <= 4ms

> 20 <= 30ms > 4 <= 6ms

> 30 <= 50ms > 6 <= 10ms

> 50 <= 75ms > 10 <= 20ms

> 75 <= 100ms > 20 <= 30ms

> 100 <= 150ms > 30 <= 50ms

> 150 <= 250ms > 50 <= 75ms

> 250 <= 500ms > 75 <= 100ms

> 500 <= 1000ms > 100 <= 150ms

> 1000 <= 4500ms > 150 <= 250ms

> 4500ms > 250 <= 1000ms

> 1000 <= 4500ms

> 4500ms

Size Bins:

Read & Write > 0 <= 0.5 KiB

> 0.5 <= 1 KiB

> 1 <= 2 KiB

> 2 <= 3 KiB

> 3 <= 4 KiB

> 4 <= 8 KiB

> 8 <= 12 KiB

> 12 <= 16 KiB

> 16 <= 24 KiB

> 24 <= 32 KiB

> 32 <= 48 KiB

> 48 <= 60 KiB

> 60 <= 64 KiB

> 64 <= 96 KiB

> 96 <= 128 KiB

> 128 <= 192 KiB

> 192 <= 256 KiB

> 256 <= 512 KiB

> 512 <= 1024 KiB

> 1024 KiB

The bins were selected on three criteria:

1. Sampling from live datacenter systems

2. Common SLA Language

a. Common service level agreement

language is for 10, 15, 20, 30,

50ms boundaries

3. Expected disk seek/access latencies

a. Cache hit range 0 – 0.5ms

b. EFD / SSD range 0.5 – 2ms

c. 15k FC/SAS range 2 – 6ms

d. 10k FC/SAS range 6 – 10ms

e. SATA/NL-SAS range 10 – 15ms

39 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Write Cache Misses

Cache Hits

Cache Misses

40 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Impacts of Auto-Tiering

Cache

Hits FC

SSD

SATA

Auto-tiering left unattended

41 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

IO Size Skew

Average I/O size = 80KiB Does not do a very good

job of describing the distribution.

42 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Histogram Capabilities

43 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Answers, not data

44 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

How to Analyze HBA Queue Depth

Approach #1

$ if (queue_size > 128)

throw_red_flag

Threshold Trigger

High Quality Raw Data

Approach #2

Average

Queue Depth = 15

Average Metric

45 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Approach #3

How to Analyze HBA Queue Depth

Combining Multiple Metrics With Machine Learning Analytics

Execution throttle

set too high!

Queue Size R

esp

on

se T

ime

(m

s)

Execution throttle

set properly.

Queue Size

Resp

on

se T

ime

(m

s)

95%th

75%th

50%th

Both these scenarios would trigger red flags in Approach #2

46 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Repositioning VMs in a Cluster

High Quality Raw Data

VM#1 MEM Usage

VM#1 NET Usage

VM#1 CPU Usage

VM#1 Disk Usage

Average Metrics

Approach #1 Approach #2

$ if (vm_cpu_usage > 85%)

move_vm_process

Threshold Trigger

47 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Approach #3

Repositioning VMs in a Cluster

Predict Future Usage and Reorganize to Fix Bottlenecks BEFORE they Happen

Reorganize VMs such that the busy times of one VM

correspond with the free times of the rest of the server

Time

Se

rve

r C

PU

Utiliz

atio

n %

VM#12

VM#35

VM#46

Bottleneck One Server predicted

future

steady usage

VM#16

VM#25

VM#17

Today

Se

rve

r C

PU

Utiliz

atio

n %

(include both Dynamic CPU

& Memory Utilization)

48 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.

Where We Landed

Using high-quality, low-impact data, we can

drive better decision-making across the

infrastructure

Analytics will enable a change in the way

answers are derived from the data

PRESENTATION TITLE GOES HERE