Simple practices in performance monitoring and evaluation

Post on 22-Jan-2018

94 views 0 download

Transcript of Simple practices in performance monitoring and evaluation

Simple Practices in Performance Monitoring and Evaluation

Schubert Zhang 2016.3.24

SLA

Service Level Agreements

https://en.wikipedia.org/wiki/Service-level_agreement

SLAs commonly include segments to address: a definition of services, performance measurement, problem management, customer duties,

warranties, disaster recovery, termination of agreement.

• APIIM SLA

• Performance

• Performanceperformance oriented SLA

MetricsSLA Performance SLA

Performance Metrics

e.g.1: API

• (99%)

e.g.2: Call Center

• Abandonment Rate: Percentage of calls abandoned while waiting to be answered.

• ASA (Average Speed to Answer): Average time it takes for a call to be answered by the service desk.

• TSF (Time Service Factor): Percentage of calls answered within a definite timeframe, e.g., 80% in 20 seconds.

• FCR (First-Call Resolution): Percentage of incoming calls that can be resolved without the use of a callback or without having the caller call back the helpdesk to finish resolving the case.

• TAT (Turn-Around Time): Time taken to complete a certain task.

Metrics

Performance Metrics

Benchmarking

the quality of a service must be measured, evaluated, … benchmarked.

and we must have a set of approaches for benchmarking.

Metrics to be monitored

Throughput

QPS TPS CPS

in seconds, in minutes, in hours …

Concurrency

Latency

Response Time Round-Trip Time(RTT) …

Average Median Min. Max. Percentile …

Quantile / Percentile

refers to Google Sawzall Paper

A Summary of these Concepts

Client-1

Client-2

Client-3

Client-N

Work Thread

Work Thread

Work Thread

Work Thread

Work Thread

ThroughputLatency Concurrency

Clients Server

A Life-World Example

Example-1 Paper Amazon Dynamo

Average

99.9%, quantile

Example-2 Evaluation Report to a NoSQL DB

Cassandra

Benchmark for Write APIBenchmark for Writes Cluster overview

Throughput Latency

• Eachnoderuns6clients(threads),totally54clients.• EachclientgeneratesrandomCDRsfor50millionusers/phone-numbers,

andputsthemintoDaStoronebyone.– KeySpace:50million– SizeofaCDR: Thrift-compactedencoding,~200bytes

ü Throughput: average~80Kops/s;per-node:average~9Kops/sü Latency:average~0.5msp Bottleneck:network (andmemory)

Benchmark for Read API• Eachnoderuns8clients(threads),totally72clients.• Eachclientrandomlyusesauser-id/phone-numberoutofthe50-million

space,togetit’srecent20CDRs(onepage)fromDaStor.• AllclientsreadCDRsofasameday/bucket.

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61100ms

percentageofreadops

ü Throughput: average~140ops/s;per-node:average~16ops/sü Latency:average~500ms,97%<2s(SLA)p Bottleneck:diskIO(randomseek)(CPUloadisverylow)

average97%

quantile

Total & Delta

Total: Delta:

Generate the metrics and monitor them

• In server side

• Add a operation-count and the time-cost for every client call

• For every monitor interval, pull and push the current Throughput and Latency the monitor-tool(ganglia/zabbix) or console.

• Throughput = sum of count / time interval

• Latency = average(sum of latency / sum of count), max, min, quantile …

Code in Gitlab and Gerrit

Code for Spring Project

• Java

• JMX (Java Management Extensions, a simple example at https://github.com/schubertzhang/jsketch)

• javaagent (java -javaagent:jar path [= premain ] )

• jmxetric (use JMX and javaagent to display metrics to Ganglia, https://github.com/schubertzhang/jmxetric)

• Ganglia

• Zabbix

• …

Ganglia Zabbix etc.

Performance Benchmark Programing

Demo Test and Evaluation the Throughput and Latency of http://www.fangdd.com

Demo Time …

demo screenshots

demo screenshots

���

���

���

��

����

����

����

� � � � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � � � � � � � � � ���

���

���

���

��

���

���

���

���

��

���

���

���

���

��

���

���

����

����

�� ������� ���� �

Average 95%

The long tail …

Statistical Monitoring for Outlier

usually for trouble-shooting

Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.

The magic matrix:

• Redis Memcache

• Just add at a point, very low-cost

• Very

• Logs ELK

Heavy Logs & ELK

It’s another topic!

Thank You!