Simple practices in performance monitoring and evaluation
-
Upload
schubert-zhang -
Category
Technology
-
view
91 -
download
0
Transcript of Simple practices in performance monitoring and evaluation
Simple Practices in Performance Monitoring and Evaluation
Schubert Zhang 2016.3.24
SLA
Service Level Agreements
https://en.wikipedia.org/wiki/Service-level_agreement
SLAs commonly include segments to address: a definition of services, performance measurement, problem management, customer duties,
warranties, disaster recovery, termination of agreement.
•
•
• APIIM SLA
•
• Performance
• Performanceperformance oriented SLA
MetricsSLA Performance SLA
Performance Metrics
e.g.1: API
•
• (99%)
•
e.g.2: Call Center
• Abandonment Rate: Percentage of calls abandoned while waiting to be answered.
• ASA (Average Speed to Answer): Average time it takes for a call to be answered by the service desk.
• TSF (Time Service Factor): Percentage of calls answered within a definite timeframe, e.g., 80% in 20 seconds.
• FCR (First-Call Resolution): Percentage of incoming calls that can be resolved without the use of a callback or without having the caller call back the helpdesk to finish resolving the case.
• TAT (Turn-Around Time): Time taken to complete a certain task.
Metrics
Performance Metrics
Benchmarking
the quality of a service must be measured, evaluated, … benchmarked.
and we must have a set of approaches for benchmarking.
Metrics to be monitored
Throughput
QPS TPS CPS
in seconds, in minutes, in hours …
Concurrency
Latency
Response Time Round-Trip Time(RTT) …
Average Median Min. Max. Percentile …
Quantile / Percentile
refers to Google Sawzall Paper
A Summary of these Concepts
Client-1
Client-2
Client-3
Client-N
Work Thread
Work Thread
Work Thread
Work Thread
Work Thread
ThroughputLatency Concurrency
Clients Server
A Life-World Example
Example-1 Paper Amazon Dynamo
Average
99.9%, quantile
Example-2 Evaluation Report to a NoSQL DB
Cassandra
Benchmark for Write APIBenchmark for Writes Cluster overview
Throughput Latency
• Eachnoderuns6clients(threads),totally54clients.• EachclientgeneratesrandomCDRsfor50millionusers/phone-numbers,
andputsthemintoDaStoronebyone.– KeySpace:50million– SizeofaCDR: Thrift-compactedencoding,~200bytes
ü Throughput: average~80Kops/s;per-node:average~9Kops/sü Latency:average~0.5msp Bottleneck:network (andmemory)
Benchmark for Read API• Eachnoderuns8clients(threads),totally72clients.• Eachclientrandomlyusesauser-id/phone-numberoutofthe50-million
space,togetit’srecent20CDRs(onepage)fromDaStor.• AllclientsreadCDRsofasameday/bucket.
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61100ms
percentageofreadops
ü Throughput: average~140ops/s;per-node:average~16ops/sü Latency:average~500ms,97%<2s(SLA)p Bottleneck:diskIO(randomseek)(CPUloadisverylow)
average97%
quantile
Total & Delta
Total: Delta:
Generate the metrics and monitor them
• In server side
• Add a operation-count and the time-cost for every client call
• For every monitor interval, pull and push the current Throughput and Latency the monitor-tool(ganglia/zabbix) or console.
• Throughput = sum of count / time interval
• Latency = average(sum of latency / sum of count), max, min, quantile …
Code in Gitlab and Gerrit
Code for Spring Project
• Java
• JMX (Java Management Extensions, a simple example at https://github.com/schubertzhang/jsketch)
• javaagent (java -javaagent:jar path [= premain ] )
• jmxetric (use JMX and javaagent to display metrics to Ganglia, https://github.com/schubertzhang/jmxetric)
•
• Ganglia
• Zabbix
• …
Ganglia Zabbix etc.
Performance Benchmark Programing
Demo Test and Evaluation the Throughput and Latency of http://www.fangdd.com
Demo Time …
demo screenshots
demo screenshots
�
���
���
���
��
����
����
����
� � � � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � � � � � � � � � ���
���
���
���
��
���
���
���
���
��
���
���
���
���
��
���
���
����
����
�� ������� ���� �
Average 95%
The long tail …
Statistical Monitoring for Outlier
usually for trouble-shooting
Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.
The magic matrix:
•
• Redis Memcache
• Just add at a point, very low-cost
•
• Very
• Logs ELK
Heavy Logs & ELK
It’s another topic!
Thank You!