Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me •...
Transcript of Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me •...
![Page 1: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/1.jpg)
Deceived by monitoringNikita Salnikov-Tarnovski
@iNikem
![Page 2: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/2.jpg)
Me
• Nikita Salnikov-Tarnovski, @iNikem• Java developer for 16 years• 7 years mainly performance problems solving•Master Developer at Plumbr
![Page 3: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/3.jpg)
What is monitoring
“monitoring and management of performance and availability of software applications [with the goal] to detect and diagnose complex application performance problems to maintain an expected level of service”.
Wikipedia
![Page 4: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/4.jpg)
Huh, WAT?
• Observe the state of the system• Understand is it “good” or “bad”• If “bad” make it “good”
•Make it “better” in the future
![Page 5: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/5.jpg)
Easy Metrics
• CPU usage is 90%• Free disk space is 34GB• There is 2M active users on site• Average response time for application X is 1s• During last 24h we had 578 errors in our logs•We have 7 servers died in last 4 hours
![Page 6: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/6.jpg)
Problems
• Lack of context•Misaligned goals
![Page 7: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/7.jpg)
Goals of the application
• The goal is not to use X% of CPU• And not to keep disk mostly empty• And even not to be fast
![Page 8: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/8.jpg)
Real goal
• Satisfy customer’s need•Meet business goals
![Page 9: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/9.jpg)
Real metrics
• You have to observe application from the point of view of your users• Can they achieve their goal?
![Page 10: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/10.jpg)
The simplest useful monitoring
• Observe real user’s interactions with your application• Note failed interactions• Record response times
![Page 11: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/11.jpg)
The biggest fallacy
“Average response time is an useful metric”
![Page 12: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/12.jpg)
Anscombe's quartet
CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9838454
![Page 13: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/13.jpg)
Percentiles
Most page loads will experience the 99%’lie server response
Gil Tene, How NOT to measure latency
![Page 14: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/14.jpg)
Percentiles
Q: How many of your users will experience at least one response that is longer than the 99.99%’lie?
A: 18%
Gil Tene, How NOT to measure latency
![Page 15: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/15.jpg)
Percentiles
• Always record your maximum value• Forget about median/average• Follow your 99%’lie or higher• Plot them on logarithmic scale
![Page 16: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/16.jpg)
![Page 17: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/17.jpg)
Dichotomy of metrics
• Are users happy with your application? - direct metric•Great for alerts and health assessment
• CPU/disk usage/errors in logs - indirect metrics•Great for debugging and alert prevention
![Page 18: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/18.jpg)
That was about fixing
•What about improving?
![Page 19: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/19.jpg)
Planning performance
• Compete with actual business feature• Know when to stop
![Page 20: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/20.jpg)
This or that?
• You have to explain to your manager why performance/resilience is important• Use your user happiness metric as a proxy
![Page 21: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/21.jpg)
Not all requests are equal
• Group requests by consumed service and initiated user
![Page 22: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/22.jpg)
Suits and beards
• Let business people decide which services and which users are more important• Then you don’t need to prove the importance of any
performance fix any more :)
![Page 23: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/23.jpg)
Suits and beards
• And you have a perfect priority for improvements• That actually makes sense to your manager!
![Page 24: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/24.jpg)
When you talk to a suit
• “How many operations can fail”• “Are you stupid? Of course 0!”
• “How much time can the system be down”• “Are you kidding me? No downtime!”
• “How fast must operations be”• “What a question is this? As fast as possible!”
![Page 25: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/25.jpg)
Now you have a price tag
• “This errors happens twice a week for 1 user. Should I spend 2 days fixing it?”• “Can we have 15 minutes downtime every Sunday 3AM
when we have 0 users?”• “Should I spend 100K to move 99.99% latency from
800ms to 500ms?”
![Page 26: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/26.jpg)
Conclusion
• Technical metrics are so indirect they are almost harmful• User “happiness" is the common ground between
engineers and managers
![Page 27: Nikita Salnikov-Tarnovski @iNikem · 2019-09-11 · Nikita Salnikov-Tarnovski @iNikem. Me • Nikita Salnikov-Tarnovski, @iNikem • Java developer for 16 years • 7 years mainly](https://reader030.fdocuments.us/reader030/viewer/2022040205/5ed6c7f278b24c4a0b0918ab/html5/thumbnails/27.jpg)
Solving performance problems is hard. We don’t think it needs to be.
@JavaPlumbr/@iNikemhttp://plumbr.eu