Production Profiling: What, Why and How · Logs of GC events Often manual, aids system...

46
Production Profiling: What, Why and How Richard Warburton (@richardwarburto) Sadiq Jaffer (@sadiqj) https://www.opsian.com

Transcript of Production Profiling: What, Why and How · Logs of GC events Often manual, aids system...

Page 1: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Production Profiling: What, Why and How

Richard Warburton (@richardwarburto)Sadiq Jaffer (@sadiqj)

https://www.opsian.com

Page 2: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Why Performance MattersDevelopment isn’t Production

Profiling vs MonitoringProduction Profiling

Conclusion

Page 3: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Customer Experience

Page 4: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Amazon: 100ms of latency costs 1% of sales

Google: 500ms seconds in search page generation time drops traffic by 20%

Responsive Applications make more Money

Page 5: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Stop Costly Downtime

Page 6: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Reduce Costs

Page 7: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Why Performance MattersDevelopment isn’t Production

Profiling vs MonitoringProduction Profiling

Conclusion

Page 8: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Development isn’t Production

Performance testing in development can be easier

May not have access to production

Tooling often desktop-based

Not representative of production

Page 9: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Unrepresentative Hardware

vs

Page 10: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Unrepresentative Software

Page 11: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Unrepresentative Workloads

vs

Page 12: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

The JVM may have very different behaviour in production

Hotspot does adaptive optimisation

Production may optimise differently

Page 13: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat
Page 14: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Why Performance MattersDevelopment isn’t Production

Profiling vs MonitoringProduction Profiling

Conclusion

Page 15: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Ambient/Passive/System Metrics

Preconfigured numerical measure about the

system

CPU Time Usage / Page-load Times

Cheap and sometimes effective

Page 16: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

LoggingRecords arbitrary events emitted by the system being monitored

log4j/slf4j/logback

Logs of GC events

Often manual, aids system understanding, expensive

Page 17: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Coarse Grained Instrumentation

Measures time within some instrumented section of the code

Time spent inside the controller layer of your web-app or performing SQL queries

More detailed and actionable though expensive

Page 18: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Production ProfilingWhat methods use up CPU time?

What lines of code allocate the most objects?

Where are your CPU Cache misses coming from?

Automatic, can be cheap but often isn’t

Page 19: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Where Instrumentation can be blind in the Real World

Problem: Every 5 seconds an HTTP endpoint would be really slow.

Instrumentation: on the servlet request, didn’t even show the pause!

Cause: Tomcat expired its resources cache every 5 seconds, on load one resource

scanned the entire classpath

Page 20: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat
Page 21: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Surely a better way?

Not just Metrics - Actionable Insights

Diagnostics aren’t Diagnosis

What about Profiling?

Page 22: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Why Performance MattersDevelopment isn’t Production

Profiling vs MonitoringProduction Profiling

Conclusion

Page 23: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

How to use Production Profilers

1) Extract relevant time period and apps/machines

2) Choose a type of profile: CPU Time/Wallclock Time/Memory

3) View results to tell you what the dominant consumer of a resource is

4) Fix biggest bottleneck

5) Deploy / Iterate

Page 24: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

CPU Time vs Wallclock Time

Page 25: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Profiling Hotspots

Page 26: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Profiling Treeviews

Page 27: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Profiling Flamegraphs

Page 28: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Instrumenting Profilers

Add instructions to collect timings (Eg: JVisualVM Profiler)

Inaccurate - modifies the behaviour of the program

High Overhead - > 2x slower

Page 29: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Sampling/Statistical Profilers

WebServerThread.run()

Controller.doSomething() Controller.next()

Repo.readPerson()

new Person()

View.printHtml() ??? ???

Page 30: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Safepoint Bias after Inlining

WebServerThread.run()

Controller.doSomething() Controller.next()

Repo.readPerson()

new Person()

View.printHtml() ???

Page 31: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Time to Safepoint

-XX:+PrintSafepointStatistics

Thre

ads

Safepoint poll

VM O

pera

tion

Page 32: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Advanced Statistical Profiling in Java

OS Signals to interrupt threads on resource consumption threshold

JVM’s signal handler-safe AsyncGetCallTrace to walk the stack

Page 33: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

People are put off by practical as much as technical issues

Page 34: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Barriers to Ad-Hoc Production Profiling

Generally requires access to production

Process involves manual work - hard to automate

Low-overhead open source profilers unsupported

Page 35: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

What if we profiled all the time?

Page 36: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Historical Data

Allows for post-hoc incident analysis

Enables correlation with other data/metrics

Performance regression analysis

Page 37: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Putting Samples in Context

Application version

Environment parameters (machine type, CPU, location, etc.)

Ad-hoc profiling we can’t do this

Page 38: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Opsian - Continuous Profiling

Opsian Aggregation

service

Web ReportsJVM Agents

Page 39: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Summary

We can profile in production with low overhead

To overcome practical issues we can profile production all the time

Profiling all the time opens up new capabilities

Page 40: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Why Performance MattersDevelopment isn’t Production

Profiling vs MonitoringProduction Profiling

Conclusion

Page 41: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Performance Matters

Development isn’t Production

Metrics can be unactionable

Instrumentation has high overhead

Continuous Profiling provides insight

Page 42: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

We need an attitude shift on profiling + monitoring

Page 43: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

ContinuousProactive not Reactive

Systematic not Ad Hoc

Page 44: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Please do Production Profiling.All the time.

Page 45: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

Any Questions?https://www.opsian.com/

Page 46: Production Profiling: What, Why and How · Logs of GC events Often manual, aids system understanding, expensive. ... on the servlet request, didn’t even show the pause! Cause: Tomcat

The End