CPU Profiling Limitations● Finds CPU bound bottlenecks
● Many problems not CPU Bound○ Networking○ Database or External Service○ I/O○ Garbage Collection○ Insufficient Parallelism○ Blocking & Queuing Effects
Different Execution Profilers● Instrumenting
○ Adds timing code to application
● Sampling○ Collects thread dumps periodically
Sampling Profilers
WebServerThread.run()
Controller.doSomething() Controller.next()
Repo.readPerson()
new Person()
View.printHtml()
Periodicity Bias
● Bias from sampling at a fixed interval
● Periodic operations with the same frequency as the samples
● Timed operations
Stack Trace Sampling
● JVMTI interface: GetCallTrace○ Trigger a global safepoint(not on Zing)
○ Collect stack trace
● Large impact on application
● Samples only at safepoints
Exampleprivate static void outer()
{
for (int i = 0; i < OUTER; i++)
{
hotMethod(i);
}
}
// https://github.com/RichardWarburton/profiling-samples
Example (2)private static void hotMethod(final int i)
{
for (int k = 0; k < N; k++)
{
final int[] array = SafePointBias. array;
final int index = i % SIZE;
for (int j = index; j < SIZE; j++)
{
array[index] += array[j];
}
}
}
Whats a safepoint?
● Java threads poll global flag○ At ‘uncounted’ loops back edge○ At method exit/enter
● A safepoint poll can be delayed by:○ Large methods○ Long running ‘counted’ loops○ BONUS: Page faults/thread suspension
Safepoint Bias
WebServerThread.run()
Controller.doSomething() Controller.next()
Repo.readPerson()
new Person()
View.printHtml() ???
Let sleeping dogs lie?
● ‘GetCallTrace’ profilers will sample ALL
threads
● Even sleeping threads...
AsyncGetCallTrace
● Used by Oracle Solaris Studio
● Adapted to open source prototype by Google’s Jeremy Manson
● Unsupported, Undocumented … Underestimated
SIGPROF - Interrupt Handlers
● OS Managed timing based interrupt
● Interrupts the thread and directly calls an event handler
● Used by profilers we’ll be talking about
Design
Log File
Processor Thread Graphical UI
Console UI
Signal Handler
Signal Handler
Os Timer Thread
AsyncGetCallTrace under the hood
● A Java thread is ‘possessed’
● You have the PC/FP/SP
● What is the call trace?○ jmethodId - Java Method Identifier
○ bci - Byte Code Index -> used to find line number
Where Am I?
● Given a PC what is the current method?
● Is this a Java method?○ Each method ‘lives’ in a range of addresses
● If not, what do we do?
Java Method? Which line?
● Given a PC, what is the current line?○ Not all instructions map directly to a source line
● Given super-scalar CPUs what does PC
mean?
● What are the limits of PC accuracy?
“> I think Andi mentioned this to me last year -- > that instruction profiling was no longer reliable.
It never was.”
http://permalink.gmane.org/gmane.linux.kernel.perf.user/1948Exchange between Brenden Gregg and Andi Kleen
Skid
● PC indicated will be >= to PC at sample time
● Known limitation of instruction profiling
● Leads to harder ‘blame analysis’
Limits of line number accuracy:
Line number (derived from BCI) is the closest
attributable BCI to the PC (-XX:+DebugNonSafepoint)
The PC itself is within some skid distance from
actual sampled instruction
● Divided into frames○ frame { sender*, stack*, pc }
● A single linked list:root(null, s0, pc1) <- call1 (root, s1, pc2) <- call2(call1, s2, pc2)
● Convert to: (jmethodId,lineno)
The Stack
A typical stack
● JVM Thread runner infra:○ JavaThread::run to JavaCalls::call_helper
● Interleaved Java frames:○ Interpreted○ Compiled○ Java to Native and back
● Top frame may be Java or Native
Native frames
● Ignored, but need to navigate through
● Use a dedicated FP register to find sender
● But only if compiled to do so…
● Use a last remembered Java frame insteadSee: http://duartes.org/gustavo/blog/post/journey-to-the-stack/
Java Compiled Frames
● C1/C2 produce native code
● No FP register: use set frame size
● Challenge: methods can move (GC)
● Challenge: methods can get recompiled
Java Interpreter frames
● Separately managed by the runtime
● Make an effort to look like normal frames
● Challenge: may be interrupted half-way
through construction...
Virtual Frames
● C1/C2 inline code (intrinsics/other methods)
● No data on stack
● Must use JVM debug info
AsyncGetCallTrace Limitations
● Only profiles running threads
● Accuracy of line info limited by reality
● Only reports Java frames/threads
● Must lookup debug info during call
Compilers: Friend or Fiend?void safe_reset(void *start, size_t size) {
char *base = reinterpret_cast<char *>(start);
char *end = base + size;
for (char *p = base; p < end; p++) {
*p = 0;
}
}
Compilers: Friend or Fiend?safe_reset(void*, unsigned long):
lea rdx, [rdi+rsi]
cmp rdi, rdx
jae .L3
sub rdx, rdi
xor esi, esi
jmp memset
.L3:
rep ret
Concurrency Bug
● Even simple concurrency bugs are hard to spot
● Unspotted race condition in the ring buffer
● Spotted thanks to open source & Rajiv Signal
Native Profiling Tools
● Profile native methods
● Profile at the instruction level
● Profile hardware counters
Perf
● A Linux profiling tool
● Can be made to work with Java
● JMH integration
● Ongoing integration efforts
What did we cover?
● Biases in Profilers
● More accurate sampling
● Alternative Profiling Approaches
Q & A@nitsanw
psy-lob-saw.blogspot.co.uk@richardwarburtoinsightfullogic.comjava8training.comwww.pluralsight.
com/author/richard-warburton
Top Related