Profiling JVM Applications in Production...improve JVM application performance on Linux...

74
Profiling JVM Applications in Production Sasha Goldshtein CTO, Sela Group @goldshtn github.com/goldshtn

Transcript of Profiling JVM Applications in Production...improve JVM application performance on Linux...

Page 1: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

ProfilingJVMApplicationsinProduction

SashaGoldshteinCTO,Sela Group

@goldshtngithub.com/goldshtn

Page 2: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

https://s.sashag.net/srecon0318

Page 3: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

WorkshopIntroduction

• Mission:Applymodern,low-overhead,production-readytoolstomonitorandimproveJVMapplicationperformanceonLinux• Objectives:qIdentifyingoverloadedresourcesqProfilingforCPUbottlenecksqVisualizingandexploringstacktracesusingflamegraphsqRecordingsystemevents(I/O,network,GC,etc.)qProfilingforheapallocations

Page 4: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

CourseIntroduction

• Targetaudience:Applicationdevelopers,systemadministrators,productionengineers• Prerequisites:UnderstandingofJVMfundamentals,experiencewithLinuxsystemadministration,familiaritywithOSconcepts• Labenvironment:EC2,deliveredthroughthebrowserduringthecoursedates• Coursehands-onlabs:https://github.com/goldshtn/linux-tracing-workshop

Page 5: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

CoursePlan

• JVMandLinuxperformanceinformationsources• CPUsampling• Flamegraphsandsymbols• Lab:Profilingwithperfandasync-profiler• eBPF• BCCtools• Lab:Tracingfileopens• GCtracingandallocationprofiling• Lab:Allocationprofiling

Page 6: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

TheLabEnvironment

• Followthelinkprovidedbytheinstructor• SignuporloginwithGoogle• Entertheclassroomtoken• Clickthebeaker-in-a-cloudicontogetyourownlabinstance• Waitfortheterminaltoinitialize

Page 7: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

JVMandLinuxPerformanceSources

Page 8: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Kernel

PerformanceInformationSources

JVM

Syscall interface

Devicedrivers

CPU

Javaapplications

Systemlibraries

PMUeventsOtherdevices

BlockI/O Ethernet Scheduler Mem

Filesystem TCP/IP

Otherapplications

GC JIT

Classloader

USDT(dtrace)probes

mbeansJMX

JVMTIagentsServiceabilityAPI

+PrintCompilation

+PrintGC &other

JavaFlightRecorder

Tracepoints

kprobes

uprobes

Tracepoints

attachinterface(jcmd)

hsperf (jstat)

”softwareevents”

Page 9: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

USEChecklistforLinuxSystemshttp://www.brendangregg.com/USEmethod/use-linux.html

CPU

Core Core

LLC

RAM

MemorycontrollerGFX I/O

controller SSD

E1000

FSB

PCIe

U:perf

U:mpstat -P 0S:vmstat 1

E:perf

U:sar -n DEV 1S:ifconfigE:ifconfig

U:free -mS:sar -BE:dmesg

U:iostat 1S:iostat -xz 1E:…/ioerr_cntU:perf

Page 10: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

USEChecklistForJVMApplications

JavaThread

Heap

allocate

JNI Nativelibraries

Kernel

JavaThread

Thread Thread

allocate

GC JIT

syscalls

U:jmap,jhat,jstat

U:top,jstackE:JVMTI

U:JVMTI,USDT

U:+PrintCompilation,

jstat,USDT

U:+PrintGC,jstat,NMT,

USDT

Page 11: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

⚠MindTheOverhead

• Anyobservationcanchangethestateofthesystem,butsomeobservationsareworsethanothers• Performancetoolshaveoverhead• Checkthedocs• Tryonatestsystemfirst• Measuredegradationintroducedbythetool

OVERHEADThis traces variouskernelpagecachefunctionsandmaintainsin-kernelcounts,whichareasynchronouslycopiedtouser-space.Whiletherateofoperationscanbeveryhigh(>1G/sec)wecanhaveupto34%overhead,thisisstillarelativelyefficientwaytotracetheseevents,andsotheoverheadisexpectedtobesmallfornormalworkloads. Measureinatestenvironment.

—man cachestat (fromBCC)

Page 12: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

CPUSampling

Page 13: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Samplingvs.Tracing

• Sampling worksbygettingasnapshotoracallstackeveryNoccurrencesofaninterestingevent• Formostevents,implementedinthePMUusingoverflowcountersandinterrupts

• Tracing worksbygettingamessageoracallstackateveryoccurrenceofaninterestingevent

CPUtimepid 121 pid 121 pid 408 pid 188

systemtimepid 121 pid 408

CPUsample

diskwrite

Page 14: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

JVMStackSampling

• TraditionalCPUprofilerssampleallthreadstacksperiodically(e.g.100timespersecond)• TypicallyusetheJVMTIGetAllStackTraces API• jstack,JVisualVM,YourKit,JProfiler,andalotofothers

Thread1running blocked runningGC

Thread2running blockedGC

Thread1blockedGC

sample samplesample

Page 15: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Safepoint Bias

• Samplesarecapturedonlyatsafepoints• ResearchEvaluatingTheAccuracyofJavaProfilers byMytkowicz,Diwan,Hauswirth,Sweeneyshowswildvarietyofresultsbetweenprofilersduetosafepoint bias• Additionally,capturingafullstacktraceforallthreadsisquiteexpensive(thinkSpring)

Page 16: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

perf

• perf isaLinuxmulti-toolforperformanceinvestigations• Capableofbothtracingandsampling• Developedinthekerneltree,mustmatchrunningkernel’sversion

• Debian-based: apt install linux-tools-common• RedHat-based: yum install perf

Page 17: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

RecordingCPUStacksWithperf

• TofindaCPUbottleneck,recordstacksattimedintervals:# system-wideperf record -ag -F 97# specific processperf record -p 188 -g -F 97# specific workloadperf record -g -F 97 -- ./myapp

Legend-a allCPUs-p specificprocess-- runworkloadandcaptureit-g capturecallstacks-F frequencyofsamples(Hz)-c #ofeventsineachsample

Page 18: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

ASingleStack

# perf scriptparprimes 13393 248974.821897: 10309278 cpu-clock:

92b is_prime+0xffffffffff800035 (/…/parprimes)96c primes_loop+0xffffffffff800021 (/…/parprimes)9d4 primes_thread+0xffffffffff800020 (/…/parprimes)

75ca start_thread+0xffff011d4ae720ca (/…/libpthread-2.23.so)…# perf script | wc –l7214

Page 19: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

StackReport

# perf report --stdio# Children Self Command Shared Object Symbol# ........ ........ ............ .................. .......................................#

72.02% 71.53% parprimes parprimes [.] is_prime| --71.53%--start_thread

primes_threadprimes_loopis_prime

...truncated

27.86% 0.00% dd [kernel.kallsyms] [k] vfs_read|---vfs_read

| --27.80%--__vfs_read

...truncated

Page 20: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

FlameGraphsandMissingSymbols

Page 21: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Symbols

• perf needssymbolstodisplayfunctionnames(beyondmodulesandaddresses)• Forcompiledlanguages(C,Go,…)theseareoftenembeddedinthebinary• Orinstalledasseparatedebuginfo(usually/usr/lib/debug)

$ objdump -tT /usr/bin/bash | grep readline0000000000306bf8 g DO .bss 0000000000000004 Base rl_readline_state00000000000a46c0 g DF .text 00000000000001d4 Base readline_internal_char00000000000a3cc0 g DF .text 0000000000000126 Base readline_internal_setup0000000000078b80 g DF .text 0000000000000044 Base posix_readline_initialize00000000000a4de0 g DF .text 0000000000000081 Base readline00000000003062d0 g DO .bss 0000000000000004 Base bash_readline_initialized…

Page 22: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

ReportWithoutSymbols

# perf report --stdio# Children Self Command Shared Object Symbol# ........ ........ ....... ................. .......................#

100.00% 0.00% hello hello [.] 0xffffffffffc0051d|---0x51d

||--54.91%--0x4f7||--27.97%--0x4eb||--8.73%--0x4e3|--7.97%--0x4ff

Page 23: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

JavaAppReport

# perf report --stdio# Children Self Command Shared Object Symbol# ........ ........ ....... .................. ......................#

100.00% 0.00% java perf-2318.map [.] 0x00007f82b50004e7|---0x7f82b50004e7

||--8.15%--0x7f82b510d63e||--7.97%--0x7f82b510d6ca||--7.07%--0x7f82b510d6c2||--6.88%--0x7f82b510d686||--6.16%--0x7f82b510d68e

Page 24: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

perf-PID.map Files

• Whensymbolsaremissinginthebinary,perfwilllookforafilenamed/tmp/perf-PID.map bydefault

$ cat /tmp/perf-1882.map7f2cd1108880 1e8 Ljava/lang/System;::arraycopy7f2cd1108c00 200 Ljava/lang/String;::hashCode7f2cd1109120 2e0 Ljava/lang/String;::indexOf7f2cd1109740 1c0 Ljava/lang/String;::charAt…7f2cd110ce80 120 LHello;::doStuff7f2cd110d280 140 LHello;::fidget7f2cd110d5c0 120 LHello;::fidget7f2cd110d8c0 120 LHello;::fidget…

Page 25: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

GeneratingMapFiles

• ForinterpretedorJIT-compiledlanguages,mapfilesneedtobegeneratedatruntime• Java:perf-map-agentcreate-java-perf-map.sh $(pidof java)• ThisisaJVMTIagentthatattachesondemandtotheJavaprocess• Additionaloptionsincludedottedclass,unfoldall,sourcepos• Consider-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints formoreaccurateinlineinfo

• Otherruntimes:• Node: node --perf-basic-prof-only-functions app.js• Mono: mono --jitmap ...• .NETCore: export COMPlus_PerfMapEnabled=1

Page 26: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

FixedReport;StillBroken

# perf report --stdio# Children Self Command Shared Object Symbol# ........ ........ ....... .................. ......................#

100.00% 0.00% java perf-3828.map [.] call_stub|---call_stub

LHello;::fidget…

Page 27: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

WalkingStacks

• Tosuccessfullywalkstacks,perf requires* FPOtobedisabled• ThisisanoptimizationthatusesEBP/RBPasageneral-purposeregisterratherthanaframepointer

• C/C++: -fno-omit-frame-pointer• Java: -XX:+PreserveFramePointer sinceJava8u60

*Whendebuginformationispresent,perfcanuselibunwind andfigureoutFPO-enabledstacks,butnotfordynamiclanguages

Page 28: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

FixedReport

# perf report --stdio# Children Self Command Shared Object Symbol# ........ ........ ....... .................. ......................#

100.00% 99.65% java perf-4005.map [.] LHello;::fidget|--99.65%--start_thread

JavaMainjni_CallStaticVoidMethodjni_invoke_staticJavaCalls::call_helpercall_stubLHello;::mainLHello;::doStuffLHello;::identifyWidgetLHello;::fidget

Page 29: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Real-WorldStackReports

# perf report --stdio | wc -l14823

Page 30: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

FlameGraphs

• Avisualizationmethod(adjacencygraph),veryusefulforstacktraces,inventedbyBrendanGregg• http://www.brendangregg.com/flamegraphs.html

• Turns1000sofstacktracepagesintoasingleinteractivegraph• Examplescenarios:• IdentifyCPUhotspotsonthesystem/application• Showstacksthatperformheavydiskaccesses• Findthreadsthatblockforalongtimeandthestackwheretheydoit

Page 31: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

ReadingaFlameGraph

• Eachrectangleisafunction• Y-axis:stackdepth• X-axis:sortedstacks(nottime)

• Widerframesaremorecommon• Supportszoom,find• Filterwithgrep😎

Page 32: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

GeneratingaFlameGraph

$ git clone https://github.com/BrendanGregg/FlameGraph$ sudo perf record -F 97 -g -p `pidof java` -- sleep 10$ sudo perf script |

FlameGraph/stackcollapse-perf.pl |FlameGraph/flamegraph.pl > flame.svg

Page 33: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

NotJustForMethods

• Forjustapackage-levelunderstandingofwhereyourtimegoes,usepkgsplit-perf.pl andgenerateapackage-levelflamegraph:

Fromhttp://www.brendangregg.com/blog/2017-06-30/package-flame-graph.html

Page 34: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Lab:CPUInvestigationWithperf AndFlameGraphs 💻

Page 35: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Problemswithperf

• OnlyJava8u60andlaterissupported(todisableFPO)• DisablingFPOhasasmallperformanceimpact(upto10%inpathologicalcases)• Symbolresolutionrequiresanadditionalagent• Interpreterframescan’tberesolved(shownas“Interpreter”)• Recompiledmethodscanbemisreported(appearmorethanonceintheperfmap)• Stackdepthisusuallylimitedto127(again,thinkSpring)• CanbeconfiguredsinceLinux4.8using/proc/sys/kernel/perf_event_max_stack

Page 36: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

async-profiler

Page 37: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

JVMTIAgents

• AJVMTI(JVMToolInterface)agentcanbeloadedwith-agentpathorattachedthroughtheJVMattachinterface• Examplesoffunctionality:• Tracethreadstartandstopevents• Countmonitorcontentionsandwaittimes• Aggregateclassloadandunloadinformation• Fulleventreference:http://docs.oracle.com/javase/8/docs/platform/jvmti/jvmti.html

Page 38: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

AsyncGetCallTrace

• InternalAPIintroducedtosupportlightweightprofilinginOracleDeveloperStudio• Producesasinglethread’sstackwithoutwaitingforasafepoint• Designedtobecalledfromasignalhandler• UsedbyHonestProfiler(byRichardWarburtonandcontributors):https://github.com/jvm-profiling-tools/honest-profiler

Page 39: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

async-profiler

• OpensourceprofilerbyAndreiPangin andcontributors:https://github.com/jvm-profiling-tools/async-profiler

kernel

perf_events

user

CPUPMU

Cyclicperfbuffer

JVMthread

JVMthread

Nativethread

perffd

libasyncProfiler.soinotify

signalAsyncGetCallTrace

samplestacksamplestack

Page 40: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Profilers,Compared

perf• Java≧8u60todisableFPO• DisablingFPOhasaperfpenalty• Needamapfile• Interpreterframesarenotsupported• System-wideprofilingispossible• Canprofilecontainersfromthehost(orfromasidecar)

async-profiler• WorksonolderJavaversions• FPOcanstayon• Nomapfileisrequired• Interpreterframesaresupported• Intheory,nativeandJavastacksdon’talwayssync• Profilingrunsin-process(so,in-container)

Page 41: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Lab:ProfilingWithasync-profiler 💻

Page 42: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

eBPF

Page 43: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

What’sWrongWithperf?

• perf reliesonpushingalotofdata touserspace,throughfiles,foranalysis• Downloadingafileat∼1Gb/sproduces∼89Knetif_receive_skb events/s(19MB/sincludingstacks)

kernel

e1000 netif_receive_skb

perf_events perf.data

user

perf | awk | …

monitoraverage packet size: 189 bytes

Page 44: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BPF:1990

• Invented byMcCanne andJacobsonatBerkeley,1990-1992:instructionset,representation,implementationofpacketfilters

$ tcpdump -d 'ip and dst 186.173.190.239'(000) ldh [12](001) jeq #0x800 jt 2 jf 5(002) ld [30](003) jeq #0xbaadbeef jt 4 jf 5(004) ret #262144(005) ret #0

Page 45: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BPF:Today

• Supportsawidespectrumofusages• HasaJITformaximumefficiency

kernel

BPFruntimeprobes

sockets

syscalls

user

controlprogram

BPFprogram

BPFprogram

BPFmap

controlprogramverifier&JIT

BPFcompiler

Page 46: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BPFTracingkernel

BPFruntimekprobes

tracepoints

user

controlprogramBPFprogram

map

applicationUSDT

uprobesperfoutput

① installsBPFprogramandattachestoevents

②eventsinvoketheBPFprogram

perf_events

③BPFprogramupdatesamaporpushesaneweventtoabuffersharedwithuser-space

④user-spaceprogramisinvokedwithdatafromthesharedbuffer

⑤user-spaceprogramreadsstatisticsfromthemapandclearsitifnecessary

controlprogram

Page 47: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BPFTracingFeaturesinTheLinuxKernel

Version Feature Scenarios4.1 kprobes/uprobes attach Dynamic tracingwithBPFbecomespossible4.1 bpf_trace_printk BPFprogramscanprintoutputtoftrace pipe4.3 perf_events output Efficienttracingoflargeamountsofdatafor

analysisinuser-space4.6 Stack traces Efficientaggregationof callstacksforprofiling

ortracing4.7 Tracepoints support APIstabilityfortracingprograms4.9 perf_events attach Low-overhead profilingandPMUsampling

16.04

24

16.10

25

Page 48: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

TheOldWayAndTheNewWaykernel

VFS k{,ret}probe:vfs_read

perf_events perf.data

user

perf | awk | …

monitorLATμs # distribution0 - 1 … |@@@@ |1 – 2 … |@ |2 - 4 … |@@@@@@@@ |

kernel

VFS k{,ret}probe:vfs_read

BPFprogram BPFmap

user

controlprogram

monitorLATμs # distribution0 - 1 … |@@@@ |1 – 2 … |@ |2 - 4 … |@@@@@@@@ |

Page 49: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BCCPerformanceChecklist

Page 50: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

TheBCCBPFFront-End

• https://github.com/iovisor/bcc• BPFCompilerCollection(BCC)isaBPFfrontendlibraryandamassivecollectionofperformancetools• ContributorsfromFacebook,PLUMgrid,Netflix,Sela

• HelpsbuildBPF-basedtoolsinhigh-levellanguages• Python,Lua,C++

kernel

user

BCCtool BCCtool …

BCCcompilerfrontend

Clang+LLVM

BCCloaderlibrary

BPFruntime eventsources

Page 51: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

JVM

Syscall interface

BlockI/O EthernetScheduler Mem

Devicedrivers

Filesystem TCP/IP

CPUApplications

Systemlibraries

profilellcstat

hardirqssoftirqsttysnoop

runqlatcpudist

offcputimeoffwaketimecpuunclaimed

memleakoomkill

slabratetoptcptoptcplife

tcpconnecttcpaccept

biotopbiolatencybiosnoopbitesize

filetopfilelifefileslowervfscountvfsstatcachestatcachetopmountsnoop*fsslower*fsdistdcstatdcsnoopmdflush

execsnoopopensnoopkillsnoopstatsnoopsyncsnoopsetuidsnoop

mysqld_qslowerbashreadlinedbslowerdbstat

mysqlsniff

memleaksslsniff

gethostlatencydeadlock_detector

ustatugc

ucalls

uthreadsuobjnewuflow

argdisttrace

funccountfunclatencystackcount

Page 52: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BCCLinuxPerformanceChecklist

1. execsnoop2. opensnoop3. ext4slower

(orbtrfs*,xfs*,zfs*)4. biolatency5. biosnoop6. cachestat7. tcpconnect

8. tcpaccept9. tcptop10.gethostlatency11.cpudist12.runqlat13.profile

Page 53: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

SomeBCCTools

# ext4slower 1Tracing ext4 operations slower than 1 msTIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME06:49:17 bash 3616 R 128 0 7.75 cksum06:49:17 cksum 3616 R 39552 0 1.34 [06:49:17 cksum 3616 R 96 0 5.36 2to3-2.706:49:17 cksum 3616 R 96 0 14.94 2to3-3.4^C# execsnoopPCOMM PID RET ARGSbash 15887 0 /usr/bin/man lspreconv 15894 0 /usr/bin/preconv -e UTF-8man 15896 0 /usr/bin/tblman 15897 0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8^C

Page 54: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

SomeBCCTools

# runqlat -p `pidof java` 10 1Tracing run queue latency... Hit Ctrl-C to end.

usecs : count distribution0 -> 1 : 11 |* |2 -> 3 : 7 | |4 -> 7 : 133 |****************** |8 -> 15 : 288 |****************************************|16 -> 31 : 205 |**************************** |32 -> 63 : 38 |***** |64 -> 127 : 11 |* |128 -> 255 : 5 | |256 -> 511 : 3 | |512 -> 1023 : 1 | |1024 -> 2047 : 3 | |2048 -> 4095 : 0 | |4096 -> 8191 : 3 | |

Page 55: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BCC’sprofile Tool

# profile 10 -F 97 -K # kernel stacks only…

ffffffffa4818691 __lock_text_startffffffffa45b0341 ata_scsi_queuecmdffffffffa458813d scsi_dispatch_cmdffffffffa458b021 scsi_request_fnffffffffa43be643 __blk_run_queueffffffffa43c3bc1 blk_queue_bioffffffffa43c1cf2 generic_make_requestffffffffa43c1e4d submit_bioffffffffa43b825d submit_bio_waitffffffffa43c5c65 blkdev_issue_flushffffffffa4309b4d ext4_sync_fsffffffffa428b260 sync_fs_one_sbffffffffa425a553 iterate_supersffffffffa428b374 sys_syncffffffffa4003c17 do_syscall_64ffffffffa4818bab return_from_SYSCALL_64- stress (3303)

14

Page 56: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BCC’sprofile Toolkernel

PMUcpu-clocks

perf_events perf.data

userperf script | fold

| flamegraph

monitor

kernel

BPFprogramBPFmap

userprofile –f | flamegraph

monitor

BPFstacks

PMUcpu-clocks

Page 57: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Lab:SnoopingFileOpens 💻

Page 58: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

General-PurposeBCCTools

Page 59: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

TracingSourcesForBCCTools

kernel

BPFprogram

user

application

USDThotspot:class_loaded

application

uprobesmysqld:…mysql_parse…

kprobestcp_sendmsg

tracepointssched:sched_switch

perf_eventscpu-clocks

Page 60: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

USDTProbesin(Some)High-LevelLanguages

OpenJDKhotspot:gc_begin

hotspot:thread_starthotspot:method_entry

OOTB

notsupported

buildflag

OracleJDKNode.js

node:http_server_requestnode:http_client_request

node:gc_begin

Pythonpython:function_entrypython:function_return

python:gc_start

Rubyruby:method_entryruby:object_createruby:load_entry

libc/libpthreadlibc:memory_malloc_retrylibpthread:pthread_startlibpthread:mutex_acquired

MySQLmysql:query_start

mysql:connection_startmysql:query_parse_start

PHPphp:request_startupphp:function_entry

php:error

Page 61: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

USDTProbesandUprobesintheJVM

• OpenJDK Hotspothasalargenumberofstatic(USDT)probesinvarioussubsystems;displaywithtplist orreadelf:$ tplist -p $(pidof java) | grep 'hotspot.*gc'.../libjvm.so hotspot:mem__pool__gc__begin.../libjvm.so hotspot:mem__pool__gc__end.../libjvm.so hotspot:gc__begin.../libjvm.so hotspot:gc__end

• AllJVMnativemethodscanbeusedwithdynamicprobes;discoverwithobjdump ornm:$ nm -C $(find /usr/lib/debug -name libjvm.so.debug)

| grep 'card.*table'0000000000854751 t PSScavenge::card_table()00000000016dd778 b PSScavenge::_card_table...

Page 62: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BCCtrace

• trace isamulti-purposeloggingtool;thinkofitasadynamiclogatarbitrarylocationsinthesystem(canalsoprintcallstacks)

# trace 'SyS_write (arg3 > 100000) "large write: %d bytes", arg3'PID TID COMM FUNC -9353 9353 dd SyS_write large write: 1048576 bytes9353 9353 dd SyS_write large write: 1048576 bytes9353 9353 dd SyS_write large write: 1048576 bytes^C# trace 'r:/usr/bin/bash:readline "%s", retval'TIME PID COMM FUNC -02:02:26 3711 bash readline ls –la02:02:36 3711 bash readline wc -l src.c^C

Page 63: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BCCfunccount/stackcount

• funccount countsthenumberofinvocationsofaparticularmethod,whilestackcount alsoaggregatesthecallstacks

# LIBJVM=$(find /usr/lib -name libjvm.so)# funccount -p $(pidof java) "$LIBJVM:*do_collection*"Tracing 5 functions for ".../libjvm.so:*do_collection*"... Hit Ctrl-C to end.^CFUNC COUNT_ZN16GenCollectedHeap13do_collectionEbbmbi 848Detaching...

Page 64: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Lab:TracingDatabaseAccesses 💻

Page 65: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

HeapAllocationProfiling

Page 66: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

ApproachesforAllocationProfiling

• AllocationprofilingcanhelpreduceGCpressureandpausetimes• Tracingeachobjectallocationisextremelyexpensive,though• Use-XX:+ExtendedDTraceProbes andsamplehotspot:object__alloc probes (expectasignificantoverhead)• TraceHotspotallocationtracingcallbacksdesignedforJFR• send_allocation_in_new_tlab_event:whenanewTLABisallocatedforathreadbecausetheoldonewasexhausted• send_allocation_outside_tlab_event:whenanobjectisallocatedoutsideaTLAB(e.g.becauseit’stoobig,orbecausetheTLABisexhausted)

Page 67: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

async-profiler

• Whenusedwiththeheapmode,instrumentstheJFRTLABallocationeventsandreportsobjectsallocatedandstacksamples• RequiresJDKdebuginfo tobeinstalled(tofindtherelevantsymbols)

$ ./profiler.sh -d 10 -e alloc -o summary,flat `pidof java`HEAP profiling started...696470120 (75.33%) [C226075184 (24.45%) [B

425600 (0.05%) [Ljava/util/HashMap$Node;193592 (0.02%) com/sun/org/apache/xerces/internal/dom/ElementImpl185536 (0.02%) com/sun/org/apache/xml/internal/serializer/NamespaceMappings$MappingRecord

162176 (0.02%) java/util/Stack

Page 68: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

BCCToolsWithExtendedProbes

# funccount -p `pidof java` u:$LIBJVM:object__allocTracing 1 functions for "u:.../libjvm.so:object__alloc"... Hit Ctrl-C to end.FUNC COUNTobject__alloc 4000987Detaching...

# argdist -p `pidof java` -C "u:$LIBJVM:object__alloc():char*:arg2"605018 arg2 = java/lang/String609801 arg2 = java/util/HashMap$Nod908716 arg2 = com/sun/org/apache/xml/internal/serializer/NamespaceMappings$MappingRecord

908778 arg2 = java/util/Stack909348 arg2 = [Ljava/lang/Object;910097 arg2 = [C

Page 69: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

grav

• CollectionofperformancevisualizationtoolsbyMarkPriceandAmirLanger:https://github.com/epickrram/grav• IncludesaPythonwrapperontopofobject__alloc probeswithsamplingsupport,flamegraphgeneration,andfilteringspecifictypes

$ sudo python src/heap/heap_profile.py -p `pidof java` -d 10 > alloc.stacks$ FlameGraph/flamegraph.pl < alloc.stacks > alloc.svg

Page 70: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Lab:ExcessiveGCAndAllocationProfiling 💻

Page 71: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

CourseWrap-Up

Page 72: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

ObjectivesReview

• Mission:Applymodern,low-overhead,production-readytoolstomonitorandimproveJVMapplicationperformanceonLinux• Objectives:üIdentifyingoverloadedresourcesüProfilingforCPUbottlenecksüVisualizingandexploringstacktracesusingflamegraphsüRecordingsystemevents(I/O,network,GC,etc.)üProfilingforheapallocations

Page 73: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

References• JVMobservabilitytools

• http://openjdk.java.net/groups/hotspot/docs/Serviceability.html

• http://docs.oracle.com/javase/8/docs/platform/jvmti/jvmti.html

• http://cr.openjdk.java.net/~minqi/6830717/raw_files/new/agent/doc/index.html

• https://docs.oracle.com/javase/8/docs/technotes/guides/management/jconsole.html

• perf andflamegraphs• https://perf.wiki.kernel.org/index.php/Main_Page

• http://www.brendangregg.com/flamegraphs.html

• AGCTprofilers• https://github.com/jvm-profiling-tools/async-profiler

• https://github.com/jvm-profiling-tools/honest-profiler

• BCCandBPF• https://github.com/iovisor/bcc/blob/master/docs/tutorial.md

• http://www.brendangregg.com/ebpf.html• http://blogs.microsoft.co.il/sasha/2016/03/31/probing-the-jvm-with-bpfbcc/

• http://blogs.microsoft.co.il/sasha/2016/03/30/usdt-probe-support-in-bpfbcc/

• ContainersandJVM• https://blog.csanchez.org/2017/05/31/running-a-jvm-in-a-container-without-getting-killed/

• http://www.brendangregg.com/blog/2017-05-15/container-performance-analysis-dockercon-2017.html

• http://batey.info/docker-jvm-flamegraphs.html

Page 74: Profiling JVM Applications in Production...improve JVM application performance on Linux •Objectives: qIdentifying overloaded resources qProfiling for CPU bottlenecks qVisualizing

Questions?

SashaGoldshteinCTO,Sela Group

@goldshtngithub.com/goldshtn