Hardware Latencies How to flush them out (A use case)trace-cmd ksoftirqd/33-216 [033] 55597.719935:...
Transcript of Hardware Latencies How to flush them out (A use case)trace-cmd ksoftirqd/33-216 [033] 55597.719935:...
Hardware LatenciesHow to flush them out
(A use case)
Steven RostedtRed Hat
Here’s a story, of a lovely lady...
● No this isn’t the Brady Bunch– Nor is it about a lovely lady
– But it probably could have been a Brady Bunch episode.
Here’s a story, of an upset customer
● Who were seeing lots of latencies on their own
● The machine wasn’t verified yet● Real time requires not just a kernel
– Requires the entire spectrum● Application● Kernel● Hard ware
Here’s a story, of an upset customer
● Who were seeing lots of latencies on their own
● The machine wasn’t verified yet● Real time requires not just a kernel
– Requires the entire spectrum● Application● Kernel● Hard ware
Verification of Hardware
● rteval– A tool by Red Hat to stress the machine
– Measures jitter (using cyclictest)
● Was a large machine– 40 CPUs
– For such a box, we expect no more than● 200us jitter
– Like less, but we are lenient with large HW
Latencies
● Seeing 500us latencies!!!!– May not sound big to you
– But it's huge for PREEMPT_RT
● Took a while to hit that● Was it HW? SW?
– We control the app (rteval)
– Of course I blamed the HW ;-)
– Of course the HW vendor blamed SW
The Enemy● 500 microsecond latency
The Weapons● Function tracing● Latency tracers● HW Lat detector● Event tracing● trace_printk()
Function Tracing
● echo function > current_tracer● echo function_graph > current_tracer● trace-cmd is nicer
– trace-cmd start -p function_graph
– trace-cmd stop
– trace-cmd extract
– trace-cmd report
rteval
● hackbench● kernel builds● cyclictest● rteval --duration=100h
rteval
● Breaking it up– rteval --onlyload --duration=100h
● Does not run cyclictest
– Run cyclictest separately
cyclictest● cyclictest --numa -p95 -d0 -i100 -qm
– numa implies -a -t -n● a - bind a task to each CPU● t - thread per CPU● n - use nanosleep() not signals
– p95 - set priority to 95
– d0 - all threads run same interval
– i100 - sleep for 100 us
– q - quiet - don't show status during test
– m - mlockall memory
cyclictest● cyclictest --numa -p95 -d0 -i100 -qm -b 200
– b 200 - break after 200 us latency
– implies running function tracer
– Stops tracer on latency too
● Function tracing adds a lot of overhead!● cyclictest --numa -p95 -d0 -i100 -qm -b 1000
– increase breakpoint by a lot!
cyclictest● Function tracing adds too much overhead● cyclictest --numa -p95 -d0 -i100 -qm -b 200 -E
– E - uses event tracing instead of function
– Better with latencies
– Not as much info
cyclictest● Limit function tracing with trace-cmd
– trace-cmd start -p function -n '*lock*
– trace-cmd start -p function -l '*sched*''
● cyclictest --numa -p95 -d0 -i100 -qm -b 300
Latency Tracers
● wakeup-rt– Ignore wakeup tracer
● preemptirqsoff– Just ignore the:
● irqsoff● preemptoff
Wakeup-rt● trace-cmd start -p wakeup-rt● Records the time of the highest -rt task
– From wake up to schedule
● Problems– defaults running function tracer
– trace-cmd start -p wakeup-rt -d● disables function tracing
– Not much info without functions
– trace-cmd start -p wakeup-rt -d -e all● enables all events
Wakeup-rt
● Didn't help :-(– Not enough info with events
– Function tracing caused latencies● Hard to determine if latency was real or
Heisenbug
preemptirqsoff
● trace-cmd start -p preemptirqsoff -e all -d● Showed us issues with the scheduler● Pointed to load balancing
– but that was a symptom not the cause
Modified cyclictest
● Changed to use function graph instead of function
● trace-cmd -p function -l load_balance
<idle>-0 [000] 60085.036305: function: load_balance<idle>-0 [001] 60085.036305: function: load_balance<idle>-0 [000] 60085.036306: function: load_balance
Modified cyclictest
● trace-cmd -p function_graph -l load_balance– Much more useful
<idle>-0 [002] 60305.482591: 0.795 us | load_balance();<idle>-0 [003] 60305.482591: 1.035 us | load_balance();<idle>-0 [002] 60305.482593: 0.978 us | load_balance();<idle>-0 [003] 60305.482593: 0.456 us | load_balance();
Latency without Load Balance?
● Hit a latency, and load balance wasn't called?
● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!
● trace-cmd start -p function_graph \
-l '*raw_spin_lock*'
<idle>-0 24dN.10 111214.800190: funcgraph_entry: ! 235.991 us | _raw_spin_lock_irqsave();
Latency without Load Balance?
● Hit a latency, and load balance wasn't called?
● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!
● trace-cmd start -p function_graph \
-l '*raw_spin_lock*'
entry: ! 235.991 us | _ra
Latency without Load Balance?
● Hit a latency, and load balance wasn't called?
● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!
● trace-cmd start -p function_graph \
-l '*raw_spin_lock*'
entry: ! 235.991 us | _ra
Graph vs Function Tracing
● graph gives you time of function held● function tracing can give you backtrace
– trace-cmd -p function -l 'raw_spin*' --func-stack
trace-cmd-8725 [002] 148276.692827: function: _raw_spin_lock_irq trace-cmd-8725 [002] 148276.692830: kernel_stack: <stack trace>=> __schedule (ffffffff8146d08f)=> schedule (ffffffff8146dd09)=> do_nanosleep (ffffffff8146c7ec)=> hrtimer_nanosleep (ffffffff8106eecb)=> sys_nanosleep (ffffffff8106f00e)=> system_call_fastpath (ffffffff81476692)
What to do?
● Keep function graph● Add events● All events added their own latencies
– Limit the events to trace
● trace-cmd start -p function_graph -l '*raw_spin_lock*' -e sched -e timer -e irq
Long story short
● Found latency● rq lock contention in pull_rt_tasks● 30 or more CPUs tried to take the same lock● Between cache line bouncing and locking
the bus, caused a large HW latency– but you can still blame SW
● Fixed by doing IPIs instead
Pull RT Tasks
CPU 0 CPU 1 CPU 2 CPU 40
cyclic testprio 90
cyclic testprio 90
cyclic testprio 90
cyclic testprio 90
Pull RT Tasks
CPU 0 CPU 1 CPU 2 CPU 40
cyclic testprio 90
cyclic testprio 90
watchdogprio 99
cyclic testprio 90
cyclic testprio 90
Pull RT Tasks
CPU 0 CPU 1 CPU 2 CPU 40
cyclic testprio 90
cyclic testprio 90
cyclic testprio 90
cyclic testprio 90
Pull RT Tasks
CPU 0 CPU 1 CPU 2 CPU 40
cyclic testprio 90
cyclic testprio 90
cyclic testprio 90
cyclic testprio 90
irq threadprio 50
Pull RT Tasks
CPU 0 CPU 1 CPU 2 CPU 40
<idle> <idle>cyclic test
prio 90 <idle>
irq threadprio 50
The Finding Nemo Seagull Effect!
Mine
Pull RT TasksThe Finding Nemo Seagull Effect
CPU 0 CPU 1 CPU 2 CPU 40
<idle> <idle>cyclic test
prio 90 <idle>
irq threadprio 50
Pull RT TasksThe Finding Nemo Seagull Effect
CPU 0 CPU 1 CPU 2 CPU 40
<idle> <idle>cyclic test
prio 90 <idle>
irq threadprio 50
Mine
Mine
Mine
Pull RT TasksIPI to push task
CPU 0 CPU 1 CPU 2 CPU 40
<idle> <idle>cyclic test
prio 90 <idle>
irq threadprio 50
Pull RT TasksIPI to push task
CPU 0 CPU 1 CPU 2 CPU 40
<idle> <idle>cyclic test
prio 90 <idle>
irq threadprio 50
IPI
IPI
IPI
Pull RT TasksIPI to push task
CPU 0 CPU 1 CPU 2 CPU 40
<idle>cyclic test
prio 90 <idle>irq thread
prio 50
The End?
● Looked like we found our bug!● Started verification process● Told everyone things will be verified shortly
Nope!
● Passed a 12 hour run● Failed a 24 hour run● Lets start again!
HW Lat Detector
● Hard ware latency detector● Runs periodic stop machine
– Define a period and run
– run != period● system will lock up
● Spins looking for latency
HW Lat Issue
While (now - start < perido) {tmp = timestamp();now = timestamp();diff = now - tmp;if (diff > thresh)
record();}
HW Lat Issue
While (now - start < perido) {tmp = timestamp();now = timestamp();diff = now - tmp;if (diff > thresh)
record();}
20%
80%
HW Lat Issue
Last = 0;While (now - start < perido) {
tmp = timestamp();now = timestamp();if (last) {
diff = tmp - last;if (diff > outer_thresh)
record_outer();}last = now;diff = now - tmp;if (diff > thresh)
record();}
HW Lat DetectorStop Machine
● Needs to run periodically– Will lock up the system otherwise
– Has chance to miss latency again!
● Changed to a thread– Thread takes up one of the CPUs
– Still needs to yield● Locks up machine otherwise
– But yield is much smaller than periodic● More likely to measure latency
HW Lat DetectorWorked!
● But not good enough● Vendor did not trust this code ???● Had to use their code
– Did somewhat the same thing
– In userspace
– Could easily miss latencies
trace-cmd
● trace-cmd start -p function_graph -l ‘raw_spin*’ -e all
● Modified cyclictest to use function graph– Still limited to raw_spin* locks
– Wont disable the events started
● cyclictest will still stop the trace on latency
trace-cmd ksoftirqd/33-216 [033] 55597.719935: timer_cancel: timer=0xffff88403f0ce520 ksoftirqd/33-216 [033] 55597.719935: timer_expire_entry: timer=0xffff88403f0ce520 function=delayed_work ksoftirqd/33-216 [033] 55597.719935: funcgraph_entry: 0.069 us | _raw_spin_lock_irqsave(); ksoftirqd/33-216 [033] 55597.719936: funcgraph_entry: 0.047 us | _raw_spin_lock(); ksoftirqd/33-216 [033] 55597.719936: sched_stat_sleep: comm=kworker/33:1 pid=1222 delay=132870067 ksoftirqd/33-216 [033] 55597.719937: sched_wakeup: kworker/33:1:1222 [120] success=1 CPU:033 ksoftirqd/33-216 [033] 55597.719937: timer_expire_exit: timer=0xffff88403f0ce520 ksoftirqd/33-216 [033] 55597.719942: timer_cancel: timer=0xffff88403f0ce620 ksoftirqd/33-216 [033] 55597.719942: timer_expire_entry: timer=0xffff88403f0ce620 function=delayed_work ksoftirqd/33-216 [033] 55597.719943: funcgraph_entry: 0.045 us | _raw_spin_lock_irqsave(); ksoftirqd/33-216 [033] 55597.719943: timer_expire_exit: timer=0xffff88403f0ce620 cyclictest-6110 [007] 55597.719955: funcgraph_entry: 0.194 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719956: funcgraph_entry: 0.175 us | _raw_spin_lock_irqsave(); cyclictest-6110 [007] 55597.719957: funcgraph_entry: 0.175 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719957: funcgraph_entry: 2.436 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719957: funcgraph_entry: 0.203 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719958: sched_wakeup: cyclictest:6113 [4] success=1 CPU:010 cyclictest-6110 [007] 55597.719959: funcgraph_entry: 0.048 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719960: funcgraph_entry: 0.170 us | _raw_spin_lock_irq(); cyclictest-6113 [010] 55597.719961: funcgraph_entry: 0.045 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719962: funcgraph_entry: 0.043 us | _raw_spin_lock_irq(); cyclictest-6110 [007] 55597.719963: print: ffffffff810e5776 hit latency threshold (247 > 200)
trace-cmd
● trace-cmd report– Lots of information
– Detailed information
– Great to analyze
● TOO MUCH INFO!– Can not understand it all
– Hard to see the big picture
● KernelShark
kernelshark
kernelshark
kernelshark
kernelshark
kernelshark
kernelshark
kernelshark
kernelshark
Demo
Questions?
Questions?
Yeah Right?Like we have time.