Hardware Latencies How to flush them out (A use case)trace-cmd ksoftirqd/33-216 [033] 55597.719935:...

Hardware LatenciesHow to flush them out

(A use case)

Steven RostedtRed Hat

Here’s a story, of a lovely lady...

● No this isn’t the Brady Bunch– Nor is it about a lovely lady

– But it probably could have been a Brady Bunch episode.

Here’s a story, of an upset customer

● Who were seeing lots of latencies on their own

● The machine wasn’t verified yet● Real time requires not just a kernel

– Requires the entire spectrum● Application● Kernel● Hard ware

Verification of Hardware

● rteval– A tool by Red Hat to stress the machine

– Measures jitter (using cyclictest)

● Was a large machine– 40 CPUs

– For such a box, we expect no more than● 200us jitter

– Like less, but we are lenient with large HW

Latencies

● Seeing 500us latencies!!!!– May not sound big to you

– But it's huge for PREEMPT_RT

● Took a while to hit that● Was it HW? SW?

– We control the app (rteval)

– Of course I blamed the HW ;-)

– Of course the HW vendor blamed SW

The Enemy● 500 microsecond latency

The Weapons● Function tracing● Latency tracers● HW Lat detector● Event tracing● trace_printk()

Function Tracing

● echo function > current_tracer● echo function_graph > current_tracer● trace-cmd is nicer

– trace-cmd start -p function_graph

– trace-cmd stop

– trace-cmd extract

– trace-cmd report

rteval

● hackbench● kernel builds● cyclictest● rteval --duration=100h

rteval

● Breaking it up– rteval --onlyload --duration=100h

● Does not run cyclictest

– Run cyclictest separately

cyclictest● cyclictest --numa -p95 -d0 -i100 -qm

– numa implies -a -t -n● a - bind a task to each CPU● t - thread per CPU● n - use nanosleep() not signals

– p95 - set priority to 95

– d0 - all threads run same interval

– i100 - sleep for 100 us

– q - quiet - don't show status during test

– m - mlockall memory

cyclictest● cyclictest --numa -p95 -d0 -i100 -qm -b 200

– b 200 - break after 200 us latency

– implies running function tracer

– Stops tracer on latency too

● Function tracing adds a lot of overhead!● cyclictest --numa -p95 -d0 -i100 -qm -b 1000

– increase breakpoint by a lot!

cyclictest● Function tracing adds too much overhead● cyclictest --numa -p95 -d0 -i100 -qm -b 200 -E

– E - uses event tracing instead of function

– Better with latencies

– Not as much info

cyclictest● Limit function tracing with trace-cmd

– trace-cmd start -p function -n '*lock*

– trace-cmd start -p function -l '*sched*''

● cyclictest --numa -p95 -d0 -i100 -qm -b 300

Latency Tracers

● wakeup-rt– Ignore wakeup tracer

● preemptirqsoff– Just ignore the:

● irqsoff● preemptoff

Wakeup-rt● trace-cmd start -p wakeup-rt● Records the time of the highest -rt task

– From wake up to schedule

● Problems– defaults running function tracer

– trace-cmd start -p wakeup-rt -d● disables function tracing

– Not much info without functions

– trace-cmd start -p wakeup-rt -d -e all● enables all events

Wakeup-rt

● Didn't help :-(– Not enough info with events

– Function tracing caused latencies● Hard to determine if latency was real or

Heisenbug

preemptirqsoff

● trace-cmd start -p preemptirqsoff -e all -d● Showed us issues with the scheduler● Pointed to load balancing

– but that was a symptom not the cause

Modified cyclictest

● Changed to use function graph instead of function

● trace-cmd -p function -l load_balance

<idle>-0 [000] 60085.036305: function: load_balance<idle>-0 [001] 60085.036305: function: load_balance<idle>-0 [000] 60085.036306: function: load_balance

Modified cyclictest

● trace-cmd -p function_graph -l load_balance– Much more useful

<idle>-0 [002] 60305.482591: 0.795 us | load_balance();<idle>-0 [003] 60305.482591: 1.035 us | load_balance();<idle>-0 [002] 60305.482593: 0.978 us | load_balance();<idle>-0 [003] 60305.482593: 0.456 us | load_balance();

Latency without Load Balance?

● Hit a latency, and load balance wasn't called?

● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!

● trace-cmd start -p function_graph \

-l '*raw_spin_lock*'

<idle>-0 24dN.10 111214.800190: funcgraph_entry: ! 235.991 us | _raw_spin_lock_irqsave();

Latency without Load Balance?

● Hit a latency, and load balance wasn't called?

● PREEMPT_RT converts spinlocks to mutex– except for raw_spin_locks!

● trace-cmd start -p function_graph \

-l '*raw_spin_lock*'

entry: ! 235.991 us | _ra

Graph vs Function Tracing

● graph gives you time of function held● function tracing can give you backtrace

– trace-cmd -p function -l 'raw_spin*' --func-stack

trace-cmd-8725 [002] 148276.692827: function: _raw_spin_lock_irq trace-cmd-8725 [002] 148276.692830: kernel_stack: <stack trace>=> __schedule (ffffffff8146d08f)=> schedule (ffffffff8146dd09)=> do_nanosleep (ffffffff8146c7ec)=> hrtimer_nanosleep (ffffffff8106eecb)=> sys_nanosleep (ffffffff8106f00e)=> system_call_fastpath (ffffffff81476692)

What to do?

● Keep function graph● Add events● All events added their own latencies

– Limit the events to trace

● trace-cmd start -p function_graph -l '*raw_spin_lock*' -e sched -e timer -e irq

Long story short

● Found latency● rq lock contention in pull_rt_tasks● 30 or more CPUs tried to take the same lock● Between cache line bouncing and locking

the bus, caused a large HW latency– but you can still blame SW

● Fixed by doing IPIs instead

Pull RT Tasks

CPU 0 CPU 1 CPU 2 CPU 40

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

Pull RT Tasks


cyclic testprio 90

cyclic testprio 90

watchdogprio 99

cyclic testprio 90

cyclic testprio 90

Pull RT Tasks


cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

Pull RT Tasks


cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

cyclic testprio 90

irq threadprio 50

Pull RT Tasks


<idle> <idle>cyclic test

prio 90 <idle>

irq threadprio 50

The Finding Nemo Seagull Effect!

Mine

Pull RT TasksThe Finding Nemo Seagull Effect



prio 90 <idle>

irq threadprio 50

Pull RT TasksThe Finding Nemo Seagull Effect



prio 90 <idle>

irq threadprio 50

Mine

Mine

Mine

Pull RT TasksIPI to push task



prio 90 <idle>

irq threadprio 50




prio 90 <idle>

irq threadprio 50

IPI

IPI

IPI



<idle>cyclic test

prio 90 <idle>irq thread

prio 50

The End?

● Looked like we found our bug!● Started verification process● Told everyone things will be verified shortly

Nope!

● Passed a 12 hour run● Failed a 24 hour run● Lets start again!

HW Lat Detector

● Hard ware latency detector● Runs periodic stop machine

– Define a period and run

– run != period● system will lock up

● Spins looking for latency

HW Lat Issue

While (now - start < perido) {tmp = timestamp();now = timestamp();diff = now - tmp;if (diff > thresh)

record();}

HW Lat Issue

While (now - start < perido) {tmp = timestamp();now = timestamp();diff = now - tmp;if (diff > thresh)

record();}

20%

80%

HW Lat Issue

Last = 0;While (now - start < perido) {

tmp = timestamp();now = timestamp();if (last) {

diff = tmp - last;if (diff > outer_thresh)

record_outer();}last = now;diff = now - tmp;if (diff > thresh)

record();}

HW Lat DetectorStop Machine

● Needs to run periodically– Will lock up the system otherwise

– Has chance to miss latency again!

● Changed to a thread– Thread takes up one of the CPUs

– Still needs to yield● Locks up machine otherwise

– But yield is much smaller than periodic● More likely to measure latency

HW Lat DetectorWorked!

● But not good enough● Vendor did not trust this code ???● Had to use their code

– Did somewhat the same thing

– In userspace

– Could easily miss latencies

trace-cmd

● trace-cmd start -p function_graph -l ‘raw_spin*’ -e all

● Modified cyclictest to use function graph– Still limited to raw_spin* locks

– Wont disable the events started

● cyclictest will still stop the trace on latency

trace-cmd ksoftirqd/33-216 [033] 55597.719935: timer_cancel: timer=0xffff88403f0ce520 ksoftirqd/33-216 [033] 55597.719935: timer_expire_entry: timer=0xffff88403f0ce520 function=delayed_work ksoftirqd/33-216 [033] 55597.719935: funcgraph_entry: 0.069 us | _raw_spin_lock_irqsave(); ksoftirqd/33-216 [033] 55597.719936: funcgraph_entry: 0.047 us | _raw_spin_lock(); ksoftirqd/33-216 [033] 55597.719936: sched_stat_sleep: comm=kworker/33:1 pid=1222 delay=132870067 ksoftirqd/33-216 [033] 55597.719937: sched_wakeup: kworker/33:1:1222 [120] success=1 CPU:033 ksoftirqd/33-216 [033] 55597.719937: timer_expire_exit: timer=0xffff88403f0ce520 ksoftirqd/33-216 [033] 55597.719942: timer_cancel: timer=0xffff88403f0ce620 ksoftirqd/33-216 [033] 55597.719942: timer_expire_entry: timer=0xffff88403f0ce620 function=delayed_work ksoftirqd/33-216 [033] 55597.719943: funcgraph_entry: 0.045 us | _raw_spin_lock_irqsave(); ksoftirqd/33-216 [033] 55597.719943: timer_expire_exit: timer=0xffff88403f0ce620 cyclictest-6110 [007] 55597.719955: funcgraph_entry: 0.194 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719956: funcgraph_entry: 0.175 us | _raw_spin_lock_irqsave(); cyclictest-6110 [007] 55597.719957: funcgraph_entry: 0.175 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719957: funcgraph_entry: 2.436 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719957: funcgraph_entry: 0.203 us | _raw_spin_lock(); cyclictest-6110 [007] 55597.719958: sched_wakeup: cyclictest:6113 [4] success=1 CPU:010 cyclictest-6110 [007] 55597.719959: funcgraph_entry: 0.048 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719960: funcgraph_entry: 0.170 us | _raw_spin_lock_irq(); cyclictest-6113 [010] 55597.719961: funcgraph_entry: 0.045 us | _raw_spin_lock_irqsave(); cyclictest-6113 [010] 55597.719962: funcgraph_entry: 0.043 us | _raw_spin_lock_irq(); cyclictest-6110 [007] 55597.719963: print: ffffffff810e5776 hit latency threshold (247 > 200)

trace-cmd

● trace-cmd report– Lots of information

– Detailed information

– Great to analyze

● TOO MUCH INFO!– Can not understand it all

– Hard to see the big picture

● KernelShark

kernelshark

Questions?

Questions?

Yeah Right?Like we have time.

Hardware Latencies How to flush them out (A use case)trace-cmd ksoftirqd/33-216 [033] 55597.719935:...

Documents

Transcript of Hardware Latencies How to flush them out (A use case)trace-cmd ksoftirqd/33-216 [033] 55597.719935:...