Finding concurrency problems in core ruby libraries

75
Louis Dunne Principal Software Engineer @ Workday Finding Concurrency Problems in Core Ruby Libraries

Transcript of Finding concurrency problems in core ruby libraries

Page 1: Finding concurrency problems in core ruby libraries

Louis DunnePrincipal Software Engineer @ Workday

Finding Concurrency Problems in Core Ruby

Libraries

Page 2: Finding concurrency problems in core ruby libraries

Structure of the Talk

• Signals & Stacktraces

• Reproduce / Examine / Experiment

• Root Cause Analysis

• Results From MRI, JRuby, RBX

• Lessons from Other Languages

Page 3: Finding concurrency problems in core ruby libraries

Signals & Stacktraces

Signal Handlers in 2.0

• Can't use a mutex• Hard to share state safely

Page 4: Finding concurrency problems in core ruby libraries

reader, writer = IO.pipe

# writer.puts won’t block %w(INT USR2).each do |sig| Signal.trap(sig) { writer.puts(sig) } end

Thread.new { signal_thread(reader) }

Signals & Stacktraces

Page 5: Finding concurrency problems in core ruby libraries

# signal_thread... sig = reader.gets.chomp # This will block

...

Thread.list.each do |thd| puts(thd.backtrace.join) end

Signals & Stacktraces

Page 6: Finding concurrency problems in core ruby libraries

Reproduce, Examine, Experiment

ReproduceExamine

Experiment

Page 7: Finding concurrency problems in core ruby libraries

Reproduce It

• A lot of effort but essential

• The easier you can reproduce it

• The easier you can debug it

Page 8: Finding concurrency problems in core ruby libraries

Examine The Code

• Multi-threaded principles

• Anything obvious?

• You still need to experiment and prove your case

Page 9: Finding concurrency problems in core ruby libraries

Experiment

• Start running experiments

• See if your expectations match reality

• Keep a written log

Page 10: Finding concurrency problems in core ruby libraries

Reproduce

Reproduce

Page 11: Finding concurrency problems in core ruby libraries

Reproduce

• Start simple...

• Start 100 clients doing a list operation

Page 12: Finding concurrency problems in core ruby libraries

while true; do date echo "Starting 100 requests..." for i in {1..100}; do <rest-client-list-operation> & done wait done

Reproduce

Page 13: Finding concurrency problems in core ruby libraries

Reproduce

• On my laptop → No lockup

• On a real server → No lockup

• Need to try both

Page 14: Finding concurrency problems in core ruby libraries

• More concurrency

• A dependency verification thread

• Run this every second

• Test again for 30 minutes → No lockup

Reproduce

Page 15: Finding concurrency problems in core ruby libraries

• We deploy to an OpenStack cluster

• What if we do nothing and return early

• Run the test again for 30 minutes

→ No lockup

Reproduce

Page 16: Finding concurrency problems in core ruby libraries

Reproduce

Timeout.timeout(job.timeout_seconds) do run_job(job) end

• Set job.timeout_seconds to 1

→ Deadlock!

Page 17: Finding concurrency problems in core ruby libraries

Reproduce

.../monitor.rb:185:in `lock'

.../monitor.rb:185:in `mon_enter'

.../monitor.rb:210:in `mon_synchronize'.../logger.rb:559:in `write'

Page 18: Finding concurrency problems in core ruby libraries

Examine

Examine

Page 19: Finding concurrency problems in core ruby libraries

Examine def write(message) begin @mutex.synchronize do # write-the-log-line end rescue Exception # log-a-warning end end

Anything wrongwith this?

Page 20: Finding concurrency problems in core ruby libraries

Examine

• mon_synchronize• mon_enter• mon_exit• mon_check_owner

Page 21: Finding concurrency problems in core ruby libraries

Examine def mon_synchronize mon_enter begin yield ensure mon_exit end end Looks OK

Page 22: Finding concurrency problems in core ruby libraries

Examine

def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1 end

@mon_mutex = Mutex.new@mon_owner = nil@mon_count = 0

Grrrrr...

Page 23: Finding concurrency problems in core ruby libraries

Rant

• I’ve debugged locks involving reentrant mutexes more times than I can remember

• If you ever feel like using a reentrant mutex, please I beg you, don’t do it

• There’s almost always a way to structure your code so that you can use a regular mutex

Page 24: Finding concurrency problems in core ruby libraries

Examine

def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1 end Anything wrong

with this?

Page 25: Finding concurrency problems in core ruby libraries

Examine

def mon_exit mon_check_owner @mon_count -=1 if @mon_count == 0 @mon_owner = nil @mon_mutex.unlock end end Looks OK

if @mon_owner != Thread.current raise...

Page 26: Finding concurrency problems in core ruby libraries

• Take a look at the first line in mon_enter: if @mon_owner != Thread.current

• Modified by multiple threads

• Read by other threads without being locked

• Read access needs a mutex too

Examine

Page 27: Finding concurrency problems in core ruby libraries

Aside: Double Checked Locking

• Many people have gotten this wrong

• Doug Schmidt & Co, ACE C++

• Pattern-Oriented Software Architecture(Volume 2, April 2001)

• Popularised a pattern that was completely broken: Double Checked Locking

Page 28: Finding concurrency problems in core ruby libraries

• A variable shared between multiple threads...

• ...Modified by one or more threads

• You need to use a mutex around the modification (of course)

• But you also need to a mutex around any READ access to that variable

Aside: Takeaway

GIL?

Page 29: Finding concurrency problems in core ruby libraries

Aside: Takeaway

This is because of…

• Instruction pipelining

• Multiple levels of chip caches

• Out of order memory references

• The memory model of the platform

• The memory model of the language

Page 30: Finding concurrency problems in core ruby libraries

Examine

def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1end

def mon_exit mon_check_owner @mon_count -=1 if @mon_count == 0 @mon_owner = nil @mon_mutex.unlock endend

Page 31: Finding concurrency problems in core ruby libraries

Examine

So that's two concerning things so far:

1. Logger's rescue of Exception

2. Read access to @mon_owner outside of any mutex

Page 32: Finding concurrency problems in core ruby libraries

Experiment

Experiment

Page 33: Finding concurrency problems in core ruby libraries

Experiment 1

The Change:• Puts all access to @mon_owner and @mon_count (& the Thread ID)

The Result:• Deadlock• I saw @mon_count changing from 0 to 2

Page 34: Finding concurrency problems in core ruby libraries

The Change:• Keep track of @mon_count and @mon_owner

in a list in memory (& the Thread ID)• Puts the list when we dump the stacktraces

Experiment 2

The Result: • Deadlock• @mon_count changing from 0 to 2 (same)

Page 35: Finding concurrency problems in core ruby libraries

The Change:• @mon_owner and @mon_count don’t really

need to be shared among threads• Use thread local variables instead

Experiment 3

The Result:• Deadlock• @mon_count jumps from 0 to 2 occasionally

Page 36: Finding concurrency problems in core ruby libraries

The Change:• When a thread acquires the monitor mutex @mon_count should always be zero

• So check to see if it’s ever non-zero

Experiment 4

Page 37: Finding concurrency problems in core ruby libraries

Experiment 4 def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current if @mon_count != 0 puts '=========XXXXXXXXXX=======' end end @mon_count += 1 end

Page 38: Finding concurrency problems in core ruby libraries

Experiment 4

The Result:• Test again → No Deadlock → No log line

• OK that's really odd…

• But you can't rely on a negative, so then I removed those lines and ran again

• Now it locks

Page 39: Finding concurrency problems in core ruby libraries

Experiment 4

The Result:• Add back the lines → Doesn't lock

• Remove the lines → Deadlocks quickly

• Hmm, ok that's definitely odd, feels like a memory visibility issue

Page 40: Finding concurrency problems in core ruby libraries

The Change:• Download and build a debug version of MRI

• Looking in thread.c I found: rb_threadptr_unlock_all_locking_mutexes() with the following warning commented out:

Experiment 5

/* rb_warn("mutex #<%p> remains to be locked by terminated thread", mutexes); */

Page 41: Finding concurrency problems in core ruby libraries

The Result:

• Deadlocks

• Saw threads exiting with that warning about locked mutexes

Experiment 5

Page 42: Finding concurrency problems in core ruby libraries

Experiment 6

The Change:• Examining mon_enter and mon_exit we can

see that when the lock is taken @mon_count should always be zero

• But we saw @mon_count jumping from 0 to 2 so let’s try putting in @mon_count = 0 explicitly

Page 43: Finding concurrency problems in core ruby libraries

Experiment 6

def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current @mon_count = 0 end @mon_count += 1 end

Page 44: Finding concurrency problems in core ruby libraries

Experiment 6

The Result:• Doesn't lock, left it running for hours

• Take out the @mon_count = 0 and it locks

• But remember checking if @mon_count != 0 had the same effect

Page 45: Finding concurrency problems in core ruby libraries

Experiment 6

• So it seems that adding@mon_count = 0

"fixes" the problem

• I still want to understand the cause

• I’d like a reproducible test case that doesn’t rely on our service

Page 46: Finding concurrency problems in core ruby libraries

Experiment 7

• With new threads coming and going, some exiting normally, some timing out, all emitting log messages

• What about if we try to log heavily within a timeout block and time it out in a bunch of threads

Page 47: Finding concurrency problems in core ruby libraries

def run count = 1 begin Timeout.timeout(1) do loop do @logger.error("#{Thread.current}: Loop #{count}") count += 1 end end rescue Exception @logger.error("#{Thread.current}: Exception #{count}") endend

Experiment 7

Page 48: Finding concurrency problems in core ruby libraries

• So with this code I get:

... `join': No live threads left. Deadlock? (fatal)

• Happens every time after a few seconds

Experiment 7

Page 49: Finding concurrency problems in core ruby libraries

Experiment 7

• Since it says all threads are dead, what happens if there is another thread just sitting there doing nothing?

• Add

Thread.new { loop { sleep 1 } }

Page 50: Finding concurrency problems in core ruby libraries

Experiment 7

• Run the code again → Deadlock

• All threads stuck in the same location as before:

.../monitor.rb:185:in `lock'

.../monitor.rb:185:in `mon_enter'

.../monitor.rb:210:in `mon_synchronize'

.../logger.rb:559:in `write'

Page 51: Finding concurrency problems in core ruby libraries

• So now I have a simple test case that reproduces the issue every time

• I can also confirm that adding @mon_count = 0 into mon_enter "fixes" the problem

Experiment 7

Page 52: Finding concurrency problems in core ruby libraries

Examine Again

• At some point during all of this I showed this to a colleague who suggested I look for recent changes in this code within the Ruby repo

• We checked the Ruby git repo...

Page 53: Finding concurrency problems in core ruby libraries

Examine Again

commit 7be5169804ee0cfe1991903fa10c31f8bd6525bdAuthor: shugo <shugo@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>Date: Mon May 18 04:56:22 2015 +0000

* lib/monitor.rb (mon_try_enter, mon_enter): should reset @mon_count just in case the previous owner thread dies without mon_exit.[fix GH-874] Patch by @chrisberkhout

Page 54: Finding concurrency problems in core ruby libraries

Root Cause Analysis

Root CauseAnalysis

Page 55: Finding concurrency problems in core ruby libraries

Root Cause Analysis

• With a little more thought I realised what the root cause of this problem is…

• It’s the Timeout module and how corrupts state in the monitor object

Page 56: Finding concurrency problems in core ruby libraries

Root Cause Analysis class Monitor def synchronize mon_enter begin yield ensure mon_exit end end end

Timeout.timeout(seconds) do logger.write end class Logger def write @mon.synchronize do write-log end end end

1

24

5

3

Page 57: Finding concurrency problems in core ruby libraries

Root Cause Analysis

Thread 1 (T1)• Timeout.timeout(seconds)

• Start a new thread T2• logger.write

• mon.synchronize– write-the-log

• Kill T2

Thread 2 (T2)

• Keeps a reference to T1

• sleep(seconds)

• Raise a Timeout exception against T1

1

2

Page 58: Finding concurrency problems in core ruby libraries

class Monitor def synchronize mon_enter begin yield ensure mon_exit end end end

Root Cause Analysis

• What about right here?• mon_enter is invoked• mon_exit is not

Page 59: Finding concurrency problems in core ruby libraries

def mon_enter if @mon_owner != Thread.current @mon_mutex.lock @mon_owner = Thread.current end @mon_count += 1 end

Root Cause Analysis

Page 60: Finding concurrency problems in core ruby libraries

Root Cause Analysis

def mon_exit mon_check_owner @mon_count -=1 if @mon_count == 0 @mon_owner = nil @mon_mutex.unlock end end

Page 61: Finding concurrency problems in core ruby libraries

Finally!It all makes sense

Page 62: Finding concurrency problems in core ruby libraries

Demo Time!

Page 63: Finding concurrency problems in core ruby libraries

https://github.com/lad/ruby_concurrency• thread_deadlock.rb• thread_starve.rb

Show Me The Code

Page 64: Finding concurrency problems in core ruby libraries

Results From Different Ruby VMs

MRI 1.8.7 MRI 1.9.3 MRI 2.1.5 MRI (HEAD)

Deadlock Yes Yes Yes No (*)

Starvation YesCan’t say.

Always deadlocks

Yes Yes

Mid Jan, 2016 (2.3+)

Page 65: Finding concurrency problems in core ruby libraries

Results From Different Ruby VMs

JRuby 1.6.8

JRuby 1.7.11

JRuby 1.7.19

JRuby 9.0.0.0.pre1

Deadlock YesNo Deadlock or Starvation.

Though only because the Timeout exception is not raisedStarvation Yes

Page 66: Finding concurrency problems in core ruby libraries

Results From Different Ruby VMs

RBX 2.4.1 RBX 2.5.2

Deadlock VM Crashes Yes.Though mostly thread starvation

Starvation VM Crashes Yes

Page 67: Finding concurrency problems in core ruby libraries

My Assertion

It is fundamentally unsafe to interrupt a running thread in the general case

Side Effects

Page 68: Finding concurrency problems in core ruby libraries

Other Languages

Lessons FromOther Languages

Page 69: Finding concurrency problems in core ruby libraries

Thread Cancellation in C

• The C pthread API offers additional features

• Has the concept of thread cancellation:

• Enable / Disable thread cancellation requests

• User defined, per thread cleanup handlers

Page 70: Finding concurrency problems in core ruby libraries

Thread Cancellation in JavaWhy is Thread.stop deprecated?• Because it is inherently unsafe

• Stopping a thread causes it to unlock all the monitors that it has locked

• If any of the objects previously protected by these monitors were in an inconsistent state, other threads may now view these objects in an inconsistent state

http://docs.oracle.com/javase/1.5.0/docs/guide/misc/threadPrimitiveDeprecation.html

Page 71: Finding concurrency problems in core ruby libraries

“As of JDK8, Thread.stop is really gone. It is the first deprecated method to have

actually been de-implemented. It now just throws UnsupportedOperationException”

Doug Lea, Java Concurrency Guru

http://cs.oswego.edu/pipermail/concurrency-interest/2013-December/012028.html

Java Thread Cancellation

Page 72: Finding concurrency problems in core ruby libraries

Ruby rant: Timeout::ErrorJan 2008

(http://goo.gl/PLxR76)

Ruby's Thread#raise, Thread#kill, timeout.rb, and net/protocol.rb libraries are broken

February 2008

(http://goo.gl/DI8GMX)

Ruby timeouts are dangerousMarch 2013

(https://goo.gl/3EoTM6)Ruby’s Most Dangerous API

May 2015

(http://goo.gl/2RkFbn)Why Ruby’s Timeout is dangerous (and Thread.raise is terrifying)

Nov 2015

(http://goo.gl/xLvuWG)

Reliable Ruby timeouts for M.R.I. 1.8https://github.com/ph7/system-timer

Fixing Ruby's standard library Timeouthttps://github.com/jjb/sane_timeout

A safer alternative to Ruby's Timeout that uses unix processes instead of threadshttps://github.com/david-mccullars/safe_timeout

Better timeout management for rubyhttps://github.com/ryanking/deadline

What Does The Community Say?

Page 73: Finding concurrency problems in core ruby libraries

Threading Quick Links

John Ousterhout:http://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf

Ed Lee:http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

• Best practices, smart people, code locked up after running successfully with minimal changes for four years

Page 74: Finding concurrency problems in core ruby libraries

Takeaways

• Try to avoid writing multi-threaded code

• Try to avoid reentrant mutexes

• Always use a mutex for read access to shared state

• Don’t use the Timeout module. It’s a broken concept

Page 75: Finding concurrency problems in core ruby libraries