Lies, Damned Lies, and Substrings

82
Lies, Damned Lies, and Substrings HASEEB QURESHI SOFTWARE ENGINEER @

Transcript of Lies, Damned Lies, and Substrings

Page 1: Lies, Damned Lies, and Substrings

Lies, Damned Lies, and Substrings

HASEEB QURESHI

SOF TWARE ENGINEER @

Page 2: Lies, Damned Lies, and Substrings

Let me tell you a story about a time Ruby lied to me.

Page 3: Lies, Damned Lies, and Substrings

A coworker and I were arguing about an algorithm.

Him

Me

Page 4: Lies, Damned Lies, and Substrings

It started with a classic problem:

How to generate all of the substrings of a string?

Page 5: Lies, Damned Lies, and Substrings

Hello

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello Helloi = 0 j = 3

Page 6: Lies, Damned Lies, and Substrings

Hello

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello Helloi = 1 j = 4

Each substring is defined by a unique start and end index.

Page 7: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend Quadratically many pairs of indices,

therefore the inner loop runs O(n2) many times.

Page 8: Lies, Damned Lies, and Substrings

Me: This algorithm is O(n2).

Page 9: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend But what about what’s inside the loop?

Page 10: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend

How long does it actually take to build a substring?

Page 11: Lies, Damned Lies, and Substrings

(We’re going to assume fixed-width [ASCII/UTF-32] strings for simplicity.)

(Also, Ruby treats strings less than 24 characters differently, but we can ignore that for large n.)

Page 12: Lies, Damned Lies, and Substrings

H e l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

Memory

e l l52a0 52a1 52a2 52a3 52a4 52a5

str

str2 =str[1..3]

Page 13: Lies, Damned Lies, and Substrings

Obviously, copying each substring takes linear time.

That is, linear in the length of the average substring.

Page 14: Lies, Damned Lies, and Substrings

O(1)? Log(n)? O(n)?H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello

… Which is how long?

Page 15: Lies, Damned Lies, and Substrings

require_relative 'substrings'

def average_substring_ratio(original_string_length) str = 'a' * original_string_length substring_lengths = substrings(str).map(&:length) average_substring_length = substring_lengths.reduce(:+) .fdiv(substring_lengths.count)

average_substring_length / original_string_lengthend

(1..150).step(5).each do |count| puts "#{count}: #{average_substring_ratio(count)}"end

Page 16: Lies, Damned Lies, and Substrings

1: 1.06: 0.444444444444444411: 0.393939393939393916: 0.37521: 0.365079365079365126: 0.35897435897435931: 0.354838709677419436: 0.3518518518518518641: 0.3495934959349593646: 0.3478260869565217351: 0.3464052287581756: 0.3452380952380952361: 0.344262295081967266: 0.343434343434343471: 0.3427230046948357

76: 0.3421052631578947581: 0.3415637860082304586: 0.3410852713178294591: 0.3406593406593406796: 0.34027777777777773101: 0.33993399339933994106: 0.33962264150943394111: 0.3393393393393393116: 0.339080459770115121: 0.33884297520661155126: 0.3386243386243386131: 0.3384223918575064136: 0.3382352941176471141: 0.3380614657210402146: 0.33789954337899547

(You can also prove

this mathematically.)

Limn→∞=⅓n

Page 17: Lies, Damned Lies, and Substrings

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello

So the average substring grows linearly with the original string.

Page 18: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend Thus, this copy is O(n)

Page 19: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend

So this whole thing takes O(n3) time.Colleague:

Page 20: Lies, Damned Lies, and Substrings

Not so fast. (or slow.)

Page 21: Lies, Damned Lies, and Substrings

Enter COW(copy-on-write)

Page 22: Lies, Damned Lies, and Substrings

Copy-on-write is a kind of structural sharing.

Page 23: Lies, Damned Lies, and Substrings

H e l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

Memory

str

str2 = str[1..3]

str_ptr: 8fe1length: 3

Page 24: Lies, Damned Lies, and Substrings

Here’s the proof.

Page 25: Lies, Damned Lies, and Substrings

require_relative 'display_string' # credit to Pat Shaughnessy

debug = Debug.new

str = ('a'..'z').to_a.joinstr2 = str.dup

debug.display_string(str) # DEBUG: RString = 0x7f98fb05b090 # DEBUG: ptr = 0x7f98fc0aa970 -> "abcdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

debug.display_string(str2) # DEBUG: RString = 0x7f98fb05afa0 # DEBUG: ptr = 0x7f98fc0aa970 -> "abcdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

Pointer to same string in memory!

Page 26: Lies, Damned Lies, and Substrings

require_relative 'display_string' # credit to Pat Shaughnessy

debug = Debug.new

str = ('a'..'z').to_a.joinstr2 = str[1..-1]

debug.display_string(str) # DEBUG: RString = 0x7f98fb05b090 # DEBUG: ptr = 0x7f98fc0aa970 -> "abcdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

debug.display_string(str2) # DEBUG: RString = 0x7f98fb05afa0 # DEBUG: ptr = 0x7f98fc0aa971 -> "bcdefghijklmnopqrstuvwxyz" # DEBUG: len = 25

Still the same string, but now offset by 1.

Page 27: Lies, Damned Lies, and Substrings

What happens if either string gets mutated?

Page 28: Lies, Damned Lies, and Substrings

require_relative 'display_string' # credit to Pat Shaughnessy

debug = Debug.new

str = ('a'..'z').to_a.joinstr2 = str[1..-1]str[1] = '&'

debug.display_string(str) # DEBUG: RString = 0x7fa2a304fbf8 # DEBUG: ptr = 0x7fa2a2f1f170 -> "a&cdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

debug.display_string(str2) # DEBUG: RString = 0x7fa2a304fae0 # DEBUG: ptr = 0x7fa2a2f50b11 -> "bcdefghijklmnopqrstuvwxyz" # DEBUG: len = 25

The write forced a copy to a new string in memory.

Page 29: Lies, Damned Lies, and Substrings

H e l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

Memory

str

str2 = str[1..3]

str_ptr: 8fe1length: 3

callbacks: [str2]

Page 30: Lies, Damned Lies, and Substrings

str[1] = '&'

Page 31: Lies, Damned Lies, and Substrings

H e l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

Memory

str

str2 = str[1..3]

callbacks: [str2]

e l l52a0 52a1 52a2 52a3 52a4 52a5

Page 32: Lies, Damned Lies, and Substrings

H & l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

Memory

str

str2 = str[1..3] e l l

52a0 52a1 52a2 52a3 52a4 52a5

Page 33: Lies, Damned Lies, and Substrings

So…

Page 34: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend

This is a shallow copy, which is actually O(1).

Page 35: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend

And this whole thing takes O(n2) time.

Page 36: Lies, Damned Lies, and Substrings
Page 37: Lies, Damned Lies, and Substrings

Case closed.

Page 38: Lies, Damned Lies, and Substrings

require_relative 'substrings'require 'benchmark'

str = 'abcdefgh' * 128str2 = str * 2

benchmarks = Benchmark.bmbm do |bm| bm.report(str.length) do substrings(str) end

bm.report(str2.length) do substrings(str2) endend

puts 'Growth: ' + benchmarks[1].real / benchmarks[0].real

Page 39: Lies, Damned Lies, and Substrings

Rehearsal ----------------------------------------1024 0.290000 0.070000 0.360000 ( 0.357953)2048 2.360000 0.500000 2.860000 ( 2.876344)------------------------------- total: 3.220000sec

user system total real1024 0.270000 0.070000 0.340000 ( 0.338351)2048 2.200000 0.400000 2.600000 ( 2.601713)

Growth: 7.689380300623611

Page 40: Lies, Damned Lies, and Substrings
Page 41: Lies, Damned Lies, and Substrings

When the input doubles, the time grows by a factor of 8.

This algorithm is not quadratic.

( 0.338351)( 2.601713)

Page 42: Lies, Damned Lies, and Substrings

wat

Page 43: Lies, Damned Lies, and Substrings

require 'benchmark'NUM_TIMES = 100_000

str = 'abcde' * 2 ** 10str2 = str * 2

Benchmark.bmbm do |bm| bm.report(str.length) do NUM_TIMES.times { str[1..-1] } end

bm.report(str2.length) do NUM_TIMES.times { str2[1..-1] } endend

Page 44: Lies, Damned Lies, and Substrings

Rehearsal -----------------------------------------...---------------------------------------------------

user system total real5120 0.020000 0.000000 0.020000 ( 0.021144)10240 0.020000 0.000000 0.020000 ( 0.020291)

Page 45: Lies, Damned Lies, and Substrings

That sure looks like copy-on-write optimization…

Page 46: Lies, Damned Lies, and Substrings

require 'benchmark'NUM_TIMES = 100_000

str = 'abcde' * 2 ** 10str2 = str * 2

Benchmark.bmbm do |bm| bm.report(str.length) do NUM_TIMES.times { str[1..-2] } end

bm.report(str2.length) do NUM_TIMES.times { str2[1..-2] } endend

Page 47: Lies, Damned Lies, and Substrings

Rehearsal -----------------------------------------...---------------------------------------------------

user system total real5120 0.110000 0.060000 0.170000 ( 0.171367)10240 0.200000 0.140000 0.340000 ( 0.347153)

Page 48: Lies, Damned Lies, and Substrings
Page 49: Lies, Damned Lies, and Substrings

Only substrings that include the last character are copy-on-write.

So turns out:

Page 50: Lies, Damned Lies, and Substrings

the vast majority of substrings don’t include the last character.H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello

And, of course,

Page 51: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend So this on average is linear.

Page 52: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend

And this whole thing is O(n3).

Page 53: Lies, Damned Lies, and Substrings

It was all a lie.

Page 54: Lies, Damned Lies, and Substrings

WHY HAVE YOU BETRAYED ME

RUBY

Page 55: Lies, Damned Lies, and Substrings
Page 56: Lies, Damned Lies, and Substrings

Naturally…

Page 57: Lies, Damned Lies, and Substrings

Hmm.

Page 58: Lies, Damned Lies, and Substrings
Page 59: Lies, Damned Lies, and Substrings

¯\_( )_/¯

Page 60: Lies, Damned Lies, and Substrings

ᕕ( ᐛ )ᕗ

Let’s…

… recompile Ruby…?

Page 61: Lies, Damned Lies, and Substrings

maml004775hquresh:ruby haseeb_qureshi$ make installCC = clangLD = ldLDSHARED = clang -dynamic -bundleCFLAGS = -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-

long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Werror=implicit-int -Werror=pointer-arith -Werror=write-strings -Werror=declaration-after-statement -Werror=shorten-64-to-32 -Werror=implicit-function-declaration -Werror=division-by-zero -Werror=deprecated-declarations -Werror=extra-tokens -pipe

XCFLAGS = -D_FORTIFY_SOURCE=2 -fstack-protector -fno-strict-overflow -fvisibility=hidden -DRUBY_EXPORT -fPIE

CPPFLAGS = -D_XOPEN_SOURCE -D_DARWIN_C_SOURCE -D_DARWIN_UNLIMITED_SELECT -D_REENTRANT -I. -I.ext/include/x86_64-darwin15 -I./include -I. -I./enc/unicode/9.0.0

DLDFLAGS = -Wl,-undefined,dynamic_lookup -Wl,-multiply_defined,suppress -fstack-protector -Wl,-u,_objc_msgSend -Wl,-pie -framework CoreFoundation

SOLIBS =Apple LLVM version 7.3.0 (clang-703.0.31)Target: x86_64-apple-darwin15.4.0Thread model: posix

Page 62: Lies, Damned Lies, and Substrings

I now have a custom version of in my usr/local/bin

ml004775hquresh:bin haseeb_qureshi$ ls -l...-rwxr-xr-x 1 haseeb_qureshi admin 3.1M Oct 23 00:37 ruby...

ml004775hquresh:bin haseeb_qureshi$ ./ruby -vruby 2.4.0dev (2016-10-23 trunk 56478) [x86_64-darwin15]

Page 63: Lies, Damned Lies, and Substrings

require 'benchmark'NUM_TIMES = 100_000

str = 'abcde' * 2 ** 10str2 = str * 2

Benchmark.bmbm do |bm| bm.report(str.length) do NUM_TIMES.times { str[1..-2] } end

bm.report(str2.length) do NUM_TIMES.times { str2[1..-2] } endend

Let’s run this benchmark again…

Page 64: Lies, Damned Lies, and Substrings

Rehearsal -----------------------------------------...--------------------------------------------------- user system total real5120 0.020000 0.000000 0.020000 ( 0.020432)10240 0.020000 0.000000 0.020000 ( 0.020300)

ml004775hquresh:bin haseeb_qureshi$ ./ruby ~/Projects/substrings/benchmark3.rb

Boom.

Page 65: Lies, Damned Lies, and Substrings

Ruby is now doing copy-on-write optimization on all strings!

Page 66: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend And this bad boy, finally,

takes O(n2) time.

Page 67: Lies, Damned Lies, and Substrings

(Applause break)

Page 68: Lies, Damned Lies, and Substrings

But you have to wonder…

why was that the default behavior?

Page 69: Lies, Damned Lies, and Substrings

H e l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

str \0

In C, strings should end with a null-terminator or null byte.

This is how C knows it’s reached the end of a string.

Null terminator

Page 70: Lies, Damned Lies, and Substrings

H e l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

str \0

If you passed a substring which did not include a NUL into a library written in C, it might keep reading bytes until it found

the NUL.

Null terminator

str2 = str[1..3]

Page 71: Lies, Damned Lies, and Substrings

Essentially, it ensures any C extensions treat all Ruby

strings correctly.

Page 72: Lies, Damned Lies, and Substrings

So that’s it.

We’re finally done.

We have an O(n2) algorithm for substrings.

Page 73: Lies, Damned Lies, and Substrings

Except one thing…

Page 74: Lies, Damned Lies, and Substrings

Remember where we started?

We need to generate all the substrings.

Did we actually… generate them?

Page 75: Lies, Damned Lies, and Substrings

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend

puts substrings("Hello")It takes linear time to print a substring, so printing all

the substrings will still take O(n3) time.

Page 76: Lies, Damned Lies, and Substrings

So in what sense is this O(n2)?

Page 77: Lies, Damned Lies, and Substrings

If you think about it, the whole idea of copy-on-write is laziness.

What we’ve created are lazy strings.

Page 78: Lies, Damned Lies, and Substrings

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello

Instead of making these:

str[0..-1]

We made these:

str[0..3], str[1..4]

str[0..2], str[1..3], str[2..4]

str[0..1], str[1..2], str[2..3], str[3..4]

str[0..0], str[1..1], str[2..2], str[3..3], str[4..4]

Page 79: Lies, Damned Lies, and Substrings

All we’ve really done is build each pair of indices.

Page 80: Lies, Damned Lies, and Substrings

The Ruby array that substrings(str) returns does not actually contain the

substrings.

It’s just a clever, lazy way to express them.

Page 81: Lies, Damned Lies, and Substrings

It’s lies all the way down.

Page 82: Lies, Damned Lies, and Substrings

Thanks for listening.

You can follow me at @hosseeb

Special thanks to Ned Ruggeri, David Runger, and Pat Shaughnessy.

You can find the code on Github: Haseeb-Qureshi