Lies, Damned Lies, and Substrings

HASEEB QURESHI

SOF TWARE ENGINEER @

Let me tell you a story about a time Ruby lied to me.

A coworker and I were arguing about an algorithm.

It started with a classic problem:

How to generate all of the substrings of a string?

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello Helloi = 0 j = 3

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Hello Helloi = 1 j = 4

Each substring is defined by a unique start and end index.

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend Quadratically many pairs of indices,

therefore the inner loop runs O(n2) many times.

Me: This algorithm is O(n2).

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend But what about what’s inside the loop?

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend

How long does it actually take to build a substring?

(We’re going to assume fixed-width [ASCII/UTF-32] strings for simplicity.)

(Also, Ruby treats strings less than 24 characters differently, but we can ignore that for large n.)

H e l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

Memory

e l l52a0 52a1 52a2 52a3 52a4 52a5

str2 =str[1..3]

Obviously, copying each substring takes linear time.

That is, linear in the length of the average substring.

O(1)? Log(n)? O(n)?H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

… Which is how long?

require_relative 'substrings'

def average_substring_ratio(original_string_length) str = 'a' * original_string_length substring_lengths = substrings(str).map(&:length) average_substring_length = substring_lengths.reduce(:+) .fdiv(substring_lengths.count)

average_substring_length / original_string_lengthend

(1..150).step(5).each do |count| puts "#{count}: #{average_substring_ratio(count)}"end

1: 1.06: 0.444444444444444411: 0.393939393939393916: 0.37521: 0.365079365079365126: 0.35897435897435931: 0.354838709677419436: 0.3518518518518518641: 0.3495934959349593646: 0.3478260869565217351: 0.3464052287581756: 0.3452380952380952361: 0.344262295081967266: 0.343434343434343471: 0.3427230046948357

76: 0.3421052631578947581: 0.3415637860082304586: 0.3410852713178294591: 0.3406593406593406796: 0.34027777777777773101: 0.33993399339933994106: 0.33962264150943394111: 0.3393393393393393116: 0.339080459770115121: 0.33884297520661155126: 0.3386243386243386131: 0.3384223918575064136: 0.3382352941176471141: 0.3380614657210402146: 0.33789954337899547

(You can also prove

this mathematically.)

Limn→∞=⅓n

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

So the average substring grows linearly with the original string.

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend Thus, this copy is O(n)

So this whole thing takes O(n3) time.Colleague:

Not so fast. (or slow.)

Enter COW(copy-on-write)

Copy-on-write is a kind of structural sharing.

Memory

str2 = str[1..3]

str_ptr: 8fe1length: 3

Here’s the proof.

require_relative 'display_string' # credit to Pat Shaughnessy

debug = Debug.new

str = ('a'..'z').to_a.joinstr2 = str.dup

debug.display_string(str) # DEBUG: RString = 0x7f98fb05b090 # DEBUG: ptr = 0x7f98fc0aa970 -> "abcdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

debug.display_string(str2) # DEBUG: RString = 0x7f98fb05afa0 # DEBUG: ptr = 0x7f98fc0aa970 -> "abcdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

Pointer to same string in memory!

debug = Debug.new

str = ('a'..'z').to_a.joinstr2 = str[1..-1]

debug.display_string(str) # DEBUG: RString = 0x7f98fb05b090 # DEBUG: ptr = 0x7f98fc0aa970 -> "abcdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

debug.display_string(str2) # DEBUG: RString = 0x7f98fb05afa0 # DEBUG: ptr = 0x7f98fc0aa971 -> "bcdefghijklmnopqrstuvwxyz" # DEBUG: len = 25

Still the same string, but now offset by 1.

What happens if either string gets mutated?

debug = Debug.new

str = ('a'..'z').to_a.joinstr2 = str[1..-1]str[1] = '&'

debug.display_string(str) # DEBUG: RString = 0x7fa2a304fbf8 # DEBUG: ptr = 0x7fa2a2f1f170 -> "a&cdefghijklmnopqrstuvwxyz" # DEBUG: len = 26

debug.display_string(str2) # DEBUG: RString = 0x7fa2a304fae0 # DEBUG: ptr = 0x7fa2a2f50b11 -> "bcdefghijklmnopqrstuvwxyz" # DEBUG: len = 25

The write forced a copy to a new string in memory.

Memory

str2 = str[1..3]

str_ptr: 8fe1length: 3

callbacks: [str2]

str[1] = '&'

Memory

str2 = str[1..3]

callbacks: [str2]

e l l52a0 52a1 52a2 52a3 52a4 52a5

H & l l o8fe0 8fe1 8fe2 8fe3 8fe4 8fe5

Memory

str2 = str[1..3] e l l

52a0 52a1 52a2 52a3 52a4 52a5

This is a shallow copy, which is actually O(1).

And this whole thing takes O(n2) time.

Case closed.

require_relative 'substrings'require 'benchmark'

str = 'abcdefgh' * 128str2 = str * 2

benchmarks = Benchmark.bmbm do |bm| bm.report(str.length) do substrings(str) end

bm.report(str2.length) do substrings(str2) endend

puts 'Growth: ' + benchmarks[1].real / benchmarks[0].real

Rehearsal ----------------------------------------1024 0.290000 0.070000 0.360000 ( 0.357953)2048 2.360000 0.500000 2.860000 ( 2.876344)------------------------------- total: 3.220000sec

user system total real1024 0.270000 0.070000 0.340000 ( 0.338351)2048 2.200000 0.400000 2.600000 ( 2.601713)

Growth: 7.689380300623611

When the input doubles, the time grows by a factor of 8.

This algorithm is not quadratic.

( 0.338351)( 2.601713)

require 'benchmark'NUM_TIMES = 100_000

str = 'abcde' * 2 ** 10str2 = str * 2

Benchmark.bmbm do |bm| bm.report(str.length) do NUM_TIMES.times { str[1..-1] } end

bm.report(str2.length) do NUM_TIMES.times { str2[1..-1] } endend

Rehearsal -----------------------------------------...---------------------------------------------------

That sure looks like copy-on-write optimization…

str = 'abcde' * 2 ** 10str2 = str * 2

Rehearsal -----------------------------------------...---------------------------------------------------

Only substrings that include the last character are copy-on-write.

So turns out:

the vast majority of substrings don’t include the last character.H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

And, of course,

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend So this on average is linear.

And this whole thing is O(n3).

It was all a lie.

WHY HAVE YOU BETRAYED ME

Naturally…

¯\_( )_/¯

ᕕ( ᐛ )ᕗ

Let’s…

… recompile Ruby…?

maml004775hquresh:ruby haseeb_qureshi$ make installCC = clangLD = ldLDSHARED = clang -dynamic -bundleCFLAGS = -O3 -fno-fast-math -ggdb3 -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wno-

long-long -Wno-missing-field-initializers -Wno-tautological-compare -Wno-parentheses-equality -Wno-constant-logical-operand -Wno-self-assign -Wunused-variable -Werror=implicit-int -Werror=pointer-arith -Werror=write-strings -Werror=declaration-after-statement -Werror=shorten-64-to-32 -Werror=implicit-function-declaration -Werror=division-by-zero -Werror=deprecated-declarations -Werror=extra-tokens -pipe

XCFLAGS = -D_FORTIFY_SOURCE=2 -fstack-protector -fno-strict-overflow -fvisibility=hidden -DRUBY_EXPORT -fPIE

CPPFLAGS = -D_XOPEN_SOURCE -D_DARWIN_C_SOURCE -D_DARWIN_UNLIMITED_SELECT -D_REENTRANT -I. -I.ext/include/x86_64-darwin15 -I./include -I. -I./enc/unicode/9.0.0

DLDFLAGS = -Wl,-undefined,dynamic_lookup -Wl,-multiply_defined,suppress -fstack-protector -Wl,-u,_objc_msgSend -Wl,-pie -framework CoreFoundation

SOLIBS =Apple LLVM version 7.3.0 (clang-703.0.31)Target: x86_64-apple-darwin15.4.0Thread model: posix

I now have a custom version of in my usr/local/bin

ml004775hquresh:bin haseeb_qureshi$ ls -l...-rwxr-xr-x 1 haseeb_qureshi admin 3.1M Oct 23 00:37 ruby...

ml004775hquresh:bin haseeb_qureshi$ ./ruby -vruby 2.4.0dev (2016-10-23 trunk 56478) [x86_64-darwin15]

str = 'abcde' * 2 ** 10str2 = str * 2

Let’s run this benchmark again…

Rehearsal -----------------------------------------...--------------------------------------------------- user system total real5120 0.020000 0.000000 0.020000 ( 0.020432)10240 0.020000 0.000000 0.020000 ( 0.020300)

ml004775hquresh:bin haseeb_qureshi$ ./ruby ~/Projects/substrings/benchmark3.rb

Ruby is now doing copy-on-write optimization on all strings!

def substrings(str) (0...str.length).each_with_object([]) do |i, subs| (i...str.length).each do |j| subs << str[i..j] end endend And this bad boy, finally,

takes O(n2) time.

(Applause break)

But you have to wonder…

why was that the default behavior?

str \0

In C, strings should end with a null-terminator or null byte.

This is how C knows it’s reached the end of a string.

Null terminator

str \0

If you passed a substring which did not include a NUL into a library written in C, it might keep reading bytes until it found

the NUL.

Null terminator

str2 = str[1..3]

Essentially, it ensures any C extensions treat all Ruby

strings correctly.

So that’s it.

We’re finally done.

We have an O(n2) algorithm for substrings.

Except one thing…

Remember where we started?

We need to generate all the substrings.

Did we actually… generate them?

puts substrings("Hello")It takes linear time to print a substring, so printing all

the substrings will still take O(n3) time.

So in what sense is this O(n2)?

If you think about it, the whole idea of copy-on-write is laziness.

What we’ve created are lazy strings.

H, e, l, l, o

He, el, ll, lo

Hel, ell, llo

Hell, ello

Instead of making these:

str[0..-1]

We made these:

str[0..3], str[1..4]

str[0..2], str[1..3], str[2..4]

str[0..1], str[1..2], str[2..3], str[3..4]

str[0..0], str[1..1], str[2..2], str[3..3], str[4..4]

All we’ve really done is build each pair of indices.

The Ruby array that substrings(str) returns does not actually contain the

substrings.

It’s just a clever, lazy way to express them.

It’s lies all the way down.

Thanks for listening.

You can follow me at @hosseeb

Special thanks to Ned Ruggeri, David Runger, and Pat Shaughnessy.

You can find the code on Github: Haseeb-Qureshi

Lies, Damned Lies, and Substrings

Software

Transcript of Lies, Damned Lies, and Substrings

LIES, DAMNED LIES AND STATISTICSLIES, DAMNED LIES AND STATISTICS: A TIME TRACKING SURVEY Dimensional Research | July 2013 © 2013 Dimensional Research.

Lies, damned lies, and statistics - Universiteit Leidengill/bengeentalk.pdf · 2020. 2. 11. · Statistics Group Meeting, Leiden, 9 December 2019 gill Lies, damned lies, and statistics

Statistics and other damned lies - CaltechAUTHORSauthors.library.caltech.edu/25038/186/Chapter 12. Statistics and other... · Statistics and other damned lies The remark attributed

Damned Lies and Statistics

Lies, Damned Lies + Email Design : Seattle Interactive Conference 2013

Lies, Damned Lies, AVs, Shared Mobility, and Urban Transit ...

Lies, Damned Lies and Statistics - Uniwersytet Warszawskilee.wne.uw.edu.pl/.../Lies_damned_lies_and_statistics.pdfLies, Damned Lies and Statistics The Adverse Incentive Effects of

Lies, Damned Lies, and Statistics - · PDF fileThe title of this course “Lies, Damned Lies, and Statistics’ is part of a famous ... may not show us the whole picture. ... illustrates

Lies, damned lies, & legal truths

Lies, Damned Lies, and Statistics reading - My Languages21 · Lies, Damned Lies, and Statistics 1/3 ... 8 because studios saw no need to preserve them > Why are most silent films

15. Reporting on Human Development: Lies, Damned Lies and …press-files.anu.edu.au/downloads/press/p278051/pdf/15... · Development: Lies, Damned Lies and . Statistics. 1. Ian Castles.

Lies, Damned Lies, or Statistics - poritz.net · 2017. 5. 14. · Lies, Damned Lies, or Statistics: How to Tell the Truth with Statistics Jonathan A. Poritz Department of Mathematics

LIES, DAMNED LIES…people.wku.edu/rick.grieve/IQPracticum/PowerPoint/Test Construction Scaling... · 10/21/2020 1 LIES, DAMNED LIES… Rick Grieve PSY 562 Western Kentucky University

DevOps Metrics - Lies, Damned Lies and Statistics

Lies, Damned Lies and the 2015 Intergenerational …...Lies, Damned Lies and the 2015 IGR (i) Abstract The Australian Government releases an ‘intergenerational report’ at least

There are three kinds of lies: lies, damned lies, and statistics.

LIES, DAMNED LIES, SCIENCE, AND THEOLOGY

Lies, damned lies and the Sun · ISSUE NUMBER 10, VOLUME 12 Essential reading for today’s transport worker NOVEMBER/DECEMBER 2010 org.ukuk Lies, damned lies and

Lies, Damned Lies, and Primary Sources: The Lost History of Prester John

'Lies, damned lies, and statistics'...'Lies, damned lies, and statistics' From the ballot boxes of Florida, USA, to the police stations of South Africa, numbers and the mechanisms