Thinking about performance€¦ · “Premature optimization is the root of all evil” “We...

166
Thinking about performance

Transcript of Thinking about performance€¦ · “Premature optimization is the root of all evil” “We...

Page 1: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Thinking about performance

Page 2: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Search: a case study

Page 3: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Perf: speed/power/etc.

Page 4: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Perf: why do we care?

Page 5: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

“Premature optimization is the root of all evil”

Page 6: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

“We should forget about small efficiencies, say about 97% of

the time”

Page 7: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Different designs:100x - 1000x perf difference

Page 8: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 9: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

“Coding feels like real work”

Page 10: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Whiteboard: 1h/iterationImplementation: 2yr/iteration

Page 11: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Scale(precursor to perf discussion)

Page 12: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10k; 10M; 10G(5kB per doc)

Page 13: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

What’s the actual problem?

Page 14: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

AND queries

Page 15: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10k; 10M; 10G(5kB per doc)

10k

Page 16: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

One person’s emailOne forum

10k

Page 17: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

5kB * 10k = 50MB

10k

Page 18: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

50MB is small!

10k

Page 19: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

$50 phone => 1GB RAM

10k

Page 20: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Naive algorithm

for loop over all documents {

for loop over terms in document {

// matching logic here.

}

}

10k

Page 21: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10k; 10M; 10G(5kB per doc)

10M

Page 22: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

~Wikipedia sized

10M

Page 23: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

5kB * 10M = 50GB

10M

Page 24: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

$2000 for 128GB server(Broadwell single socket Xeon-D)

10M

Page 25: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

25 GB/s memory bandwidth

10M

Page 26: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

50GB / 25 GB/s = 2s(½ query per sec (QPS))

10M

Page 27: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Is 2s latency ok?

10M

Page 28: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Is 1/2 QPS ok?

10M

Page 29: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Larger service

Latency == $$$

10M

Page 30: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Latency == $$$

http://assets.en.oreilly.com/1/event/29/Keynote%20Presentation%202.pdf

http://www.bizreport.com/2016/08/mobify-report-reveals-impact-of-mobile-website-speed.html

http://assets.en.oreilly.com/1/event/29/The%20User%20and%20Business%20Impact%20of%20Server%20Delays,%20Additional%20Bytes,%20and%20HTTP%20Chunking%20in%20Web%20Search%20Presentation.pptx

http://assets.en.oreilly.com/1/event/27/Varnish%20-%20A%20State%20of%20the%20Art%20High-Performance%20Reverse%20Proxy%20Presentation.pdf

10M

Page 31: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Google: 400ms extra latency

0.44% decrease in searches per user

10M

Page 32: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Google: 400ms extra latency

0.44% decrease in searches per user

0.76% after six weeks

10M

Page 33: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Google: 400ms extra latency

0.44% decrease in searches per user

0.76% after six weeks

0.21% decrease after delay removed

10M

Page 34: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Bing

10M

Page 35: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Mobify

100ms home load => 1.11% delta in conversions

10M

Page 36: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Mobify

100ms home load => 1.11% delta in conversions

100ms checkout page speed => 1.55% delta in

conversions

10M

Page 37: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10M

Page 38: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

To hit 500ms round trip...

10M

Page 39: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

...budget ~10ms for search

10M

Page 40: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Larger service

Latency == $$$

Need to handle more than ½ QPS

10M

Page 41: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Use an index?

Salton; The SMART Retrieval System (1971); work originally done in early 60s10M

Page 42: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

30 - 30,000 QPS(we’ll talk about figuring this out later)

http://www.anandtech.com/show/9185/intel-xeon-d-review-performance-per-watt-server-soc-champion/14

Haque et al.; Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services (ASPLOS, 2015)

10M

Page 43: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10k; 10M; 10G;(5kB per doc)

10B

Page 44: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

5kB * 10G = 50TB

10B

Page 45: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Horizontal scaling(use more machines)

10B

Page 46: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Easy to scale(different docuemnts on different machines)

10B

Page 47: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines

10B

Page 48: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Redmond-Dresden: 150ms

10B

Page 49: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines

1k machines * 10 clusters = 10k machines

10B

Page 50: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

“[With 1800 machines, in one year], it’s typical that 1,000 individual machine failures will occur; thousands of hard drive

failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail,

each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5

percent of the machines at any given moment over a 2-day span”

10B

Page 51: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines

1k machines * 10 clusters = 10k machines

10k machines * 3 redundancy = 30k machines

10B

Page 52: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines

1k machines * 10 clusters = 10k machines

10k machines * 3 redundancy = 30k machines

30k machines * $1k/yr/machine = $30M / yr

10B

Page 53: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

2x perf: $15m/yr

10B

Page 54: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

2% perf: $600k/yr

10B

Page 55: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines

1k machines * 10 clusters = 10k machines

10k machines * 3 redundancy = 30k machines

30k machines * $1k/yr/machine = $30M / yr

Machine time vs. dev time10B

Page 56: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Search Algorithms

Page 57: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

What’s the problem again?

Algorithms

Page 58: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Posting list

Algorithms: posting list

Page 59: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

See http://nlp.stanford.edu/IR-book/ for

implementation details

Page 60: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

HashMap[term] => list[docs]

Algorithms: posting list

Page 61: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Bloom filter

Algorithms: bloom filter

Page 62: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

BitFunnel

Algorithms: bloom filter

Page 63: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

What about an array?

Algorithms: bloom filter

Page 64: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 65: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

How many terms?

Algorithms: bloom filter

Page 66: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filter

Page 67: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

One site has 37B primes

Algorithms: bloom filter

Page 68: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

GUIDs, timestamps, DNA, etc.

Algorithms: bloom filter

Page 69: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Why index that stuff?

Algorithms: bloom filter

Page 70: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

GTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAATA

Algorithms: bloom filter

Page 71: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 72: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Most terms aren’t in most docs => use hashing

Algorithms: bloom filter

Page 73: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Bloom Filters

Algorithms: bloom filter

Page 74: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 75: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 76: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 77: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 78: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Probability of false positive?

Algorithms: bloom filter

Page 79: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

(assume 10% bit density)

1 location: .1 = 10% false positive rate

Algorithms: bloom filter

Page 80: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

(assume 10% bit density)

1 location: .1 = 10% false positive rate

2 locations: .1 * .1 = 1% false positive rate

Algorithms: bloom filter

Page 81: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

(assume 10% bit density)

1 location: .1 = 10% false positive rate

2 locations: .1 * .1 = 1% false positive rate

3 locations: .1 * .1 * .1 = 0.1% false positive rate

Algorithms: bloom filter

Page 82: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Linear costExponential benefit

Algorithms: bloom filter

Page 83: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Multiple DocumentsMultiple Bloom Filters

Algorithms: bloom filter

Page 84: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 85: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Do comparisons in parallel!

Algorithms: bloom filter

Page 86: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 87: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filter

Page 88: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filter

Page 89: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filter

Page 90: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filterAlgorithms: bloom filter

Page 91: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filter

Page 92: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 93: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 94: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filter

Page 95: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Algorithms: bloom filter

Page 96: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

How do we estimate perf?

Page 97: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Cost modelNumber of operations

Perf estimation

Page 98: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

512-bit “blocks”(pay for memory accesses)

Perf estimation

Page 99: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

How many memory accesses per block?

Perf estimation

Page 100: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

http://bitfunnel.org

Perf estimation

Page 101: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Perf estimation

Page 102: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 103: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Why do we have so many rows?

Term rewriting

Perf estimation

Page 104: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Term Rewriting

“Large yellow dog”

Perf estimation

Page 105: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Term Rewriting

“Large yellow dog” ||

“Golden Retriever”

Perf estimation

Page 106: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Term Rewriting

“Large yellow dog” ||

“Golden Retriever” ||

“Old Yeller” ||

Perf estimation

Page 107: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 108: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Expected performance?

Perf estimation

Page 109: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10 M docs / 512 bits per block = 20k “blocks”

Perf estimation

Page 110: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10 M docs / 512 bits per block = 20k “blocks”

20 k-blocks * 5 transfers per block = 100 kT

Perf estimation

Page 111: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10 M docs / 512 bits per block = 20k “blocks”

20 k-blocks * 5 transfers per block = 100 kT

25 GB/s / 512 bits per transfer = 390 MT/s

Perf estimation

Page 112: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

10 M docs / 512 bits per block = 20k “blocks”

20 k-blocks * 5 transfers per block = 100 kT

25 GB/s / 512 bits per transfer = 390 MT/s

390 MT/s / 100 kT = 3900 QPS (with rounding)

Perf estimation

Page 113: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Actual performance?

Perf estimation

Page 114: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Actual performance ~similar

Perf estimation

Page 115: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Small factors

Perf estimation

Page 116: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Large factors

Perf estimation

Page 117: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Ranking results

Perf estimation

Page 118: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Ingestion(faster than querying)

Perf estimation

Page 119: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Ingestion is just setting bits

Perf estimation

Page 120: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Hierarchical bloom filters

Perf estimation

Page 121: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Complicating issues?

Perf estimation

Page 122: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 123: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Conclusions?

Page 124: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

False conclusions

Search is simple

Page 125: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

False conclusions

Search is simple

Bloom filters are better than posting lists

Zobel et al., Inverted files versus signature files for text indexing; TODS 1998

Page 126: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

False conclusions

Search is simple

Bloom filters are better than posting lists

You can easily reason about all performance

Zobel et al., Inverted files versus signature files for text indexing; TODS 1998

Page 127: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Conclusions!

Page 128: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

You can reason about perf

Page 129: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

It’s often just arithmetic

Page 130: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Acknowledgements

Thanks to Leah Hanson, Mike Hopcroft, Julia Evans, Hari Angepat,

David Turner, Danielle Sucher, Ikhwan Lee, Tejas Sapre, Raul Jara,

Rich Ercolani, Bert Muthalaly, Harsha Nori, Jeshua Smith, Bill

Barnes, Gary Bernhardt, Marek Majkowski, Tom Crayford, Gina

Willard, Laura Lindzey, Larry Marbuger, Siddarth Anand, Eric

Lemmon, Tom Ballinger, and [Anonymous Reviewer] for feedback.

Page 131: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

bitfunnel.org/strangeloop

github.com/bitfunnel/bitfunnel

danluu.com

Page 132: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Unused slides(thar be dragons)

Page 133: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

SLIDE FOR HOMEWORK. TODO: USE DIFFERENT TEMPLATE

Page 134: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Why are posting lists standard?

Page 135: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Literature on alternatives

“Signatures files were proposed in [23] and shown to be inferior to inverted indexing in [24]. “

“ Inverted indexes have been benchmarked as the most generalisable, and well performing structure (Zobel et al., 1998). The experiments in this

thesis are therefore conducted solely on an inverted index system.”

“While this technique provides a relatively low computation overhead, studies by Zobel et al. [1998] have shown that inverted files significantly

outperform signature files. We will now focus the analysis on inverted files as it is generally considered to be the most efficient indexing method

for most IR systems.”

“The other two mechanisms are usually adopted in certain applications even if, recently, they have been mostly abandoned in favor of inverted

indexes because some extensive experimental results [194] have shown that: Inverted indexes offer better performance than signature files and

bitmaps, in terms of both size of index and speed of query handling [188]”

“Zobel et al. [16] compared inverted files and signature files with respect to query response time and space requirements. They found that the

inverted files evaluated queries in less time than the signature files and needed less space. Their results showed that the signature files were

much larger, more expensive to construct and update, their response time was unpredictable, they support ranked queries only with difficulty,

they did not scale well and they were slow“

Page 136: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Zobel et al., actual quotes

“Inverted file indexes with in-memory search structures require no more disk accesses to

answer a conjunctive query than do bitsliced signature files.”

“One of the difficulties in the comparison of inverted files and signature files is that many

variants of signature file techniques have been proposed, and it is possible that some

combination of parameters and variants will result in a better method.”

Page 137: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Citations are lossy

Page 138: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 139: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 140: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 141: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 142: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 143: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 144: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 145: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 146: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 147: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 148: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 149: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Search: why do we care?

Page 150: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

$20M/yr * 2% savings =

$400k/yr

Page 151: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

How things fit together

Page 152: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

TODO: add diagram

Page 153: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Posting list

Page 154: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 155: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 156: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:
Page 157: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

How many terms?

Page 158: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

TODO: pseudo-code

TODO: diagram about how bits drop out

TODO: search is a high dynamic range problem.

TODO: higher rank rows

TODO: sharding by document length

TODO: diagram of how things fit together. Could just be concentric circles

Page 159: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Posting lists are standard

Page 160: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Posting list optimizations

Skip list

Delta compression

etc.

Page 161: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Search

Page 162: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Perf: how to think about it?

Page 163: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Performance

Page 164: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Search is BIG

Page 165: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Parsing / TokenizationHarder than it sounds

Page 166: Thinking about performance€¦ · “Premature optimization is the root of all evil” “We should forget about small efficiencies, say about 97% of the time” Different designs:

Search is a big problem

Tokenization

Some languages mix alphabets, are partially left-to-right and right-to-left, etc.

Can’t drop non-alphanumeric characters (C# vs C++)

Multi-language queries

Ranking / Relevance

Distributed Systems

etc.