Introduction to Search Systems - ScaleConf Colombia 2017

74
Introduction to Search Systems Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive

Transcript of Introduction to Search Systems - ScaleConf Colombia 2017

Introduction toSearch Systems

Toria Gibbs

Senior Software Engineer @ Etsy

@scarletdrive

2

LEONOR. Macrame wall hanging

$145.00 USDAncestralStore

3

Bread your Cat Costume for Cats

$12.00 USDMissMaddyMakes

4

45MITEMS FOR SALE

AS OF DECEMBER 31, 2016

5

Agenda Main Section One

Main Section Two

Main Section Three

Why Build Search Systems?

Search Indexes

Open Source Tools

Interesting Challenges in Search

7

Why build search systems?

“Isn’t search a solved problem? We have Google!”All my friends

Photo by Alissaloveherbyalissa.etsy.com

title• Title • Title

Very very large scope Medium scope

No control over content Some control over content

High intent Low intent

Optimize for Google users Optimize for Etsy users

9

Google Etsy

Why build search systems?

1. Customize the solution (your users, your data, your algorithms)

10

id description price

001 red cat mittens 40.00

002 blue mittens 19.99

003 blue hat for cats 12.50

004 cat hat 25.00

005 red and blue hat 30.00

11

Database Example

q=“cat”

SELECT * FROM itemsWHERE description LIKE ‘%cat%’

12

n = items in databasem = length of string

SUBSTRING SEARCH

O(n·m)

13

n n·m

10 250

100 2500

1000 25000

10000 250000

100000 2500000

1000000 25000000

Database Scalability

m=25

Why build search systems?

1. Customize the solution (your users, your data, your algorithms)

2. Improve performance

14

✓ cat hat

✓ blue hat for cats

✓ vacation hat

? kitten hat

By Laura Solartefloflyco.etsy.com

SELECT * FROM itemsWHERE description LIKE ‘%cat%’

Why build search systems?

1. Customize the solution (your users, your data, your algorithms)

2. Improve performance

3. Improve quality of results

16

17

Search Index

Inverted Index

red [001, 005]

blue [002, 003, 005]

cat [001, 003, 004]

hat [003, 004, 005]

mitten [001, 002]

18

001 red cat mittens

002 blue mittens

003 blue hat for cats

004 cat hat

005 red and blue hat

Terminology

red [001, 005]

blue [002, 003, 005]

cat [001, 003, 004]

hat [003, 004, 005]

mitten [001, 002]

19

● A document is a single searchable unit

001 red cat mittens 40.00

Terminology

red [001, 005]

blue [002, 003, 005]

cat [001, 003, 004]

hat [003, 004, 005]

mitten [001, 002]

20

● A document is a single searchable unit

● A field is a defined value in a document

id description price

001 red cat mittens 40.00

Terminology

red [001, 005]

blue [002, 003, 005]

cat [001, 003, 004]

hat [003, 004, 005]

mitten [001, 002]

21

● A document is a single searchable unit

● A field is a defined value in a document

● A term is a value extracted from the source in order to build the inverted index

id description price

001 red cat mittens 40.00

Terminology

red [001, 005]

blue [002, 003, 005]

cat [001, 003, 004]

hat [003, 004, 005]

mitten [001, 002]

22

● A document is a single searchable unit

● A field is a defined value in a document

● A term is a value extracted from the source in order to build the inverted index

● An inverted index is an internal data structure that maps terms of a field to document ids

Terminology

red [001, 005]

blue [002, 003, 005]

cat [001, 003, 004]

hat [003, 004, 005]

mitten [001, 002]

23

● A document is a single searchable unit

● A field is a defined value in a document

● A term is a value extracted from the source in order to build the inverted index

● An inverted index is an internal data structure that maps terms of a field to document ids

● An index is a collection of documents

12.50 [003]

19.99 [002]

25.00 [004]

30.00 [005]

40.00 [001]

001 red cat mittens 40.00

002 blue mittens 19.99

... ... ...

red [001, 005]

blue [002, 003, 005]

cat [001, 003, 004]

hat [003, 004, 005]

mitten [001, 002]

001 red cat mittens

002 blue mittens

003 blue hat for cats

004 cat hat

005 red and blue hat

How did we do this?

string: “cat hat”

array: [“cat”, “hat”]

Tokenization

By Meredith Langleyiheartneedlework.etsy.com

Stemming

By Paradise CrowParadiseCrow.etsy.com

“cats” → “cat”“walking” → “walk”

“painting” → “paint” ?

By Dina Castellanomamaslilsugarcrochet.etsy.com

Bonus: Synonyms

✓ [“cat”, “kitten”]

✓ [“color”, “colour”]

✓ [“Canada”, “Canadian”, “canuck”]

✗ [“Poland”, “Polish”]

=(

By Ludwinus van den Arendcircuszoo.etsy.com

● Stemming ✓ hat for cats

● Tokenization ✗ vacation

● Synonyms ✓ kitten hat

Building an Inverted Index

30

INDEX TIME

O(n·m·p)QUERY TIME

O(1)

n = items in databasem = length of string

p = preprocessing steps

31By Lisa Van Riper

humbleelephant.etsy.com

title 1. “big data”2. “small data”3. “big data”4. “small data”5. “big data”6. “small data”7. “big data”8. “small data”9. “big data”

10. “small data”11. “bigger data”12. “biggest data”

data=[1,2,3,4,5,6,7,8,9,10,11,12]big=[1,3,5,7,9,11,12]small=[2,4,6,8,10]

32

title1. “Carlos Vives is the

greatest singer alive”

2. “Shakira is the best dancer in the world”

3. “Sophía Vergara is the most famous Colombian in the United States”

carlos=[1]vives=[1]is=[1,2,3]the=[1,2,3]great=[1]singer=[1]alive=[1]shakira=[2]best=[2]dancer=[2]

in=[2,3]world=[2]sophia=[3]vergara=[3]most=[3]famous=[3]colombia=[3]unite=[3]states=[3]

33

Did we solve it?

✓ Customize the solution (your users, your data, your algorithms)

✓ Improve performance

✓ Improve quality of results

34

Agenda Main Section One

Main Section Two

Main Section Three

Why Build Search Systems?

Search Indexes

Open Source Tools

Interesting Challenges in Search

36

Open Source Tools

37

38

● Inverted index● Field data (uninverted index)● Basic stemming, tokenizing,

faceting

● Advanced stemming, tokenizing, faceting

● Plugins● Caching, warming● Replication● Sharding, distribution● ...and more!

Which one should I pick?

IT DOESN’T MATTER39

SourceSide by Side with Elasticsearch and SolrBy Rafał Kuć and Radu Gheorghehttps://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solrhttps://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability

See alsohttp://solr-vs-elasticsearch.com/By Kelvin Tan

40

It Doesn’t Matter

● Most projects work well with either

● Getting configuration right is more important

● Test with your own data and your own queries

41

<schema name="items" version="1.6"> <types> <fieldType name="long" class="solr.TrieLongField"/> <fieldType name="int" class="solr.TrieField" type="integer"/> <fieldType name="tdate" class="solr.TrieDateField"/> <fieldType name="text" class="solr.TextField"/> </types>

<fields> <field name="item_id" type="long" stored="true" required="true"/> <field name="description" type="text"/> <field name="quantity" type="int"/> <field name="price" type="long"/> <field name="update_date" type="tdate"/></fields>

<defaultSearchField>description</defaultSearchField><uniqueKey>item_id</uniqueKey></schema>

"item" : { "properties" : { "item_id": { "type": "long", "store": true }, "description": { "type": "string" }, "quantity": { "type": "int" }, "price": { "type": "long" }, "update_date": { "type": "date" } } }

Which one should I pick?

Just pick one and get started :)

42

43

Interesting Challenges

Scalability

Relevance

Query Understanding

INTERESTING CHALLENGES

44

45By Bekki

TresorsDesPyrenees.etsy.com

DataUsers

46

Replication

47

Replication

update

48

Sharding

Distribution

49

50

Scalability

Relevance

Query Understanding

INTERESTING CHALLENGES

51

TF·IDF

58

59

TF-IDF

TF(term) = # times this term appears in doc / total # terms in doc

IDF(term) = loge(total number of docs / # docs which contain this term)

1. The orange cat is a very good cat

2. My cat ate an orange

3. Cats are the best and I will give every cat a special cat toy

1. TF(cat) = 2/82. TF(cat) = 1/53. TF(cat) = 3/14

IDF(cat) = loge(3/3)

“cat” → [1, 3, 2]

60

TF-IDF

TF(term) = # times this term appears in doc / total # terms in doc

IDF(term) = loge(total number of docs / # docs which contain this term)

1. The orange cat is a very good cat

2. My cat ate an orange

3. Cats are the best and I will give every cat a special cat toy cat cat cat cat cat

1. TF(cat) = 2/82. TF(cat) = 1/53. TF(cat) = 8/19

IDF(cat) = loge(3/3)

“cat” → [3, 1, 2]

TF·IDF

61

IDF·Q·R

62

Quality

By Lisaairfriend.etsy.com

● User reviews● Clicks● Favorites● Adds to shopping cart● Purchases● Dwell (time spent viewing the item)

● ...and more!

Recency

By Olyafoxberrystudio.etsy.com

● Ensure that each visit is new and fresh

● New items have a chance to be seen

Diversity

65

Scalability

Relevance

Query Understanding

INTERESTING CHALLENGES

66

Query Understanding

● Tokenization and stemming

● Language identification

● Spelling correction

● Query rewriting (scoping, expansion, relaxation)

For more informationhttp://queryunderstanding.com/By Daniel Tunkelang

67

Query Scoping

68

q=“red mittens”

q=“pizza restaurants in Medellin”

q=“necklace under $20”

q=“mittens” & color=red

q=“pizza restaurant” & location=“Medellin”

q=“necklace” & price<20

By Amanda EllisGreenChickens.etsy.com

How Etsy Uses Thermodynamics to Help You Search for “Geeky” by Fiona Condonhttp://codeascraft.com/2015/08/31/how-etsy-uses-thermodynamics-to-help-you-search-for-geeky

Scalability

Relevance

Query Understanding

INTERESTING CHALLENGES

71

Agenda Main Section One

Main Section Two

Main Section Three

Why Build Search Systems?

Search Indexes

Open Source Tools

Interesting Challenges in Search

Follow me on Twitter!

@scarletdrive

Thanks!

title

74

We Covered We Did Not Cover

● Stemming● Tokenization● Synonyms● Replication, distribution,

and sharding● Ranking for relevance● Query understanding

● Faceting● Field data● Internationalization● Spelling correction● Autocomplete suggestions