Ruby meetup evolution of bof search

26
Evolution of BoF search

Transcript of Ruby meetup evolution of bof search

Evolution of BoF search

BoF?

- www.businessoffashion.com- London based fashion media startup, - one of the biggest DLabs products

BoF?

Infrastructure

Wordpress, Symfony2, MySql, Mongo, Redis, Elasticsearch, Ruby, HHVM, Go, AWS….

You name it....

The fashion folks

- They are not tekkies- Great attention to details- Not ready to make compromises

The Challenge

● Make search work in such a way, that relevant results will be found. It should be a bit fuzzy, but not too much.

● It should degrade results from certain categories (daily digest), but not on the expense of accuracy.

● It should degrade old articles, but not always.● Items that are shared more, should be higher - but not

sure how high.

The challenge - example search queries:

Tom Adeyoola - is mentioned once in body of and old article -> should be 1st result.‘Tweets and tribes’ - title of an old article ->expected to be 1st result.Dolce Gabbana: 1st should be article from 2011, 2nd should be article from 2014 (News and analysis)

Elasticsearch toolsetSearch types:

● Fuzzy● Match● Phrase

They can be all used together using ‘Bool’ query.

Boosting options

● Boost on field (title, content, category…)

● Boost on search type

● Boost mode (multiply, sum, avg, first, max, min)

Scoring Function

● Script score"script_score" : {

"script" : "_score * category == 1 ? 0.2 : 1"}

● Decay functions ( e.g. every 100 days score should decrease by factor 0.2)

● Score mode (multiply, sum, avg, max, min)

Decay?

"DECAY_FUNCTION": {

"FIELD_NAME": {

"origin": "TODAY",

"scale": "10d",

"offset": "5d",

"decay" : 0.5

}

}

T

Sco

re

1y 2y

Decay function can be Gauss, linear or exponential

How do parameters look?boost_mode: 'avg',

score_mode: 'sum',

split_query: false,

min_score: 0.5,

use_fuzzy: true,

use_query_string: true,

use_match: true,

use_pharse: true,

use_decay_on_time: true,

use_views_weights: true,

use_shares_weights: true,

use_downboost_for_categories: true,

time_decay_scale: '30d',

time_decay_decay: 0.001,

view_weight_divisor: 100000,

shares_weight_divisor: 100000,

boost_factor_fuzzy: 1,

boost_factor_match: 8,

boost_factor_phrase: 8,

boost_title: 8,

boost_summary: 4,

boost_keywords: 2,

boost_content: 8,

boost_slug: 1,

boost_author: 1,

downboost_cat_2687: 0.5,

downboost_cat_4: 0.25,

downboost_cat_77: 0.125

I ended up fixing cases by trial and errorAnd often when 1 case was fixed, another one was broken.

Straight search Decay On + Scoring

Tom Adeyoola 1. Tom Adeyoola2. Tom Ford3. Tom Ford………..

1. Tom Ford2. Tom ford….77. Tom Adeyoola :(

Dolce Gabbana 1. Some ‘irrelevant’ result2. Other ‘irrelevant’ result

1. ‘Relevant’ result2. ‘Relevant’ result

But… I’m a programer, shouldn’t computer be doing this boring tasks of trial and error for me?

Damn right! let’s just try all possible combinations of parameters. There are roughly 1.28E+36 of them. It could take years.So what then?

Evolution FTW.Darwin's theory of evolution was a concept

of such stunning simplicity, but it gave rise,

naturally, to all of the infinite and baffling

complexity of life. The awe it inspired in me

made the awe that people talk about in

respect of religious experience seem,

frankly, silly beside it. I'd take the awe of

understanding over the awe of ignorance

any day.

Douglas Adams

Evolutionary algorithm

Well, genetic algorithm.Genome:

Create random population of

search configurations

Run fitness function for

each subject

Choose the best subjects

Create new populations from bests.

boost_mode: 'avg',

score_mode: 'sum',

split_query: false,

min_score: 0.5,

use_fuzzy: true,

use_query_string: true,

use_match: true,

use_pharse: true,

use_decay_on_time: true,

use_views_weights: true,

use_shares_weights: true,

use_downboost_for_categories: true,

time_decay_scale: '30d',

time_decay_decay: 0.001,

view_weight_divisor: 100000,

shares_weight_divisor: 100000,

boost_factor_fuzzy: 1,

boost_factor_match: 8,

boost_factor_phrase: 8,

boost_title: 8,

boost_summary: 4,

boost_keywords: 2,

boost_content: 8,

boost_slug: 1,

boost_author: 1,

downboost_cat_2687: 0.5,

downboost_cat_4: 0.25,

downboost_cat_77: 0.125

(Vocabulary)

Genome: a set of rules that determines how each subject will behave -> set of search parameters

Population: A set of all subjects -> set of all search objects

(Vocabulary)

Subject: a single member of population -> single search object, created with a genomeElstSearch.new(huge_settings_hash)

Fintess function: Function that determines how good a subject is.

Create random

population

Run fitness function for

each subject

Choose the best

subjects

Create new population

from bests.

boost_mode: [sum, avg, max, min ...].sample,

score_mode: [sum, avg, max, min ...].sample,

split_query: [true, false].sample,

min_score: rand,

….

….

Create random

population

Run fitness function for

each subject

Choose the best

subjects

Create new population

from bests.

Fitness function scores each subject. It shows how much individual subject is appropriate for further usage.

def fitness_tweets_and_tribes

rs = search('tweets and tribes')

place = rs.string_place(:title, 'tweets and tribes')

if place

increase_score(1 / place.to_f)

end

end

def fintess_chris_morton

rs = search('chris morton')

place1 = rs.string_place(:title, 'One Cart to Rule

Them All')

place2 = rs.string_place(:title, "Net Native")

increase_score(1 / place1.to_f)

increase_score(1 / place2.to_f)

increase_score(1) if place1 < place2

end

end

Create random

population

Run fitness function for

each subject

Choose the best

subjects

Create new population

from bests.

Sort by score, take top 10.

Create random

population

Run fitness function for

each subject

Choose the best

subjects

Create new population

from bests.

dna1 = [‘avg’, ‘min’, true, true, false, 14 …]

dna2 = [‘max’, ‘max’, false, true, true, 18 …]

-------------------------------------------------Take from random parent-------

child= [‘avg’, ‘max’, true, true, false, 18 …..]

Mating two DNAs

We create 100 children from 10 bests subject of current generation.

Create random

population

Run fitness function for

each subject

Choose the best

subjects

Create new population

from bests.

The problem I: inbreds

After a while, top subjects all gives same score. The score doesn’t increase even after 1000s of generations.

Solution: mutations

Create random

population

Run fitness function for

each subject

Choose the best

subjects

Create new population

from bests.

dna1 = [‘avg’, ‘min’, true, true, false, 14 …]

dna2 = [‘max’, ‘max’, false, true, true, 18 …]

---Take from random parent + mutate

child =[‘avg’, ‘avg’, true, true, false, 4 …]

Problem II: the cost of new best subject

generations

scor

e

Solution: increase mutation rate

● If too much generations has passed from last best score increase mutation rate

● If that does not help, push last best genome into breeding pairs

● If that does not help, create new random generation.

Result boost_mode: 'sum',

score_mode: 'sum',

split_query: false,

min_score: 0.27697059014144987,

use_fuzzy: true,

use_query_string: false,

use_match: false,

use_pharse: true,

use_decay_on_time: true,

use_views_weights: true,

use_shares_weights: false,

use_downboost_for_categories: true,

time_decay_scale: '2491d',

time_decay_decay: 0.3306393289954982,

view_weight_divisor: 504537,

shares_weight_divisor: 703657,

boost_factor_fuzzy: 85,

boost_factor_match: 0,

boost_factor_phrase: 68,

boost_title: 86,

boost_summary: 30,

boost_keywords: 32,

boost_content: 75,

boost_slug: 45,

boost_author: 27,

downboost_cat_2687: 0.73255136431042,

downboost_cat_4: 0.1384377696262037,

downboost_cat_1: 0.7109314266874576,

Thank you