Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
-
Upload
lucidworks -
Category
Software
-
view
328 -
download
1
Transcript of Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
Agenda
• An Optimization Problem • Genetic Algorithm Overview • Modeling Solr Parameters • Fitness Function
sir can you help me… ????
"iam from indonesia want to build search engine like a Google and i want to build the system using Genetic Algorithm but iam confused what will i do first. Thanks before."
Search Algorithm Parameters
/select?q=foo&defType=dismax
&qf=name^20+desc^10
&pf=name^10&ps=3&mm=2
&bf=”ord(popularity)^0.05”
and many more
An Optimization Problem
So, how do we know we have the best set of numbers? Or even a good set? We have an optimization problem.
Sample Schema
<field name="name" type="text_en" indexed="true" stored="true" required="true" multiValued="false" omitNorms="true"/>
<field name="description" type="text_en" indexed="true" stored="true" multiValued="false" omitNorms="true"/>
Sample Data Set [{
"name":"Red Lobster",
"description":"We deliver the freshest caught seafood every day."
},{
"name":"Joe's Crab Shack",
"description":"We serve delicious red crabs, rock crabs, large lobsters, and other delicious seafood. Our lobsters are our specialty."}]
http://localhost:8983/solr/restaurantsCollection/select?q=red+lobster&defType=dismax&qf=name+description&indent=true&fl=name+description
Genetic Algorithms
• A tool for solving optimization problems • Based on ideas from genetics, evolution,
and natural selection • DEAP – Distributed Evolutionary
Algorithms in Python
Genetic Algorithms
• Define candidate solution encoding • Define a fitness function • Generate random solutions • Select candidates for reproduction • Use crossover and mutation to create a new
generation • Repeat until some criteria is met
Crossover and Mutation
Parent 1: [1,0,1,1,1,0,1,1]
Parent 2: [0,0,0,0,1,1,1,1]
Child: [1,0,0,1,1,0,1,0]
Encoding Parameters
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
Encoding Parameters
>>> import numpy
>>> single = numpy.float32(3.4)
>>> single
3.4000001
>>> half_single = numpy.float16(3.4)
>>> half_single
3.4004
Decimal / Fibonacci Encoding
• 0, 0.2, 0.4, 0.8, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144
• 16 values encode into 4-bits • Supports fast evolution • Avoids relative maxima
Decimal / Fibonacci Encoding
0.0 => [0, 0, 0, 0]
0.2 => [0, 0, 0, 1]
0.4 => [0, 0, 1, 0]
…
1 => [0, 1, 0, 1]
2 => [0, 1, 1, 0]
…
144 => [1, 1, 1, 1]
Candidate Solution Encoding
/select?q=foo&qf=name^0.4+desc^13
0.4 => [0, 0, 1, 0]
13 => [1, 0, 1, 0]
Candidate Solution: [0, 0, 1, 0, 1, 0, 1, 0]
Normalized Discounted Cumulative Gain
• Very relevant > relevant > not relevant • Relevant results are more useful if they
appear earlier • Results should be irrelevant of the query
Precision and Recall
Precision – Likelihood that a returned result was correct Recall – Likelihood that a relevant result was returned
Analytics in Schema
<field name="searchTermInteractions" type="lowercase" indexed="true" stored="true" multiValued="true"/>