Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Evolving The Optimal Relevancy Scoring Model at Dice.com Simon Hughes

Chief Data Scientist, Dice.com

3

•  Chief Data Scientist at Dice.com and DHI, under Yuri Bykov

•  Dice.com – leading US job board for IT professionals

•  Twitter handle: https://twitter.com/hughes_meister

Who Am I?

•  Dice Skills pages - http://www.dice.com/skills

•  New Dice Careers Mobile App

Key Projects

•  PhD candidate at DePaul University, studying NLP and machine learning

•  Thesis topic – Detecting causality in scientific explanatory essays

PhD

4

•  Look under https://github.com/DiceTechJobs

•  Set of Solr plugins https://github.com/DiceTechJobs/SolrPlugins

•  Tutorial for this talk: https://github.com/DiceTechJobs/RelevancyTuning

Open Source GitHub Repositories

5

1.  Approaches to Relevancy Tuning

2.  Automated Relevancy Tuning – using Reinforcement Learning

3.  Feedback Loops - Dangers of Closed Loop Learning Systems

Overview

6

•  Last year I talked about conceptual search and how that could be used to improve recall

•  This year I want to focus on techniques to improve precision

•  Novelty

Motivations for Talk

7

Finding the Optimal Search Engine Configuration

•  Most companies initially approach this in a very ad hoc and manual process:

•  Follow ‘best practices’ and make some initial educated guesses as to the best settings

•  Manually tune the parameters on a number of key user queries

•  The search engine parameters should be tuned to reflect how your users search

•  Relevancy is a hard to define concept, but it’s what your users consider provides them with an

optimal search experience. So it should be informed by their search behavior

Relevancy Tuning

8

What Solr Configuration Options Influence Relevancy?

Solr and Lucene provide many configuration options that impact search relevancy, including:

•  Which query parser – dismax, edismax, LuceneParser, etc •  Field boosts – qf parameter •  Phrase boosts – pf, pf2, pf3 parameters •  Minimum should match - mm parameter •  Similarity Class – default similarity, BM25, Tf.Idf, custom or one of many others •  Boost queries – boost, bf, bq, etc •  Edismax tie parameter – recommended value ≈ 0.1

9

Remove Noise Chars

•  Ensure punctuation characters and plurality are removed from each field using the analysis chain

Ø  ‘q=developer’ should match ‘developer,’ ,’developer.’, ‘developer’s’ and ‘developers’

When using Stemming \ Synonyms – use Copy Fields + Edismax

•  Use copy fields to apply stemming and synonyms to existing fields

•  Allows different boosts to be applied to stemmed and synonym matches

•  Set fields boost to be lower on the stemmed and synonym copy fields

Some General Tips on Relevancy Tuning

10

Use Boost Queries for Specific Query Use Cases

•  Edismax bq parameter – allows boosting of matches to nested queries

•  See chapter 7 of Relevant Search - good coverage of this strategy

Make Good Use of Phrase Query Boosts

•  Use pf, pf2 and pf3 parameters in edismax to give preference for multi-term matches

•  pf2 and pf3 often give better performance than pf, which requires an exact match for all query terms

Caveat Emptor: Monitor impact of these changes on query performance (QTime) and index size

Some General Tips on Relevancy Tuning

11

•  To tune your search parameters, you can gather a dataset of relevancy judgements

•  For a set of important queries, the dataset will contain a set of relevancy judgements with the

top results returned annotated for relevancy

•  This dataset can be collected using domain experts and a user interface designed for this task

•  Commercial Examples:

•  Quepid – developed by OpenSource Connections

•  Fusion UI Relevancy Workbench – part of the Fusion offering from Lucidworks

The ‘Golden’ Test Collection

13

•  An alternative to manually collecting relevancy judgements is to collect them directly from your users

•  For each user search on the site, capture:

•  User’s query, and timestamp

•  Any filters applied

•  Result impressions and clicks

•  You can then turn this into a test collection by assuming that the results that people click on are more relevant

than those they don’t

•  The time spent on the results page is also a great indication of how relevant that result was to the original search

Search Log Capture

14

•  Now you have a test collection, you can use that to tune your search engine configuration

•  Using the test collection, you can measure the relevancy of a set of searches on that collection using some IR metrics, such as:

•  MAP (Mean Average Precision)

•  Precision at K (compute precision at the k’th document retrieved)

•  NDCG (Normalized Discounted Cumulative Gain)

•  Regression testing – this allows you to build a set of regression tests to ensure configuration changes both improve relevancy

and don’t break certain queries

•  Manually tuning search configurations is still a time consuming and inefficient process

•  Is there a better way?

Relevancy Tuning with a Test Collection

15

1.  Supervised Machine Learning?

•  No - cannot optimize your search configuration without a computable gradient

2.  Grid Search?

•  Perform a brute force search over a the range of possible configuration parameters

•  Very slow and inefficient – is not able to learn which ranges of settings work best

3.  Black Box Optimization Algorithms?

•  Optimization algorithms exist that attempt to find the optimum value of an unknown function in as few iterations as possible

•  Perform a much smarter search of the parameter space than grid search

Automated Relevancy Tuning Approaches

16

•  Use an optimization algorithm to optimize a ‘black box’ function

•  Black box function – provide the optimization algorithm with a function that takes a set of parameters as inputs

and computes a score

•  The black box algorithm will then try and choose parameter settings to optimize the score

•  This can be thought of as a form of reinforcement learning

•  These algorithms will intelligently search the space of possible search configurations to arrive at a solution

•  Example algorithms include Bayesian Optimization, Simulated Annealing, and Genetic Algorithms (hence talk

title)

Black Box Optimization Algorithms

17

Example Black Box Function for Search Relevancy

18

•  There are some excellent mature libraries for doing this sort of thing e.g.

•  DEAP

- Distributed Evolutionary Algorithms in Python (hence talk title)

•  Scikit Optimize

– General optimization library built by a team at CERN headed by Tim Head

•  These libraries are very easy to use, however getting them to optimize your search configuration is a little trickier

•  They tend to work better when optimizing a small set of parameters at a time – 1 to 4 works well

•  Achieved an improvement of 5% in MAP @ 5 for our MLT configuration. A\B testing changes to search before

EOY

Making it Work

19

•  To optimize a large set of search parameters – start with the most important ones and optimize those while

keeping the rest fixed

•  If you are using search logs to optimize the search configuration, use a large number of searches (at least a few

thousand) to ensure you are performing a robust enough test

•  For most search collections of a reasonable size, running these optimizations over your search collection will

take time – set it up on a server, parallelize where possible and leave running overnight

•  Typically you will want to allow the algorithm to try a few hundred variations of each parameter set at least to

find a good range of settings

•  Ideally – first optimize your search configuration against a set of relevancy judgements acquired from domain

experts, deploy to production and use the search logs to further tune against your users search behavior

Making it Work

20

•  As with any machine learning problem, it is essential to use one dataset to learn from, and a second separate dataset to

validate your results – prevents ‘overfitting’

•  Overfitting in this context means that the search parameters are over-tuned on your initial dataset, that the search engine

performs worse on new data than with the current configuration

•  Once you have an optimal set of configuration parameters, that you are happy with, these should be evaluated on a second set

of relevancy judgements to ensure the same performance gains are seen there also

•  This applies to both manual and automatic tuning of the search engine configuration. Humans can overfit a dataset just as

easily as an algorithm can

Use a Separate Testing Dataset to Validate Improvements

21

•  Auto-tune other solr parameters – phrase slop, mm settings, similarity class used

•  Your can evolve a more optimal ranking function:

•  Either tweak the settings of the existing ranking functions (see

SweetSpotSimilarityFactory class)

•  Or use Genetic Programming to evolve a better ranking function for your dataset

•  Genetic Programming is an evolutionary algorithm that can evolve programs and equations

•  Some relevant papers, good introductory paper (but not very recent)

Some Other Things to Try

22

•  Building a Machine Learned Ranking system is a premature optimization if you haven’t first optimized

your search configuration

•  Relevancy tuning and MLR both primarily optimize for precision over recall due to nature of training

data**

•  For techniques to improve recall, see conceptual \ semantic search:

•  Simon Hughes - “Conceptual Search” (Revolution 2015)

•  Trey Grainger - “Enhancing Relevancy Through Personalization and Semantic Search” (Revolution 2013)

•  Doug Turnbull and John Berryman - Chapter 11 of Relevant Search

Things to Consider

Feedback Loops – Dangers of Closed Loop Learning Systems

Users Interact with

the System Model Machine Learning

Produce

Building a Machine Learning System

1.  Users interact with the system to

produce data

2.  Machine learning algorithms turns

that data into a model

What happens if the model’s

predictions influence the user’s

behavior?

Users Interact with

the System Model

Produce

Positive Feedback Loop

1.  Users interact with the system to

produce data

2.  Machine learning algorithms turns

that data into a model

3.  Model changes user behavior,

modifying its own future training

data

Model changes behavior

Machine Learning

26

1.  Isolate a subset of data from being influenced by the model, use this data to train the system

•  E.g. leave a small proportion of user searches un-ranked by the MLR model

•  E.g. generate a subset of recommendations at random, or by using an unsupervised model

2.  Use a reinforcement learning model instead (such as a multi-armed bandit) - the system will

dynamically adapt to the users’ behavior, balancing exploring different hypotheses with

exploiting what it’s learned to produce accurate predictions

Preventing Positive Feedback Loops

27

THE END

•  Thank you for listening

•  Any questions?

Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com

Technology

Transcript of Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com