Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

55
Learning To Rank For Solr Michael Nilsson – Software Engineer Diego Ceccarelli – Software Engineer Joshua Pantony – Software Engineer Bloomberg LP Copyright 2015 Bloomberg L.P. All rights reserved.

Transcript of Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Page 1: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Learning To Rank For Solr Michael Nilsson – Software Engineer

Diego Ceccarelli – Software Engineer

Joshua Pantony – Software Engineer Bloomberg LP

Copyright 2015 Bloomberg L.P. All rights reserved.

Page 2: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

OUTLINE ●  Search at Bloomberg

●  Why do we need machine learning for search?

●  Learning to Rank

●  Solr Learning to Rank Plugin

Page 3: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

8 millions searches PER DAY

1 million PER DAY

400  million  stories  in  the  index  

Page 4: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SOLR IN BLOOMBERG ●  Search engine of choice at Bloomberg

─  Large community / Well distributed committers

─  Open source Apache Project

─  Used within many commercial products

─  Large feature set and rapid growth

●  Committed to open-source ─  Ability to contribute to core engine

─  Ability to fix bugs ourselves

─  Contributions in almost every Solr release since 4.5.0

Page 5: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

PROBLEM SETUP

score: 30

score: 1.0

Page 6: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=100∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+�10∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛

score: 52.2

score: 30.8

Page 7: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=100∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+�10∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛

Page 8: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=𝟏𝟓𝟎∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+�𝟑.𝟏𝟒∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+�𝟒𝟐∗𝑐𝑙𝑖𝑐𝑘𝑠

Page 9: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+𝟑.𝟏𝟏𝟏𝟒∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+𝟒𝟐.𝟒𝟐∗𝑐𝑙𝑖𝑐𝑘𝑠 + 5 ∗  timeElapsedFrom  LastUpdate  

Page 10: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

●  It’s hard to manually tweak the ranking ─  You must be an expert in the domain

─  … or a magician

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+𝟑.𝟏𝟏𝟏𝟒∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+𝟒𝟐.𝟒𝟐∗𝑐𝑙𝑖𝑐𝑘𝑠 + 5 ∗  timeElapsedFrom  LastUpdate  

query = solr query = lucene query = austin query = bloomberg query = …

Page 11: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

PROBLEM SETUP

It’s easier with Machine Learning ●  2,000+ parameters (non-linear, factorially larger than linear form)

●  8,000+ queries that are regularly tuned

●  Early on we spent many days hand tuning…

Page 12: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SEARCH PIPELINE (ONLINE)

Index

Top-k retrieval

User Query

People

Commodities News

Other Sources

ReRanking Model

Top-k reranked

Top-x retrieval x >> k

Page 13: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

TRAINING PIPELINE (OFFLINE)

Index

Feature Extraction

Learning Algorithm

Ranking Model

Training Query-Document

Pairs

People

Commodities News

Other Sources

Metrics

Page 14: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

TRAINING PIPELINE (OFFLINE)

Index

Feature Extraction

Learning Algorithm

Ranking Model

Training Query-Document

Pairs

People

Commodities News

Other Sources

Metrics

Page 15: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

TRAINING DATA: IMPLICIT VS EXPLICIT What is explicit data? ●  A set of judges will assess the

search results manually given a query ─  Experts ─  Crowd

What is implicit data? ●  Infer user preferences based on

user behavior ─  Aggregated results clicks ─  Query reformulation ─  Dwell time

Pros: ─  Data is very clean

Cons: ─  Can be very expensive!

Pros: ─  A lot of data!

Cons: ─  Extremely noisy

─  Privacy concerns

Page 16: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

TRAINING PIPELINE (OFFLINE)

Index

Feature Extraction

Learning Algorithm

Ranking Model

Training Query-Document

Pairs

People

Commodities News

Other Sources

Metrics

Page 17: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

FEATURES ●  A feature is an individual measurable property

●  Given a query, and a collection we can produce many features for each document in the collection ─  If the query matches the title

─  Length of the document

─  Number of views

─  How old is it?

─  Can be visualized on a mobile device?

Page 18: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

FEATURES Extract “features”

Was the result a cofounder? 0

Features are signals that give an indication of a result’s importance

Page 19: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

FEATURES Extract “features”

Features are signals that give an indication of a result’s importance

Was the result a cofounder? 0

Does the document have an exec. position? 1

Query : APPL US

Page 20: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

FEATURES Extract “features”

Features are signals that give an indication of a result’s importance

Was the result a cofounder? 0

Does the query match the document title? 0

Does the document have an exec. position? 1

Page 21: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

FEATURES Extract “features”

Features are signals that give an indication of a result’s importance

Was the result a cofounder? 0

Does the query match the document title? 0

Does the document have an exec. position? 1

Popularity (%) 0.9

Page 22: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

FEATURES Extract “features”

Features are signals that give an indication of a result’s importance

Was the result a cofounder? 0

Does the query match the document title? 1

Does the document have an exec. position? 0

Popularity (%) 0.6

Page 23: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

TRAINING PIPELINE (OFFLINE)

Index

Feature Extraction

Learning Algorithm

Ranking Model

Training Query-Document

Pairs

People

Commodities News

Other Sources

Metrics

Page 24: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

METRICS How do we know if our model is doing better? ●  Offline metrics

─  Precision/Recall/F1 score

─  nDCG (Normalized Discount Cumulative Gain)

─  Other metrics (e.g., ERR, MAP, …)

●  Online Metrics ─  Click through rates à higher

─  Time to first click à lower

─  Interleaving1

1O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Science, 30(1), 2012.

Page 25: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

TRAINING PIPELINE (OFFLINE)

Index

Feature Extraction

Learning Algorithm

Ranking Model

Training Query-Document

Pairs

People

Commodities News

Other Sources

Metrics

Page 26: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LEARNING TO RANK

●  Learn how to combine the features for optimizing one or more metrics

●  Many learning algorithms ─  RankSVM1

─  LambdaMART2

─  …

1T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. 2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-TR-2010-82, 2010.

Page 27: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SEARCH PIPELINE: STANDARD

Index

Top-k retrieval

User Query

Solr People

Commodities News

Other Sources

Page 28: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SEARCH PIPELINE: STANDARD

Index

Top-k retrieval

User Query

Solr

Training Data

Learning Algorithm

Ranking Model Offline

People

Commodities News

Other Sources

Page 29: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SEARCH PIPELINE: STANDARD

Index

Top-k retrieval

User Query

Solr

Ranking Model Online Top-x

reranked

People

Commodities News

Other Sources

Page 30: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SEARCH PIPELINE: SOLR INTEGRATION

Index

Top-k retrieval

User Query

Solr

Ranking Model Online Top-x

reranked

People

Commodities News

Other Sources

Page 31: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SOLR RELEVANCY ●  Pros

─  Simple and quick scoring computation

─  Phrase matching

─  Function query boosting on time, distance, popularity, etc

─  Customized fields for stemming, synonyms, etc

●  Cons ─  Lots of manual time for creating a well tuned query

─  Weights are brittle, and may not be compatible in the future with more documents or fields added

Page 32: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: GOALS ●  Don’t tune the relevancy manually!

─  Uses machine learning to power automatic relevancy tuning

●  Significant relevancy improvements

●  Allow comparable scores across collections ─  Collections of different sizes

●  Maintaining low latency ─  Re-use the vast Solr search functionality that is already built-in

─  Less data transport

●  Makes it simple to use domain knowledge to rapidly create features ─  Features are no longer coded but rather scripted

Page 33: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

STANDARD SOLR SEARCH REQUEST

Index

Top-k retrieval

User Query

People

Commodities News

Other Sources

Page 34: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Index

STANDARD SOLR SEARCH REQUEST

Index [10 Million]

Top-10 retrieval

User Query

Matches [10k]

Score [10k]

Solr Query

People

Commodities News

Other Sources

Page 35: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR SOLR SEARCH REQUEST

Index [10 Million]

Top-1000 retrieval

User Query

Matches [10k]

Score [10k]

Ranking Model

Top-10 reranked

Solr Query

LTR Query

People

Commodities News

Other Sources

Page 36: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

<!-- Query parser used to rerank top docs with a provided model -->  <queryParser name="ltr" class="org.apache.solr.ltr.ranking.LTRQParserPlugin" />  

LTR PLUGIN: RERANKING

●  LTRQuery extends Solr’s RankQuery ─  Wraps main query to fetch initial results ─  Returns custom TopDocsCollector for reranked ordered results

●  Solr rerank request parameter rq={!ltr model=myModel1 reRankDocs=100 efi.user_query=‘james’ efi.my_var=123} ─  !ltr – name used in the solrconfig.xml for the LTRQParserPlugin ─  model – name of deployed model to use for reranking ─  reRankDocs – total number of documents to rerank ─  efi.* – custom parameters used to pass external feature information for your

features to use

•  Query intent

•  Personalization

Page 37: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SEARCH PIPELINE (ONLINE)

Index [10 Million]

Top-1000 retrieval

User Query

Matches [10k]

Score [10k]

Ranking Model

Top-10 reranked

Feature Extraction

People

Commodities News

Other Sources

Page 38: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

{          "name":    "Tim  Cook",          "primary_position":    "ceo",          "category  ":    "person",          …  }  

FEATURES Extract “features”

Features are signals that give an indication of a result’s importance

Was the result a cofounder? 0

Does the query match the document title? 0

Does the document have an exec. position? 1

Popularity (%) 0.9

Page 39: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: FEATURES BEFORE

Page 40: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

[          {                  "name":    "isPersonAndExecutive",                  "type":  "org.apache.solr.ltr.feature.impl.SolrFeature",                  "params":  {                          "fq":  [                                  "{!terms  f=category}person",                                  "{!terms  f=primary_position}ceo,  cto,  cfo,  president"                          ]                  }          },          …  ]  

LTR PLUGIN: FEATURES AFTER

Page 41: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: FUNCTION QUERIES [          {                  "name":    "documentRecency",                  "type":  "org.apache.solr.ltr.feature.impl.SolrFeature",                  "params":  {                          "q":  "{!func}recip(  ms(NOW,publish_date),  3.16e-­‐11,  1,  1)"                  }          },          …  ]    1  for  docs  dated  now,  1/2  for  docs  dated  1  year  ago,  1/3  for  docs  dated  2  years  ago,  etc..    See  http://wiki.apache.org/solr/FunctionQuery#Date_Boosting  

Page 42: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: FEATURE STORE ●  FeatureStore is a Solr Managed Resource

─  REST API endpoint for performing CRUD operations on Solr objects

─  Stored in maintained in Zookeeper

●  Deploy ─  curl -XPUT 'http://yoursolrserver/solr/collection/config/fstore'

--data-binary @./features.json -H 'Content-type:application/json'

●  View ─  http://yoursolrserver/solr/collection/config/fstore

Page 43: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: FEATURES ●  Simplifies feature engineering through configuration file

●  Utilizes rich search functionality built-in to Solr ─  Phrase matching

─  Synonyms, Stemming, etc

●  Inherit the Feature class for specialized features

Page 44: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

SEARCH PIPELINE (ONLINE)

Index [10 Million]

Top-1000 retrieval

User Query

Matches [10k]

Score [10k]

Ranking Model

Top-10 reranked

Feature Extraction

People

Commodities News

Other Sources

Page 45: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

TRAINING PIPELINE (OFFLINE)

Index [10 Million]

Top-1000 retrieval

Training Queries

Matches [10k]

Score [10k]

Feature Extraction

Learning Algorithm

Ranking Model

People

Commodities News

Other Sources

Page 46: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

{          "name":    "Tim  Cook",          "primary_position":    "ceo",          "category  ":    "person",          …  }  

FEATURES Extract “features”

Features are signals that give an indication of a result’s importance

Was the result a cofounder? 0

Does the query match the document title? 0

Does the document have an exec. position? 1

Popularity (%) 0.9

Page 47: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

<!-- Document transformer adding feature vectors with each retrieved document -->  <transformer name="fv" class= "org.apache.solr.ltr.ranking.LTRFeatureTransformer" />  

LTR PLUGIN: FEATURE EXTRACTION

●  Feature extraction uses Solr’s TransformerFactory ─  Returns a custom field with each document

●  fl = *, [fv] {          "name":    "Tim  Cook",          "primary_position":    "ceo",          "category  ":    "person",          …          "[fv]":    "isCofounder:0.0,  isPersonAndExecutive:1.0,  matchTitle:0.0,  popularity:0.9"  }  

Page 48: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: MODEL {          "type":  "org.apache.solr.ltr.ranking.LambdaMARTModel",          "name":  "mymodel1",          "features":  [                  {  "name":  "matchedTitle"},                  {  "name":  "isPersonAndExecutive"}          ],          "params":  {                  "trees":  [                          {                                  "weight":  1,                                  "tree":  {                                          "feature":  "matchedTitle",                                          "threshold":  0.5,                                          "left":  {  "value":  -­‐100  },                                          "right":  {                                                  "feature":  "isPersonAndExecutive",                                                  "threshold":  0.5,                                                  "left":  {  "value":  50  },                                                  "right":  {  "value":  75  }                                          }                                  }                          }                  ]          }  }  

Page 49: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: MODEL ●  ModelStore is also a Solr Managed Resource

●  Deploy ─  curl -XPUT 'http://yoursolrserver/solr/collection/config/mstore'

--data-binary @./model.json -H 'Content-type:application/json'

●  View ─  http://yoursolrserver/solr/collection/config/mstore

●  Inherit from the model class for new scoring algorithms ─  score()

─  explain()

Page 50: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: EVALUATION ●  Offline Metrics

─  nDCG increased approximately 10% after reranking

●  Online Metrics ─  Clicks @ 1 up by approximately 10%

Page 51: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

BEFORE AND AFTER Query: “unemployment” Solr Ranking Machine Learned Reranking

Page 52: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: EVALUATION ●  Offline Metrics

─  nDCG increased approximately 10% after reranking

●  Online Metrics ─  Clicks @ 1 up by approximately 10%

●  Performance ─  About 30% faster than previous external ranking system

10 million documents in collection 100k queries 1k features 1k documents/query reranked

Page 53: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

LTR PLUGIN: BENEFITS ●  Simpler feature engineering, without compiling

●  Access to rich internal Solr search functionality for feature building

●  Search result relevancy improvements vs regular Solr relevance

●  Automatic relevancy tuning

●  Compatible scores across collections

●  Performance benefits vs external ranking system

Page 54: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

FUTURE WORK ●  Continue work to open source the plugin

●  Support pipelining multiple reranking models

●  Allow a simple ranking model to be used in the first pass

Page 55: Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

QUESTIONS?