Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Learning To Rank For Solr Michael Nilsson – Software Engineer

Diego Ceccarelli – Software Engineer

Joshua Pantony – Software Engineer Bloomberg LP

OUTLINE ●  Search at Bloomberg

●  Why do we need machine learning for search?

●  Learning to Rank

●  Solr Learning to Rank Plugin

8 millions searches PER DAY

1 million PER DAY

400 million stories in the index

SOLR IN BLOOMBERG ●  Search engine of choice at Bloomberg

─  Large community / Well distributed committers

─  Open source Apache Project

─  Used within many commercial products

─  Large feature set and rapid growth

●  Committed to open-source ─  Ability to contribute to core engine

─  Ability to fix bugs ourselves

─  Contributions in almost every Solr release since 4.5.0

PROBLEM SETUP

score: 30

score: 1.0

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=100∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+�10∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛

score: 52.2

score: 30.8

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=100∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+�10∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=𝟏𝟓𝟎∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+�𝟑.𝟏𝟒∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+�𝟒𝟐∗𝑐𝑙𝑖𝑐𝑘𝑠

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+𝟑.𝟏𝟏𝟏𝟒∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+𝟒𝟐.𝟒𝟐∗𝑐𝑙𝑖𝑐𝑘𝑠 + 5 ∗ timeElapsedFrom LastUpdate

●  It’s hard to manually tweak the ranking ─  You must be an expert in the domain

─  … or a magician

PROBLEM SETUP

𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+𝟑.𝟏𝟏𝟏𝟒∗𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+𝟒𝟐.𝟒𝟐∗𝑐𝑙𝑖𝑐𝑘𝑠 + 5 ∗ timeElapsedFrom LastUpdate

query = solr query = lucene query = austin query = bloomberg query = …

PROBLEM SETUP

It’s easier with Machine Learning ●  2,000+ parameters (non-linear, factorially larger than linear form)

●  8,000+ queries that are regularly tuned

●  Early on we spent many days hand tuning…

SEARCH PIPELINE (ONLINE)

Top-k retrieval

User Query

People

Commodities News

Other Sources

ReRanking Model

Top-k reranked

Top-x retrieval x >> k

TRAINING PIPELINE (OFFLINE)

Feature Extraction

Learning Algorithm

Ranking Model

Training Query-Document

People

Commodities News

Other Sources

Metrics

Feature Extraction

Learning Algorithm

Ranking Model

People

Commodities News

Other Sources

Metrics

TRAINING DATA: IMPLICIT VS EXPLICIT What is explicit data? ●  A set of judges will assess the

search results manually given a query ─  Experts ─  Crowd

What is implicit data? ●  Infer user preferences based on

user behavior ─  Aggregated results clicks ─  Query reformulation ─  Dwell time

Pros: ─  Data is very clean

Cons: ─  Can be very expensive!

Pros: ─  A lot of data!

Cons: ─  Extremely noisy

─  Privacy concerns

Feature Extraction

Learning Algorithm

Ranking Model

People

Commodities News

Other Sources

Metrics

FEATURES ●  A feature is an individual measurable property

●  Given a query, and a collection we can produce many features for each document in the collection ─  If the query matches the title

─  Length of the document

─  Number of views

─  How old is it?

─  Can be visualized on a mobile device?

FEATURES Extract “features”

Was the result a cofounder? 0

Features are signals that give an indication of a result’s importance

Does the document have an exec. position? 1

Query : APPL US

Does the query match the document title? 0

Popularity (%) 0.9

Popularity (%) 0.6

Feature Extraction

Learning Algorithm

Ranking Model

People

Commodities News

Other Sources

Metrics

METRICS How do we know if our model is doing better? ●  Offline metrics

─  Precision/Recall/F1 score

─  nDCG (Normalized Discount Cumulative Gain)

─  Other metrics (e.g., ERR, MAP, …)

●  Online Metrics ─  Click through rates à higher

─  Time to first click à lower

─  Interleaving1

1O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Science, 30(1), 2012.

Feature Extraction

Learning Algorithm

Ranking Model

People

Commodities News

Other Sources

Metrics

LEARNING TO RANK

●  Learn how to combine the features for optimizing one or more metrics

●  Many learning algorithms ─  RankSVM1

─  LambdaMART2

─  …

1T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. 2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-TR-2010-82, 2010.

SEARCH PIPELINE: STANDARD

Top-k retrieval

User Query

Solr People

Commodities News

Other Sources

Top-k retrieval

User Query

Training Data

Learning Algorithm

Ranking Model Offline

People

Commodities News

Other Sources

Top-k retrieval

User Query

Ranking Model Online Top-x

reranked

People

Commodities News

Other Sources

SEARCH PIPELINE: SOLR INTEGRATION

Top-k retrieval

User Query

Ranking Model Online Top-x

reranked

People

Commodities News

Other Sources

SOLR RELEVANCY ●  Pros

─  Simple and quick scoring computation

─  Phrase matching

─  Function query boosting on time, distance, popularity, etc

─  Customized fields for stemming, synonyms, etc

●  Cons ─  Lots of manual time for creating a well tuned query

─  Weights are brittle, and may not be compatible in the future with more documents or fields added

LTR PLUGIN: GOALS ●  Don’t tune the relevancy manually!

─  Uses machine learning to power automatic relevancy tuning

●  Significant relevancy improvements

●  Allow comparable scores across collections ─  Collections of different sizes

●  Maintaining low latency ─  Re-use the vast Solr search functionality that is already built-in

─  Less data transport

●  Makes it simple to use domain knowledge to rapidly create features ─  Features are no longer coded but rather scripted

STANDARD SOLR SEARCH REQUEST

Top-k retrieval

User Query

People

Commodities News

Other Sources

STANDARD SOLR SEARCH REQUEST

Index [10 Million]

Top-10 retrieval

User Query

Matches [10k]

Score [10k]

Solr Query

People

Commodities News

Other Sources

LTR SOLR SEARCH REQUEST

Index [10 Million]

Top-1000 retrieval

User Query

Matches [10k]

Score [10k]

Ranking Model

Top-10 reranked

Solr Query

LTR Query

People

Commodities News

Other Sources

<queryParser name="ltr" class="org.apache.solr.ltr.ranking.LTRQParserPlugin" />

LTR PLUGIN: RERANKING

●  LTRQuery extends Solr’s RankQuery ─  Wraps main query to fetch initial results ─  Returns custom TopDocsCollector for reranked ordered results

●  Solr rerank request parameter rq={!ltr model=myModel1 reRankDocs=100 efi.user_query=‘james’ efi.my_var=123} ─  !ltr – name used in the solrconfig.xml for the LTRQParserPlugin ─  model – name of deployed model to use for reranking ─  reRankDocs – total number of documents to rerank ─  efi.* – custom parameters used to pass external feature information for your

features to use

•  Query intent

•  Personalization

Index [10 Million]

Top-1000 retrieval

User Query

Matches [10k]

Score [10k]

Ranking Model

Top-10 reranked

Feature Extraction

People

Commodities News

Other Sources

{ "name": "Tim Cook", "primary_position": "ceo", "category ": "person", … }

Popularity (%) 0.9

LTR PLUGIN: FEATURES BEFORE

[ { "name": "isPersonAndExecutive", "type": "org.apache.solr.ltr.feature.impl.SolrFeature", "params": { "fq": [ "{!terms f=category}person", "{!terms f=primary_position}ceo, cto, cfo, president" ] } }, … ]

LTR PLUGIN: FEATURES AFTER

LTR PLUGIN: FUNCTION QUERIES [ { "name": "documentRecency", "type": "org.apache.solr.ltr.feature.impl.SolrFeature", "params": { "q": "{!func}recip( ms(NOW,publish_date), 3.16e-‐11, 1, 1)" } }, … ] 1 for docs dated now, 1/2 for docs dated 1 year ago, 1/3 for docs dated 2 years ago, etc.. See http://wiki.apache.org/solr/FunctionQuery#Date_Boosting

LTR PLUGIN: FEATURE STORE ●  FeatureStore is a Solr Managed Resource

─  REST API endpoint for performing CRUD operations on Solr objects

─  Stored in maintained in Zookeeper

●  Deploy ─  curl -XPUT 'http://yoursolrserver/solr/collection/config/fstore'

--data-binary @./features.json -H 'Content-type:application/json'

●  View ─  http://yoursolrserver/solr/collection/config/fstore

LTR PLUGIN: FEATURES ●  Simplifies feature engineering through configuration file

●  Utilizes rich search functionality built-in to Solr ─  Phrase matching

─  Synonyms, Stemming, etc

●  Inherit the Feature class for specialized features

Index [10 Million]

Top-1000 retrieval

User Query

Matches [10k]

Score [10k]

Ranking Model

Top-10 reranked

Feature Extraction

People

Commodities News

Other Sources

Index [10 Million]

Top-1000 retrieval

Training Queries

Matches [10k]

Score [10k]

Feature Extraction

Learning Algorithm

Ranking Model

People

Commodities News

Other Sources

{ "name": "Tim Cook", "primary_position": "ceo", "category ": "person", … }

Popularity (%) 0.9

<transformer name="fv" class= "org.apache.solr.ltr.ranking.LTRFeatureTransformer" />

LTR PLUGIN: FEATURE EXTRACTION

●  Feature extraction uses Solr’s TransformerFactory ─  Returns a custom field with each document

●  fl = *, [fv] { "name": "Tim Cook", "primary_position": "ceo", "category ": "person", … "[fv]": "isCofounder:0.0, isPersonAndExecutive:1.0, matchTitle:0.0, popularity:0.9" }

LTR PLUGIN: MODEL { "type": "org.apache.solr.ltr.ranking.LambdaMARTModel", "name": "mymodel1", "features": [ { "name": "matchedTitle"}, { "name": "isPersonAndExecutive"} ], "params": { "trees": [ { "weight": 1, "tree": { "feature": "matchedTitle", "threshold": 0.5, "left": { "value": -‐100 }, "right": { "feature": "isPersonAndExecutive", "threshold": 0.5, "left": { "value": 50 }, "right": { "value": 75 } } } } ] } }

LTR PLUGIN: MODEL ●  ModelStore is also a Solr Managed Resource

●  Deploy ─  curl -XPUT 'http://yoursolrserver/solr/collection/config/mstore'

--data-binary @./model.json -H 'Content-type:application/json'

●  View ─  http://yoursolrserver/solr/collection/config/mstore

●  Inherit from the model class for new scoring algorithms ─  score()

─  explain()

LTR PLUGIN: EVALUATION ●  Offline Metrics

─  nDCG increased approximately 10% after reranking

●  Online Metrics ─  Clicks @ 1 up by approximately 10%

BEFORE AND AFTER Query: “unemployment” Solr Ranking Machine Learned Reranking

LTR PLUGIN: EVALUATION ●  Offline Metrics

─  nDCG increased approximately 10% after reranking

●  Online Metrics ─  Clicks @ 1 up by approximately 10%

●  Performance ─  About 30% faster than previous external ranking system

10 million documents in collection 100k queries 1k features 1k documents/query reranked

LTR PLUGIN: BENEFITS ●  Simpler feature engineering, without compiling

●  Access to rich internal Solr search functionality for feature building

●  Search result relevancy improvements vs regular Solr relevance

●  Automatic relevancy tuning

●  Compatible scores across collections

●  Performance benefits vs external ranking system

FUTURE WORK ●  Continue work to open source the plugin

●  Support pipelining multiple reranking models

●  Allow a simple ranking model to be used in the first pass

QUESTIONS?

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Technology

Transcript of Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Inside Solr 5 - Bangalore Solr/Lucene Meetup

Scaling Solr with Solr Cloud

Apache Solr Cookbook - index-of.esindex-of.es/Varios-2/Apache-Solr-Cookbook.pdf · 2019-03-03 · Apache Solr Cookbook 1 / 78 Chapter 1 Apache Solr Tutorial for Beginners In this

Tintin Nilsson

Solr + jQuery =

2.1 nilsson

Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Ceccarelli Settlement

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr

Ethel Nilsson

Solr Flair: Search User Interfaces Powered by Apache Solr

Solr Flair

Apache Solr

Understanding the Solr security framework - Lucene Solr Revolution 2015

lennart nilsson

Oak / Solr integration Tommaso Teofili - pro!vision · Solr replicated architecture Solr%@10.1.1.20% C1 C2 Solr%@10.1.1.21% C1 C2 Solr%@10.1.1.22% C1 C2 RRLoad%balancer% adaptTo()

Maria Luisa Ceccarelli Lemut - COnnecting REpositories · Maria Luisa Ceccarelli Lemut L’edificio attraverso le fonti scritte [A stampa in Piombino.La chiesa di Sant’Antimo sopra

Apache Solr + ajax solr

TYPO3 Camp Poznan - Solr Usecases with Hosted Solr

The%NoSQL%Database% - home.apache.orgpeople.apache.org/~yonik/presentations/solr4_nosql... · EarliestHA% Solr%Conﬁguraons% Load%Balancer% Appservers% Solr%Searchers% Solr%Master%