BM25 is so Yesterday - events.static.linuxfound.org is so Yesterday ... Relevance in Solr Grant...
Transcript of BM25 is so Yesterday - events.static.linuxfound.org is so Yesterday ... Relevance in Solr Grant...
BM25issoYesterdayModernTechniquesforBetterSearch
RelevanceinSolrGrantIngersollCTOLucidworks
Lucene/Solr/MahoutCommitter
if(doc.name.contains(âVikingsâ)){doc.boost=100
}
OR
q:(MAINQUERY)OR(name:Vikings)^Y
IndexTime:
QueryTime:
⢠TermFrequency:âHowwellatermdescribesadocument?â⢠Measure:howoftenatermoccursperdocument
⢠InverseDocumentFrequency:âHowimportantisatermoverall?â⢠Measure:howrarethetermisacrossalldocuments
TF*IDF
Score(q,d)=âidf(t)¡(tf(tind)¡(k+1))/(tf(tind)+k¡(1âb+b¡|d|/avgdl)tinq
Where:t=term;d=document;q=query;i=indextf(tind)=numTermOccurrencesInDocument½idf(t)=1+log(numDocs/(docFreq+1))|d|=â1tindavgdl==(â|d|)/(â1))dinidinik=Freeparameter.Usually~1.2to2.0.Increasestermfrequencysaturationpoint.b=Freeparameter.Usually~0.75.Increasesimpactofdocumentnormalization.
BM25 (aka Okapi)
⢠Captureandlogprettymucheverything⢠Searches,Timeonpage/1stclick,Whatwasnotchosen,etc.
⢠PrecisionâOfthoseshown,whatâsrelevant?⢠RecallâOfallthatâsrelevant,whatwasfound?⢠NDCGâAccountforposition
Measure, Measure, Measure
Magic
Guessing
CoreInformationTheory(akaLucene/Solr)
SearchAids(Facets,DidYouMean,Highlighting)
MachineLearning(Clicks,Recs,Personalization,Userfeedback)
Rules,DomainSpecificKnowledge
fuhgeddaboudit
Content Collaboration Context
Core Solr capabilities: text matching, faceting, spell checking, highlighting
Business Rules for content: landing pages, boost/block, promotions, etc.
Leverage collective intelligence to predict what users will do based on historical,
aggregated data
Recommenders, Popularity, Search Paths
Who are you? Where are you? What have you done previously?
User/Market Segmentation, Roles, Security, Personalization
Next Genera/on Relevance
But What About the Real World? Indexing Edition
NER,TopicDetection,Clustering
Word2Vec,etc.
DomainRules:Synonyms,Regexes,LexicalResources
Extraction
LoadIntoSparkBuildW2V,
PageRank,Topic,ClusteringModels
Offline
Content
Models
But What About the Real World? Query Edition
QueryIntentStrategic,Tactical,
Semanticđ
iPad case
Head/Tail/Clickstreamenhancement
UserFactors:Segmentation,Location,History,Profile,Security
Parse
DomainSpecificRulesTransformResults
âŚ
CascadingRerankersLearnToRank(multi-
model),Biascorrections
But What About the Real World? Signals Edition
LoadIntoSpark ClickstreamModelsSignals
QueryAnalysisJobsRecommenders/Personalization
đ
iPad case
QueryEdition
Raw
Models
(Exact/OriginalMatch)^X(SloppyPhrase)~M^Y
(ANDQ)^Z(ORQ)^XX
(Expansions/Click/Head/TailBoosts)^YY(PersonalizationBiases)^ZZ
({!ltrâŚ})
Filters+Options:security,rules,hardpreferences,categories
The Perfect(?!?) Query* YMMV!
}Precision
Recall
CaveatEmptor!
*Note:therearealotofvariationsonthis.edismaxhandlesmost
LearntoRank
X>Y>Z>XXAllweightscanbelearned
⢠Donâttakemywordforit,experiment!⢠Goodprimer:
⢠http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics
⢠Rulesarefine,aslongasthearecontained,havealifespanandaremeasuredforeffectiveness
Experimentation, Not Editorialization
⢠But Wait, Thereâs More!
Fusion Architecture
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader Elec*on Load Balancing
ZK N
Shared Config Management
Worker Worker
Apache SparkCluster
Manager
REST
API
Admin UI
Twigkit
LOGS FILE WEB DATABASE CLOUD
HD
FS (O
p*on
al)
Core Services
Connectors
⢠⢠â˘
ETL and Query Pipelines
Recommenders/Signals/Rules
NLP
Machine Learning
AlerEng and Messaging
Security
Scheduling
Key Features
Shards Shards
Apache Solr
Worker Worker
Apache SparkCluster
Manager
⢠Solr:⢠ExtensiveTextRankingFeatures
⢠SimilarityModels⢠FunctionQueries⢠Boost/Block
⢠PluggableReranker⢠LearntoRankcontrib⢠Multi-tenant
⢠Spark⢠SparkML(RandomForests,Regression,etc.)⢠Largescale,distributedcompute
Demo Details
⢠Best Buy Kaggle Competition Data Set
- Product Catalog: ~1.3M
- Signals: 1 month of query, document logs
⢠Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib
⢠Twigkit UI (http://twigkit.com)
Demo Details
⢠http://lucidworks.com⢠http://lucene.apache.org/solr⢠http://spark.apache.org/⢠https://github.com/lucidworks/spark-solr⢠https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank
⢠BloombergtalkonLTRhttps://www.youtube.com/watch?v=M7BKwJoh96s
Resources