An Introduction to Basics of Search and Relevancy with Apache Solr
-
Upload
lucidimagination -
Category
Documents
-
view
220 -
download
2
description
Transcript of An Introduction to Basics of Search and Relevancy with Apache Solr
Introduction to basics of Search and Relevancy with Apache Solr
Mark Bennett, CTO
FEATURING:
Lucid Imagination, Inc.12/2/2009
Agenda
• Prerequisites: Browser Tricks
• Web “Command Line”
• The DisMax Parser
• Boosting Formula
• Explaining “Explain”
• Check Your Index!
• Q & A
• Resources / About NIE
2
Lucid Imagination, Inc.12/2/2009
Prerequisite: Some Browser Tricks
3
Lucid Imagination, Inc.12/2/2009
Browsers Matter – install them all!
• Default XML Rendering
• (also some versions of IE)
• Lots of Plugins
• Better “Explain” copy & paste
maintains line breaks
• Better table copy and paste
Firefox: IE and Safari:
4
Lucid Imagination, Inc.
Larger Firefox “Command Line”
Customize the Firefox URL box as a commandline in 3 easy steps
1. Toolbar: Right Click
2. Customize… Add New Toolbar
3. URL bar ->CLICK and DRAG
5
Lucid Imagination, Inc.12/2/2009
Turn off Solr HTTP Caching
• Change in solrconfig.xml
• Disable the http304 section
• Turn it back on before you deploy!
6
Lucid Imagination, Inc.12/2/2009
Understanding Solr’s“Web Command Line”
7
Lucid Imagination, Inc.12/2/2009
The “Web Command Line”
• Command Prompt
• -o or --foo bar
• (spaces)
• some punctuation
• output
• Command line “adapter”
• Script files can call URLs
• Not built into Windows – try cygwin
CLI CONCEPT SOLR EQUIVALENT
8
URL bar
XML or HTML
? or & and =
+
%nn
Curl
Lucid Imagination, Inc.12/2/2009
Solr “Command Line”
• Typical Base URL
• http://localhost:8983/solr/select?...
• Basic Input (not counting dismax)
• q = query, fq = filter query
• df = default field
• qt = query type (standard / dismax)
• Controlling Output (lots more!!!)
• debugQuery = true
• wt = “what type” (actually “writer type”)
• standard/XML, xslt (with tr=), javabin, json…
• fl = *,score (which fields)
9
Lucid Imagination, Inc.12/2/2009
Example: search for “solr”
http://localhost:8983/solr/select?q=solr&debugQuery=true
* Some versions
With Firefoxyou get XML output you can expand and collapse
With MSIE* and Safari, not so much
10
Lucid Imagination, Inc.12/2/2009
Detailed Debug & Explain Output
http://localhost:8983/solr/select?q=solr&debugQuery=true
<str name="parsedquery">text:solr</str> …
<lst name="explain">
<str name="SOLR1000">
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:
1.4142135 = tf(termFreq(text:solr)=2)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)
</str>
</lst>
11
Lucid Imagination, Inc.12/2/2009
A look at the DisMax query parser
12
Lucid Imagination, Inc.12/2/2009
Solr DisMax: Defined
• What is it?
• Dis-joint text (Multiple fields)
• Max-imum match (score)
• How do you get it?
• Configured in:
• solrconfig.xml and schema.xml
• Called with:
• qt=dismax
• Adjusted with:
• mm, bf, qf, pf, qs, ps, tie
13
Lucid Imagination, Inc.
Solr DisMax: Pros and Cons
General Benefits
• Multiple Fields
• Multiple Relevancy Rules
• Great for Freshness / Popularity
Issues to be Aware of
• Tie-in between schema.xml & solrconfig.xml
• Trouble with some CJK (Chinese, Japanese, Korean)
• Limited wildcard / field / range support
• Difficult to customize and debug
• Trouble with shingles
• Understand mm!
14
Lucid Imagination, Inc.
About the “dis” and the “max”
Distributed across multiple fields
• Breakup query into words
• Each part becomes field clause
• Like an OR but with extra credit
Takes the Maximum of each set
• Word 1 had highest score in Title
• Word 2 very dense in the doc body
• Adds in Tie breaker if in multiple fields
15
Lucid Imagination, Inc.
Coming soon: Extended DisMax
Improvements
• Flexible case Boolean ops: AND/and, OR/or
• Auto-escape punctuation & -> \&, etc.
• Improved Proximity Boosting (via word bigrams)
• Other changes in stop words, relevancy calc, URL arguments
How to get it
• Post 1.4 patch, planned for 1.5
• Details + Patch in JIRA: SOLR-1553
http://issues.apache.org/jira/browse/SOLR-1553
• TBD: change URL option qt=edismax (or qt=dismax )
16
Lucid Imagination, Inc.12/2/2009
Boosting Formulas
17
Lucid Imagination, Inc.
Boost Functions in Dismax
High Level Feature
• Numeric functions for scoring
• sum(), product(), sqrt(), log(), etc.
• Boost on recent dates, user popularity
Good Combination: Reverse-Ordinal & Reciprocal
• Position in index : ord(), reverse is: rord()
• Larger y for smaller x: recip()
How to get it
• URL parameter bf = “boost function”
• Configured in solrconfig.xml
• See http://wiki.apache.org/solr/FunctionQuery
18
Lucid Imagination, Inc.
“Freshness”: Boosting Recent Datesm x + c a / mx+c
DatePosition
ord()N-Position
rord()Linear
(x,m,c) recip(x,m,a,c)
1/1/2000 1 120 1120 0.89286
2/1/2000 2 119 1119 0.89366
3/1/2000 3 118 1118 0.89445
… … … … …
1/1/2005 61 60 1060 0.94340
… … … … …
1/1/2009 109 12 1012 0.98814
2/1/2009 110 11 1011 0.98912
3/1/2009 111 10 1010 0.99010
4/1/2009 112 9 1009 0.99108
5/1/2009 113 8 1008 0.99206
6/1/2009 114 7 1007 0.99305
7/1/2009 115 6 1006 0.99404
8/1/2009 116 5 1005 0.99502
9/1/2009 117 4 1004 0.99602
10/1/2009 118 3 1003 0.99701
11/1/2009 119 2 1002 0.99800
12/1/2009 120 1 1001 0.99900
WIKI EXAMPLE:recip( rord(creationDate), 1, 1000, 1000 )
slope m 1
numerator a 1000
intercept c 1000 (aka "b")
0.880
0.900
0.920
0.940
0.960
0.980
1.000
19
Lucid Imagination, Inc.12/2/2009
Sifting throughSolr’s “Explain” output
20
Lucid Imagination, Inc.12/2/2009
DisMax Example for “solr”
<str name="parsedquery">
+DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 | manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01) DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 | text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01) FunctionQuery((top(ord(popularity)))^0.5) FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3)
</str>
INPUT:
DEBUG OUTPUT: (1 OF 2)
http://localhost:8983/solr
/select?q=solr&debugQuery=true&qt=dismax
21
Lucid Imagination, Inc.12/2/2009
DisMax explain output for a single word query
<lst name="explain"><str name="SOLR1000">
0.74609417 = (MATCH) sum of:0.4476144 = (MATCH) max plus 0.01 times others of:0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:0.04119147 = queryWeight(text:solr^0.5), product of:0.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=text, doc=13)
0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:0.09885953 = queryWeight(name:solr^1.2), product of:1.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)
0.03710002 = (MATCH) weight(features:solr in 13), product of:0.08238294 = queryWeight(features:solr), product of:3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:1.0 = tf(termFreq(features:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)
0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)
1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)
0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of:0.12357441 = queryWeight(sku:solr^1.5), product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of:1.0 = tf(termFreq(sku:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)1.0 = fieldNorm(field=sku, doc=13)
0.22311316 = (MATCH) max plus 0.01 times others of:0.040810023 = (MATCH) weight(features:solr^1.1 in 13),
product of:0.09062123 = queryWeight(features:solr^1.1), product of:1.1 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:
1.0 = tf(termFreq(features:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.125 = fieldNorm(field=features, doc=13)
0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of:0.016476588 = queryWeight(text:solr^0.2), product of:0.2 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:1.4142135 = tf(termFreq(text:solr)=2)3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)0.22260013 = (MATCH) weight(name:solr^1.5
in 13), product of:0.12357441 = queryWeight(name:solr^1.5),
product of:1.5 = boost3.6026897 = idf(docFreq=1, numDocs=26)0.022867065 = queryNorm
1.8013449 = (MATCH) fieldWeight(name:solrin 13), product of:
1.0 = tf(termFreq(name:solr)=1)3.6026897 = idf(docFreq=1, numDocs=26)0.5 = fieldNorm(field=name, doc=13)
0.06860119 = (MATCH) FunctionQuery(top(ord(popularity))), product of:
6.0 = ord(popularity)=60.5 = boost0.022867065 = queryNorm
0.0067654043 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(price)))+1000.0)), product of:
0.9861933 = 1000.0/(1.0*float(rord(price)=14)+1000.0)
0.3 = boost0.022867065 = queryNorm
</str></lst>
22
Lucid Imagination, Inc.12/2/2009
“Explain” example:
...
0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:
0.04119147 = queryWeight(text:solr^0.5), product of:
0.5 = boost
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:
1.4142135 = tf(termFreq(text:solr)=2)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)
0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:
0.09885953 = queryWeight(name:solr^1.2), product of:
1.2 = boost
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:
1.0 = tf(termFreq(name:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)
0.5 = fieldNorm(field=name, doc=13)
0.03710002 = (MATCH) weight(features:solr in 13), product of:
0.08238294 = queryWeight(features:solr), product of:
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:
1.0 = tf(termFreq(features:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=features, doc=13)
...
tf (termFreq(text:solr )=2)idf (docFreq=1,numDocs=26)
23
Lucid Imagination, Inc.12/2/2009
Solr’s XSLT “debugger”http://localhost:8983/solr/select?
q=solr
&debugQuery=true
&wt=xslt
&tr=example.xsl
&fl=*,score
&qt=dismax
24
Lucid Imagination, Inc.
Another way to view Explain data
• Solr1.4 has Solritas
• Various features, including toggle explain display
• “Some assembly required…”
http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/
25
Lucid Imagination, Inc.12/2/2009
Checking your Index and IDF
26
Lucid Imagination, Inc.
Checking what got Indexed
Bad Index = Bad Search
• Check Upper / lower case and Punctuation
• Bad Fields / Meta Data = Bad Facets, Filters, Sorting
Use built-in Schema Browser:
• Check each field
• Common words =
• IDF “Inverse Document Frequency”
27
Lucid Imagination, Inc.
Check IDF w/ the Schema Browser
Start at the Admin Screen:
Schema Browser
• select a field
• change # to see more
http://localhost:8983/solr/admin
Lucid Imagination, Inc.12/2/2009
New Idea Engineering
About NIE
29
Lucid Imagination, Inc.12/2/2009
NIE Resources
Search Dev Newsgroup:www.SearchDev.org
Newsletter & Whitepapers:www.ideaeng.com/current
EnterpriseSearchBlog.comBlogs:
SearchComponentsOnline.com
30
Lucid Imagination, Inc.12/2/2009
Finish Line / Q & A
Review & Questions
Mark Bennett [email protected]
main 408-446-3460
cell 408-829-6513
31
Lucid Imagination, Inc.12/2/2009
Q & A
These slides and a recorded presentation are available at
bit.ly/SolrRelevancy