Solr: Beyond the Basics

Post on 17-May-2015

11.399 views 18 download

Tags:

description

The Apache Solr search engine has become almost the default choice for adding superior search capabilities to a web application. In this talk we will go beyond the basics of Solr, and look up at what it offers and how to set it up robustly and properly for production use. We will plan and implement a document model in Solr, and look at how to index different document types with Solr Cell or index data from the web with the Nutch crawler. We will cover options for tuning queries and performance, and examine how best to use more advanced features like faceting, spelling correction and 'more like this'. Solr offers a language agnostic web service, so client examples will be in PHP and Python, but the bulk of the content will be applicable to anyone looking to work well with Solr.

Transcript of Solr: Beyond the Basics

Solr: BEYOND THE BASICS!

script: Ian barber (phpir.com)Art: the internet!Editor: twitter.com/ianbarberlettering: ian.barber@gmail.comhttp://joind.in/2899

∑knk,j

tfi,j x idfi,j |{d:ti ∈

d}|∑knk,j

ni,j

REVIOUSLY....PMy site

search was slow and the results were bad, but Solr

saved me!

security comes first!

/etc/solr/solr.xml

Core Core

CONF CONF

/var/solr/data

/var/solr/lib

<solr sharedLib="/var/solr/lib" persistent="true"> <cores adminPath="/admin/cores"> <core default="true" instanceDir="main" name="main"> </core> </cores></solr>

olr.xmlS

<schema

name="ex

ample" v

ersion="

1.2">

<!-- a

ttribute

"name"

is the n

ame of t

his sche

ma and i

s only u

sed for

display

purposes

.

A

pplicati

ons shou

ld chang

e this t

o reflec

t the na

ture of

the sear

ch

collecti

on.

v

ersion="

1.2" is

Solr's v

ersion n

umber fo

r the sc

hema syn

tax and

semantic

s. It s

hould

n

ot norma

lly be c

hanged b

y applic

ations.

1

.0: mult

iValued

attribut

e did no

t exist,

all fie

lds are

multiVal

ued by

nature

1

.1: mult

iValued

attribut

e introd

uced, fa

lse by d

efault

1

.2: omit

TermFreq

AndPosit

ions att

ribute i

ntroduce

d, true

by defau

lt

except f

or text

fields.

-->

<types

>

<!--

field t

ype defi

nitions.

The "na

me" attr

ibute is

j

ust a la

bel to b

e used b

y field

definiti

ons. Th

e "class

"

a

ttribute

and any

other a

ttribute

s determ

ine the

real

b

ehavior

of the f

ieldType

.

Class n

ames sta

rting wi

th "solr

" refer

to java

classes

in the

o

rg.apach

e.solr.a

nalysis

package.

-->

<!--

The Str

Field ty

pe is no

t analyz

ed, but

indexed/

stored v

erbatim.

-

StrFiel

d and Te

xtField

support

an optio

nal comp

ressThre

shold wh

ich

l

imits co

mpressio

n (if en

abled in

the der

ived fie

lds) to

values w

hich

e

xceed a

certain

size (in

charact

ers).

-->

<fie

ldType n

ame="str

ing" cla

ss="solr

.StrFiel

d" sortM

issingLa

st="true

"

omitNorm

s="true"

/>

<!--

boolean

type: "

true" or

"false"

-->

<fie

ldType n

ame="boo

lean" cl

ass="sol

r.BoolFi

eld" sor

tMissing

Last="tr

ue"

omitNorm

s="true"

/>

<!--

Binary d

ata type

. The da

ta shoul

d be sen

t/retrie

ved in a

s Base64

encoded

Strings

-->

<fie

ldtype n

ame="bin

ary" cla

ss="solr

.BinaryF

ield"/>

<!--

The opt

ional so

rtMissin

gLast an

d sortMi

ssingFir

st attri

butes ar

e

<config> <!-- Set this to 'false' if you want solr to continue working after it has

encountered an severe configuration error. In a production

environment, you may want solr to keep working even if one handler is mis-

configured. You may also set this to false using by setting the system property:

-Dsolr.abortOnConfigurationError=false

--> <abortOnConfigurationError>${solr.abortOnConfigurationError:true}</

abortOnConfigurationError>

<!-- lib directives can be used to instruct Solr to load an Jars

identified and use them to resolve any "plugins" specified in your

solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...).

All directories and paths are resolved relative the instanceDir.

If a "./lib" directory exists in your instanceDir, all files found in

it are included as if you had used the following syntax...

<lib dir="./lib" />

--> <!-- A dir option by itself adds any files found in the directory to the

classpath, this is useful for including all jars in a directory.

--> <!--lib dir="../../contrib/extraction/lib" /-->

<!-- When a regex is specified in addition to a directory, only the files

in that directory which completely match the regex (anchored on both ends)

will be included.

--> <!--lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />

<lib dir="../../dist/" regex="apache-solr-clustering-\d.*\.jar" /-->

<!-- If a dir option (with or without a regex) is used and nothing is

found

olr’s secret plan!S

<listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">solr rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst> <lst> <str name="q">from solrconfig.xml</str> </lst> </arr></listener>

cache warming!

Query

Index Configuration

Request Handlers

search components

Content Type

section

search types

field types

fields

THe cms!

TITLE

LEAD PARADATE

BODY

permalink

Category

TagsAuthor

Scientific analysis!

how do we turn our text into tokens?

Field Type, Storage, Tokenisation,

Filters, and copy fields.

<fieldType name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory"/> </analyzer></fieldType>

chema.xmlS

keyword

ORIGINAL

Whitespace

STANDARD

O’Reilly’s wi-fi guide!

O’Reilly’s

wi-fi

guide!

O’Reilly’s wi-fi guide!

O

wi

SReilly

FI

GUIDE

“My Phrase?”

stored INDEXED

“My Phrase?”

my

phrase

doc 1

doc 1

doc 1

Ian barber

AN PRPR

<fieldtype name="phonetic" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> </analyzer></fieldtype>

IAIN BARBOUR

AN PRPR

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" generateNumberParts="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

delimiters

O

wi

SReilly

FI

GUIDE

OReillys

wifi

precision versus recall

vs

<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />

stemming

O

wi

SReilli

FI

GUID

OReilli

wifi

Je ne parle pas

anglais!

TITLE

LEAD PARA BODY

<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0" />

<fieldType name="lowercase" class="solr.TextField"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer></fieldType>

chema.xmlS

tags

Date

author

category

permalink

<fields><field name="permalink" type="lowercase" required="true" /> <field name="category" type="lowercase" /><field name="tag" type="lowercase" multiValued="true" /><field name="title" type="text" required="true"/><field name="body" type="text" required="true" /> <field name="author" type="lowercase" stored="false" multiValued="true" /><field name="date" type="tdate" multiValued="true" /><field name="lead_para" type="text" /><field name="phonetic" type="phonetic" /><field name="text" type="text" stored="false" multiValued="true" /></fields>

chema.xmlS

<!-- Copy Fields --><copyField source="permalink" dest="text" /><copyField source="category" dest="text" /><copyField source="title" dest="text" /><copyField source="lead_para" dest="text" /><copyField source="body" dest="text" /><copyField source="author" dest="text" /><copyField source="category" dest="phonetic" /><copyField source="title" dest="phonetic" /><copyField source="lead_para" dest="phonetic" /><copyField source="body" dest="phonetic" /><copyField source="author" dest="phonetic" />

<!-- ID --><uniqueKey>permalink</uniqueKey>

from solr import *s=SolrConnection( 'http://localhost:8080/solr/main')doc = dict( permalink = "http://fooweb.com/strategy/DCPO", category = "strategy", title = "DPCO: A Framework For Synergy", body = "DPCO, or Dynamic Performance Class Organisation is a ISO90210 quality oriented management process [...]", author = "Sean Alison", date = "2011-03-01T00:00:00Z", source_site = "fooweb.com",)s.add(doc)s.commit() impleadd.pys

<add> <doc> <field name="body"> DPCO, or Dynamic Performance Class [...] </field> <field name="category">strategy</field> <field name="permalink"> http://fooweb.com/strategy/DCPO </field> <field name="source_site">fooweb.com</field> <field name="title"> DPCO: A Framework For Synergy </field> <field name="date">2011-03-01T00:00:00Z </field> <field name="author">Sean Alison</field> </doc></add>

time for the gadgets!

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config"> db-data-config.xml </str> </lst></requestHandler>

olrconfig.xmlS

<dataConfig><dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/cms" user="root" password="password" /><document> <entity name="story" query="SELECT s.id, s.content, CONCAT (u.first_name, ' ', u.last_name) as author [...] s.status_id = 1" deltaImportQuery="SELECT s.id, s.content [...] AND s.id = ${dataimporter.delta.id}" deltaQuery="SELECT id FROM stories WHERE modified > ${dataimporter.last_index_time}" transformer= "TemplateTransformer,HTMLStripTransformer" >

ata-config.xmlD

<field column="permalink" name="permalink" template="http://fooweb.com/${story.slug}" /> <field column="publish_date" name="date" /> <field column="content" name="body" stripHTML="true" /> <field column="source_site" template="cms" /> [...] <entity name="topic" query="SELECT [...] st.item_id=${story.id}"> <field column="category" /> </entity> </entity></document></dataConfig>

<response> <str name="command">full-import</str> <str name="status">busy</str> <str name="importResponse"> A command is still running...</str> <lst name="statusMessages"> <str name="Time Elapsed">0:0:14.979</str> <str name="Total Requests made">5523</str> <str name="Total Rows Fetched">5522</str> <str name="Total Documents Processed"> 2760</str> <str name="Total Documents Skipped">0</str> <str name="Full Dump Started"> 2011-03-02 15:48:00</str> </lst></response>

http://SOLR:8080/solr/main/dataimport

The SOLR CELL!

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="uprefix">ignored_</str> </lst></requestHandler>

olrconfig.xmlS

<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

chema.xmlS

<dynamicField name="ignored_*" type="ignored" indexed="false" stored="false"/> can it

be...schema free?!

ynamic FieldsD

$  curl  -­‐v  “http://localhost:8080/solr/main/update/extract?literal.source_site=files&literal.permalink=http://fooweb.com/arch.pdf&commit=true&fmap.content=body&fmap.Author=author—data-­‐binary  @arch.pdf  -­‐H  ‘Content-­‐Type:application/pdf’

A crawler!

Lucidimagination.com/blog/2009/03/09/nutch-solr

# skip some protocols-^(https|telnet|file|ftp|mailto):-[?*!@=]

# allow urls in defined domain+^http://([a-z0-9\-A-Z]*\.)*fooweb.com/

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# deny anything else-.

egex-urlfilter.txtr

<mapping> <fields> <field dest="body" source="content" /> <field dest="source_site" source="site" /> <field dest="title" source="title" /> <field dest="ignored_host" source="host" /> <field dest="ignored_segment" source="segment" /> <field dest="ignored_boost" source="boost" /> <field dest="ignored_digest" source="digest" /> <field dest="date" source="tstamp" /> <field dest="permalink" source="url" /> </fields> <uniqueKey>permalink</uniqueKey></mapping>

olrindex-mapping.xmlS

$  echo  "http://subsite.fooweb.com"  >  urls/seed.txt$  bin/nutch  inject  /var/nutch/crawldb  urls

$  bin/nutch  generate  /var/nutch/crawldb                                            /var/nutch/segments$  export  SEGMENT=/var/nutch/segments/`ls  -­‐tr                                    /var/nutch/segments|tail  -­‐1`$  bin/nutch  fetch  $SEGMENT  -­‐noParsing$  bin/nutch  parse  $SEGMENT$  bin/nutch  updatedb  $SEGMENT  -­‐filter  -­‐normalize$  bin/nutch  invertlinks  /var/nutch/linkdb                                        -­‐dir  /var/nutch/segments

$  bin/nutch  solrindex  http://localhost:8080/solr/main  /var/nutch/crawldb  /var/nutch/linkdb/  /var/nutch/segments/*

solr goes to work!

he has dismax!

<requestHandler name="dismax" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm">3&lt;60%</str> </lst></requestHandler> olrconfig.xmlS

from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)

response = s.query('idie manager')for hit in response.results: print hit['title'] print hit['body']

$  python  simplequery.py  Overview  of  the  IDIE  managerTo  help  with  those  implementing  IDIE  [...]IDIE:  The  801g  Of  Talent  ManagementInspiration-­‐Direction-­‐Influence  [...]

<str name="bf"> recip(ms(NOW,date),3.16e-11,1,1)

</str>

FunctionQuery(1.0/(3.16E-11*float(ms(const(1299450070912),date(date)))+1.0)), product of: 0.9974636 = 1.0/(3.16E-11*float(ms(const(1299450070912),date(date)=1299369600000))+1.0) 1.0 = boost 0.03730806 = queryNorm

going beyond just

search results!

$solr = new Apache_Solr_Service( 'localhost', 8080, '/solr/main');$query = "badly drawn";$p = array( 'facet' => "true", 'facet.field' => 'category', 'facet.mincount' => 1,);

$r = $solr->search($query, 0, 5, $p);foreach( $r->facet_counts->facet_fields->category as $cat => $count) { echo $cat, " ", $count, PHP_EOL;

$query = "";$p = array( 'q.alt' => "*:*", "facet" => "true", "facet.date" => 'date', "facet.date.start" => "NOW/YEAR-6MONTHS", "facet.date.end" => "NOW/YEAR", "facet.date.gap" => "+1MONTH", "fq" => "category: Reviews",);

$r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_dates->date as $date => $count) { echo $date, " ", $count, PHP_EOL;}

$query = "";$p = array( 'q.alt' => "*:*", 'facet' => "true", 'facet.mincount' => 1, "facet.query" => array("title:gig", "title:album"), "fq" => "category:Reviews",); $r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_queries as $query => $count) { echo $query, " ", $count, PHP_EOL;}

What Fields to facet?

what facets to show?

how to facet?

<requestHandler name="mlt" class="solr.MoreLikeThisHandler"> <lst name="defaults"> <str name="defType">mlt</str> <str name="mlt">true</str> <str name="mlt.fl">body title</str> <str name="mlt.match.include"> false </str> </lst></requestHandler>

olrconfig.xmlS

$solr = new Apache_Solr_Service ('localhost', 8080, '/solr/main');$query = "Losing my backpacking virginity";$p = array('qt' => "mlt");$results = $solr->search($query, 0, 3, $p);foreach($results->response->docs as $doc) { echo $doc->title, PHP_EOL;}

$  php  mltquery.php  Backpacking  across  USA  social  media  waySafe  solo  travel  on  New  York  holidaysCracking  The  Big  Apple's  Big  10

THanks!

script: Ian barber (phpir.com)Art: the internet!Editor: twitter.com/ianbarberlettering: ian.barber@gmail.comhttp://joind.in/2899

Bonus content!

<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType"> textSpell </str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="buildOnCommit">true</str> <str name="spellcheckIndexDir"> /var/lib/solr/spellchecker </str> </lst></searchComponent>

olrconfig.xmlS

<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.StandardFilterFactory" /> </analyzer></fieldType> chema.xmlS

[...] <int name="ps">10</int> <int name="qs">5</int> <str name="spellcheck.onlyMorePopular">true</str> <str name="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>

ismax handlerD

$solr = new Apache_Solr_Service('localhost', 8080, '/solr/main');$p = array( 'spellcheck' => 'true', 'spellcheck.collate' => 'true');$results = $solr->search("roose", 0, 5, $p);echo "Did you mean " . $results->spellcheck->suggestions->collation, PHP_EOL;

$  php  spellquery.php  Did  you  mean  rose

include_once "Apache/Solr/Service.php";$solr = new Apache_Solr_Service( 'localhost', 8080, '/solr/main');$query = "album review";$p = array('sort' => 'title_sort desc');$res = $solr->search($query, 0, 10, $p);foreach($res->response->docs as $doc) { echo $doc->title, PHP_EOL;}

<field name="title_sort" type="lowercase" indexed="true" stored="false" />

<copyField source="title" dest="title_sort" />

$  php  sortquery.php  Zola  Jesus  album  review  -­‐  Stridulum  IIZero  7  album  review  -­‐  RecordZebra  and  GiraffeYoung  Knives  video  interview  part  2Young  Knives  -­‐  Road  to  V  winners  on  tourYou  Me  At  Six  @  Wembley  Arena,  LondonYou  Me  At  Six  -­‐  Hold  Me  DownYet  again...  Good  Shoes  @  ULU,  LondonYelle:  North  American  tour  reviewYelle:  interview  with  a  French  pop  artiste

http://code.google.com/p/solr-php-client

<highlighting><fragmenter name="regex" class="[..]highlight.RegexFragmenter"><lst name="defaults"> <int name="hl.fragsize">70</int> <float name="hl.regex.slop">0.5</float> <str name="hl.regex.pattern"> [-\w ,/\n\"']{20,200}</str></lst></fragmenter><formatter name="html" class="[...]highlight.HtmlFormatter" default="true"> <lst name="defaults"> <str name="hl.simple.pre"><![CDATA[<em>]]></str> <str name="hl.simple.post"><![CDATA[</em>]]></str></lst> </formatter></highlighting>

$so = new Apache_Solr_Service('localhost', 8080, '/solr/main');$q = "album review";$r =$so->search($q,0,5,array('hl'=>"true"));foreach($r->response->docs as $doc) { echo $r->highlighting->{$doc->permalink}->title[0], PHP_EOL;}

$  php  highlightquery.php  Fenech  Soler  <em>album</em>  <em>review</em>Weezer  -­‐  Hurley  <em>album</em>  <em>review</em>Feeder  <em>album</em>  <em>review</em>  -­‐  Renegades

The masters of scaling are here!

Replication sharding caching

from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)response = s.query('ISO90210')if(response.results.numFound == '0'): print "No results found!"

$  python  simplefail.py  No  results  found!

IS SOLR DEFEATED?

<lst name="debug"> <str name="rawquerystring">"iso 90210"</str> <str name="querystring">"iso 90210"</str> <str name="parsedquery">+DisjunctionMaxQuery((body:"iso 90210")~0.01) DisjunctionMaxQuery((body:"iso 90210")~0.01)</str>

/solr/select/?q="iso 90210"&debugQuery=true

<lst name="debug"> <str name="rawquerystring">iso 90210</str> <str name="querystring">iso 90210</str> <str name="parsedquery">+((DisjunctionMaxQuery((body:iso)~0.01) DisjunctionMaxQuery((body:90210)~0.01))~2) DisjunctionMaxQuery((body:"iso 90210")~0.01)</str> <str name="parsedquery_toString">+(((body:iso)~0.01 (body:90210)~0.01)~2) (body:"iso 90210")~0.01</str>

/solr/select/?q=iso 90210&debugQuery=true

0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (body:"iso 90210") 0.0 = weight(body:"iso 90210" in 0), product of: 0.6953707 = queryWeight(body:"iso 90210"), product of: 3.8325815 = idf(body: iso=1 90210=1) 0.18143663 = queryNorm 0.0 = fieldWeight(body:"iso 90210" in 0), product of: 0.0 = tf(phraseFreq=0.0) 3.8325815 = idf(body: iso=1 90210=1) 0.15625 = fieldNorm(field=body, doc=0)

&explainother=90210

<str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm"> 3&lt;60%</str> <int name="ps">10</int> <int name="qs">5</int> </lst>

olrconfig.xmlS

from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)response = s.query('ISO90210')if(response.results.numFound == '0'): print "No results found!"

$  python  simplefail.py  DPCO:  A  Framework  For  SynergyDPCO,  or  Dynamic  Performance  Class  Organisation  is  a  ISO90210  quality  [...]