Solr: Beyond the Basics

Solr: BEYOND THE BASICS!

script: Ian barber (phpir.com)Art: the internet!Editor: twitter.com/ianbarberlettering: ian.barber@gmail.comhttp://joind.in/2899

∑knk,j

tfi,j x idfi,j |{d:ti ∈

d}|∑knk,j

REVIOUSLY....PMy site

search was slow and the results were bad, but Solr

saved me!

security comes first!

/etc/solr/solr.xml

Core Core

CONF CONF

/var/solr/data

/var/solr/lib

olr.xmlS

<schema

name="ex

ample" v

ersion="

<!-- a

ttribute

"name"

is the n

ame of t

his sche

ma and i

s only u

sed for

display

purposes

pplicati

ons shou

ld chang

e this t

o reflec

t the na

ture of

the sear

collecti

ersion="

1.2" is

Solr's v

ersion n

umber fo

r the sc

hema syn

tax and

semantic

s. It s

ot norma

lly be c

hanged b

y applic

ations.

.0: mult

iValued

attribut

e did no

t exist,

all fie

lds are

multiVal

ued by

nature

.1: mult

iValued

attribut

e introd

uced, fa

lse by d

efault

.2: omit

TermFreq

AndPosit

ions att

ribute i

ntroduce

d, true

by defau

except f

or text

fields.

<types

field t

ype defi

nitions.

The "na

me" attr

ibute is

ust a la

bel to b

e used b

y field

definiti

ons. Th

e "class

ttribute

and any

other a

ttribute

s determ

ine the

ehavior

of the f

ieldType

Class n

ames sta

rting wi

th "solr

" refer

to java

classes

in the

rg.apach

e.solr.a

nalysis

package.

The Str

Field ty

pe is no

t analyz

ed, but

indexed/

stored v

erbatim.

StrFiel

d and Te

xtField

support

an optio

nal comp

ressThre

shold wh

imits co

mpressio

n (if en

abled in

the der

ived fie

lds) to

values w

xceed a

certain

size (in

charact

ldType n

ame="str

ing" cla

ss="solr

.StrFiel

d" sortM

issingLa

st="true

omitNorm

s="true"

boolean

type: "

true" or

"false"

ldType n

ame="boo

lean" cl

ass="sol

r.BoolFi

eld" sor

tMissing

Last="tr

omitNorm

s="true"

Binary d

ata type

. The da

ta shoul

d be sen

t/retrie

ved in a

s Base64

encoded

Strings

ldtype n

ame="bin

ary" cla

ss="solr

.BinaryF

ield"/>

The opt

ional so

rtMissin

gLast an

d sortMi

ssingFir

st attri

butes ar

<config> <!-- Set this to 'false' if you want solr to continue working after it has

encountered an severe configuration error. In a production

environment, you may want solr to keep working even if one handler is mis-

configured. You may also set this to false using by setting the system property:

-Dsolr.abortOnConfigurationError=false

--> <abortOnConfigurationError>${solr.abortOnConfigurationError:true}</

abortOnConfigurationError>

<!-- lib directives can be used to instruct Solr to load an Jars

identified and use them to resolve any "plugins" specified in your

solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...).

All directories and paths are resolved relative the instanceDir.

If a "./lib" directory exists in your instanceDir, all files found in

it are included as if you had used the following syntax...

--> <!-- A dir option by itself adds any files found in the directory to the

classpath, this is useful for including all jars in a directory.

-->

<!-- When a regex is specified in addition to a directory, only the files

in that directory which completely match the regex (anchored on both ends)

will be included.

--> <!--lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />

<!-- If a dir option (with or without a regex) is used and nothing is

olr’s secret plan!S

<listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">solr rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst> <lst> <str name="q">from solrconfig.xml</str> </lst> </arr></listener>

cache warming!

Index Configuration

Request Handlers

search components

Content Type

section

search types

field types

fields

THe cms!

LEAD PARADATE

permalink

category

permalink

chema.xmlS

<copyField source="permalink" dest="text" /><copyField source="category" dest="text" /><copyField source="title" dest="text" /><copyField source="lead_para" dest="text" /><copyField source="body" dest="text" /><copyField source="author" dest="text" /><copyField source="category" dest="phonetic" /><copyField source="title" dest="phonetic" /><copyField source="lead_para" dest="phonetic" /><copyField source="body" dest="phonetic" /><copyField source="author" dest="phonetic" />

<uniqueKey>permalink</uniqueKey>

from solr import *s=SolrConnection( 'http://localhost:8080/solr/main')doc = dict( permalink = "http://fooweb.com/strategy/DCPO", category = "strategy", title = "DPCO: A Framework For Synergy", body = "DPCO, or Dynamic Performance Class Organisation is a ISO90210 quality oriented management process [...]", author = "Sean Alison", date = "2011-03-01T00:00:00Z", source_site = "fooweb.com",)s.add(doc)s.commit() impleadd.pys

<add> <doc> <field name="body"> DPCO, or Dynamic Performance Class [...] </field> <field name="category">strategy</field> <field name="permalink"> http://fooweb.com/strategy/DCPO </field> <field name="source_site">fooweb.com</field> <field name="title"> DPCO: A Framework For Synergy </field> <field name="date">2011-03-01T00:00:00Z </field> <field name="author">Sean Alison</field> </doc></add>

time for the gadgets!

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config"> db-data-config.xml </str> </lst></requestHandler>

olrconfig.xmlS

<dataConfig><dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/cms" user="root" password="password" /><document> <entity name="story" query="SELECT s.id, s.content, CONCAT (u.first_name, ' ', u.last_name) as author [...] s.status_id = 1" deltaImportQuery="SELECT s.id, s.content [...] AND s.id = ${dataimporter.delta.id}" deltaQuery="SELECT id FROM stories WHERE modified > ${dataimporter.last_index_time}" transformer= "TemplateTransformer,HTMLStripTransformer" >

ata-config.xmlD

<response> <str name="command">full-import</str> <str name="status">busy</str> <str name="importResponse"> A command is still running...</str> <lst name="statusMessages"> <str name="Time Elapsed">0:0:14.979</str> <str name="Total Requests made">5523</str> <str name="Total Rows Fetched">5522</str> <str name="Total Documents Processed"> 2760</str> <str name="Total Documents Skipped">0</str> <str name="Full Dump Started"> 2011-03-02 15:48:00</str> </lst></response>

http://SOLR:8080/solr/main/dataimport

The SOLR CELL!

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="uprefix">ignored_</str> </lst></requestHandler>

olrconfig.xmlS

chema.xmlS

<dynamicField name="ignored_*" type="ignored" indexed="false" stored="false"/> can it

be...schema free?!

ynamic FieldsD

$ curl -‐v “http://localhost:8080/solr/main/update/extract?literal.source_site=files&literal.permalink=http://fooweb.com/arch.pdf&commit=true&fmap.content=body&fmap.Author=author—data-‐binary @arch.pdf -‐H ‘Content-‐Type:application/pdf’

A crawler!

Lucidimagination.com/blog/2009/03/09/nutch-solr

# skip some protocols-^(https|telnet|file|ftp|mailto):-[?*!@=]

# allow urls in defined domain+^http://([a-z0-9\-A-Z]*\.)*fooweb.com/

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# deny anything else-.

egex-urlfilter.txtr

<mapping> <fields> <field dest="body" source="content" /> <field dest="source_site" source="site" /> <field dest="title" source="title" /> <field dest="ignored_host" source="host" /> <field dest="ignored_segment" source="segment" /> <field dest="ignored_boost" source="boost" /> <field dest="ignored_digest" source="digest" /> <field dest="date" source="tstamp" /> <field dest="permalink" source="url" /> </fields> <uniqueKey>permalink</uniqueKey></mapping>

olrindex-mapping.xmlS

$ echo "http://subsite.fooweb.com" > urls/seed.txt$ bin/nutch inject /var/nutch/crawldb urls

$ bin/nutch generate /var/nutch/crawldb /var/nutch/segments$ export SEGMENT=/var/nutch/segments/`ls -‐tr /var/nutch/segments|tail -‐1`$ bin/nutch fetch $SEGMENT -‐noParsing$ bin/nutch parse $SEGMENT$ bin/nutch updatedb $SEGMENT -‐filter -‐normalize$ bin/nutch invertlinks /var/nutch/linkdb -‐dir /var/nutch/segments

$ bin/nutch solrindex http://localhost:8080/solr/main /var/nutch/crawldb /var/nutch/linkdb/ /var/nutch/segments/*

solr goes to work!

he has dismax!

<requestHandler name="dismax" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm">3<60%</str> </lst></requestHandler> olrconfig.xmlS

from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)

response = s.query('idie manager')for hit in response.results: print hit['title'] print hit['body']

$ python simplequery.py Overview of the IDIE managerTo help with those implementing IDIE [...]IDIE: The 801g Of Talent ManagementInspiration-‐Direction-‐Influence [...]

<str name="bf"> recip(ms(NOW,date),3.16e-11,1,1)

</str>

FunctionQuery(1.0/(3.16E-11*float(ms(const(1299450070912),date(date)))+1.0)), product of: 0.9974636 = 1.0/(3.16E-11*float(ms(const(1299450070912),date(date)=1299369600000))+1.0) 1.0 = boost 0.03730806 = queryNorm

going beyond just

search results!

$solr = new Apache_Solr_Service( 'localhost', 8080, '/solr/main');$query = "badly drawn";$p = array( 'facet' => "true", 'facet.field' => 'category', 'facet.mincount' => 1,);

$r = $solr->search($query, 0, 5, $p);foreach( $r->facet_counts->facet_fields->category as $cat => $count) { echo $cat, " ", $count, PHP_EOL;

$query = "";$p = array( 'q.alt' => "*:*", "facet" => "true", "facet.date" => 'date', "facet.date.start" => "NOW/YEAR-6MONTHS", "facet.date.end" => "NOW/YEAR", "facet.date.gap" => "+1MONTH", "fq" => "category: Reviews",);

$r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_dates->date as $date => $count) { echo $date, " ", $count, PHP_EOL;}

$query = "";$p = array( 'q.alt' => "*:*", 'facet' => "true", 'facet.mincount' => 1, "facet.query" => array("title:gig", "title:album"), "fq" => "category:Reviews",); $r = $solr->search($query, 0, 0, $p);foreach($r->facet_counts->facet_queries as $query => $count) { echo $query, " ", $count, PHP_EOL;}

What Fields to facet?

what facets to show?

how to facet?

<requestHandler name="mlt" class="solr.MoreLikeThisHandler"> <lst name="defaults"> <str name="defType">mlt</str> <str name="mlt">true</str> <str name="mlt.fl">body title</str> <str name="mlt.match.include"> false </str> </lst></requestHandler>

olrconfig.xmlS

$solr = new Apache_Solr_Service ('localhost', 8080, '/solr/main');$query = "Losing my backpacking virginity";$p = array('qt' => "mlt");$results = $solr->search($query, 0, 3, $p);foreach($results->response->docs as $doc) { echo $doc->title, PHP_EOL;}

$ php mltquery.php Backpacking across USA social media waySafe solo travel on New York holidaysCracking The Big Apple's Big 10

THanks!

script: Ian barber (phpir.com)Art: the internet!Editor: twitter.com/ianbarberlettering: ian.barber@gmail.comhttp://joind.in/2899

http://wiki.apache.org/solrhttp://nutch.apache.org/http://lucidimagination.com/blog/http://robotlibrarian.billdueber.com/http://code.google.com/p/solr-php-clienthttp://pypi.python.org/pypi/solrpyhttps://www.packtpub.com/solr-1-4-enterprise-search-server/book

http://github.com/ianbarber/SolrBTB-Talk

Some useful links!

Bonus content!

<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType"> textSpell </str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="buildOnCommit">true</str> <str name="spellcheckIndexDir"> /var/lib/solr/spellchecker </str> </lst></searchComponent>

olrconfig.xmlS

<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.StandardFilterFactory" /> </analyzer></fieldType> chema.xmlS

[...] <int name="ps">10</int> <int name="qs">5</int> <str name="spellcheck.onlyMorePopular">true</str> <str name="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>

ismax handlerD

$solr = new Apache_Solr_Service('localhost', 8080, '/solr/main');$p = array( 'spellcheck' => 'true', 'spellcheck.collate' => 'true');$results = $solr->search("roose", 0, 5, $p);echo "Did you mean " . $results->spellcheck->suggestions->collation, PHP_EOL;

$ php spellquery.php Did you mean rose

include_once "Apache/Solr/Service.php";$solr = new Apache_Solr_Service( 'localhost', 8080, '/solr/main');$query = "album review";$p = array('sort' => 'title_sort desc');$res = $solr->search($query, 0, 10, $p);foreach($res->response->docs as $doc) { echo $doc->title, PHP_EOL;}

$ php sortquery.php Zola Jesus album review -‐ Stridulum IIZero 7 album review -‐ RecordZebra and GiraffeYoung Knives video interview part 2Young Knives -‐ Road to V winners on tourYou Me At Six @ Wembley Arena, LondonYou Me At Six -‐ Hold Me DownYet again... Good Shoes @ ULU, LondonYelle: North American tour reviewYelle: interview with a French pop artiste

http://code.google.com/p/solr-php-client

$so = new Apache_Solr_Service('localhost', 8080, '/solr/main');$q = "album review";$r =$so->search($q,0,5,array('hl'=>"true"));foreach($r->response->docs as $doc) { echo $r->highlighting->{$doc->permalink}->title[0], PHP_EOL;}

$ php highlightquery.php Fenech Soler album reviewWeezer -‐ Hurley album reviewFeeder album review -‐ Renegades

The masters of scaling are here!

Replication sharding caching

from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)response = s.query('ISO90210')if(response.results.numFound == '0'): print "No results found!"

$ python simplefail.py No results found!

IS SOLR DEFEATED?

http://solrurl:8080/solr/main/admin/analysis.jsp

<lst name="debug"> <str name="rawquerystring">"iso 90210"</str> <str name="querystring">"iso 90210"</str> <str name="parsedquery">+DisjunctionMaxQuery((body:"iso 90210")~0.01) DisjunctionMaxQuery((body:"iso 90210")~0.01)</str>

/solr/select/?q="iso 90210"&debugQuery=true

<lst name="debug"> <str name="rawquerystring">iso 90210</str> <str name="querystring">iso 90210</str> <str name="parsedquery">+((DisjunctionMaxQuery((body:iso)~0.01) DisjunctionMaxQuery((body:90210)~0.01))~2) DisjunctionMaxQuery((body:"iso 90210")~0.01)</str> <str name="parsedquery_toString">+(((body:iso)~0.01 (body:90210)~0.01)~2) (body:"iso 90210")~0.01</str>

/solr/select/?q=iso 90210&debugQuery=true

0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (body:"iso 90210") 0.0 = weight(body:"iso 90210" in 0), product of: 0.6953707 = queryWeight(body:"iso 90210"), product of: 3.8325815 = idf(body: iso=1 90210=1) 0.18143663 = queryNorm 0.0 = fieldWeight(body:"iso 90210" in 0), product of: 0.0 = tf(phraseFreq=0.0) 3.8325815 = idf(body: iso=1 90210=1) 0.15625 = fieldNorm(field=body, doc=0)

&explainother=90210

<str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> text^0.5 category^1.5 title^2 body^1 permalink^10.0 author^1.8 tag^1.3 </str> <str name="pf"> text^0.2 title^4 author^1.8 body^1 </str> <str name="mm"> 3<60%</str> <int name="ps">10</int> <int name="qs">5</int> </lst>

olrconfig.xmlS

from solr import *url = 'http://localhost:8080/solr/main's = SolrConnection(url)response = s.query('ISO90210')if(response.results.numFound == '0'): print "No results found!"

$ python simplefail.py DPCO: A Framework For SynergyDPCO, or Dynamic Performance Class Organisation is a ISO90210 quality [...]

Solr: Beyond the Basics

Technology

Transcript of Solr: Beyond the Basics

Beyond Basics Resp

Beyond Beading Basics

Solr 3.1 and Beyond

Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Django: Beyond Basics

Git beyond basics

Apache Solr Introduction demo | Basics | Tutorial ppts

Apache Solr 5.0 and beyond

Thinking Beyond Search with Solrinfo2.magento.com/rs/magentoenterprise/images/... · 7/31/2013 · July 31, 2013 | 6 Thinking Beyond Search with Solr – Understanding How Solr Can

LinkedIn Beyond the Basics

Beyond full-text searches with Lucene and Solr

Beyond basics -_acoustic_blues_guitar

BEYOND BASICS

Google basics and beyond

Beyond the Basics Manual

Beyond the Basics - isis.apache.org · Table of Contents 1. Beyond the Basics ...

Beyond GeoServer Basics

Beyond full-text sear ches W ith Lucene and Solrpeople.apache.org/~bdelacretaz/slides/bdelacretaz-solr-aceu07.pdf · Beyond full-text sear ches W ith Lucene and Solr Bertrand Delacr

Go: Beyond the Basics