Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

44
Search Engine-Building with Lucene and Solr Part 2 Kai Chan SoCal Code Camp, November 2013

description

These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013. http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a

Transcript of Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Page 1: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Search Engine-Building with Lucene and Solr

Part 2Kai Chan

SoCal Code Camp, November 2013

Page 2: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Overview

● indexing process● searching process● advanced features● scaling/redundancy● resources● demo● questions/answers

Page 3: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Indexing Process

● request handler○ data are read to create documents

● update request processor chain○ optional document-wide processing○ fields can be added, changed, removed○ analysis○ creation of indexed and stored fields

● update handler○ the index is updated

Page 4: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Update Request Processor Chain

● de-duplication○ creates a signature (hash) for each document to be

added○ replaces (delete) existing documents with the same

signature○ MD5Signature

■ exact hashing○ Lookup3Signature

■ faster calculation and smaller hash than MD5○ TextProfileSignature

■ fuzzy hashing, near-duplicate detection

Page 5: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Update Request Processor Chain

● language detection○ detects the language used in field(s)○ adds a language field to the document○ TikaLanguageIdentifierUpdateProcessorFa

ctory■ uses Apache Tika

○ LangDetectLanguageIdentifierUpdateProcessorFactory■ uses language-detection library

○ external programs■ e.g. Chromium Compact Language Detector

See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html>

Page 6: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Analysis

● analyzed○ tokenization, i.e. breaking down the content to be

search into smaller units (“tokens”)○ manipulation of tokens

● not analyzed○ the whole content treated as 1 unit for searching

● analyzed v.s. not analyzed○ are individual tokens meaningful on their own?○ are individual tokens used in queries?

Page 7: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

1-933-98817-7

Lucene in Action, Second Edition: Covers Apache Lucene 3.0

Lucene in Action, Second Edition: Covers Apache Lucene 3.0

1 933 98817 7

Example 1: book title

Example 2: ISBN

Lucene in Action, Second Edition: Covers Apache Lucene 3.0

1 933 98817 7

makes more sense to not tokenize

makes more sense to tokenize

search for “Lucene”: no match

search for “933”: match

Page 8: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Analysis

analyzed:● text

How about URL?

not analyzed:● number● serial number● GUID● checksum

Page 9: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Analysis

● character filter(s)○ character replacement○ e.g. accent marks with their base forms

café → cafejalapeño → jalapeno

● tokenizer● token filter(s)

Page 10: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Analysis

● character filter(s)● tokenizer

○ create tokens (“words”) from characters○ sometimes straightforward○ many unusual cases:

e-mail address, URL, code, etc.● token filter(s)

Page 11: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Analysis

● character filter(s)● tokenizer● token filter(s)

○ token replacement■ change case, remove apostrophe■ remove stop words (a, and, the, for)■ split/join words (ice-cream, ice cream, icecream)■ stemming (importing, imported → import)■ synonym (nation → country)

Page 12: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Field value:Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free Wi-Fi!

Tokens (text_en):1 2 3 6 7 8 9 10 12 13 14 15 16 17let sign up amaz so cal code camp http bit.li ozizsu free wi fi

Tokens (text_en_splitting):1 2 3 6 7 8 9 10 12 13 14 1516 17 18 19 20let sign up amaz so cal code camp http bit ly o zi zsu free wi fi socal httpbitlyozizsu wifi 8 17 20

Tokens (text_general):1 2 3 4 6 6 7 8 9 10 11 12 13 14 15 16 17let's sign up for the amazing so cal code camp at http bit.ly oZiZsu free wi fi

Page 13: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Searching Process

● query parsing● analysis● scoring● sorting● loading of stored fields● optional search components

○ faceting○ term vector○ More Like This○ highlighting

Page 14: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Scoring

● for a given query, each document not filtered out gets a score (float)

● higher score: higher in the results● scoring algorithms

○ default: TF-IDF○ other: Okapi BM25, etc.○ very customizable

See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”

Page 15: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Scoring - TF-IDF

● term frequency (TF)○ how many times does this term appear in this

document?● inverse document frequency (IDF)

○ how many documents contain this term?○ score proportional to the inverse of document

frequency

Page 16: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Scoring - Other Factors

● coordination factor (coord)○ documents that contains all or most query terms get

higher scores● normalizing factor (norm)

○ adjust for field length and query complexity

Page 17: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Scoring - Boost

● manual override: ask Lucene/Solr to give a higher score to some particular thing(s)

● index-time○ per document○ per field (of a particular document)

● search-time○ per query

Page 18: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

More Like This

● finds documents similar in content (of one field) to those matched

● constructs a query based on the highest scoring terms in a document

● requires the field to:○ have stored term vectors (recommended), or○ be stored

Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>

Page 19: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Spell Checking

● typos in queries happen● returns spell checking suggestion (if any)

within the same result● can also be used for auto-complete

○ treating a prefix as a spelling mistake○ returning full words as suggestions

Page 20: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

<lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset">14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst></lst>

/select?q=text:"busness comunication"&spellcheck=true&wt=xml

Page 21: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Query Elevation

● a.k.a. “sponsored search”● make sure certain documents appear at the

top of the results for a certain query

Page 22: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Credit: Google Web Search <http://www.google.com/>

Page 23: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Query Elevation

● configure the elevator search component in solrconfig.xml

● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude

● enable query elevation:enableElevation=true

● (optional) override the sort parameter:forceElevation=true

Page 24: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Function Query

● like formulas in Excel● apply functions to field values for filtering

and scoring

Page 25: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Function Query

● query:q={!func} cos(angle)

● query (range):q={!frange l=0.5 u=1} cos(angle)

● field:fl=angle,cos(angle)

● sort:sort=cos(angle) desc

Page 26: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Spatial Search

● data: contains locations (longitudes, latitudes)○ e.g. merchants with store locations

● search: filter and/or sort by location

Page 27: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Credit: Google Maps <http://maps.google.com/>

Page 28: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Spatial Search

● geofilt○ circle centered at a given point○ distance from a given point○ fq={!geofilt sfield=store}&pt=45.15,

-93.85&d=5● bbox

○ square (“bounding box”) centered at a given point○ distance from a given point + corners○ fq={!bbox sfield=store}&pt=45.15,-93.85

&d=5

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 29: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 30: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

geofilt bbox

5 km 5 km

(45.15, -93.85) (45.15, -93.85)

x

o

o

x

x

xo

o

o

o

x

o

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 31: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Spatial Search

● geodist○ returns the distance between the location given in a

field and a certain coordinate○ e.g. sort by ascending distance from (45.15,-93.85),

and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

Page 32: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Scaling/Redundancy - Problems

● collection too large for a single machine● too many requests for a single machine● a machine can go down

Page 33: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Scaling/Redundancy - Solutions

● collection too large for a single machine○ distribution

■ spread the collection across multiple machines● too many requests for a single machine

○ distribution■ spread the requests across multiple machines

● a machine can go down○ replication

■ copy data and configuration across multiple machines

■ make sure no single point of failure

Page 34: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

SolrCloud

● Solr instances● ZooKeeper instances

Page 35: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

SolrCloud

● Solr instances○ collection (logical index) divided into one or more

partial collections (“shards”)○ for each shard, one or more Solr instances keep

copies of the data■ one as leader - handles reads and writes■ others as replicas - handle reads

● ZooKeeper instances

Page 36: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

SolrCloud

● Solr instances● ZooKeeper instances

○ management of Solr instances○ leader election○ node discovery

Page 37: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

leader replica replica

leader replica

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

Page 38: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

leader replica replica

leader replica

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

Page 39: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

leader replica replica

(offline) leader

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

Page 40: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

leader replica replica

replica leader

leader replica

shard 1: ⅓ of the collection

shard 2:⅓ of the collection

shard 3:⅓ of the collection

collection (i.e. logical index)

replica

replica

replica

replica

Page 41: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Resources - Books

● Lucene in Action○ written by 3 committer and PMC members○ somewhat outdated (2010; covers Lucene 3.0)○ http://www.manning.com/hatcher3/

● Solr in Action○ early access; coming out later this year○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook○ common problems and useful tips○ http://www.packtpub.com/apache-solr-4-

cookbook/book

Page 42: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Resources - Books

● Introduction to Information Retrieval○ not specific to Lucene/Solr, but about IR concepts○ free e-book○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes○ indexing, compression and other topics○ accompanied by MG4J - a full-text search software○ http://mg4j.di.unimi.it/

Page 43: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Resources - Web

● official websites○ Lucene Core - http://lucene.apache.org/core/○ Solr - http://lucene.apache.org/solr/

● mailing lists● Wiki sites

○ Lucene Core - http://wiki.apache.org/lucene-java/○ Solr - http://wiki.apache.org/solr/

● reference guides○ API Documentation for Lucene and Solr○ Apache Solr Reference Guide

Page 44: Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Getting Started

● download Solr○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample documents○ <Solr directory>/example/exampledocs/post.jar○ java -Durl=http://localhost:

8983/solr/update -jar post.jar *.xml● use the Solr admin interface

○ http://localhost:8983/solr/