Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
-
Upload
kai-chan -
Category
Technology
-
view
109 -
download
2
description
Transcript of Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
![Page 1: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/1.jpg)
Search Engine-Building with Lucene and Solr
Kai ChanSoCal Code Camp, June 2014
http://bit.ly/sdcodecamp2014solr
![Page 2: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/2.jpg)
all data
matched data
data that a user actually sees
![Page 3: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/3.jpg)
Lucene
● full-text search library● creates, updates and read from the index● takes queries and produces search results● your application creates objects and calls
methods in the Lucene API● provides building blocks for custom features
![Page 4: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/4.jpg)
Solr
● full-text search platform● uses Lucene for indexing and search● REST-like API over HTTP● different output formats (e.g. XML, JSON)● provides some features not built into Lucene
![Page 5: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/5.jpg)
machine running Java VM
your application
machine running Java VM
servlet container (e.g. Tomcat, Jetty)
Solr
Solr code
Lucene code libraries
index
Lucene
Lucene code
indexlibraries
clientHTTP
Lucene:
Solr:
![Page 6: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/6.jpg)
How Data Are Organized
collection
document document document
field
field
field
field
field
field
field
field
field
![Page 7: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/7.jpg)
field
content (e.g. "please read" or 30)
name (e.g. "title" or "price")
type
options
![Page 8: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/8.jpg)
collection
document document document
subject
date
from
subject
date
from
date
from
text text
reply-to
text
reply-to
![Page 9: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/9.jpg)
collection
document document document
subject
date
from
title
SKU
price
last name
phone
text description
first name
address
![Page 10: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/10.jpg)
Solr Field Definition
● fieldo name (e.g. "subject")o type (e.g. "text_general")o options (e.g. indexed="true" stored="true")
● field typeo text: "string", "text_general"o numeric: "int", "long", "float", "double"
● optionso indexed: content can be searchedo stored: content can be returned at search-timeo multivalued: multiple values per field & document
![Page 11: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/11.jpg)
Solr Dynamic Field
● define field by naming convention● "amount_i": int, index, stored● "tag_ss": string, indexed, stored, multivalued
![Page 12: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/12.jpg)
Solr Copy Field
● copy one or more fields into another field● can be used to define a catch-all field
o source: "title", "author", "content"o destination: "text"o searching the "text" field has the effect of searching
all the other three fields
![Page 13: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/13.jpg)
Indexing - UpdateRequestHandler
● upload (POST) content or file to http://host:port/solr/update
● formats: XML, JSON, CSV
![Page 14: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/14.jpg)
Indexing - DataImportHandler
● has its own config file (data-config.xml)● import data from various sources
o RDBMS (JDBC)o e-mail (IMAP)o XML data locally (file) or remotely (HTTP)
● transformers o extract data (RegEx, XPath)o manipulate data (strip HTML tags)
![Page 15: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/15.jpg)
Indexing - ExtractingRequestHandler
● allows indexing of different formatso e.g. PDF, MS Word, XML
● extract text and metadata● maps extracted text to the “content” field● maps metadata to different fields
![Page 16: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/16.jpg)
Searching - Basics
● send request to http://host:port/solr/search● parameters
o q - main queryo fq - filter queryo defType - query parser (e.g. lucene, edismax)o fl - fields to returno sort - sort criteriao wt - response writer (e.g. xml, json)o indent - set to true for pretty-printing
![Page 17: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/17.jpg)
http://localhost:8983/solr/select?q=title:tablet&fl=title,price,inStock&sort=price&wt=json
search handler's URL main query
response writersort criteriafields to return
![Page 18: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/18.jpg)
Searching - Query Syntax
name:tablet
name:”galaxy tab”name:tablet category:tablet
+name:tablet +category:tablet
![Page 19: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/19.jpg)
Searching - Query Syntax (cont.)
+name:tablet +(manu:apple manu:samsung)
+name:tablet -manu:apple
+name:tablet +range:[300 TO 500]
+name:tablet manu:apple^5
![Page 20: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/20.jpg)
EDisMax Parser
● suitable for user-generated querieso does not complain about the syntaxo does not require field name in queryo searches across several fields
● configurable
![Page 21: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/21.jpg)
● default: sorting by decreasing score● custom sorting rules: use the sort parameter
o syntax: fieldName (asc|desc)o e.g. sort by ascending price (i.e. lowest price
first):price asco e.g. sort by descending date (i.e. newest date
first):date asc
Sorting
![Page 22: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/22.jpg)
Sorting
● multiple fields and orders: separate by commaso e.g. sort by descending starRating and ascending
price:o starRating desc, price asc
![Page 23: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/23.jpg)
Sorting
● cannot use multivalued fields● overrides the default sorting behavior
![Page 24: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/24.jpg)
Faceted Search
● facet values: (distinct) values (generally non-overlapping) ranges of a field
● displaying facetso show possible valueso let users narrow down their searches easily
![Page 25: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/25.jpg)
facet
facet values (5 of them)
![Page 26: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/26.jpg)
Faceted Search
● set facet parameter to true - enables faceting
● other parameterso facet.field - use the field's values as facets
return <value, count> pairso facet.query - use the given queries as facets
return <query, count> pairso facet.sort - set the ordering of the facets;
can be "count" or "index"o facet.offset and face.limit - used for
pagination of facets
![Page 27: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/27.jpg)
Spatial Search
● data: locations (longitudes, latitudes)● search: filter and/or sort by location
![Page 28: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/28.jpg)
Filter by Location
● geofilto circle centered at a given pointo distance from a given pointo fq={!geofilt sfield=store}&pt=45.15,-
93.85&d=5● bbox
o square (“bounding box”) centered at a given pointo distance from a given point + cornerso fq={!bbox sfield=store}&pt=45.15,-
93.85&d=5
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 29: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/29.jpg)
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 30: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/30.jpg)
geofilt bbox
5 km 5 km
(45.15, -93.85) (45.15, -93.85)
x
o
o
x
x
x
o
o
o
o
x
o
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 31: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/31.jpg)
Sort by Location
● geodisto returns the distance between the location given in a
field and a certain coordinateo e.g. sort by ascending distance from (45.15,-93.85),
and return the distances as the score:q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc
Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
![Page 32: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/32.jpg)
Scaling/Redundancy
problem solution
collection too large for a single machine
distribution
too many requests for a single machine
distribution
a machine can go down replication
![Page 33: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/33.jpg)
SolrCloud
● Solr instanceso collection (logical index) divided into one or more
partial collections (“shards”)o for each shard, one or more Solr instances keep
copies of the data one as leader - handles reads and writes others as replicas - handle reads
● ZooKeeper instances
![Page 34: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/34.jpg)
SolrCloud
● Solr instances● ZooKeeper instances
o management of Solr instanceso leader electiono node discovery
![Page 35: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/35.jpg)
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
![Page 36: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/36.jpg)
leader replica replica
leader replica
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
![Page 37: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/37.jpg)
leader replica replica
(offline) leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
![Page 38: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/38.jpg)
leader replica replica
replica leader
leader replica
shard 1: ⅓ of the collection
shard 2:⅓ of the collection
shard 3:⅓ of the collection
collection (i.e. logical index)
replica
replica
replica
replica
![Page 39: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/39.jpg)
Resources - Books
● Solr in Actiono just released, up-to-dateo http://www.manning.com/grainger/
● Apache Solr 4 Cookbooko common problems and useful tipso http://www.packtpub.com/apache-solr-4-cookbook/b
ook● Lucene in Action
o written by 3 committer and PMC memberso somewhat outdated (2010; covers Lucene 3.0)o http://www.manning.com/hatcher3/
![Page 40: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/40.jpg)
Resources - Books
● Introduction to Information Retrievalo not specific to Lucene/Solr, but about IR conceptso free e-booko http://nlp.stanford.edu/IR-book/
● Managing Gigabyteso indexing, compression and other topicso accompanied by MG4J - a full-text search softwareo http://mg4j.di.unimi.it/
![Page 41: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/41.jpg)
Resources - Web
● official websiteo http://lucene.apache.org/o Wikio reference guideo mailing list
● StackOverflowo http://stackoverflow.com/o “Lucene” and “Solr” tags
![Page 42: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/42.jpg)
Getting Started
● download Solro requires Java 7 or newer to run
● Solr comes bundled/configured with Jettyo <Solr directory>/example/start.jar
● "exampledocs" directory contains sample documentso <Solr directory>/example/exampledocs/post.jaro java
-Durl=http://localhost:8983/solr/update -jar post.jar *.xml
● use the Solr admin interfaceo http://localhost:8983/solr/
![Page 43: Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)](https://reader036.fdocuments.us/reader036/viewer/2022062617/54c66b0e4a79594b538b482e/html5/thumbnails/43.jpg)
Thanks for Coming!
● Java Performance Tips @ 10:15, same room● slides available
o http://bit.ly/sdcodecamp2014solr● please vote for my conference session
o http://bit.ly/tvnews2014● questions/feedback
o [email protected]● questions?