elasticsearch - advanced features in practice
-
Upload
jan-suchal -
Category
Technology
-
view
117 -
download
2
description
Transcript of elasticsearch - advanced features in practice
A D V A N C E D F E A T U R E S I N P R A C T I C E
@J S U C H A L
# R U B Y S L A V A
elasticsearch
elasticwhat?
based on Apache Lucene
REST API
Data & API in JSON
Schema-free
Real time
Distributed
Advanced functionality
Quickstart
1. Download & extract from http://www.elasticsearch.org/download/
2. $ bin/elasticsearch –f
3. There is no step 3.
Quickstart - index
$ curl -XPOST 'http://localhost:9200/rubyslava/talks/1' -d '{
"title" : "elasticsearch - advanced features in practice",
"presenter" : "jsuchal",
"presented_at" : "2011-09-22T19:00:00",
"message" : "hopefully clear",
"tags" : ["elasticsearch", "rocks"]
}'
=> {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
Quickstart - search
$ curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘ => { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] } }
Advanced features
Search
analyzer, stemming, ngrams, ascii folding & custom analyzers
boosting, fragment highlighting, fuzzy search
…
Facets
Percolate
Scroll
and more…
The Case Study
Find suspicious government contracts
using heuristics
IT contract where price > 1M euro
Supplier company age < 3 months
using crowdsourcing
Data
Central government contract repositories
www.crz.gov.sk, zmluvy.egov.sk
~70K contracts in 8 months
100+ GB pdf/doc/scan
The Solution
Faceted search
The Solution
Faceted search
Search
e.g. Find all contracts by Orange Slovakia
Analyze
e.g. Which department has most contracts with Orange Slovakia?
e.g. What is the contract price distribution for Orange Slovakia?
…
Define penalty heuristics
Facets
Types
term, range, histogram, statistical, geo distance
$ curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?pretty' -d '{ "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } } }'
Facets - results
{ "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } } }
Facets - advanced
Problem
Generate options for facets with some selected restrictions
Solution
global facet
facet_filter
{ "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
Percolate
Problem
New contract/document added, which heuristics does it match?
Solution
1. Save heuristics/searches in percolator index
2. Percolate new documents
Percolate
$ curl -XPUT 'localhost:9200/_percolator/rubyslava/heuristic-1' -d '{
"query" : {
"term" : {
"tags" : "rubyslava"
}
}
}'
$ curl -XPOST 'http://localhost:9200/rubyslava/talks/_percolate' -d '{
"doc" : {
"tags" : ["rubyslava", "rocks", "too"]
}
}‘
=> {"ok":true,"matches":["heuristic-1"]}
Scroll
Problem
New heuristic added and matches many (1K+) documents
Add heuristic to all matching documents
+ Offset performance problem known in RDBMS
Solution
Use async background job
Scroll through results (a.k.a. cursor)
Scroll
$ curl -XGET 'http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty' -d '{
"query": {
"match_all" : {}
}
}
‘
=>
{
"_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1 FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcj ZUUlFLR1FISEZfa3pNNFE7MDs=",
"took" : 2,
…
"hits" : […],
}
$ curl -XGET 'http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…'
=> more results & repeat
Ruby Scroll API
Mimics find_each in ActiveRecord
def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
Tutorials & Guides
http://www.slideshare.net/clintongormley/cool-bonsai-cool-an-introduction-to-elasticsearch
http://www.slideshare.net/clintongormley/terms-of-endearment-the-elasticsearch-query-dsl-explained
http://www.elasticsearch.org/guide/