elasticsearch - advanced features in practice

19
ADVANCED FEATURES IN PRACTICE JSUCHAL #RUBYSLAVA elasticsearch

description

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

Transcript of elasticsearch - advanced features in practice

Page 1: elasticsearch - advanced features in practice

A D V A N C E D F E A T U R E S I N P R A C T I C E

@J S U C H A L

# R U B Y S L A V A

elasticsearch

Page 2: elasticsearch - advanced features in practice

elasticwhat?

based on Apache Lucene

REST API

Data & API in JSON

Schema-free

Real time

Distributed

Advanced functionality

Page 3: elasticsearch - advanced features in practice

Quickstart

1. Download & extract from http://www.elasticsearch.org/download/

2. $ bin/elasticsearch –f

3. There is no step 3.

Page 4: elasticsearch - advanced features in practice

Quickstart - index

$ curl -XPOST 'http://localhost:9200/rubyslava/talks/1' -d '{

"title" : "elasticsearch - advanced features in practice",

"presenter" : "jsuchal",

"presented_at" : "2011-09-22T19:00:00",

"message" : "hopefully clear",

"tags" : ["elasticsearch", "rocks"]

}'

=> {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}

Page 5: elasticsearch - advanced features in practice

Quickstart - search

$ curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘ => { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] } }

Page 6: elasticsearch - advanced features in practice

Advanced features

Search

analyzer, stemming, ngrams, ascii folding & custom analyzers

boosting, fragment highlighting, fuzzy search

Facets

Percolate

Scroll

and more…

Page 7: elasticsearch - advanced features in practice

The Case Study

Find suspicious government contracts

using heuristics

IT contract where price > 1M euro

Supplier company age < 3 months

using crowdsourcing

Data

Central government contract repositories

www.crz.gov.sk, zmluvy.egov.sk

~70K contracts in 8 months

100+ GB pdf/doc/scan

Page 8: elasticsearch - advanced features in practice

The Solution

Faceted search

Page 9: elasticsearch - advanced features in practice
Page 10: elasticsearch - advanced features in practice

The Solution

Faceted search

Search

e.g. Find all contracts by Orange Slovakia

Analyze

e.g. Which department has most contracts with Orange Slovakia?

e.g. What is the contract price distribution for Orange Slovakia?

Define penalty heuristics

Page 11: elasticsearch - advanced features in practice

Facets

Types

term, range, histogram, statistical, geo distance

$ curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?pretty' -d '{ "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } } }'

Page 12: elasticsearch - advanced features in practice

Facets - results

{ "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } } }

Page 13: elasticsearch - advanced features in practice

Facets - advanced

Problem

Generate options for facets with some selected restrictions

Solution

global facet

facet_filter

{ "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }

Page 14: elasticsearch - advanced features in practice

Percolate

Problem

New contract/document added, which heuristics does it match?

Solution

1. Save heuristics/searches in percolator index

2. Percolate new documents

Page 15: elasticsearch - advanced features in practice

Percolate

$ curl -XPUT 'localhost:9200/_percolator/rubyslava/heuristic-1' -d '{

"query" : {

"term" : {

"tags" : "rubyslava"

}

}

}'

$ curl -XPOST 'http://localhost:9200/rubyslava/talks/_percolate' -d '{

"doc" : {

"tags" : ["rubyslava", "rocks", "too"]

}

}‘

=> {"ok":true,"matches":["heuristic-1"]}

Page 16: elasticsearch - advanced features in practice

Scroll

Problem

New heuristic added and matches many (1K+) documents

Add heuristic to all matching documents

+ Offset performance problem known in RDBMS

Solution

Use async background job

Scroll through results (a.k.a. cursor)

Page 17: elasticsearch - advanced features in practice

Scroll

$ curl -XGET 'http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty' -d '{

"query": {

"match_all" : {}

}

}

=>

{

"_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1 FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcj ZUUlFLR1FISEZfa3pNNFE7MDs=",

"took" : 2,

"hits" : […],

}

$ curl -XGET 'http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…'

=> more results & repeat

Page 18: elasticsearch - advanced features in practice

Ruby Scroll API

Mimics find_each in ActiveRecord

def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end

Page 19: elasticsearch - advanced features in practice

Tutorials & Guides

http://www.slideshare.net/clintongormley/cool-bonsai-cool-an-introduction-to-elasticsearch

http://www.slideshare.net/clintongormley/terms-of-endearment-the-elasticsearch-query-dsl-explained

http://www.elasticsearch.org/guide/