Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

Post on 15-Apr-2017

3.369 views 1 download

Transcript of Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

E L A S T I C S E A R C H ,LO G S TA S H , K I B A N A

C O O L S E A R C H ,A N A LY T I C S ,

D ATA M I N I N GA N D M O R E …

O L E K S I Y PA N C H E N KO / LO H I K A / 2 0 1 5

MY NAME IS…

Oleksiy PanchenkoSoftware engineer, Lohika

E-mail: oleksij@gmail.comTwitter: oleskiyp

LinkedIn: https://ua.linkedin.com/in/opanchenko

AGENDA• Introduction. What is it all about?• Jump start Elastic. Demo time• Architecture and deployment. Why is

Elasticsearch elastic?• Case studies. 4 real-life projects• Query API in depth + Demo• Elasticsearch ecosystem. ELK Stack + Demo• Q & A

INTRODUCTIONW H AT I S I T A L L A B O U T ?

HOW TO MAKE YOUR SITE SEARCHABLE?

http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png

• Google search• Why not to use plain vanilla SQL? RDBMS rocks! select * from books join authors on … where …• Sphinx (hello Craigslist, Habrahabr, The Pirate

Bay, 1C); Xapian• Lucene Family: Apache Lucene, Elasticsearch,

Apache Solr, Amazon Cloudsearch, …

WHO HAS EVER USED ELASTICSEARCH?

http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png

LUCENE AS A CORE• Lucene = Low-level Java library (JAR) which

implements search functionality• Can be used in both web and standalone

applications (desktop, mobile)• Lucene stores its index as a local binary file• Implemented in Java, ports to other languages

available• Initial version: 1999• Apache project since 2001• Latest stable release: 5.2.1 (15 June 2015)

LUCENE AS A CORE• Lucene was originally

written in 1999 by Doug Cutting (creator of Hadoop and Nutch, currently Chief Architect at Cloudera) as a part of open-source web search engine (Nutch)

http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg

MORE ABOUT SEARCH ENGINES

Riak Search

TIME TO TALK ABOUT ELASTICSEARCH

https://www.elastic.co/products/elasticsearch

Near Real-Time Data (NRT)

Full-Text SearchMultilingual search, geolocation, fuzzy search, did-you-mean suggestions, autocomplete

https://www.elastic.co/products/elasticsearch

High Availability

Multitenancy

Distributed, Horizontally Scalable

https://www.elastic.co/products/elasticsearch

Document-Oriented

Schema-Free

Conflict ManagementOptimistic Concurrency Control

https://www.elastic.co/products/elasticsearch

Apache 2 Open Source License

Awesome documentation

Large community

Developer-Friendly, RESTful APIClient libraries available for many programming languages and frameworks.

ELASTICSEARCH USERS

https://www.elastic.co/use-caseshttps://en.wikipedia.org/wiki/Elasticsearch#Users

ELASTICSEARCH – PAST & PRESENT• 2004. Shay Banon (aka

Kimchy) started working on Compass – Java Search Engine on top of Lucene• 2010. Initial release of

Elasticsearch• Latest stable release:

1.7.1(July 29, 2015)• 500K downloads per

month• https://github.com/elastic/elasticsearch

http://opensource.hk/sites/default/files/u1/shay-banon.jpg

ELASTICSEARCHAS A COMPANY• 2012. Elasticsearch BV; Funding: $104M in 3

rounds, 100+ employees• https://www.elastic.co/• Product portfolio:

– Elasticsearch, Logstash, Kibana (ELK stack)– Watcher– Shield– Marvel– es-hadoop– found

JUMP START ELASTIC

D E M O T I M E

INSTALLATION & CONFIGURATION• Prerequisites:

– JDK 6 or above (recommended: JDK 8)– RAM: min. 2Gb (recommended: 16–64 Gb for

production)– CPU: number of cores over clock rate– Disks: recommended SSD

• Homebrew, apt, yum: apt-get install elasticsearch

• Download (ZIP, TAR, DEB, RPM): https://www.elastic.co/downloads/elasticsearch

• Installation is absolutely straightforward and easy: https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html

LET’S TALK ABOUT TERMINOLOGYIndex ~ DB Schema

Type ~ DB Table

Document

Record, JSON object

Mapping ~ Schema definition in RDBMS

DEMO #1

http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg

http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg

ARCHITECTURE AND DEPLOYMENTW H Y I S E L A S T I C S E A R C H E L A S T I C ?

Cluster One or more nodes which share the same cluster name

Node Running instance of Elasticsearch which belongs to a cluster

Shard A portion of data – single Lucene instance.Default: 5 shards in an index

Primary Shard

Master copy of data

Replica Shard

Exact copy of a primary shard.Default: 1 replica

SINGLE-NODE CLUSTER0 1 2 3 4

HashFunction*

{ "id": "123", "name": "john", … }

{ "id": "124", "name": "patricia", … }

{ "id": "125", "name": "scott", … }

* Also consider custom routing

TWO-NODE CLUSTER

0 1 R2 3 R4Node 1

R0 R1 2 R3 4Node 2

* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)

BENEFITS OF SHARDING• Take advantage of multi-core CPUs (one shard

is a single Lucene instance = single JVM process)• Horizontal scalability. Dynamic rebalancing• Fault tolerance and cluster resilience• NB! The number of shards can not be changed

dynamically on the fly – need to perform full reindexing• Max number of documents per shard:

2,147,483,519 – imposed by Lucene

CUSTOM ROUTING• Social network. Users, events• event_id: 17567654, 17567655, 17567656, …user_id: 10300, 10301, …

• No Elasticsearch ID provided: ID will be auto-generated Events will be equally distributed across the shards

• Obvious approach: Elasticsearch ID = event_id Events will be equally distributed across the shards

• Elasticsearch ID = user_id Events which belong to the same user will be stored in a single shard no overheads better performance

ELASTICSEARCH NODE TYPES• Data node node.data = true• Master node node.master = true• Communication client http.enabled = true• TCP ports 9200 (ext), 9300 (int)• A node can play 2 or 3 roles at the same time• Multicast discovery (true by default):discovery.zen.ping.multicast.enabled

DEPLOYMENT DIAGRAM

INDEXING A DOCUMENT

https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html

RETRIEVING A DOCUMENT

https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html

• In terms of retrieving documents, primary and replica shards are equivalent: data can be read from either primary or replica shard

DISTRIBUTED SEARCH• Given search query, retrieve 10 most relevant results

https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html

CASE STUDIES4 R E A L - L I F E P R O J E C T S

http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&path-prefix=ru

GENERAL INFO• 4 projects, ~2 years• RDBMS (MySQL, PostgreSQL) as a primary

data storage• Both on-premise Elasticsearch installation

(AWS, MS Azure) and SaaS (Bonsai @ Heroku)• 1 or 2 instances in a cluster• Data volume: Gigabytes; millions of

documents• Back-end: Java, Ruby

#1. SOCIAL INFLUENCER MARKETING PLATFORM

http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg

• Document types: Blog Posts, Bloggers (Influencers)• Elasticsearch usage:

– search and rank Influencers by category, keywords, tags, location, audience, influence

– search blog posts by keywords etc.• Amount of data:

– Influencers: hundreds of thousands– Blog Posts: millions

• ES cluster size: 2 instances• Technology stack: Java, MySQL, Dynamo DB,

AWS• Considered alternatives: Sphinx, Apache Solr

#2. JOB SITE

http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg

• Document types: Job Postings, Jobseekers• Find relevant jobs

– Simple one-click search– Advanced search (title, keywords, industry,

location/distance, salary, requirements)• Elasticsearch as a Recommendation Engine

Recommend jobs based on: previously applied/viewed jobs, location, distance, schedule etc.• 2 types of recommendations:

– Side banner (You also might be interested in…)

– E-mail subscriptions every 2 weeks• Find appropriate candidates by location,

requirements (experience, education, languages), salary expectations

• No fixed document structure (jobs from different providers)• Full-text search• Fuzzy search• Geolocation (distance)• Weighted search: Boosted search

clauses• Dynamic scripting (Mvel until v1.4.0,

then Groovy)

SEARCH QUERIES

SOME MORE FACTS• Amount of data:

– Job postings: ~1M–Applicants: ~20K

• Cluster size: 2 ‘medium’ EC2 instances• Technology stack:

–Ruby on Rails–Elasticsearch, PostgreSQL, Redis–Heroku + add-ons, AWS (S3, EC2)–Lots of 3rd party APIs and integrations

IMPLEMENTATION (RUBY)• A Model is ActiveRecord (Ruby on Rails ORM)• ActiveRecord can persist itself to the database• ActiveRecord::Callbacks:

– after_commit on [:create, :update] { index_document }– after_commit on [:destroy] { delete_document }– after_create…– after_save …– after_destroy…

• Rake tasks to drop/recreate index, reindex documents

• Zero-downtime reindexing using aliases• Ruby/Rails client:

https://github.com/elastic/elasticsearch-rails

LESSONS LEARNED• On-premise deployment (EC2) vs. SaaS

(Bonsai @ Heroku)• Dynamic scripting• PostgreSQL as a backup search engine

sucks

#3. CAR TRADING

http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png

PARSING ADS

Price

$3900

1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPGWAT???• Fuzzy Search (Levenstein Distance Algorithm) used to parse

ads and classify cars• Elasticsearch index contains dictionary (Year, Make, Model,

Trim)• Used in conjunction with other approaches: regular

expressions, dictionaries of synonyms (VW Volkswagen, Chevy Chevrolet), normalization (e.g. LX-370 LX370)

• Algorithm approach:– Parse Year (1996)– Search most relevant Make (VW, volkswagon

Volkswagen)– Search most relevant Model (Passat) for Make =

Volkswagen, Year = 1996– Search most relevant Trim (TDi 4dr Sedan)

• Parsing quality: 90%https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html

#4. [NDA]

http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg

SOME UNCOVERED INFO• Check documents against duplicate content• Shingle analysis (commonly used by copywriters and SEO

experts)– I have a dream that one day this nation will rise up and live…– Normalization

I have a dream that one day this nation will rise up and live…

– Splitting a text into shingles (n-grams), n = 3..10have dream that

dream that thisthat this nationthis nation will

…– Replacement: latin ‘c’ cyrillic ‘c’

• Custom or standard ES implementation of Shingle analysishttps://en.wikipedia.org/wiki/W-shingling

QUERY API IN DEPTH+ D E M O

FILTERS VS. QUERIESAs a general rule, filters should be used:• for binary yes/no searches• for queries on exact values

Filters are much faster than queriesFilters are usually great candidates for caching

27 Filters available (Elasticsearch 1.7.1)

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html

QUERIES VS. FILTERSAs a general rule, queries should be used instead of filters:• for full text search• where the result depends on a relevance score

Common approach: Filter as many records as possible, then query them.

38 Queries available (Elasticsearch v 1.7.1)

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html

DEMO #2

http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg

SOME THEORY BEHIND RELEVANCE SCORINGfull AND text AND search AND (elasticsearch OR lucene)

• Term Frequency: How often does the term appear in the document?

• Inverse Document Frequency: How often does the term appear in all documents in the collection?

• Field-length norm: How long is the field?

• TF, FLN etc. are calculated and stored at index timehttps://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting

MORE COOL FEATURES• Indexing attachments: MS Office, ePub, PDF

(Apache Tika)• Autocomplete suggestion:

• Did-you-mean suggestion:

• Highlight results:

SEARCH IMAGES

https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/https://github.com/kzwang/elasticsearch-image

http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg

ELASTICSEARCH ECOSYSTEM.ELK STACK+ D E M O

CLIENTS

http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png

• Java: 1 native client + 1 community supported• Python: 1 official + 7 community supported• Ruby: 1 official + 7 community supported• JavaScript: 1 official + 4• PHP: 1 official + 4• C#. NET: 1 official + 2• Scala: 4• Groovy (1), Haskell (1), Perl (1), Clojure (1),

Go (3),R (2), Erlang (3), OCaml (2), Smalltalk (1), ColdFusion (1), C++ (1)• Command Line (2)https://www.elastic.co/guide/en/elasticsearch/client/community/current/clients.html

INTEGRATIONS• Django• Ruby on Rails• Spring, Spring Data• Node.js• Symfony, Drupal, Wordpress• Grails• Play! Framework

https://www.elastic.co/guide/en/elasticsearch/client/community/current/integrations.html

FRONT ENDS

http://php.archive.razorflow.com/assets/img/header_v1.png

ELASTICSEARCH-HEAD

http://mobz.github.io/elasticsearch-head/

ESCLIENT

https://github.com/rdpatil4/ESClient

AVAILABLE FRONT ENDS

https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html

• elasticsearch-head: A web front end for an Elasticsearch cluster.

• browser: Web front-end over elasticsearch data.• Inquisitor: Front-end to help debug/diagnose queries and

analyzers• Hammer: Web front-end for elasticsearch• Calaca: Simple search client for Elasticsearch• ESClient: Simple search, update, delete client for

Elasticsearch

HEALTH AND PERFORMANCE

http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png

ELASTICSEARCH-HEAD

https://github.com/mobz/elasticsearch-head

BIGDESK

https://github.com/lukas-vlcek/bigdesk

WHATSON

https://github.com/xyu/elasticsearch-whatson

ELASTICOCEAN

https://itunes.apple.com/us/app/elasticocean/id955278030

HEALTH AND PERFORMANCE

https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html

• bigdesk: Live charts and statistics for elasticsearch cluster.• Kopf: Live cluster health and shard allocation monitoring with administration

toolset.• paramedic: Live charts with cluster stats and indices/shards information.• ElasticsearchHQ: Free cluster health monitoring tool• SPM for Elasticsearch: Performance monitoring with live charts showing cluster

and node stats, integrated alerts, email reports, etc.• check-es: Nagios/Shinken plugins for checking on elasticsearch• check_elasticsearch: An Elasticsearch availability and performance monitoring

plugin for Nagios.• opsview-elasticsearch: Opsview plugin written in Perl for monitoring

Elasticsearch• SegmentSpy: Plugin to watch Lucene segment merges across your cluster• es2graphite: Send cluster and indices stats and status to Graphite for monitoring

and graphing.• Scout: Provides plugins for monitoring Elasticsearch nodes, clusters, and indices.• ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring tool to keep

an eye on DigitalOcean Droplets or Elasticsearch instances or both of them on-a-go.

10 ES METRICS TO WATCH

http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html

1. Cluster health — nodes and shards2. Node performance — CPU3. Node performance — memory usage4. Node performance — disk I/O5. Java — heap usage and garbage collection6. Java — JVM pool size7. Search performance — request latency and

request rate8. Search performance — filter cache9. Search performance — field data cache10.Indexing performance — refresh times and

merge times

RIVERS (DEPRECATED IN 1.5.0)

http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi

• JDBC River Plugin, CSV River Plugin• MongoDB, CouchDB, Solr, Redis, Neo4j,

DynamoDB, RethinkDB, Hazelcast, …• JMS, RabbitMQ, ActiveMQ, Amazon SQS,

Kafka, …• Twitter, Wikipedia, Git, GitHub, Subversion,

RSS, …• FileSystem, Dropbox, Google Drive, Amazon S3,

…• IMAP/POP3, Web, LDAP

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#river

OTHER PLUGINS

https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg

• Internalization, normalization, analysis, languages support (Chinese, Japanese, Khmer, Thai etc.), transliteration etc.• Discovery plugins: Amazon AWS, MS Azure,

Google GCE, ZooKeeper• Transport plugins: allow to use Elasticsearch

REST API over Servlet, ZeroMQ, Jetty, Redis, Memecached• Scripting in Elasticsearch queries: Groovy,

JavaScript, Python, Clojure, SQL (!)• Front-ends (CRUD operations) & data

visualization• Snapshot/Restore Repository: HDFS, AWS S3,

GridFS• Misc: Attachments handling (uses Apache

Tika), image support, tracking changes, Mock Solr, NewRelic integration, …

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html

ELASTICSEARCHPRODUCT PORTFOLIO

http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg

FOUND ($)• Elasticsearch as a service• Starts from $45/mo (1GB RAM, 8GB SSD, 1

data center)• No deployment and maintenance overhead

https://www.elastic.co/products/found

SHIELD ($)• Authentication• Authorization: RBAC• Encrypted communication, IP filtering• Audit logging

• Other approaches:• Jetty instead of

embedded server• Nginx as a front-end

https://www.elastic.co/products/shield

MARVEL ($)• Elasticsearch cluster health check,

monitoring, performance• Real-time and historical analysis• Customizable dashboards

https://www.elastic.co/products/marvel

WATCHER• Alerts about anomalies in data• Proactive monitoring of ES cluster (in

conjunction with Marvel)• A lot of ways of notifications: e-mails, SMS,

webhooks• Retrospective analysis• High availability

https://www.elastic.co/products/watcher

ELK

https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png

LOGSTASH + ELASTIC + KIBANA

LOGSTASH ADVANCED

LOGSTASH

• Variety of inputs and outputs (165 plugins)• 120 predefined patterns + custom log formats• Flexible DSL to parse/normalize/enrich logs• Implemented in Ruby, running on JRuby

https://www.elastic.co/products/logstash

SOME LOGSTASH INPUTS

https://www.elastic.co/guide/en/logstash/current/input-plugins.html

• file• stdin• syslog• eventlog• jdbc• varnishlog• websocket• log4j• jmx• s3

• sqs• rss• redis• rabbitmq• zeromq• kafka• twitter• elasticsearch• github• lumberjack

SOME LOGSTASH OUTPUTS

https://www.elastic.co/guide/en/logstash/current/output-plugins.html

• file• stdout• csv• exec• elasticsearch• email• nagios• syslog• redis• loggly

• jira• hipchat• irc• graphite• http• s3• sqs• sns• rabbitmq• zeromq

KIBANA• Variety of charts: bar charts, line and scatter

plots, histograms, pie charts, maps• Flexible and customizable UI, responsive

design• Slice and dice data to get necessary details• Seamless integration with Elasticsearch• Simple data export

https://www.elastic.co/products/kibana

DEMO #3

http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg

ELASTICSEARCH DRAWBACKS• No transaction support. Elasticsearch is not a

database.• No joins, constraints and other RDBMS

features• Durability and consistency issues, data loss:– https://

aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0

– https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

PERFORMANCE?

http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/http://solr-vs-elasticsearch.com/

• Apache Solr can be faster than ES in search-only scenarios while Elasticsearch usually outperforms Solr when doing writes and reads concurrently• Sphinx is faster at indexing (up to 15MB/s per

core)• Performance issues can be usually fixed by

horizontal scaling

SUMMARY• ES is not a silver bullet but really really

powerful tool• Elasticsearch is not a RDBMS and is not

supposed to act as a database. Choose your tools properly. Leverage the synergy of DB + ES

• Elasticsearch is dead simple at the start but might be sophisticated later as you go

• Kick off easily, then hire a good DevOps engineer for best results

• Ecosystem around Elasticsearch is just amazing• Give it a try – it can bring a lot of value to your

product and your CV ;) http://www.aperfectworld.org/clipart/gestures/rockhard11.png

QUESTIONS?

http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png

THANK YOU!

http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg

USEFUL LINKS• Elasticsearch: https://

www.elastic.co/products/elasticsearch• Logstash: https://www.elastic.co/products/logstash• Kibana: https://www.elastic.co/products/kibana

• Scripts for the demos:https://github.com/opanchenko/morning-at-lohika-ELK