Elasticsearch for Data Engineers

ElasticsearchCrash Course for Data Engineers

Duy Do (@duydo)

About● A Father, A Husband, A Software Engineer● Founder of Vietnamese Elasticsearch Community● Author of Vietnamese Elasticsearch Analysis Plugin● Technical Consultant at Sentifi AG● Co-Founder at Krom● Follow me @duydo

https://www.facebook.com/groups/elasticsearchvn/

https://github.com/duydo/elasticsearch-analysis-vietnamese

https://twitter.com/duydo

https://twitter.com/duydo

Elasticsearch is Everywhere

What is Elasticsearch?

Elasticsearch is a distributed search and analytics engine, designed for horizontal scalability with easy management.

Basic Terms

● Cluster is a collection of nodes.● Node is a single server, part of a

cluster.● Index is a collection of shards ~

database.● Shard is a collection of

documents.● Type is a category/partition of an

index ~ table in database.● Document is a Json object ~

record in database.

Distributed & Scalable

Shards & Replicas

One node, One shard

Node 1

employees

P0

PUT /employees{

“settings”: {“number_of_shards”: 1,“number_of_replicas”: 0

}}

Two nodes, One shard

Node 1

employees

P0

PUT /employees{


}}

Node 2

One node, Two shards

Node 1

employees

P0

PUT /employees{


}}

P1

Two Nodes, Two Shards

Node 1

employees

P0

PUT /employees{


}}

Node 2

employees

P1P1

Two nodes, Two shards, One replica of each shard

Node 1

employees

P0

PUT /employees{


}}

R1

Node 2

employees

P1 R0

Index Management

Create IndexPUT /employees{

“settings”: {...},“mappings”: {

“type_one”: {...},“type_two”: {...}

},“aliases”: {

“alias_one”: {...},“alias_two”: {...}

}}

Index SettingsPUT /employees/_settings{

“number_of_replicas”: 1}

Index MappingsPUT /employees/_mappings{

“employee”: {“properties”: {

“name”: {“type”: “string”},“gender”: {“type”: “string”, “index”: “not_analyzed”},“email”: {“type”: “string”, “index”: “not_analyzed”},“dob”: {“type”: “date”},“country”: {“type”: “string”, “index”: “not_analyzed”},“salary”: {“type”: “double”},

}}

}

Delete IndexDELETE /employees

Put Data In, Get Data Out

Index a Document with ID

PUT /employees/employee/1{

“name”: “Duy Do”,“email”: “[email protected]”,“dob”: “1984-06-20”,“country”: “VN”“gender”: “male”,“salary”: 100.0

}

Index a Document without ID

POST /employees/employee/{

“name”: “Duy Do”,“email”: “[email protected]”,“dob”: “1984-06-20”,“country”: “VN”“gender”: “male”,“salary”: 100.0

}

Retrieve a Document

GET /employees/employee/1

Update a Document

POST /employees/employee/1/_update{

“doc”:{“salary”: 500.0

}}

Delete a Document

DELETE /employees/employee/1

Searching

Structured SearchDate, Times, Numbers, Text

● Finding Exact Values● Finding Multiple Exact Values● Ranges● Working with Null Values● Combining Filters

Finding Exact ValuesGET /employees/employee/_search{

“query”: {“term”: {

“country”: “VN”}

}}

SQL: SELECT * FROM employee WHERE country = ‘VN’;

Finding Multiple Exact ValuesGET /employees/employee/_search{

“query”: {“terms”: {

“country”: [“VN”, “US”]}

}}

SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;

RangesGET /employees/employee/_search{

“query”: {“range”: {

“dob”: {“gt”: “1984-01-01”, “lt”: “2000-01-01”} }

}}

SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;

Working with Null valuesGET /employees/employee/_search{

“query”: {“filtered”: {

“filter”: {“exists”: {“field”: “email”}

} }

}}

SELECT * FROM employee WHERE email IS NOT NULL;

Working with Null ValuesGET /employees/employee/_search{


“filter”: {“missing”: {“field”: “email”}

} }

}}

SELECT * FROM employee WHERE email IS NULL;

Combining FiltersGET /employees/employee/_search{


“filter”: {“bool”: {

“must”:[{“exists”: {“field”: “email”}}],“must_not”:[{“term”: {“gender”: “female”}}],“should”:[{“terms”: {“country”: [“VN”, “US”]}}]

}}

}}

}

Combining FiltersSQL:

SELECT * FROM employeeWHERE email IS NOT NULL

AND gender != ‘female’AND (country = ‘VN’ OR country = ‘US’);

More Queries

● Prefix● Wildcard● Regex● Fuzzy● Type● Ids● ...

Full-Text SearchRelevance, Analysis

● Match Query● Combining Queries● Boosting Query Clauses

Match Query - Single WordGET /employees/employee/_search{

“query”: {“match”: {

“name”: {“query”: “Duy”

} }

}}

Match Query - Multi WordsGET /employees/employee/_search{

“query”: {“match”: {

“name”: {“query”: “Duy Do”,“operator”: “and”

} }

}}

Combining QueriesGET /employees/employee/_search{

“query”: {“bool”: {

“must”:[{“match”: {“name”: “Do”}}],“must_not”:[{“term”: {“gender”: “female”}}],“should”:[{“terms”: {“country”: [“VN”, “US”]}}]

}

}}

Boosting Query ClausesGET /employees/employee/_search{

“query”: {“bool”: {

“must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[

{“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important

{“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2],

}

}}

More Queries● Multi Match● Common Terms● Query Strings● ...

Analytics

AggregationsAnalyze & Summarize

● How many needles in the haystack?

● What is the average length of the needles?

● What is the median length of the needles, broken down by manufacturer?

● How many needles are added to the haystacks each month?

● What are the most popular needle manufacturers?

● ...

Buckets & Metrics

SELECT COUNT(country) # a metricFROM employeeGROUP BY country # a bucket

GET /employees/employee/_search{

“aggs”: {“by_country”: {

“terms”: {“field”: “country”}}

}}

Bucket is a collection of documents that meet certain criteria.

Metric is simple mathematical operations such as: min, max, mean, sum and avg.

CombinationBuckets & Metrics

● Partitions employees by country (bucket)

● Then partitions each country bucket by gender (bucket)

● Finally calculate the average salary for each gender bucket (metric)

Combination QueryGET /employees/employee/_search{

“aggs”: {“by_country”: { “terms”: {“field”: “country”},

“aggs”: {“by_gender”: { “terms”: {“field”: “gender”},

“aggs”: {“avg_salary”: {“avg”: “field”: “salary”}

}}

}}

}}

More Aggregations● Histogram● Date Histogram● Date Range● Filter/Filters● Missing● Geo Distance● Nested● ...

Best Practices

Indexing

● Use bulk indexing APIs.● Tune your bulk size 5-10MB.● Partitions your time series data

by time period (monthly, weekly, daily).

● Use aliases for your indices.● Turn off refresh, replicas while

indexing. Turn on once it’s done● Multiple shards for parallel

indexing.● Multiple replicas for parallel

reading.

Mapping● Disable _all field● Keep _source field, do not store

any field. ● Use not_analyzed if possible

Query

● Use filters instead of queries if possible.

● Consider orders and scope of your filters.

● Do not use string query.● Do not load too many results

with single query, use scroll API instead.

Kibana for Discovery, Visualization

Sense for Query

Marvel for Monitoring

Elasticsearch for Data Engineers

Technology

Transcript of Elasticsearch for Data Engineers