Elasticsearch for Data Engineers
-
Upload
duy-do -
Category
Technology
-
view
129 -
download
1
Transcript of Elasticsearch for Data Engineers
About● A Father, A Husband, A Software Engineer● Founder of Vietnamese Elasticsearch Community● Author of Vietnamese Elasticsearch Analysis Plugin● Technical Consultant at Sentifi AG● Co-Founder at Krom● Follow me @duydo
Elasticsearch is a distributed search and analytics engine, designed for horizontal scalability with easy management.
Basic Terms
● Cluster is a collection of nodes.● Node is a single server, part of a
cluster.● Index is a collection of shards ~
database.● Shard is a collection of
documents.● Type is a category/partition of an
index ~ table in database.● Document is a Json object ~
record in database.
One node, One shard
Node 1
employees
P0
PUT /employees{
“settings”: {“number_of_shards”: 1,“number_of_replicas”: 0
}}
Two nodes, One shard
Node 1
employees
P0
PUT /employees{
“settings”: {“number_of_shards”: 1,“number_of_replicas”: 0
}}
Node 2
One node, Two shards
Node 1
employees
P0
PUT /employees{
“settings”: {“number_of_shards”: 2,“number_of_replicas”: 0
}}
P1
Two Nodes, Two Shards
Node 1
employees
P0
PUT /employees{
“settings”: {“number_of_shards”: 2,“number_of_replicas”: 0
}}
Node 2
employees
P1P1
Two nodes, Two shards, One replica of each shard
Node 1
employees
P0
PUT /employees{
“settings”: {“number_of_shards”: 2,“number_of_replicas”: 1
}}
R1
Node 2
employees
P1 R0
Create IndexPUT /employees{
“settings”: {...},“mappings”: {
“type_one”: {...},“type_two”: {...}
},“aliases”: {
“alias_one”: {...},“alias_two”: {...}
}}
Index MappingsPUT /employees/_mappings{
“employee”: {“properties”: {
“name”: {“type”: “string”},“gender”: {“type”: “string”, “index”: “not_analyzed”},“email”: {“type”: “string”, “index”: “not_analyzed”},“dob”: {“type”: “date”},“country”: {“type”: “string”, “index”: “not_analyzed”},“salary”: {“type”: “double”},
}}
}
Index a Document with ID
PUT /employees/employee/1{
“name”: “Duy Do”,“email”: “[email protected]”,“dob”: “1984-06-20”,“country”: “VN”“gender”: “male”,“salary”: 100.0
}
Index a Document without ID
POST /employees/employee/{
“name”: “Duy Do”,“email”: “[email protected]”,“dob”: “1984-06-20”,“country”: “VN”“gender”: “male”,“salary”: 100.0
}
Structured SearchDate, Times, Numbers, Text
● Finding Exact Values● Finding Multiple Exact Values● Ranges● Working with Null Values● Combining Filters
Finding Exact ValuesGET /employees/employee/_search{
“query”: {“term”: {
“country”: “VN”}
}}
SQL: SELECT * FROM employee WHERE country = ‘VN’;
Finding Multiple Exact ValuesGET /employees/employee/_search{
“query”: {“terms”: {
“country”: [“VN”, “US”]}
}}
SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;
RangesGET /employees/employee/_search{
“query”: {“range”: {
“dob”: {“gt”: “1984-01-01”, “lt”: “2000-01-01”} }
}}
SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;
Working with Null valuesGET /employees/employee/_search{
“query”: {“filtered”: {
“filter”: {“exists”: {“field”: “email”}
} }
}}
SELECT * FROM employee WHERE email IS NOT NULL;
Working with Null ValuesGET /employees/employee/_search{
“query”: {“filtered”: {
“filter”: {“missing”: {“field”: “email”}
} }
}}
SELECT * FROM employee WHERE email IS NULL;
Combining FiltersGET /employees/employee/_search{
“query”: {“filtered”: {
“filter”: {“bool”: {
“must”:[{“exists”: {“field”: “email”}}],“must_not”:[{“term”: {“gender”: “female”}}],“should”:[{“terms”: {“country”: [“VN”, “US”]}}]
}}
}}
}
Combining FiltersSQL:
SELECT * FROM employeeWHERE email IS NOT NULL
AND gender != ‘female’AND (country = ‘VN’ OR country = ‘US’);
Match Query - Single WordGET /employees/employee/_search{
“query”: {“match”: {
“name”: {“query”: “Duy”
} }
}}
Match Query - Multi WordsGET /employees/employee/_search{
“query”: {“match”: {
“name”: {“query”: “Duy Do”,“operator”: “and”
} }
}}
Combining QueriesGET /employees/employee/_search{
“query”: {“bool”: {
“must”:[{“match”: {“name”: “Do”}}],“must_not”:[{“term”: {“gender”: “female”}}],“should”:[{“terms”: {“country”: [“VN”, “US”]}}]
}
}}
Boosting Query ClausesGET /employees/employee/_search{
“query”: {“bool”: {
“must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[
{“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important
{“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2],
}
}}
AggregationsAnalyze & Summarize
● How many needles in the haystack?
● What is the average length of the needles?
● What is the median length of the needles, broken down by manufacturer?
● How many needles are added to the haystacks each month?
● What are the most popular needle manufacturers?
● ...
Buckets & Metrics
SELECT COUNT(country) # a metricFROM employeeGROUP BY country # a bucket
GET /employees/employee/_search{
“aggs”: {“by_country”: {
“terms”: {“field”: “country”}}
}}
CombinationBuckets & Metrics
● Partitions employees by country (bucket)
● Then partitions each country bucket by gender (bucket)
● Finally calculate the average salary for each gender bucket (metric)
Combination QueryGET /employees/employee/_search{
“aggs”: {“by_country”: { “terms”: {“field”: “country”},
“aggs”: {“by_gender”: { “terms”: {“field”: “gender”},
“aggs”: {“avg_salary”: {“avg”: “field”: “salary”}
}}
}}
}}
More Aggregations● Histogram● Date Histogram● Date Range● Filter/Filters● Missing● Geo Distance● Nested● ...
Indexing
● Use bulk indexing APIs.● Tune your bulk size 5-10MB.● Partitions your time series data
by time period (monthly, weekly, daily).
● Use aliases for your indices.● Turn off refresh, replicas while
indexing. Turn on once it’s done● Multiple shards for parallel
indexing.● Multiple replicas for parallel
reading.
Mapping● Disable _all field● Keep _source field, do not store
any field. ● Use not_analyzed if possible
Query
● Use filters instead of queries if possible.
● Consider orders and scope of your filters.
● Do not use string query.● Do not load too many results
with single query, use scroll API instead.