Text search with Elasticsearch on AWS

Text search with Elasticsearch on AWSŁukasz PrzybyłekTidio

What’s Elasticsearch?

● Search & analytics engine● Fast● Scalable● Distributed● Full text search capabilities● (near) Real time indexing● Document oriented● Schema free

When do I need it?

● If needed faster search mechanism● If needed searching in large amount of data● If needed powerful full text queries

How does it work?

Input Document Analyzer Terms Index

Inverted Index

Id Content

1 The quick brown fox jumped over the lazy dog

2 Quick brown foxes leap over lazy dogs in summer

analysis

Term Doc_1 Doc_2

brown X X

dog X X

fox X X

jump X X

lazy X X

over X X

quick X X

summer X

the X X

Logical data structures

● Elasticsearch (cluster) contains indexes● Index contains types● Type contains documents● Mappings are assigned to types● Index aliases (optional) can point to indices and modify queries (e.g. add

filter)● There are no classic SQL-like relationships (!)

Logical data structures

Cluster

Index IndexIndex

Type Type

Document

Physical data structures

● Cluster contains nodes● Index is stored in one or more shards (single shard is a Lucene index

instance)● Single node contains shards of different indexes

How to deal with lack of joins?

● Denormalization● Client-side joins● Parent-child relationships

Elasticsearch in Tidio

● Tidio Chat - business communication tool where business owners (operators) communicate with their customers (visitors)

● www.tidiochat.com● ES used instead of MariaDB to perform:

○ Fetching last conversations in project○ Perform search by message content and visitor email in project’s conversation history

Relations in Tidio Chat

Message

visitor_id

operator_id

content

Project

public_key

Visitor

project_public_key

Operator

project_public_key

Message document schema

● Project’s public key added to document● Search by email performed in MariaDB● Time mapped as date explicitly● Client-side join with Visitor

Message

visitor_id

operator_id

project_public_key

content

Design decisions

● Questionsa. What indexes should be created?b. What types should be created?c. How shards should be distributed among nodes and indexes?

● Things to considera. Search in smaller dataset usually means faster search resultsb. Index with small number of shards does not scale efficiently to new nodes

c. Types are used mainly to assign mappings, they are not separated “search entities” so there is no direct performance boost from using many types

d. Index doesn’t need to represent domain entity

Ideas?

Index for each project, one type inside index

● 250k projects = 250k indexes● Adding new index is slow● Large overhead associated with shards and indices count

Ideas?

One index and separate type for each project

● Large index● Nodes scaling up only to number of shards in particular index (default 5, no

auto index splitting)● Every query would go through all shards and filter by project_public_key (large

amount of data to search in)

Ideas?

Group projects and create an index for each group

● Limited amount of data to search in● Reasonable number of shards, which still can scale up to many nodes● Possibility to add alias for each project and search as it would be separate

index● Projects may be grouped by language and use specific analyzers

Amazon Web Services Elasticsearch cluster

● Quick and easy to install● Extremely limited configuration options● Limited query options (scripts disabled)● Can be used with standard AWS authentication● There is no AWS SDK that supports ES, so users have to write code that sign

requests manually

PHP clients for ES

● elasticsearch/elasticsearch○ https://github.com/elastic/elasticsearch-php○ Low level ES client○ One-to-one mapping with REST API○ Pluggable architecture (can use custom request handler and send AWS signed requests)

○ Does all things that you don’t want to know about, e.g. discovery of cluster nodes, load balancing, Keep-Alive connections

○ Accepts queries in JSON

● ruflin/elastica○ https://github.com/ruflin/Elastica○ High level client○ Classes representing indices/queries/terms - you do not have to write JSONs

Elasticsearch limitations

● Less capable than SQL● There is no paging support for aggregations

AWS Elasticsearch limitations

● threadpool.bulk.queue_size=50● No script support

Indexing performance

● Check your mappings!● Set fields as not analyzed ● Disable _all field● Tune your analyzer and index_options (advanced)

Search performance

● Unfair comparison ● Over 26 million documents● Time of PHP requests in seconds

Query\Service MariaDB (8 CPU) Elasticsearch (4 CPU)

Search by text 14.16 (σ=0.51) 0.80 (σ=0.20)

Last conversations 4.77 (σ=0.45) 0.87 (σ=0.23)

Any questions?

Thank you!lucas@tidio.netlprzybylek@gmail.com

Text search with Elasticsearch on AWS

Data & Analytics

Transcript of Text search with Elasticsearch on AWS

AWS Certified Solutions Architect –Associate (SAA-C01) · 2019-08-09 · AWS Device Farm. AWS Web App Firewall. Amazon Elasticsearch Service. Amazon QuickSight. AWS Import/Export

Experience with Elasticsearch scalability in AWSbiconsulting.hu/letoltes/2016budapestdata/boros_bela_scalable... · Experience with Elasticsearch scalability in AWS BÉLA BOROS, EPAM

Search and analyze your data with elasticsearch

Amazon Web Services Partner Package – State and Local ... · Amazon Elasticsearch Service (Amazon ES) AWS Snowball (Snowball) AWS Command Line Interface (AWS CLI) Amazon EMR : AWS

Enrichirson datalakeavec les services cognitifsAWSawsmarketingbucket.s3-eu-west-1.amazonaws.com/2018... · AWS Lambda Search Amazon API Gateway Amazon ElasticSearch Amazon Cognito

Real-time search in Drupal. Meet Elasticsearch

AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)

AWS October Webinar Series - Introducing Amazon Elasticsearch Service

Search and Analytics (using Elasticsearch) - Costin Leau.pdf · Elasticsearch Open-Source Search & Analytics engine -Structured & Unstructured Data -Real Time -Analytics capabilities

Intro to Big Data on AWS Igor Roiter Big Data Cloud ... presentation- AWS 18... · AWS IoT DynamoDB AWS Snowball Amazon Athena EC2 ... Elasticsearch Service Lambda AWS Database Migration

Elasticsearch Introduction to Data model, Search & Aggregations

Wordpress search-elasticsearch

Search Bugs Fast with Elasticsearch .

ElasticSearch on AWS

Elasticsearch - Guide to Search

Elasticsearch - Inlogiq · 2017-01-11 · 1. 2. Elasticsearch Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search,

Adventures In AWs - University of Wisconsin–Madison...Platform & Architecture • Course search using Elasticsearch database • Comprised of 5 Spring Boot scalable microservices

Login with AWS Elasticsearch Servicesfintechasiapacific.com/pdf/techtalk_anushka.pdf · •Elasticsearch is a distributed, open source search and analytics engine for all types of

Amazon Elasticsearch Service Security Deep Dive - AWS Online Tech Talks

Search technologies & aws cloud search