Text search with Elasticsearch on AWS

Post on 13-Jan-2017

83 views 1 download

Transcript of Text search with Elasticsearch on AWS

Text search with Elasticsearch on AWSŁukasz PrzybyłekTidio

What’s Elasticsearch?

● Search & analytics engine● Fast● Scalable● Distributed● Full text search capabilities● (near) Real time indexing● Document oriented● Schema free

When do I need it?

● If needed faster search mechanism● If needed searching in large amount of data● If needed powerful full text queries

How does it work?

Input Document Analyzer Terms Index

Inverted Index

Id Content

1 The quick brown fox jumped over the lazy dog

2 Quick brown foxes leap over lazy dogs in summer

analysis

Term Doc_1 Doc_2

brown X X

dog X X

fox X X

in X

jump X X

lazy X X

over X X

quick X X

summer X

the X X

Logical data structures

● Elasticsearch (cluster) contains indexes● Index contains types● Type contains documents● Mappings are assigned to types● Index aliases (optional) can point to indices and modify queries (e.g. add

filter)● There are no classic SQL-like relationships (!)

Logical data structures

Cluster

Index IndexIndex

Type Type

Document

Map

ping

Document

Physical data structures

● Cluster contains nodes● Index is stored in one or more shards (single shard is a Lucene index

instance)● Single node contains shards of different indexes

How to deal with lack of joins?

● Denormalization● Client-side joins● Parent-child relationships

Elasticsearch in Tidio

● Tidio Chat - business communication tool where business owners (operators) communicate with their customers (visitors)

● www.tidiochat.com● ES used instead of MariaDB to perform:

○ Fetching last conversations in project○ Perform search by message content and visitor email in project’s conversation history

Relations in Tidio Chat

Message

id

visitor_id

operator_id

content

time

Project

public_key

Visitor

id

project_public_key

name

email

Operator

id

project_public_key

Message document schema

● Project’s public key added to document● Search by email performed in MariaDB● Time mapped as date explicitly● Client-side join with Visitor

Message

id

visitor_id

operator_id

project_public_key

content

time

Design decisions

● Questionsa. What indexes should be created?b. What types should be created?c. How shards should be distributed among nodes and indexes?

● Things to considera. Search in smaller dataset usually means faster search resultsb. Index with small number of shards does not scale efficiently to new nodes

c. Types are used mainly to assign mappings, they are not separated “search entities” so there is no direct performance boost from using many types

d. Index doesn’t need to represent domain entity

Ideas?

Index for each project, one type inside index

● 250k projects = 250k indexes● Adding new index is slow● Large overhead associated with shards and indices count

Ideas?

One index and separate type for each project

● Large index● Nodes scaling up only to number of shards in particular index (default 5, no

auto index splitting)● Every query would go through all shards and filter by project_public_key (large

amount of data to search in)

Ideas?

Group projects and create an index for each group

● Limited amount of data to search in● Reasonable number of shards, which still can scale up to many nodes● Possibility to add alias for each project and search as it would be separate

index● Projects may be grouped by language and use specific analyzers

Amazon Web Services Elasticsearch cluster

● Quick and easy to install● Extremely limited configuration options● Limited query options (scripts disabled)● Can be used with standard AWS authentication● There is no AWS SDK that supports ES, so users have to write code that sign

requests manually

PHP clients for ES

● elasticsearch/elasticsearch○ https://github.com/elastic/elasticsearch-php○ Low level ES client○ One-to-one mapping with REST API○ Pluggable architecture (can use custom request handler and send AWS signed requests)

○ Does all things that you don’t want to know about, e.g. discovery of cluster nodes, load balancing, Keep-Alive connections

○ Accepts queries in JSON

● ruflin/elastica○ https://github.com/ruflin/Elastica○ High level client○ Classes representing indices/queries/terms - you do not have to write JSONs

Elasticsearch limitations

● Less capable than SQL● There is no paging support for aggregations

AWS Elasticsearch limitations

● threadpool.bulk.queue_size=50● No script support

Indexing performance

● Check your mappings!● Set fields as not analyzed ● Disable _all field● Tune your analyzer and index_options (advanced)

Search performance

● Unfair comparison ● Over 26 million documents● Time of PHP requests in seconds

Query\Service MariaDB (8 CPU) Elasticsearch (4 CPU)

Search by text 14.16 (σ=0.51) 0.80 (σ=0.20)

Last conversations 4.77 (σ=0.45) 0.87 (σ=0.23)

Any questions?

Thank you!lucas@tidio.netlprzybylek@gmail.com