Elasticsearch in production New York Meetup at Twitter October 2014

Post on 20-Jun-2015

70 views 1 download

Tags:

description

Elasticsearch easily lets you develop amazing things, and it has gone to great lengths to make Lucene's features readily available in a distributed setting. However, when it comes to running Elasticsearch in production, you still have a fairly complicated system on your hands: a system with high demands on network stability, a huge appetite for memory, and a system that assumes all users are trustworthy. This talk will cover some of the lessons we've learned from securing and herding hundreds of Elasticsearch clusters.

Transcript of Elasticsearch in production New York Meetup at Twitter October 2014

Elasticsearch in production !

Konrad Beiske konrad@found.no

@beiske

Who?

Senior software engineer of Found AS Working with Elasticsearch for 2 years

Herding hundreds of Elasticsearch clusters

Agenda

Agenda• Anti-patterns

• Memory / Resource Usage

• Distributed problems

• Security

• Client concerns

• Changing a cluster

found.no/foundation

Snapshot / Restore

Circuit breakersDocument values

Aggregations

Distributed percolation

Suggesters

Snapshot / Restore

Circuit breakersDocument values

Aggregations

Distributed percolation

Suggesters

Anti-Patterns

Arbitrary Keys

• “Schema Free”

• One field per value

• Ever-growing cluster state

acls: 1234: READ 42: WRITE

Heavy Updating

• Update = Delete + Reindex

• Be careful with counters

Slow queries

• WHERE foo ILIKE ‘%bar%’

• {“query_string”: {“query”: “foo:*bar*”}}

Arbitrary searches

query: filtered: filter: term: user_id: 42 query: [user’s query here]

Time Bomb

Memory

Memory• Field caches

• Filter caches

• Page caches

• Aggregations

• Index building

Page Cache

• Keeping index pages in memory

• Can’t have too much

• Outgrow: Gradual slowdown

Heap Space

• Memory used by Elasticsearch process

• Field / Filter caches

• Aggregations

Time Bomb

Time Bomb

OutOfMemoryError

Woah there

I ate all the memories

Your cluster may or may not work any more

OutOfMemory

• Growing too big

• Selecting too big timespan in Kibana

• Document ingestion peak

Preventing OOMs• Have enough memory :-)

• Understand your search’s memory profile

• Bulk / Circuit breaker settings

• Monitoring

• Document values

Marvel( /_stats )

Document Values

"my_field": { "type": "string", "fielddata": { "format": "doc_values" } }

Sizing

Sizing

• Test, don’t guess

• Start big, scale down

• Index, search, monitor

Glitch Meltdown

Glitch Meltdown

Glitch Meltdown

Glitch Meltdown

Glitch Meltdown

• Tie-breaker can be a cheap master-node

• Applies to data centers / availability zones too

Data-only nodes

Master-only nodes

Jepsen

Jepsen

• Kyle Kingsbury’s series on distributed systems

• Distributed systems are hard

• aphyr.com

Security

Security

• “Not my job!” – Elasticsearch

• That’s fine!

Dynamic Scripts

!

• Scoring

• Aggregations

• Updating

Dynamic Scripts

Runtime.getRuntime().exec(…)

Security

!

• Disable dynamic scripts

• Mind index patterns

• Even then, don’t accept arbitrary requests

Client Concerns

Client Concerns

• Connection pools

• Idempotent requests

• Have sane syncing/indexing strategies

# BOOM !

Cluster changes

Cluster changes

• Make new nodes join existing cluster

• No rolling restarts

• Easy rollback if things go bad

v1.0.0 v1.0.1

v1.0.0 v1.0.1

v1.0.0 v1.0.1

v1.0.0 v1.0.1

v1.0.0 v1.0.1

Cluster changes

• Test first

• Mind recover_*-settings

Multi-Cluster Workflows

• Snapshot/Restore

• Operations across clusters

• Swap clusters!

• Works well with good syncing strategy

Misc

• Same JVM

• ulimits

• Unicast and cluster name

• SSD? noop-scheduler

@foundsays

Learn More! !

found.no/foundation

@beiskeFollow