Scaling ELK Stack - DevOpsDays Singapore

ELKLog processing at Scale

#DevOpsDays 2015, Singapore@DevOpsDaysSG

Angad Singh

About meDevOps at Viki, Inc - A global video streaming site with subtitles.

Previously a Twitter SRE, National University of Singapore

Twitter @angadsg,

Github @angad

Elasticsearch - Log Indexing and Searching

Logstash - Log Ingestion plumbing

Kibana - Frontend{

Metrics vs LoggingMetrics

● Numeric timeseries data

● Actionable

● Counts, Statistical (p90, p99 etc.)

● Scalable cost-effective solutions

already available

Logging

● Useful for debugging

● Catch-all

● Full text searching

● Computationally intensive, harder

to scale

Metrics vs LoggingMetrics

● Numeric timeseries data

● Actionable

● Counts, Statistical (p90, p99 etc.)

● Scalable cost-effective solutions

already available

Alerting and Monitoring at Viki

Deeper level debugging with application logs

Success Rate Alert for service X

http://2.bp.blogspot.com/-NYO2lN7WizE/VRSnom4L1JI/AAAAAAAAIlw/-WxvtmM_fg4/s1600/pagerduty_teams_001.png

http://p4.zdassets.com/hc/settings_assets/552392/200031699/KNmABOne3dg9WcezFiBPyQ-signalfx_logo_RGB.png

http://tiger.towson.edu/~mwilla3/programmer_creattica_full.jpg

https://pbs.twimg.com/media/BK5s6buCcAAoG2I.png

Logs● Application logs - Stack Traces, Handled Exceptions

● Access Logs - Status codes, URI, HTTP Method at all levels of the stack

● Client Logs - Direct HTTP requests containing log events from client-side

Javascript or Mobile application (android/ios)

● Standardized log format to JSON - easy to add / remove fields.

● Request tracing through various services using Unique-ID at Load Balancer

● Log aggregator● Log preprocessing

(Filtering etc.)● 3 stage pipeline● Input > Filter > Output

Logstash



Logstash Elasticsearch● Full text searching and

indexing● on top of Apache

Lucene● RESTful web interface● Horizontally scalable



Logstash Elasticsearch● Full text searching and

indexing● on top of Apache

Lucene● RESTful web interface● Horizontally scalable

Kibana● Frontend● Visualizations,

Dashboards● Supports Geo

visualizations● Uses ES REST API

Input

Any Stream

● local file● queue● tcp, udp● twitter● etc..

LogstashFilter

Mutation

● add/remove field● parse as json● ruby code● parse geoip● etc..

Output

● elasticsearch● redis● queue● file● pagerduty● etc..

● Golang program that sits next to log files, lumberjack protocol

● Forwards logs from a file to a logstash server

● Removes the need for a buffer (such as redis, or a queue) for

logs pending ingestion to logstash.

● Docker container with volume mounted /var/log.

Configuration stored in Consul.

● Application containers with volume mounted /var/log to

/var/log/docker/<container>/application.log

Logstash Forwarder

Logstash pool with HAProxy4 x logstash machines, 8 cores, 16 GB RAM

7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs.

Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol.

Easily scalable by adding more machines and spinning up more logstash processes.

Application ServiceContainer 1

Application ServiceContainer 2

Logstash-Forwarder Container

Mounted /var/log to/var/log/docker/ on host

http://frontrangecontainers.com/yahoo_site_admin/assets/images/blue_container.153192929_std.jpg



https://upload.wikimedia.org/wikipedia/commons/f/f9/3.5FD_disk.jpg

http://www.haproxy.org/img/logo-med.png

http://blog.arungupta.me/wp-content/uploads/2015/07/elasticsearch-logo.png

Elasticsearch Hardware12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.

20 nodes, 20 shards, 3 replicas (with 1 primary).

Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.

Average 6k-8k logs per second, peak 25k logs per second.

https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html



Elasticsearch Hardware

● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap● Sweet spot - 64GB of RAM with half available for Lucene file buffers.● SSD or RAID 0 (or multiple path directories similar to RAID 0). ● If SSD then set I/O scheduler to deadline instead of cfq.● RAID0 - no need to worry about disks failing as machines can easily be

replaced due to multiple copies of data.● Disable swap.

Hardware Tuning

● 20 days of indexes open based on available memory, rest closed - open on demand

● Field data - cache used while sorting and aggregating data.● Circuit breaker - cancels requests which require large memory, prevent OOM,

http://elasticsearch:9200/_cache/clear if field data is very close to memory limit.

● Shards >= Number of nodes● Lucene forceMerge - minor performance improvements for older indexes

(https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html)

Elasticsearch Configuration

https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html



Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1)

Set higher ulimit for elasticsearch process

Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days

And also...

Marvel - Official plugin from Elasticsearch

KOPF - Index management plugin

CAT APIs - REST APIs to view cluster information

Curator - Data management

Monitoring

Thanksemail: [email protected]

twitter: @angadsg

mailto:[email protected]

Scaling ELK Stack - DevOpsDays Singapore

Internet

Transcript of Scaling ELK Stack - DevOpsDays Singapore