Scaling ELK Stack - DevOpsDays Singapore
-
Upload
angad-singh -
Category
Internet
-
view
399 -
download
4
Transcript of Scaling ELK Stack - DevOpsDays Singapore
ELKLog processing at Scale
#DevOpsDays 2015, Singapore@DevOpsDaysSG
Angad Singh
About meDevOps at Viki, Inc - A global video streaming site with subtitles.
Previously a Twitter SRE, National University of Singapore
Twitter @angadsg,
Github @angad
Elasticsearch - Log Indexing and Searching
Logstash - Log Ingestion plumbing
Kibana - Frontend{
Metrics vs LoggingMetrics
● Numeric timeseries data
● Actionable
● Counts, Statistical (p90, p99 etc.)
● Scalable cost-effective solutions
already available
Logging
● Useful for debugging
● Catch-all
● Full text searching
● Computationally intensive, harder
to scale
Metrics vs LoggingMetrics
● Numeric timeseries data
● Actionable
● Counts, Statistical (p90, p99 etc.)
● Scalable cost-effective solutions
already available
Alerting and Monitoring at Viki
Deeper level debugging with application logs
Success Rate Alert for service X
Logs● Application logs - Stack Traces, Handled Exceptions
● Access Logs - Status codes, URI, HTTP Method at all levels of the stack
● Client Logs - Direct HTTP requests containing log events from client-side
Javascript or Mobile application (android/ios)
● Standardized log format to JSON - easy to add / remove fields.
● Request tracing through various services using Unique-ID at Load Balancer
● Log aggregator● Log preprocessing
(Filtering etc.)● 3 stage pipeline● Input > Filter > Output
Logstash
● Log aggregator● Log preprocessing
(Filtering etc.)● 3 stage pipeline● Input > Filter > Output
Logstash Elasticsearch● Full text searching and
indexing● on top of Apache
Lucene● RESTful web interface● Horizontally scalable
● Log aggregator● Log preprocessing
(Filtering etc.)● 3 stage pipeline● Input > Filter > Output
Logstash Elasticsearch● Full text searching and
indexing● on top of Apache
Lucene● RESTful web interface● Horizontally scalable
Kibana● Frontend● Visualizations,
Dashboards● Supports Geo
visualizations● Uses ES REST API
Input
Any Stream
● local file● queue● tcp, udp● twitter● etc..
LogstashFilter
Mutation
● add/remove field● parse as json● ruby code● parse geoip● etc..
Output
● elasticsearch● redis● queue● file● pagerduty● etc..
● Golang program that sits next to log files, lumberjack protocol
● Forwards logs from a file to a logstash server
● Removes the need for a buffer (such as redis, or a queue) for
logs pending ingestion to logstash.
● Docker container with volume mounted /var/log.
Configuration stored in Consul.
● Application containers with volume mounted /var/log to
/var/log/docker/<container>/application.log
Logstash Forwarder
Logstash pool with HAProxy4 x logstash machines, 8 cores, 16 GB RAM
7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs.
Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol.
Easily scalable by adding more machines and spinning up more logstash processes.
Application ServiceContainer 1
Application ServiceContainer 2
Logstash-Forwarder Container
Mounted /var/log to/var/log/docker/ on host
Elasticsearch Hardware12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.
20 nodes, 20 shards, 3 replicas (with 1 primary).
Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.
Average 6k-8k logs per second, peak 25k logs per second.
https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html
Elasticsearch Hardware
● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap● Sweet spot - 64GB of RAM with half available for Lucene file buffers.● SSD or RAID 0 (or multiple path directories similar to RAID 0). ● If SSD then set I/O scheduler to deadline instead of cfq.● RAID0 - no need to worry about disks failing as machines can easily be
replaced due to multiple copies of data.● Disable swap.
Hardware Tuning
● 20 days of indexes open based on available memory, rest closed - open on demand
● Field data - cache used while sorting and aggregating data.● Circuit breaker - cancels requests which require large memory, prevent OOM,
http://elasticsearch:9200/_cache/clear if field data is very close to memory limit.
● Shards >= Number of nodes● Lucene forceMerge - minor performance improvements for older indexes
(https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html)
Elasticsearch Configuration
Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1)
Set higher ulimit for elasticsearch process
Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days
And also...
Marvel - Official plugin from Elasticsearch
KOPF - Index management plugin
CAT APIs - REST APIs to view cluster information
Curator - Data management
Monitoring