Effecient monitoring with Open source tools · PDF fileOpenTSDB • Hadoop backed •

37
Effecient monitoring with Open source tools Osman Ungur, github.com/o

Transcript of Effecient monitoring with Open source tools · PDF fileOpenTSDB • Hadoop backed •

Effecient monitoring with Open source tools

Osman Ungur, github.com/o

Who i am?• software developer with system-administration

background over 10 years

• mostly writes Java and PHP

• also working about infrastructure design, system automation, deployment and monitoring

• obsessed about clean, well structured, maintainable and scalable architectures.

• loves open source github.com/o

My career path• in 2002, i started to learn fundamentals of Linux network and

security. After that, for years i sold and managed dedicated servers and shared web hosting

• after the Linux administration story, in 2005 dived into PHP and learned principles of object-oriented-programming

• in 2010, i'd started a company which is uses Java, Spring Framework and SOA architecture. Ported thousands of line PHP code to Java and experienced with very large traffic. Slowly i'd embraced Java, NoSQL, RESTful and micro-services architectures

• Since August 2015, i'm working as a freelance consultant, trainer and developer. I'm an active contributor and author of open-source projects.

Today• Why i need?

• Best practices

• Time-series databases

• Agents

• Dashboards

• Alerting

What is going on?

• What is your application doing right now?

• Do you will be notifed when a server fails?

How about fixing things?

• Fixing problems is difficult without logs and monitoring

• Sleep better by automation and monitoring

Customers and Boss

• Don't tolerate software errors

• Everyone hates "500 Server Error"

• Don't like slow websites

Loss of• productivity

• money

• reputation

• time

• customer

• trust

What kind of problems? What to monitor?!?!

• Can the users hit my page?

• What is %95th page load time?

• Is our revenue increased?

• What are mostly occured exceptions in last hour?

• I didn't change the code, something wonky?

• Which part of system is unaccesible?

• Do i need to scale up / down my servers?

• Is my servers works over capacity?

• Is (rdbms|mq|cache) running healthy?

• What are (mem|cpu|disk|io) usage of servers?

• Is (app server|web server|lb) is up?

• Current (bandwidth|network) usage comparing with last weeks?

If something fails?

You need to get it up and running ASAP

Our objective is reducing

• time to detect

• time to repair

Time series databases

A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range). In some fields these time series are called profiles, curves, or traces. A time series of stock prices might be called a price curve. A time series of energy consumption might be called a load profile. A log of temperature values over time might be called a temperature trace.

Wikipedia

RRDTool• Round robin database tool (File based)

• Successor of MRTG

• Used by Nagios, Munin, Cacti, pfSense, Ganglia

• Storing and graphing capability

• Outdated data model, only command line interface

Graphite• Whisper database library (File based)

• Very popular, simple to operate

• Tons of tools that work with graphite

• Comes with dashboard, nice functions

• Outdated data model, doesn't scale

InfluxDB• Time Structured Merge Tree (TSM)

• Easy to operate, highly customisable

• Also supports events

• Good performance, InfluxQL

• Clustering removed from open source edition

Prometheus• Local file per time series

• Pull based metric collectors, PromQL

• Easy to operate, good data model

• Effecient storage, good performance

• Also supports alerting

OpenTSDB• Hadoop backed

• Scales very well, moderate performance

• JSON over HTTP

• One of the first databases to use metric lables in its data model

• Painful to operate

RiakTS• Riak backed

• Very easy to operate

• Moderate performance

• Highly resilient

• Good data model, querying like SQL

DalmatinerDB• Riak backed

• Very high performance

• Clustering and fault tolerance

• Works with ZFS, Postgres

• Limited client support

KairosDB• Cassandra storage

• Fast writes

• Good data model

• Ineffecient storage

• Slow to query

Blueflood• Cassandra storage

• Good performance

• Highly scalable

• Outdated data model

• Metric processing system behind Rackspace Metrics

Others

• Druid

• Netflix Atlas

• Chronix Server

Complex monitoring suites

• Nagios

• Sensu

• Ganglia

Log based

• Graylog

• ELK

• Splunk

Agents

• Collectd

• Diamond

• Metrics

Alerting

• Riemann

• Seyren

• Icinga

Questions?

github.com/o