Seminar CNAF

54
Seminar CNAF Exploiting open source tools to realize a new monitoring infrastructure at CERN Pedro Andrade CERN IT/CF

Transcript of Seminar CNAF

Page 1: Seminar CNAF

Seminar CNAF

Exploiting open source tools to

realize a new monitoring

infrastructure at CERN

Pedro Andrade – CERN IT/CF

Page 2: Seminar CNAF

Overview

• Agile Infrastructure

• Monitoring Project

• Solutions and Technologies

• Producers

• Transport

• Archive

• Query and Analytics

• Real-time Analytics

• Notifications

3/17/2014 CNAF Seminar 2

Page 3: Seminar CNAF

Agile Infrastructure

3/17/2014 CNAF Seminar 3

Page 4: Seminar CNAF

Challenges

• New data centre in Budapest since 2013

• Additional capacity required in view of physics needs

• Local on-site maintenance for installations/repairs

3/17/2014 CNAF Seminar 4

Page 5: Seminar CNAF

Challenges

• Be ready to handle 15’000 servers

• Increasing users of CERN’s facilities and higher

computing requirements as data rates increase

• Staff numbers are fixed, no more people

• Materials budget decreasing, no more money

• Legacy tools are high maintenance and brittle

• Deploy new services within hours

3/17/2014 CNAF Seminar 5

Page 6: Seminar CNAF

Challenges

• “We Are Not Special”

• Move to commonly used open source tools

• Focus on strong communities and momentum

• Stop re-inventing tools, not made here syndrome

• Implement clouds at scale

• Aim for 90% infrastructure virtualised

• Ecosystem solutions rather than writing from scratch

• Request to delivery in a coffee break

3/17/2014 CNAF Seminar 6

Page 7: Seminar CNAF

Agile Infrastructure

• Activity started in 2012

• Remodel IT services

• Move to a more horizontal approach

• Layered model: IaaS, PaaS, SaaS

• Services, Configuration, Installation, Hardware

• Virtualisation is key

• Improve efficiency

• Operational, Resources

3/17/2014 CNAF Seminar 7

Page 8: Seminar CNAF

Agile Infrastructure

3/17/2014 CNAF Seminar 8

Bamboo

Koji, Mock

AIMS/PXE

Foreman

Yum repo

Pulp

Puppet-DB

mcollective, yum

JIRA

Lemon /

Hadoop /

Elastic Search /

Kibana

git

OpenStack

Nova

Hardware

database

Puppet

Active Directory /

LDAP

Page 9: Seminar CNAF

Monitoring Project

3/17/2014 CNAF Seminar 9

Page 10: Seminar CNAF

Challenges

• Several independent monitoring activities in IT

• High level services are interdependent

• Understanding performance more important

• Move to a virtualized dynamic infrastructure

• Preserve our investment in monitoring

Shared architecture & tool-chain components

3/17/2014 CNAF Seminar 10

Page 11: Seminar CNAF

Objectives

• Deliver solutions for the shared architecture

• Work with all IT monitoring teams

• Deliver simple adoption: PaaS

• Better exploit IT resources

• While at the same time

• Mix and match open source solutions

• Exploit new tools from the Agile Infrastructure

• Retire old tools: Lemon DB, Lemon Web, LAS, etc.

3/17/2014 CNAF Seminar 11

Page 12: Seminar CNAF

Architecture

3/17/2014 CNAF Seminar 12

Page 13: Seminar CNAF

Process Improvements

• Establish Agile methodology

• Well defined sprints with clear targets

• Interactive evolution, continuous feedback

• Exploit Open Source tools

• Best fit, large adoption, active community

• Fast to adopt, accept limitations, easily replaced

• Look at DevOps

• Quality Assurance processes

• Contiguous Integration processes

3/17/2014 CNAF Seminar 13

Page 14: Seminar CNAF

Technologies

• Many options available !

3/17/2014 CNAF Seminar 14

Page 15: Seminar CNAF

Technologies

3/17/2014 CNAF Seminar 15

Page 16: Seminar CNAF

Producers

3/17/2014 CNAF Seminar 16

Page 17: Seminar CNAF

Motivation

• Preserve sensors/probes knowledge

• Many years writing sensors for Lemon

• Integrate other data sources

• Most likely service specific monitoring data

Selected Technology: Lemon + Others

3/17/2014 CNAF Seminar 17

Page 18: Seminar CNAF

Lemon Producer

• Same old lemon agent

• Running in all data centre nodes

• Lemon agent extended with lemon forwarder

• Send notifications to ActiveMQ

• Send metrics to Flume

• Send syslog to Flume

3/17/2014 CNAF Seminar 18

Page 19: Seminar CNAF

Other Producers

• Must follow common monitoring specification

• Metric v3.0 and Notification v2.0

• Can use monitoring-data-model to create new

metrics and notifications and validate them

• Messages can be send

• To ActiveMQ using a stomp client

• To Flume gateway using a flume agent

• Planning to evaluate Collectd later this year

3/17/2014 CNAF Seminar 19

Page 20: Seminar CNAF

Transport

3/17/2014 CNAF Seminar 20

Page 21: Seminar CNAF

Motivation

• Collect operations data

• Lemon metrics and syslog

• 3rd party applications and services

• Scalable transport layer

• Large data volume

• Easy integration with other technologies

Selected Technology: Flume

3/17/2014 CNAF Seminar 21

Page 22: Seminar CNAF

Flume

• Distributed service for collecting large data sets

• Robust and fault tolerant

• Horizontally scalable

• Many ready to be used input/output plugins

• Java based, Apache license

• Cloudera is the main contributor

• Using their releases

• Less frequent but more stable releases

3/17/2014 CNAF Seminar 22

Page 23: Seminar CNAF

Flume

• Flume event

• Payload + set of string headers

• Flume agent

• JVM process hosting “source to sink” flows

3/17/2014 CNAF Seminar 23

Page 24: Seminar CNAF

Flume

• Many ready-to-be-used plugins

• Sources: Avro, JMS, Spool, Syslog, HTTP, etc.

• Interceptors: decorate events, filter events

• Channels: Memory, File, JDBC

• Sinks: Avro, Thrift, ElasticSearch, HDFS, File, etc.

• Custom sources/sinks can be implemented

3/17/2014 CNAF Seminar 24

Page 25: Seminar CNAF

Flume

• Routing is static

• On demand subscriptions are not possible

• Requires reconfiguration and restart

• No authN and authZ features

• But secure transport available

• Java process on client side

• Small memory footprint would be nicer

3/17/2014 CNAF Seminar 25

Page 26: Seminar CNAF

Deployment

• Running flume 1.3, latest is flume 1.4

3/17/2014 CNAF Seminar 26

Page 27: Seminar CNAF

Deployment

• 1st layer: Flume Data publisher

• Deployed in all data centre nodes

• 2nd layer: Flume Gateway

• 20 VMs aggregating events

• 3rd layer: Flume ElasticSearch

• 10 VMs inserting to ElasticSearch

• 3rd layer: Flume Hadoop HDFS

• 10 VMs inserting to Hadoop HDFS

3/17/2014 CNAF Seminar 27

Page 28: Seminar CNAF

Feedback

• Sizing flume layers needs some tuning

• Available sources/sinks saved a lot of time

3/17/2014 CNAF Seminar 28

Page 29: Seminar CNAF

Archive

3/17/2014 CNAF Seminar 29

Page 30: Seminar CNAF

Motivation

• Store operations raw data

• Long term archival required

• Allow future data replay to other tools

• Feed real-time engine

• Offline processing of collected data

• Security data? Syslog data?

Selected Technology: Hadoop/HDFS

30 3/17/2014 CNAF Seminar 30

Page 31: Seminar CNAF

Hadoop/HDFS

• Hadoop is a framework that allows the

distributed processing of large data sets

• HDFS is a distributed filesystem designed to

run on commodity hardware

• Suitable for applications with large data sets

• Designed for batch processing, not interactive use

• High throughput preferred to low latency access

3/17/2014 CNAF Seminar 31

Page 32: Seminar CNAF

Hadoop/HDFS

• Small files not welcome: blocks of 64M,128M

• Tens of millions files limit per cluster

• Namenode holding in memory files map

• Transparent compression not available

• Raw text could take much less space

• Real-time data access is not possible

32 3/17/2014 CNAF Seminar 32

Page 33: Seminar CNAF

Deployment

• Production cluster

• ~200 TB available in 5 data nodes

• 6.3 TB stored since mid July 2013

• Data organized by hostgroup (cluster)

• Daily jobs to aggregate data by month

• Large files preferred to many small files

33 3/17/2014 CNAF Seminar 33

Page 34: Seminar CNAF

Query & Analytics

3/17/2014 CNAF Seminar 34

Page 35: Seminar CNAF

Motivation

• Real-time queries based on clear API

• Dynamic dashboards creation

• Rich user-friendly dashboards

• Horizontally scalable and easy to deploy

• Limited data retention policy

• Handle different data types in the same way

Selected Technology: ElasticSearch + Kibana

35 3/17/2014 CNAF Seminar 35

Page 36: Seminar CNAF

ElasticSearch

• Distributed RESTful search & analytics engine

• Real time data acquisition and indexing

• Automatically balanced shards and replicas

• Schema free, document oriented (JSON)

• No prior data declaration required

• Automatic data type discovery

• Distributed under Apache license

36 3/17/2014 CNAF Seminar 36

Page 37: Seminar CNAF

ElasticSearch

• Full text search

• Apache Lucene is used to provide full text search

• Not only text: integer/long, float/double, boolean, etc.

• RESTful JSON API

3/17/2014 CNAF Seminar 37

$ curl -XGET http://es-search:9200/_cluster/health?pretty=true

{

"cluster_name" : "itmon-es",

"status" : "green",

"timed_out" : false,

"number_of_nodes" : 11,

"number_of_data_nodes" : 8,

"active_primary_shards" : 2990,

"active_shards" : 8970,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 0

}

Page 38: Seminar CNAF

Limitations ElasticSearch

• Requires a lot of RAM, mainlly on data nodes

• IO intensive, careful deployment required

• Shards re-initialisation takes some time (~1h)

• Lots of shards and replicas per index, lots of indexes

• Not frequent operation, only after full cluster reboot

• Authentication not built-in (“bricolage”)

• Apache+Shibboleth on top of Jetty plugin

3/17/2014 CNAF Seminar 38

Page 39: Seminar CNAF

Kibana Kibana

• “Make sense of a mountain of logs”

• Designed to analyse logs

• Perfectly fits timestamped data (e.g. metrics)

• Profits from ElasticSearch search/analyse features

• No coding required

• Simply point & click to build your own dashboard

• Fully integrated and supported by

ElasticSearch

• Started as separate project

3/17/2014 CNAF Seminar 39

Page 40: Seminar CNAF

Kibana

• Built with AngularJS

• JavaScript MVC for client-side rich application

• Developed and maintained by

• No backend: web server delivers static files

• JS directly queries ElasticSearch

• Easy to install and configure

• “git clone” OR “tar -xvzf” OR ElasticSearch plugin

• 1-line config file to point to the ElasticSearch cluster

• Save its own configuration in ElasticSearch

Kibana

3/17/2014 CNAF Seminar 40

Page 41: Seminar CNAF

Our Deployment Deployment

• Production cluster

• Running ElasticSearch 0.90.7

• 2 master nodes (16GB RAM, 8 cores)

• 1 search node (16GB RAM, 8 cores)

• 8 data nodes (48GB RAM, 24 cores, 500GB SSD)

• Monitoring: ElasticHQ, BigDesk, and Head

• Indexes structure

• One index per day with 30 days TTL

• 10 shards per index, 3 replicas per shards

3/17/2014 CNAF Seminar 41

Page 42: Seminar CNAF

Our Deployment Deployment

• Based on ElasticSearch plugin

• Running v3.pre-4

• Deployed together with search node

• Profits from Jetty authentication

• Different endpoints for AuthN

• Public (read only)

• Private (read write)

3/17/2014 CNAF Seminar 42

Page 43: Seminar CNAF

Feedback

• Easy to deploy and manage

• Robust, fast, and rich API

• Easy query language (DSL)

• More features with aggregation framework

• Released with ElasticSearch v1.0

3/17/2014 CNAF Seminar 43

Page 44: Seminar CNAF

Feedback

• Easy to deploy and use

• Very cool user interface

• Fits many use cases: text (syslog), metrics (lemon)

• Many “panels” available: tables, charts, hits, etc.

• Very active community and growing

• A bit limited feature set

• Many developments ongoing

3/17/2014 CNAF Seminar 44

Page 45: Seminar CNAF

Notifications

3/17/2014 CNAF Seminar 45

Page 46: Seminar CNAF

Motivation

• Modular tools to manage notifications

• Notifications delivered to multiple endpoints

• Automatic SNOW tickets / Central dashboard / etc.

• More efficient handling of notifications

• Enable SMs to improve automation of their services

• Improve routing of SNOW tickets

• Avoid wasting time in multiple (fake) hops

• Make visible problems hidden to SM before

• Allow others to publish/consumer notifications

3/17/2014 CNAF Seminar 46

Page 47: Seminar CNAF

GNI

• General Notifications Infrastructure

• Manage all data centre notifications

• Messaging consumers integrating with other tools

• Multiple notification types: HW, APP, OS, NC

• Notifications delivered as SNOW Incidents

• Incidents assigned to appropriate support unit

• Incidents masking per notification type

• Notifications stored in ElasticSearch

• Visible via a dedicated Kibana dashboard

3/17/2014 CNAF Seminar 47

Page 48: Seminar CNAF

Deployment

• 3 VMs for messaging clients + ES cluster

• Using other IT services: ActiveMQ, SNOW

3/17/2014 CNAF Seminar 48

Page 49: Seminar CNAF

Real-time Analytics

3/17/2014 CNAF Seminar 49

Page 50: Seminar CNAF

Motivation

• Real-time analytics engine

• Automatic generation of curated data

• Easy to use under different contexts

• First target is aggregation of notifications

• Online machine learning, ETL, etc.

• Adopt open source tool

• Good candidates: Spark, Storm, ?

• Easy integration with current tools

3/17/2014 CNAF Seminar 50

Page 51: Seminar CNAF

Summary

3/17/2014 CNAF Seminar 51

Page 52: Seminar CNAF

Summary

3/17/2014 CNAF Seminar 52

Before After

Many central services More platform services

Notifications limited to lemon Generic notifications producers

Inefficient ticket routing Flexible ticket routing

Limited to lemon metrics Open to any monitoring data

Complex data access Easy data access

Central lemon dashboard Dashboard instances per application

Limited offline analytics Batch analytics in HDFS

No real-time analytics New real-time analytics tools

Page 53: Seminar CNAF

Summary

• New shared monitoring architecture

• Being adopted by all IT monitoring activities

• Selected technologies look good

• Flume, ES, Kibana, HDFS

• Happy to get your feedback on these and others

• Don’t forget the cultural changes

• Agile methology, DevOps, PaaS, etc.

• As important as the technology changes

3/17/2014 CNAF Seminar 53