Job Services With Data Requirements Elisabetta Ronchieri INFN CNAF
Seminar CNAF
Transcript of Seminar CNAF
Seminar CNAF
Exploiting open source tools to
realize a new monitoring
infrastructure at CERN
Pedro Andrade – CERN IT/CF
Overview
• Agile Infrastructure
• Monitoring Project
• Solutions and Technologies
• Producers
• Transport
• Archive
• Query and Analytics
• Real-time Analytics
• Notifications
3/17/2014 CNAF Seminar 2
Agile Infrastructure
3/17/2014 CNAF Seminar 3
Challenges
• New data centre in Budapest since 2013
• Additional capacity required in view of physics needs
• Local on-site maintenance for installations/repairs
3/17/2014 CNAF Seminar 4
Challenges
• Be ready to handle 15’000 servers
• Increasing users of CERN’s facilities and higher
computing requirements as data rates increase
• Staff numbers are fixed, no more people
• Materials budget decreasing, no more money
• Legacy tools are high maintenance and brittle
• Deploy new services within hours
3/17/2014 CNAF Seminar 5
Challenges
• “We Are Not Special”
• Move to commonly used open source tools
• Focus on strong communities and momentum
• Stop re-inventing tools, not made here syndrome
• Implement clouds at scale
• Aim for 90% infrastructure virtualised
• Ecosystem solutions rather than writing from scratch
• Request to delivery in a coffee break
3/17/2014 CNAF Seminar 6
Agile Infrastructure
• Activity started in 2012
• Remodel IT services
• Move to a more horizontal approach
• Layered model: IaaS, PaaS, SaaS
• Services, Configuration, Installation, Hardware
• Virtualisation is key
• Improve efficiency
• Operational, Resources
3/17/2014 CNAF Seminar 7
Agile Infrastructure
3/17/2014 CNAF Seminar 8
Bamboo
Koji, Mock
AIMS/PXE
Foreman
Yum repo
Pulp
Puppet-DB
mcollective, yum
JIRA
Lemon /
Hadoop /
Elastic Search /
Kibana
git
OpenStack
Nova
Hardware
database
Puppet
Active Directory /
LDAP
Monitoring Project
3/17/2014 CNAF Seminar 9
Challenges
• Several independent monitoring activities in IT
• High level services are interdependent
• Understanding performance more important
• Move to a virtualized dynamic infrastructure
• Preserve our investment in monitoring
Shared architecture & tool-chain components
3/17/2014 CNAF Seminar 10
Objectives
• Deliver solutions for the shared architecture
• Work with all IT monitoring teams
• Deliver simple adoption: PaaS
• Better exploit IT resources
• While at the same time
• Mix and match open source solutions
• Exploit new tools from the Agile Infrastructure
• Retire old tools: Lemon DB, Lemon Web, LAS, etc.
3/17/2014 CNAF Seminar 11
Architecture
3/17/2014 CNAF Seminar 12
Process Improvements
• Establish Agile methodology
• Well defined sprints with clear targets
• Interactive evolution, continuous feedback
• Exploit Open Source tools
• Best fit, large adoption, active community
• Fast to adopt, accept limitations, easily replaced
• Look at DevOps
• Quality Assurance processes
• Contiguous Integration processes
3/17/2014 CNAF Seminar 13
Technologies
• Many options available !
3/17/2014 CNAF Seminar 14
Technologies
3/17/2014 CNAF Seminar 15
Producers
3/17/2014 CNAF Seminar 16
Motivation
• Preserve sensors/probes knowledge
• Many years writing sensors for Lemon
• Integrate other data sources
• Most likely service specific monitoring data
Selected Technology: Lemon + Others
3/17/2014 CNAF Seminar 17
Lemon Producer
• Same old lemon agent
• Running in all data centre nodes
• Lemon agent extended with lemon forwarder
• Send notifications to ActiveMQ
• Send metrics to Flume
• Send syslog to Flume
3/17/2014 CNAF Seminar 18
Other Producers
• Must follow common monitoring specification
• Metric v3.0 and Notification v2.0
• Can use monitoring-data-model to create new
metrics and notifications and validate them
• Messages can be send
• To ActiveMQ using a stomp client
• To Flume gateway using a flume agent
• Planning to evaluate Collectd later this year
3/17/2014 CNAF Seminar 19
Transport
3/17/2014 CNAF Seminar 20
Motivation
• Collect operations data
• Lemon metrics and syslog
• 3rd party applications and services
• Scalable transport layer
• Large data volume
• Easy integration with other technologies
Selected Technology: Flume
3/17/2014 CNAF Seminar 21
Flume
• Distributed service for collecting large data sets
• Robust and fault tolerant
• Horizontally scalable
• Many ready to be used input/output plugins
• Java based, Apache license
• Cloudera is the main contributor
• Using their releases
• Less frequent but more stable releases
3/17/2014 CNAF Seminar 22
Flume
• Flume event
• Payload + set of string headers
• Flume agent
• JVM process hosting “source to sink” flows
3/17/2014 CNAF Seminar 23
Flume
• Many ready-to-be-used plugins
• Sources: Avro, JMS, Spool, Syslog, HTTP, etc.
• Interceptors: decorate events, filter events
• Channels: Memory, File, JDBC
• Sinks: Avro, Thrift, ElasticSearch, HDFS, File, etc.
• Custom sources/sinks can be implemented
3/17/2014 CNAF Seminar 24
Flume
• Routing is static
• On demand subscriptions are not possible
• Requires reconfiguration and restart
• No authN and authZ features
• But secure transport available
• Java process on client side
• Small memory footprint would be nicer
3/17/2014 CNAF Seminar 25
Deployment
• Running flume 1.3, latest is flume 1.4
3/17/2014 CNAF Seminar 26
Deployment
• 1st layer: Flume Data publisher
• Deployed in all data centre nodes
• 2nd layer: Flume Gateway
• 20 VMs aggregating events
• 3rd layer: Flume ElasticSearch
• 10 VMs inserting to ElasticSearch
• 3rd layer: Flume Hadoop HDFS
• 10 VMs inserting to Hadoop HDFS
3/17/2014 CNAF Seminar 27
Feedback
• Sizing flume layers needs some tuning
• Available sources/sinks saved a lot of time
3/17/2014 CNAF Seminar 28
Archive
3/17/2014 CNAF Seminar 29
Motivation
• Store operations raw data
• Long term archival required
• Allow future data replay to other tools
• Feed real-time engine
• Offline processing of collected data
• Security data? Syslog data?
Selected Technology: Hadoop/HDFS
30 3/17/2014 CNAF Seminar 30
Hadoop/HDFS
• Hadoop is a framework that allows the
distributed processing of large data sets
• HDFS is a distributed filesystem designed to
run on commodity hardware
• Suitable for applications with large data sets
• Designed for batch processing, not interactive use
• High throughput preferred to low latency access
3/17/2014 CNAF Seminar 31
Hadoop/HDFS
• Small files not welcome: blocks of 64M,128M
• Tens of millions files limit per cluster
• Namenode holding in memory files map
• Transparent compression not available
• Raw text could take much less space
• Real-time data access is not possible
32 3/17/2014 CNAF Seminar 32
Deployment
• Production cluster
• ~200 TB available in 5 data nodes
• 6.3 TB stored since mid July 2013
• Data organized by hostgroup (cluster)
• Daily jobs to aggregate data by month
• Large files preferred to many small files
33 3/17/2014 CNAF Seminar 33
Query & Analytics
3/17/2014 CNAF Seminar 34
Motivation
• Real-time queries based on clear API
• Dynamic dashboards creation
• Rich user-friendly dashboards
• Horizontally scalable and easy to deploy
• Limited data retention policy
• Handle different data types in the same way
Selected Technology: ElasticSearch + Kibana
35 3/17/2014 CNAF Seminar 35
ElasticSearch
• Distributed RESTful search & analytics engine
• Real time data acquisition and indexing
• Automatically balanced shards and replicas
• Schema free, document oriented (JSON)
• No prior data declaration required
• Automatic data type discovery
• Distributed under Apache license
36 3/17/2014 CNAF Seminar 36
ElasticSearch
• Full text search
• Apache Lucene is used to provide full text search
• Not only text: integer/long, float/double, boolean, etc.
• RESTful JSON API
3/17/2014 CNAF Seminar 37
$ curl -XGET http://es-search:9200/_cluster/health?pretty=true
{
"cluster_name" : "itmon-es",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 11,
"number_of_data_nodes" : 8,
"active_primary_shards" : 2990,
"active_shards" : 8970,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}
Limitations ElasticSearch
• Requires a lot of RAM, mainlly on data nodes
• IO intensive, careful deployment required
• Shards re-initialisation takes some time (~1h)
• Lots of shards and replicas per index, lots of indexes
• Not frequent operation, only after full cluster reboot
• Authentication not built-in (“bricolage”)
• Apache+Shibboleth on top of Jetty plugin
3/17/2014 CNAF Seminar 38
Kibana Kibana
• “Make sense of a mountain of logs”
• Designed to analyse logs
• Perfectly fits timestamped data (e.g. metrics)
• Profits from ElasticSearch search/analyse features
• No coding required
• Simply point & click to build your own dashboard
• Fully integrated and supported by
ElasticSearch
• Started as separate project
3/17/2014 CNAF Seminar 39
Kibana
• Built with AngularJS
• JavaScript MVC for client-side rich application
• Developed and maintained by
• No backend: web server delivers static files
• JS directly queries ElasticSearch
• Easy to install and configure
• “git clone” OR “tar -xvzf” OR ElasticSearch plugin
• 1-line config file to point to the ElasticSearch cluster
• Save its own configuration in ElasticSearch
Kibana
3/17/2014 CNAF Seminar 40
Our Deployment Deployment
• Production cluster
• Running ElasticSearch 0.90.7
• 2 master nodes (16GB RAM, 8 cores)
• 1 search node (16GB RAM, 8 cores)
• 8 data nodes (48GB RAM, 24 cores, 500GB SSD)
• Monitoring: ElasticHQ, BigDesk, and Head
• Indexes structure
• One index per day with 30 days TTL
• 10 shards per index, 3 replicas per shards
3/17/2014 CNAF Seminar 41
Our Deployment Deployment
• Based on ElasticSearch plugin
• Running v3.pre-4
• Deployed together with search node
• Profits from Jetty authentication
• Different endpoints for AuthN
• Public (read only)
• Private (read write)
3/17/2014 CNAF Seminar 42
Feedback
• Easy to deploy and manage
• Robust, fast, and rich API
• Easy query language (DSL)
• More features with aggregation framework
• Released with ElasticSearch v1.0
3/17/2014 CNAF Seminar 43
Feedback
• Easy to deploy and use
• Very cool user interface
• Fits many use cases: text (syslog), metrics (lemon)
• Many “panels” available: tables, charts, hits, etc.
• Very active community and growing
• A bit limited feature set
• Many developments ongoing
3/17/2014 CNAF Seminar 44
Notifications
3/17/2014 CNAF Seminar 45
Motivation
• Modular tools to manage notifications
• Notifications delivered to multiple endpoints
• Automatic SNOW tickets / Central dashboard / etc.
• More efficient handling of notifications
• Enable SMs to improve automation of their services
• Improve routing of SNOW tickets
• Avoid wasting time in multiple (fake) hops
• Make visible problems hidden to SM before
• Allow others to publish/consumer notifications
3/17/2014 CNAF Seminar 46
GNI
• General Notifications Infrastructure
• Manage all data centre notifications
• Messaging consumers integrating with other tools
• Multiple notification types: HW, APP, OS, NC
• Notifications delivered as SNOW Incidents
• Incidents assigned to appropriate support unit
• Incidents masking per notification type
• Notifications stored in ElasticSearch
• Visible via a dedicated Kibana dashboard
3/17/2014 CNAF Seminar 47
Deployment
• 3 VMs for messaging clients + ES cluster
• Using other IT services: ActiveMQ, SNOW
3/17/2014 CNAF Seminar 48
Real-time Analytics
3/17/2014 CNAF Seminar 49
Motivation
• Real-time analytics engine
• Automatic generation of curated data
• Easy to use under different contexts
• First target is aggregation of notifications
• Online machine learning, ETL, etc.
• Adopt open source tool
• Good candidates: Spark, Storm, ?
• Easy integration with current tools
3/17/2014 CNAF Seminar 50
Summary
3/17/2014 CNAF Seminar 51
Summary
3/17/2014 CNAF Seminar 52
Before After
Many central services More platform services
Notifications limited to lemon Generic notifications producers
Inefficient ticket routing Flexible ticket routing
Limited to lemon metrics Open to any monitoring data
Complex data access Easy data access
Central lemon dashboard Dashboard instances per application
Limited offline analytics Batch analytics in HDFS
No real-time analytics New real-time analytics tools
Summary
• New shared monitoring architecture
• Being adopted by all IT monitoring activities
• Selected technologies look good
• Flume, ES, Kibana, HDFS
• Happy to get your feedback on these and others
• Don’t forget the cultural changes
• Agile methology, DevOps, PaaS, etc.
• As important as the technology changes
3/17/2014 CNAF Seminar 53
Thanks !
http://cern.ch/itmon
3/17/2014 CNAF Seminar 54