Agile Infrastructure Update Monitoring
description
Transcript of Agile Infrastructure Update Monitoring
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Agile Infrastructure UpdateMonitoring
Pedro Andrade – IT/GT
6th July 2012
IT Technical Forum
2
Overview
• Introduction– Motivation, Challenge, Architecture
• Status Update– Producers, Messaging, Storage/Analysis, Visualization
• Milestones and Next Steps
• Summary and Conclusions
3
Introduction
• Motivation– Several independent monitoring activities in IT– Based on different tool-chain but sharing same limitations – High level services are interdependent – Combination of data and complex analysis necessary
• Challenge– Find a shared architecture and tool-chain components– Adopt existing tools and avoid home grown solutions– Aggregate monitoring data in a large data store – Correlate monitoring data and make it easy to access
Architecture
4
Aggregation
Analysis/Storage Feed
Storage/Analysis
Alarm Feed
AlarmPortal Report
Custom Feed
Publisher Sensor Publisher Sensor
Portal
Apollo
Lemon
Hadoop
Oracle
Splunk
Application Specific
5
Status Update
Apollo
Castor Hdp. Consumer
Hadoop
Castor Logs Consumer
Castor Cockpit
Lemon Spk. Consumer
Security Netlog
Castor Logs Producer
Lemon Producer
Splunk
Lemon Snow Consumer
SNOWQuattor + Puppet
6
Producers
• Lemon Producer implemented and tested– Lemon Agent + Lemon Forwarder– Supports publication of notifications and metrics– Retrieves local metadata via puppet (notification targets)– Tested on approximately 500 quattor nodes– Mocked publication of notifications from all quattor nodes
• Castor Log Producer implemented and tested– Publishes parsed castor logs in messaging broker– Generic producer supporting different input sources– Tested in 30 development nodes
7
Messaging Broker
• Apollo broker deployed and tested– Few initial problems found, service running smoothly now– Our deployment scenario hit a bug (fast feedback)
• Currently exploited by two monitoring apps– Lemon: 5 msg/sec, avg size of 3KB (500 hosts)– Castor Logs: 120 msg/sec, aveg size of 11KB (30 hosts)
• Producers and consumers using CERN msg tools – mig-admin-utils-stompclt, messaging (python, perl)
8
Storage and Analysis
• Small Hadoop cluster deployed and tested– Being upgraded to latest CDH v4
• Currently exploited by two monitoring apps– Security Netlog
• Data being imported directly to Hadoop via Fuse• Goal is to run analysis on network activity
– Castor Logs• Development of messaging consumer to Hadoop just started
• Testing Hadoop components: HBase and Flume– Hive and Sqoop ??
9
Visualization and Alarms
• Splunk deployed and tested– So far only exploited by Lemon– Lemon notifications continuously added to Splunk using
the Lemon Splunk consumer– Lemon metrics manually imported: 1,5 years of data with
1335 metrics for all CC nodes (8,5 TB)– Splunk playground available: https://lxfssm4508.cern.ch
• SNOW integration implemented and tested– Lemon notifications delivered as SNOW tickets assigned
to correct target (FE) using the Lemon SNOW consumer– Improving tickets routing
10
Milestones
v1
Q1 2012
• AI nodes monitored with Lemon (dependency on Quattor)• Deployment of Messaging Broker and Hadoop cluster• Testing of other technologies (Splunk)
v2
Q2 2012
• AI nodes monitored with Lemon• Lemon data starts to be published via messaging
v3
Q4 2012
• Several clients exploiting the messaging infrastructure• Messaging consumers for alarms and notifications in production• Initial data store/analysis for selected use cases
v4
Q4 2013
• Monitoring data published to the messaging infrastructure• Large scale data store/analysis on Hadoop cluster
11
HadoopConsumer
Next Steps
Apollo
Castor Hdp. Consumer
Hadoop
Castor Logs Consumer
Castor Cockpit
Lemon Spk. Consumer
Security Netlog
Castor Logs Producer
Lemon Producer
Splunk
Lemon Snow Consumer
SnowCorrelation Engine
Generic Consumer
Tests for Production
Other Producers
Dashboards
12
Summary
• All layers of the proposed monitoring architecture successfully tested with an initial set of tools– Apollo, Hadoop, Splunk deployed and tested– New components implemented and tested (messaging)– Partial functional and scalability tests
• Several concrete results with real data achieved– Castor logs data aggregation via Apollo– Lemon notifications aggregation via Apollo– Lemon notifications visualization in Splunk– Security netlog data stored in Hadoop
• Notifications mechanism tested with Lemon data
13
Summary
• Base monitoring for AI nodes ongoing– Eating our own dog food
• Several contacts established with other teams– Different IT groups attending monitoring meetings– BE and GS: discussion of similar projects– LHCb and ATLAS online teams: sharing experiences– Crucial for uptake by users
• Other monitoring applications are welcome to join– More use cases, more data, (more) correlation
14
Conclusion
• Work progressing as planned
• Core components of the architecture in place– Ready to be used and evaluated
• AI Monitoring needs YOU ! – Move (part of) your monitoring apps
15
Thank You !
16
Backup Slides
17
Introduction
• Motivation– Several independent monitoring activities in IT
• Similar overall approach, different tool-chains, similar limitations
– High level services are interdependent • Combination of data from different groups necessary, but difficult
– Understanding performance became more important • Requires more combined data and complex analysis
– Move to a virtualized dynamic infrastructure • Comes with complex new requirements on monitoring
• Challenge– Find a shared architecture and tool-chain components
while preserving/improving our investment in monitoring
18
Monitoring in IT
• More then 30 monitoring applications– Number of producers: ~40k– Input data volume: ~280 GB per day
• Covering a wide range of different resources– Hardware, OS, applications, files, jobs, etc.
• Application-specific monitoring solutions– Using different technologies (including commercial tools)– Sharing similar needs: aggregate metrics, get alarms, etc
• Limited sharing of monitoring data
19
Architecture
• Data– Aggregate monitoring data in a large data store
• For storage and combined analysis tasks
– Make monitoring data easy to access by everyone• Not forgetting possible security constraints
– Select a simple and well supported data format
• Technology– Follow a tool chain approach
• Each tool can be easily replaced by a better one
– Select well established solutions • Adopt existing tools by avoid home grown solutions
– Allow a phased transition to the new architecture
20
Work Summary
• Tested the tools initially selected– And the data workflow between the tools
• Worked with concrete monitoring apps and data– Lemon (CF), Castor Logs (DSS), Net Logger (DI)
• Defined and tested different monitoring paths– Notifications Vs. Analysis
• Supported two different environments– Puppet nodes to be ready for new AI infrastructure– Quattor nodes to run large scale tests
21
• Monitoring data generated by all resources– Monitoring metadata available at the node– Published to messaging using common libraries– May also be produced as a result of pre-aggregation or
post-processing tasks
• Support and integrate closed monitoring solutions– By injecting final results into the messaging layer or
exporting relevant data at an intermediate stage
Producers and Sensors
Sensor
Storage
Analysis
Visualization
Mes
sagi
ng
Integrated Product
22
Messaging Broker
• Monitoring data transported via messaging– Provide a network of messaging brokers– Support for multiple configurations– The needs of each monitoring application must be clearly
analyzed and defined• Total number of producers and consumers• Size of the monitoring message• Rate of the monitoring message
– Realistic testing environments are required to produce reliable performance numbers
• First tests with Apollo (ActiveMQ)– Prior positive experience in IT and the experiments
23
Storage and Analysis
• Monitoring data stored in a common location– Easy the sharing of monitoring data and analysis tools– Allows feeding into the system data already processed– NoSQL technologies are the most suitable solutions
• Focus on column/tabular solutions
• First tests with Hadoop (Cloudera distribution)– Prior positive experience in IT and the experiments – Map-reduce paradigm is a good match for the use cases– Has been used successfully at scale– Several related modules available (Hive, HBase)
• For particular use cases a parallel relational database solution (Oracle) can be considered
24
Notifications and Dashboards
• Provide an efficient delivery of notifications– Notifications directly sent to correct consumer targets– Possible targets: operators, service managers, etc.
• Provide powerful dashboards and APIs – Complex queries on cross-domain monitoring data
• First tests with Splunk
25
Links
• Monitoring WG Twiki (new location!)– https://twiki.cern.ch/twiki/bin/view/MonitoringWG/
• Monitoring WG Report (ongoing)– https://twiki.cern.ch/twiki/bin/view/MonitoringWG/Monitori
ngReport
• Agile Infrastructure TWiki– https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/
• Agile Infrastructure JIRA– https://agileinf.cern.ch/jira/browse/AI
26
Next Steps
Apollo
Castor Hdp. Consumer
Hadoop Cluster
Castor Logs Consumer
Castor Cockpit
Lemon Spk. Consumer
Security Net Logger
Castor Logs Producer
Lemon Forwarder
Lemon Agent
Splunk
Lemon Snow Consumer
Snow LemonCorrelation Engine
Generic Consumer
Test in Production
Other Producers