Agile Infrastructure Update Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Agile Infrastructure UpdateMonitoring

Pedro Andrade – IT/GT

6th July 2012

IT Technical Forum

2

Overview

• Introduction– Motivation, Challenge, Architecture

• Status Update– Producers, Messaging, Storage/Analysis, Visualization

• Milestones and Next Steps

• Summary and Conclusions

3

Introduction

• Motivation– Several independent monitoring activities in IT– Based on different tool-chain but sharing same limitations – High level services are interdependent – Combination of data and complex analysis necessary

• Challenge– Find a shared architecture and tool-chain components– Adopt existing tools and avoid home grown solutions– Aggregate monitoring data in a large data store – Correlate monitoring data and make it easy to access

Architecture

4

Aggregation

Analysis/Storage Feed

Storage/Analysis

Alarm Feed

AlarmPortal Report

Custom Feed

Publisher Sensor Publisher Sensor

Portal

Apollo

Lemon

Hadoop

Oracle

Splunk

Application Specific

5

Status Update

Apollo

Castor Hdp. Consumer

Hadoop

Castor Logs Consumer

Castor Cockpit

Lemon Spk. Consumer

Security Netlog

Castor Logs Producer

Lemon Producer

Splunk

Lemon Snow Consumer

SNOWQuattor + Puppet

6

Producers

• Lemon Producer implemented and tested– Lemon Agent + Lemon Forwarder– Supports publication of notifications and metrics– Retrieves local metadata via puppet (notification targets)– Tested on approximately 500 quattor nodes– Mocked publication of notifications from all quattor nodes

• Castor Log Producer implemented and tested– Publishes parsed castor logs in messaging broker– Generic producer supporting different input sources– Tested in 30 development nodes

7

Messaging Broker

• Apollo broker deployed and tested– Few initial problems found, service running smoothly now– Our deployment scenario hit a bug (fast feedback)

• Currently exploited by two monitoring apps– Lemon: 5 msg/sec, avg size of 3KB (500 hosts)– Castor Logs: 120 msg/sec, aveg size of 11KB (30 hosts)

• Producers and consumers using CERN msg tools – mig-admin-utils-stompclt, messaging (python, perl)

8

Storage and Analysis

• Small Hadoop cluster deployed and tested– Being upgraded to latest CDH v4

• Currently exploited by two monitoring apps– Security Netlog

• Data being imported directly to Hadoop via Fuse• Goal is to run analysis on network activity

– Castor Logs• Development of messaging consumer to Hadoop just started

• Testing Hadoop components: HBase and Flume– Hive and Sqoop ??

9

Visualization and Alarms

• Splunk deployed and tested– So far only exploited by Lemon– Lemon notifications continuously added to Splunk using

the Lemon Splunk consumer– Lemon metrics manually imported: 1,5 years of data with

1335 metrics for all CC nodes (8,5 TB)– Splunk playground available: https://lxfssm4508.cern.ch

• SNOW integration implemented and tested– Lemon notifications delivered as SNOW tickets assigned

to correct target (FE) using the Lemon SNOW consumer– Improving tickets routing

https://lxfssm4508.cern.ch/

10

Milestones

v1

Q1 2012

• AI nodes monitored with Lemon (dependency on Quattor)• Deployment of Messaging Broker and Hadoop cluster• Testing of other technologies (Splunk)

v2

Q2 2012

• AI nodes monitored with Lemon• Lemon data starts to be published via messaging

v3

Q4 2012

• Several clients exploiting the messaging infrastructure• Messaging consumers for alarms and notifications in production• Initial data store/analysis for selected use cases

v4

Q4 2013

• Monitoring data published to the messaging infrastructure• Large scale data store/analysis on Hadoop cluster

11

HadoopConsumer

Next Steps

Apollo


Hadoop


Castor Cockpit

Lemon Spk. Consumer

Security Netlog


Lemon Producer

Splunk

Lemon Snow Consumer

SnowCorrelation Engine

Generic Consumer

Tests for Production

Other Producers

Dashboards

12

Summary

• All layers of the proposed monitoring architecture successfully tested with an initial set of tools– Apollo, Hadoop, Splunk deployed and tested– New components implemented and tested (messaging)– Partial functional and scalability tests

• Several concrete results with real data achieved– Castor logs data aggregation via Apollo– Lemon notifications aggregation via Apollo– Lemon notifications visualization in Splunk– Security netlog data stored in Hadoop

• Notifications mechanism tested with Lemon data

13

Summary

• Base monitoring for AI nodes ongoing– Eating our own dog food

• Several contacts established with other teams– Different IT groups attending monitoring meetings– BE and GS: discussion of similar projects– LHCb and ATLAS online teams: sharing experiences– Crucial for uptake by users

• Other monitoring applications are welcome to join– More use cases, more data, (more) correlation

14

Conclusion

• Work progressing as planned

• Core components of the architecture in place– Ready to be used and evaluated

• AI Monitoring needs YOU ! – Move (part of) your monitoring apps

15

Thank You !

16

Backup Slides

17

Introduction

• Motivation– Several independent monitoring activities in IT

• Similar overall approach, different tool-chains, similar limitations

– High level services are interdependent • Combination of data from different groups necessary, but difficult

– Understanding performance became more important • Requires more combined data and complex analysis

– Move to a virtualized dynamic infrastructure • Comes with complex new requirements on monitoring

• Challenge– Find a shared architecture and tool-chain components

while preserving/improving our investment in monitoring

18

Monitoring in IT

• More then 30 monitoring applications– Number of producers: ~40k– Input data volume: ~280 GB per day

• Covering a wide range of different resources– Hardware, OS, applications, files, jobs, etc.

• Application-specific monitoring solutions– Using different technologies (including commercial tools)– Sharing similar needs: aggregate metrics, get alarms, etc

• Limited sharing of monitoring data

19

Architecture

• Data– Aggregate monitoring data in a large data store

• For storage and combined analysis tasks

– Make monitoring data easy to access by everyone• Not forgetting possible security constraints

– Select a simple and well supported data format

• Technology– Follow a tool chain approach

• Each tool can be easily replaced by a better one

– Select well established solutions • Adopt existing tools by avoid home grown solutions

– Allow a phased transition to the new architecture

20

Work Summary

• Tested the tools initially selected– And the data workflow between the tools

• Worked with concrete monitoring apps and data– Lemon (CF), Castor Logs (DSS), Net Logger (DI)

• Defined and tested different monitoring paths– Notifications Vs. Analysis

• Supported two different environments– Puppet nodes to be ready for new AI infrastructure– Quattor nodes to run large scale tests

21

• Monitoring data generated by all resources– Monitoring metadata available at the node– Published to messaging using common libraries– May also be produced as a result of pre-aggregation or

post-processing tasks

• Support and integrate closed monitoring solutions– By injecting final results into the messaging layer or

exporting relevant data at an intermediate stage

Producers and Sensors

Sensor

Storage

Analysis

Visualization

Mes

sagi

ng

Integrated Product

22

Messaging Broker

• Monitoring data transported via messaging– Provide a network of messaging brokers– Support for multiple configurations– The needs of each monitoring application must be clearly

analyzed and defined• Total number of producers and consumers• Size of the monitoring message• Rate of the monitoring message

– Realistic testing environments are required to produce reliable performance numbers

• First tests with Apollo (ActiveMQ)– Prior positive experience in IT and the experiments

23

Storage and Analysis

• Monitoring data stored in a common location– Easy the sharing of monitoring data and analysis tools– Allows feeding into the system data already processed– NoSQL technologies are the most suitable solutions

• Focus on column/tabular solutions

• First tests with Hadoop (Cloudera distribution)– Prior positive experience in IT and the experiments – Map-reduce paradigm is a good match for the use cases– Has been used successfully at scale– Several related modules available (Hive, HBase)

• For particular use cases a parallel relational database solution (Oracle) can be considered

24

Notifications and Dashboards

• Provide an efficient delivery of notifications– Notifications directly sent to correct consumer targets– Possible targets: operators, service managers, etc.

• Provide powerful dashboards and APIs – Complex queries on cross-domain monitoring data

• First tests with Splunk

25

Links

• Monitoring WG Twiki (new location!)– https://twiki.cern.ch/twiki/bin/view/MonitoringWG/

• Monitoring WG Report (ongoing)– https://twiki.cern.ch/twiki/bin/view/MonitoringWG/Monitori

ngReport

• Agile Infrastructure TWiki– https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/

• Agile Infrastructure JIRA– https://agileinf.cern.ch/jira/browse/AI

https://twiki.cern.ch/twiki/bin/view/MonitoringWG/

https://twiki.cern.ch/twiki/bin/view/MonitoringWG/MonitoringReport

https://twiki.cern.ch/twiki/bin/view/MonitoringWG/MonitoringReport

https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/

https://agileinf.cern.ch/jira/browse/AI

https://agileinf.cern.ch/jira/browse/AI

26

Next Steps

Apollo


Hadoop Cluster


Castor Cockpit

Lemon Spk. Consumer

Security Net Logger


Lemon Forwarder

Lemon Agent

Splunk

Lemon Snow Consumer

Snow LemonCorrelation Engine

Generic Consumer

Test in Production

Other Producers

Agile Infrastructure Update Monitoring

Documents

Transcript of Agile Infrastructure Update Monitoring