Index Exception handling Exception In Java Exception Types Uncaught Exception Throw Finally
Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
description
Transcript of Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
Computing Facilities
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF
Lemon monitoringand
Lemon Alarm System(sensors, exception, alarm)
Ivan Fedorko22/11/2010
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Overview
• Lemon overview• Lemon agent and sensors• How to write new sensor• Exception sensor• LAS
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Lemon components
SQL
TCP/UDP HTTP
Sensor Sensor Sensor
Monitoring Agent Local Cache
OracleDatabase
Repository BackendApplication
Server
Lemon CLI
Lemon-host-check
Web Browser
RRD tool / Python
Apache/ PHP
(command line tool to access data)
(command line tool node exceptions)
Measurement Repository
Lemon-web
User InterfacesNode Monitoring
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Lemon agent and sensors
Class Instance
Class Instance
Monitoring Agent
Class Instance
MetricClass
MetricClass
Class Instance
Class Instance
Class Instance Class
Instance
Class Instance
Class Instance
Sensor
MetricClass Metric
Class
Sensor
Sensor:A process or script which is connected to the lemon-agent via a bi-directional pipe and collects information on behalf of the agent. Sensors implement:Metric Classes:
The equivalent to a class in OOP (Object Orientated Programming)
Metric Instance: Is an instance (an object) of a metric class which has its own configuration data.
Metric ID: A unique identifier associated with a particular metric instance of a particular metric class.
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Lemon agent and sensors
MSA (Monitoring Sensor Agent) • forks sensors and communicate with them using
custom protocol over a bi-directional “pipes”• configures metric instances of metric classes of a
sensor and pulls for metrics• to configure: ncm-ncd --configure fmonagent• configuration: /etc/lemon/agent/• log: /var/log/lemon-agent.log
• checks on status of sensors• caches data locally ( e.g. /var/spool/lemon-agent/ )
Class Instance
Class Instance
Monitoring Agent
Class Instance
MetricClass
MetricClass
Class Instance
Class Instance
Class Instance Class
Instance
Class Instance
Class Instance
Sensor
MetricClass Metric
Class
Sensor
http://lemon.web.cern.ch/lemon/doc/sensors.shtml
Supported Lemon Sensors
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Lemon agent and sensors
Class Instance
Class Instance
Monitoring Agent
Class Instance
MetricClass
MetricClass
Class Instance
Class Instance
Class Instance Class
Instance
Class Instance
Class Instance
Sensor
MetricClass Metric
Class
Sensor
/etc/lemon/agent/sensors/linux_CDB.conf
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Lemon agent and sensors
What we can store:• StoreSample01:
• agent: <metric_id> <timestamp>• user: <value>
• StoreSample02:• agent: <metric_id> • user: <timestamp> <node> <value>
• StoreSample03:• agent: • user: <node> <metric_id> <timestamp> <value>
Example of linux sensor
Lemon API
Reporting on behalf
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Create new sensor
• Check instruction and example code– http://lemon.web.cern.ch/lemon/doc/howto/sensor_tutorial.shtml
• Prepare your code and test – Test can be done localy
• Prepare templates– For sensor will be transformed to DB table definition– For metrics create table/metric, metadata check on
server • Ask for metric ID at lemon.support• Commit templates with ID to CDB and inform lemon.support• If new exception is introduced, should be alarmed?
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Sensor template example
Somewhere in your node template
Sensor template
For configuration
For db backend
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Exception sensor
• Objective:– To run corrective action when the occupancy of
the /tmp partition is greater then 80%. • Involved Metrics
– With ID 9104 (system.partitionInfo)– Field 1 = mountname, field 5 = percentage
occupancy
• CorrelationCorrelation ((9104:1 eq '/tmp') && (9104:5 > 80)) Actuator /usr/local/sbin/clean-tmp-partition -o 75
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Exception sensor
– an officially supported Lemon sensor coded in C++– developed in collaboration between CERN and BARC– implements the Lemon alarm protocol
– has a correlation engine which allows it to evaluate 1 or more metrics to determine if a problem exists on a machine
– supports reporting on behalf of other monitored entities– allows corrective actions (actuators) up to n-times or
within a given time window– is the primary interface to inserting alarms into the Lemon
framework. – Provides one and only one metric class “alarm.exception”Full documentation at:– http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Exception sensor
Alarm evolution
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Exception metric
"/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) );
what is the alarm's importance? Not used now!• 0 - informative -> to be handled at convenience• 1 - low - 9/5 support - to be handled within working hours, e-
mail outside working hours• 2 - high - 24/24 support - requires immediate action - PK or
expert call
name of the exception to be used on the web and later for operator's GUI
short description to be also passed to GUI
if false, ncm component will not include exception to config
if false, value are stored in lemon DB archive
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Correlation
• Basic format of a correlation is: [entity_name]:<metric_id>:<field_position> <operator> <reference_value> ... • Where,
– entity_name• An optional parameter, used for reporting on behalf of other entities• The name of the entity (wildcards ‘*’ are supported)
– metric_id• The id of the metric to check
– field_position• The field to use within the metric. • Allows the correlation to extract a single value from a multi-valued metric
– Operater• E.g. ==, !=, >, <, eq, ne, regex, !regex …
– reference_value• A string or number used to compare the metric_id:field_position against
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Correlation
1. Use basic object for correlation<metric_id>:<field_position>
2. Combine exceptions10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3
eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0)
4. Join the metrics(9200:1 == 9208:1)
3. You can collect information on behalf, you can define exception on behalf[entity_name]]:<metric_id>:<field_position> e.g. (*:9501:5 != 200) && (*:9501:5 != 301)
5. "correlation", "4109:2 ne 'symlink('/system/kernel/version')'"
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Actuator
• Information:– Run as forked processes.– Are connected to the sensor via a pipe.– All information written to stdout or stderr by the actuator
is caught and recorded in the agents log file.– All actuator attempts are logged centrally and recorded
locally in the agents log file.• Running shell style actuators:
– The system call used to run actuator doesn’t provide shell style conveniences.
– To use shell style syntax like *, &&, | etc you must define you actuator like this:
Actuator /bin/sh –c \\” /bin/echo ‘This is a demo message from $HOSTNAME’ \\”
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Actuator
"/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) );
The maximum number of times an actuator can run consecutively before a final alarm is generating
The maximum number of seconds that an actuator is allowed to run before being terminated by the sensor.
Time window to execute all maxruns of actuator
Actual value of correlation objects are accessible for actuatorActuator /bin/sh -c \\"/bin/echo '$act_value_01 $act_value_02' \\"
Actuator /bin/sh -c \\"/bin/echo 'Died lemonmrd daemon $act_value_01 ' | /bin/mail -s 'Lemon RRD Daemon problem' [email protected]\\"
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Exception config"minoccurs", 5specifies how many time the exception should occur before rising the exception
“local", yesTechnically this value has no affect within the sensor but is an instruction to the lemon-agent to not transmit data for this exception to the remote application servers. As remote transmission does not occur the outcome of the exception can never appear on LAS (Lemon Alarm System) and is only visible locally on the machine using lemon-host-check.
“silent", yesAn exception which is considered silent effectively sets the exception state to the value 2 for all transitions. The exception is disabled, no actuators will run and no alarms will be displayed on the LAS console
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF LAS
ExceptionMetrics
EventManagemen
t System
Lemon-web
LAS GUI
LemonOracle
DB
LASBusiness Logic
PL/SQLOperator
Administrator
CDBSMS
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF LASRaw alarm • represents an processed exception on one of the monitored entities (reported by Lemon). Not all exceptions become an alarm (only one with code 000, 005, 135). And only from host not in maintenance.
L-alarm • can represent one or more alarms. It is an item visible on the operators screen in the LAS GUI.• every L-alarm must be acknowledged with created ticket in ITCM• states: active, inactive, acknowledge, inhibited• alarms may be grouped to L-alarm (by entities, exceptions, cluster rules…)
What is no_contact alarm?Exception not evaluated on host! Alarm means that data are not arriving from host to DB.One of heartbeat metrics (6335, 6336, 9500, 10005) reported every (usually) 5 minProcedure in db is checking if last entry not older than (usually) 10 min
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF LAS
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF LAS
How to avoid alarm• disable (e.g. by lemon-host-check)• make exception local• make exception silent• set host to maintenance actuator will run
/etc/lemon/exceptions/state.conf
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Backup
From now on backup
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Sensor parameters
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF Smoothing