LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.

17
LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

Transcript of LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.

Page 1: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

LAS for System Administrators

LAS overviewMiroslav Siket, Dennis Waldron

http://cern.ch/lemonCERN-IT/FIO-FD

Page 2: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 2

LAS building blocks

• Oracle DB server– running LAS logic and storing LAS data - PL/SQL

• OraMon – application server– Inserting exceptions to Oracle DB

• Web server– Providing access to LAS data from Oracle DB to LAS

GUI (business logic)

• Remote monitoring – ping, http

• SURE gateways for UIMON/AFS

Page 3: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 3

LAS hardware

• Two independent instances– Primary

• Oracle DB and OraMon – lemondb1• Web server – lemonweb02

– Secondary• Oracle DB and OraMon – lemondb2• Web server – lemonweb01

• Remote monitoring machines– Lxfsrk4104 (aliased as lemonmr & lemonr01) – lxservb01 (alias lemonr02)

Page 4: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 4

Oracle DB server check

• Login to machine (lemondb1,lemondb2):> source ~oracle/.oraprofile.LEMON*

> tnsping LEMON_A (LEMON_C for lemondb2)

Check output of the previous command

Example: OK (0 ms)

Page 5: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 5

OraMon check

• Already checked by LAS GUI

• Lemon-host-check

• ORAMON_WRONG procedure

• Log file: /var/log/OraMon.log

Page 6: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 6

Apache web server check

• Already checked by LAS GUI

• Lemon-host-check

• HTTPD_WRONG procedure

• Log file: /var/log/httpd/error_log

Page 7: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 7

Remote monitoring check

• Runs as sensor (remote) on remote monitoring machines

• Lemon-host-check

• Agent log file: /var/log/edg-fmon-agent.log

Page 8: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 8

SURE gateways for UIMON/SURE

• Runs as a sensor (suregateway) on remote monitoring machines

• Agent process and log file

• ISSUE: AFS machines– Uses lemon-sure-multiplexer process as a gateway– Lxfsrk4104 only– Check existence of the daemon, log file:

/var/log/lemon-sure-multiplexer.log

Page 9: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 9

lemon-cli

• Command line tool for extracting raw (un-interpreted) data from lemon.

• Information can be extracted from local cache (/var/spool/edg-fmon-agent) or remote server over SOAP (aliased as lemonmr, physical machine: lxfsrk4104)

• Limitations– local cache is limited to seven days worth of history (purged everyday by the agent)– remote server queries limited to 20,000 returned results

• this limitation will be removed when the new lemon API is deployed (end Q4, begin Q1 2007)

• local cache contains much more information then is recorded at the server– Why? smoothing!!

• Smoothing is a mechanism which allows the agent to be selective on the information it sends to the central servers

• If the information you want is < 7 days use the local cache!!

• Full documentation at: http://cern.ch/lemon/doc/components/lemon-cli.shtml

Page 10: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 10

lemon-cli (II) - Examples

• Resolving a metric id to a name– lemon-cli –m syslog– Displays all the metrics whose name contains ‘syslog’

• Referencing time periods (--end, --start), e.g.– 1h = 1 hour– 2d3h36m44s = 2 days, 3 hours, 36 minutes and 44 seconds– Also supports log file timestamps e.g. Thu 02 Nov 2006 10:45:00 (no guarantees!)

• If querying remotely –n accepts the same node name expansion criteria as wassh!

e.g lemon-cli –m 10005 –n lxb[0001-1000] --server

• All alarms can be seen on the machine using– lemon-cli –class “alarm.exception”– 1 005, 1 135 and 1 000 are alarms– lemon-host-check interprets all the codes for you!!

Page 11: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 11

lemon-host-check (I)

• Aim: to provide a command line tool for viewing the status of all active alarms on a given machine by querying the edg-fmon-agent.

• Uses the information recorded in the agents local cache. (requires /var/ to be writeable!)

• Makes sure that the information reported to you is up to date (fresh!!)

• Checks that all sensors are running, and that 1 and only 1 agent processing is running.

• Must be logged in as root!

• Full documentation at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml

Page 12: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 12

lemon-host-check (II) - Examples

• Check for active alarms on the machine– lemon-host-check

• Disable alarms “syslogd and klogd”– lemon-host-check –disable "30023,30032“

• Show me alarms even if they are disabled– lemon-host-check –force

• Disable all alarms for the next 1 hour 30 minutes and 23 seconds– lemon-host-check –disable-all –duration 1h30m23s “demo intervention”

• View a list of all disabled alarms– lemon-host-check –list

• Enable all alarms– lemon-host-check –enable-all– Some alarms are “hard” disabled! Requires a CDB reconfiguration and ncm-ncd –co

fmonagent run to make them visible again.

Page 13: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 13

lemon-host-check (III)

• Pre-alarms– Recent concept added to lemon.– Aims at dealing with transient alarms. – Real Use Case:

• high_load (30008) has pre-alarm capabilities! When high load is detected on the machine a pre alarm is raised (not visible on LAS). If the alarm exists for more then 10 minutes it becomes a proper alarm. This allows for high load spikes on machines/clusters such as lxplus to be ignored.

– Not visible by default in lemon-host-check

• Caution:– If you have a high_load alarm and restart the agent the alarm will disappear!! If the

root problem hasn’t been corrected the alarm will resurface 10 minutes later (A new ITCM ticket).

– Don’t restart the agent unless you absolutely need to (reconfiguration, errors in the edg-fmon-agent.log,…)

– If you have to restart use ‘lemon-host-check –show-all’ afterwardsNote: (make sure to check the status of the alarm!!!!!! You need to ignore the disabled ones, if any!)

Page 14: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 14

lemon-host-check (IV)

• Common errors:

No monitoring agent process running / Too many monitoring agent processes running – service edg-fmon-agent restart– If that fails [email protected]

Possible false exception – lemon-host-check has given up (after 60 seconds) trying to get information from the agent on

the machine. If it failed to find out if an alarm was present for a particular exception it assumes the worst case scenario, that an alarm does exist but may not be real (possibly false)

– Why?• The agent maybe too busy to answer lemon-host-check• Maybe some sensors have failed to retrieve the necessary information

– Solution• re-run lemon-host-check again• Still fails check /var/log/edg-fmon-agent.log for any errors about sensors or missing metrics. If they

exist spma_wrapper.sh the machine to get the latest sensor code if any. ncm-ncd –co fmonagent to reconfigure the agent.

• Try again• Still failing, contact service manager and CC [email protected]

Page 15: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 15

FAQ

Are monitored machines running only Linux (e.g : SLC3/4, RHEL 3/4) ? – Linux (lemon agent, ping, http check)

– Solaris (lemon agent, UIMON)

– Windows (ping, http)

Is there any limitation that we should be aware of on the other OS’s / platforms?– AFS machines have their own monitoring tools – no information available

– UIMON monitored machines – running UIMON process and multiplexer to send alarms to suregateway sensor on remote monitoring machines

We knew nodes' polling on SURE, what is implemented in Lemon?– Remote sensor on remote monitoring machines

Is there any load balancing (DNS) and/or redundancy ? front-/backend part of the failover?

– No, just two independent instances running in parallel.– In future (with RAC) there will be failover for OraMon and only one Oracle DB

Page 16: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 16

FAQ (II)

What should we do in a case of a piquet call about a failure on these server(s)?– Operators' LAS procedures do not have any piquet actions defined. All other failures are

standard OS/hw procedures that they already have. There is nothing LAS specific for them.

How to interpret the correlation rules ? Could you explain the syntax found in the Remedy ticket?

– Full documentation with examples at http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml

– Example: lxs5013:9104:1[/tmp] eq /tmp) && (lxs5013:9104:5[90] > 80

LAS reduction rules and multi-hosts tickets: a direct mapping?– Several use cases:

• e.g. 12 x spma_wrong on 12 nodes of cluster YYY– One LAS item if the number of machines reaches 51% of the active nodes in cluster– Several LAS items if they appear in burst and the alarm has been already reduced– Individual machine LAS items if below 51%– If new machines appear, there will be a new reduced LAS item for each set of them

A mean to detect when a node started to be "alarmed" and when this stopped.– /var/log/ncm/component-setodesiredstate.log* log file on the machine in question

Page 17: LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron  CERN-IT/FIO-FD.

09/10/2006 Lemon Tutorial 17

FAQ (III)

What to expect from them if no alarm can be displayed anymore at 3:00AM and they've got called by Operator?

– No piquet service for LAS defined. If Las does not work, operators have procedures for finding out the state of the LAS – check http://lemon.web.cern.ch/lemon/cern/las_procedures.shtml

QUESTIONS?