Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools

© 2013 Wellesley Information Services. All rights reserved.

Building a Proactive Monitoring and Alerting System Using Native IBM Domino ToolsAndy PedisichTechnotics

2

Why Do This Session …

• Many Admins want to take advantage of native Notes monitoring solutions, they just don’t have the bandwidth to explore them

“Free time” is very rare these days• This jumpstart will show you:

How to collect stats How to analyze stats How to go behind the scenes How to set up monitors, alerts

And how to capture just about any little event you are interested in

And finally, how to configure and work with DDM• Let’s get started

3

What We’ll Cover …

• Looking at the big picture of server monitoring• Understanding statistic generation• Designing an efficient and sensible collection infrastructure• Pulling useful information from statistical data• Using cluster stats to keep clusters reliable• Understanding the essentials of event monitoring• Determining the best notification methods• DDM: Understanding how it fits into your environment• DDM: Crafting a perfect DDM data collection hierarchy• DDM: Looking at DDM events and probes• Wrap-up

4

Driving Your Domino Servers• You can learn a lot about the importance of monitoring from

driving your car Your car tells you a lot about what’s going on And you know they’re important because you pay attention

to the indicators You fill the gas tank when it’s low

Unless you are Rob Axelrod (ask Rob) And (usually) pay attention to the speedometer so you won’t

get a ticket Or maybe you’re the driver who thinks that red light on the

dash is just for ambience while you’re driving at night Uh oh

5

Domino Servers Are Obsessed with Statistics

• Domino servers are constantly spewing stats Just like your car telling you how fast you’re going

Except with Domino there are literally several hundred statistics generated Most of them are updated continuously

• Many administrators don’t know which ones are important Or how to tell the good readings from the bad ones Or what to do about them when they are bad

6

The Truth About Monitoring

• A good administrator shouldn’t have to look very hard And you can be notified about most problems automatically

You can be proactive about fixing them When you’re proactive, you put out less fires

Firefighting dilutes your effort• But being notified requires that you monitor your environment for

events and issues And events depend on statistics

And statistics need to be collected And too many sites don’t collect stats correctly

Some don’t collect them at all

7



8

Perpetual Statistics

• Domino servers constantly generate statistics They track data on a surprising level On almost every aspect of server operations

Agent manager Mail and calendaring The server’s platform SMTP and Notes mail LDAP HTTP Network

And lots more, too

Server Statistics Are Organized Hierarchically

• Stats are gathered into major categories like these And then each one has a multitude of subcategories

9

ADMINP MemAgent MonitorCalendar NETDatabase PlatformDisk POP3Domino ReplicaEVENT ServerHTTP SMTPLDAP StatsMail Update

10

Subcategories of Statistics

• Here’s a snapshot from the Administrator client showing some of the statistical hierarchy This gives you a snapshot of the stats on your server

Use Refresh to get another snapshot

11

Statistics Come in Basic Types

• The basic types of statistics are: Stats that never change once the server is started Snapshot stats – reflect what’s going on right now Cumulative stats that grow from the moment the server is

started• These stats are available to you for:

Your Domino servers The platform your server is running on Your network environment

12

Static Statistics

• Statistics that don’t change usually represent the operating environment of the server Server.Version.Notes = Release 8.5.3FP3 Server.Version.OS = Windows NT 5.0 Server.CPU.Type = Intel Pentium Disk.D.Size = 71,847,784,448 Mem.PhysicalRAM = 527,433,728

13

Amazing Detail, Yours Free!

• This includes OS platform, Domino version, RAM• Lots of information about disks in use

Platform.LogicalDisk.TotalNumofDisks = 3 Platform.LogicalDisk.2.AssignedName = E Disk.C.Size = 80,023,715,840

• And even Network Interface Card (NIC) information Platform.Network.1.AdapterName = Intel[R] PRO_1000 MT

Server Adapter Platform.Network.2.AdapterName = Broadcom NetXtreme

Gigabit Ethernet _2 Platform.Network.3.AdapterName = Broadcom NetXtreme

Gigabit Ethernet

14

What Good Are These Static Stats?

• Think these static stats aren’t helpful? Guess again

They are extremely valuable• If you are collecting stats correctly from all your servers, you can

take a pretty detailed server inventory Without leaving your desk

From servers all around the world, just by looking at the data we’re going to collect in the Monitoring Results database This database is also know by its filename: STATREP.NSF

15

Snapshot Statistics

• Snapshot stats show what’s happening at the moment youask for them They are changing all the time

Disk.E.Free = 18,679,414,784 Server.Users = 280 Mem.Free = 433,614,848 MAIL.Waiting = 250

The best part about this is that you get lots of Domino-related stats you wouldn’t get by looking at the operating system’s performance tools

16

Cumulative Stats

• Some stats are cumulative They start counting from zero when you start the server

Server.Trans.Total = 31,915 SMTP.MessagesProcessed = 966

Stats, like averages and maximums, are calculated from the cumulative ones Server.Users.Peak.Time = 02/21/2006 07:50:33 MST Platform.Memory.PagesPerSec.Peak = 1,364.1

17

Resetting Statistics

• Some of these cumulative stats can be reset using the following console command: Set Statistics statisticname

You can’t use wildcards (*) with this argument!• Here’s an example of why you might want to reset a stat:

Set Stat Server.Trans.Total Resets the Server.Trans.Total statistic to 0

You might want to reset this stat if: You are starting to benchmark a new application You are debugging an agent and want to see if it is more

efficient after changes to its design

18

Platform Stats, Too

• Platform stats vary widely from OS to OS• Getting platform stats from within Notes has great value

Track Domino server performance on an OS level even if your servers run on a variety of operating systems For example, it’s very common to have a mix of AIX and

Wintel servers• In a few minutes, we’ll be discussing threshold tracking

You’ll be able to set notification thresholds universally from within Notes to track these platform stats

19

Getting to Platform Statistics

• Domino releases 6, 7, and 8 track platform stats automatically In earlier versions, they had to be explicitly enabled and many

times were disabled due to problems with servers crashing These problems are gone

• To see all platform stats – enter this console command Show stat platform

20

A Word About Platform Stats on Partitioned Servers

• Domino collects platform stats that pertain to the whole system Not to an individual partition

• The only statistics that are specific to a partition are those that reflect tasks, such as process statistics One partition might run 10 tasks, while another partition runs 15

tasks

Issue

21

Confirming Stats with Other Tools

• Be careful when trying to confirm platform statistics using other performance monitoring tools Because of the differences in sampling intervals, you cannot

use native monitoring tools to confirm platform statistics There will be discrepancies between platform statistics and

those obtained … Using Perfmon – for Windows 2000 Or a system command, such as this UNIX command:

iostat /vmstat/ netstat

22

See Server Statistics

• Quickest way to see all server stats is to enter console command: Show stat

• Any place you can get to a console, you can access stats that can tell you a lot about the current state of the server

• A SHOW STAT command gives you every statistic the Domino server has Several hundred of them!

That’s really too many to deal with at once

23

Can I See That in a Smaller Size?

• Get a better view of the stats showing just what you’re looking for using the asterisk wildcard

• You can ask directly for the top level of the hierarchy Show stat server

That shows all of the stat hierarchy under “server”

24

You Might Want Only Part of the Data

• To get a select list of just the stats under the top level requires the use of wildcards in your console commands If you only want Server.Users hierarchy, use the global “*”

Show stat server.users.*

25

Pushing the Wildcards

• If you want a closer look, like just grabbing particular sub-levels of stats, get clever with the wildcard For example, use the following command to find out about mail

waiting Show stat mail.wait*

MAIL.Waiting = 1 Mail.WaitingForDeliveryRetry = 1 MAIL.WaitingForDIR = 0 MAIL.WaitingForDNS = 0 MAIL.WaitingRecipients = 1 5 statistics found

26



27

Take It to the Next Level

• Now that we know where the statistics are, it’s time to kick it up a notch Let’s set up a collection architecture

• Some Notes shops do not collect server statistics at all! How in the world can they:

Determine what is causing performance issues? Plan for future growth? Have a grip on whether their server platforms are configured

correctly? Do they just make the stuff up and go with it?

28

The Two Things Needed

• There are two things that are needed for statistics collection to happen: The Events4 database must have a Server Collection document The Collect task must be running on the server that is

designated to collect the statistics

29

Details, Details, Details

• Events4, the Monitoring Configuration database, needs a Statistics Collection document for each server collecting stats This database should replicate to every server in the domain A server will know it is supposed to collect stats because of

this document But it won’t automatically load the collect server task

We have to make sure that happens

30

Server Statistics Collection Docs

• Use a Server Statistic Collection doc to indicate the server that will collect stats And the servers you want

the stats collected from

31

Set the Statistics Collection Interval

• Use the collection report interval on the Options tab to set up how often statistics should be gathered

• Generally, collecting once an hour is sufficient If you are upgrading or changing the environment, it’s better to

collect every 30 minutes Or even every 15 minutes, if you are trying to fix problems

32

A Single Document Looks Like Many in the View

• This single document, with a multi-value field containing all the servers, will look like it is multiple documents in the Events4 database Make sure administrators know this, or they might delete

everything by mistake Guess how I know this?

33

Centralize Your Domain’s Statistic Collection

• Ideally, use just a few key servers to do the collection You might even be able to get away with just one!

• Your network topology will have a profound effect on which servers you select So will the load currently running on the servers

• Avoid collecting stats over long, slow links Be careful of WAN routes that are already packed with other

network traffic

34

Configure Key Collect Points

• If you have offices in London and Tokyo, then pick a collection server from each city That server will collect stats from all servers in that region

• Collect stats in a database created from the Monitoring Reports template The databases don’t have to be called Statrep

Voilà! Centralized data at your fingertips

City Collecting Server

Monitoring Results Database

London LonAdmin1 LondonStatrep.nsfTokyo TokHub01 TokyoStatrep.nsf

35

Remember to Add the Collect Task

• The Collect server task must be running on the servers you selected as collectors Use LOAD COLLECT from the console to get it started Add the Collect task to the ServerTasks= line in the selected

servers’ Notes.ini to make it permanent Remove Collect from ServerTasks= from all other servers!

• Want the servers to start collecting stats immediately? Use the following console command:

Tell Collector Collect It will kick off a statistic collection of all the servers

you specified

36

The Collect Task Should Not Run on Every Server

• Stat collection can be set up so each server collects its own stats And puts them into a local Statrep Monitoring Results database

This method has the following drawbacks: You have to run the Collect task on every server You must visit Statrep on each server to analyze statistics

This is a real pain in the neck And it makes analysis harder

• Statistics have the most value when collected into a central location where they can be easily analyzed

37

Demonstration: Setting Up the Collect Task

Demo

38



39

Let’s Start by Looking at Disk Stats

• If I get a call about server performance, I check disk stats first Bad disk utilization can seriously tank a server

• One stat to track is Percent Utilization A very busy disk can mean a very busy server

But it might mean something else is wrong Perhaps a controller is beginning to fail or drive cache is wrong

• Disk stats names depends on platform, but have PctUtil in them It could be Logical Disk or Physical Disk Like Platform.LogicalDisk.1.PctUtil.Avg

This should rarely hit 60% on Wintel boxes On AIX and iSeries, it depends on disk sub-systems config

They often can run 90%+ without issues

40

Average Disk Queue Length

• This is a major statistic!• Platform.LogicalDisk.1.AvgQueueLen, .Avg and .Peak

Queues of more than a couple of seconds mean your disks can’t really keep up with the action You can hit high peaks occasionally without issues

But constant highs mean moving users or apps• Balance these disk stats against CPU/Memory stats

Because memory = virtual disk And constant thrashing of disks might mean you need more

RAM• Problem is, Statrep doesn’t have a view that shows these

important statistics

41

There’s a Lot of Stuff That Isn’t There

• Before we get any further, it’s important to point out something that is hidden Statistical data – In the Monitoring Reporting database

STATREP.NSF• Statrep has views that simply don’t have data that is as useful as

it could be It’s there, it’s just not in views

However, it’s important to know that every document in the database contains every statistic you see when you issue a SHOW STAT command at the console It’s just a matter of showing it in a view

42

Take Home This View

• But now you have a version of Statrep with a view that does contain those important stats! A specially-crafted version of the Statrep template with a view

like the one below is available You can download it from my blog

You’ll probably have to modify the columns based on the disk configurations of your own systems

43

Processor Statistics

• Platform.Memory.RAM stats will disclose memory usage Don’t just think you might need more memory: be certain by

checking this out On Wintel systems, this number should rarely be 60%

But on iSeries and AIX, it can be much higher On iSeries it can actually run quite nicely at 90%

44

CPU Stats Are There for Each Task

• Platform.Process.ActiveDomino.TotalCpuUtil Gives you the big picture of how Domino is using processors

• There is a Platform.Process.$$$.PctCpuUtil stat for each task you run on your Domino servers Platform.Process.Amgr.PctCpuUtil Platform.Process.Router.PctCpuUtil Platform.Process.Process.PctCpuUtil Platform.Process.Amgr.PctCpuUtil

… And so on

45

Using These Stats

• You might find that the Agent Manager is the biggest hog because of user personal agents! You could move busy user agents to a different server

• These stats don’t show in the Lotus version of Statrep But they are on the Technotics85Statrep.NTF version

You can download it from my blog www.andypedisich.com

46



47

Why Wouldn’t the Failover Replica Be Up to Date?

• When primary server is down, users are directed to a replica on a failover server But sometimes that replica is not up to date

• Cluster replication keeps primary server in sync with failover It’s an event-driven process – occurs automatically when a

change is made to a database Changes to a database are pushed to the replica on failover Deletion stubs are not replicated

That’s why you also need a scheduled replication doc between servers in a cluster

• It’s vital that these replicas are synchronized But by default, clusters only have 1 cluster replicator task

48

Not Now … I’m Too Busy

• Occasionally, there is too much data changing to be replicated efficiently by a single cluster replicator If cluster replicators are too busy, replication is queued until

more resources are available Your databases get out synch and stale

• Adding a cluster replicator will help fix this problem Use this parameter in the Notes.ini

CLUSTER_REPLICATORS=#• But how do you tell if there’s a potential problem?

Adding too many cluster replicators will have a negative effect on server performance

49

Key Stats for Vital Information About Cluster Replication

Statistic What It Tells You Acceptable ValuesReplica.Cluster.SecondsOnQueue

Total seconds that last DB replicated spent on work queue

< 15 sec – light load> 30 sec – heavy

Replica.Cluster.SecondsOnQueue.Avg

Average seconds a DB spent on work queue

Use for trending

Replica.Cluster.SecondsOnQueue.Max

Maximum seconds a DB spent on work queue

Use for trending

Replica.Cluster.WorkQueueDepth

Current number of databases awaiting cluster replication

Usually Zero

Replica.Cluster.WorkQueueDepth.Avg

Average work queue depth since the server started

Use for trending

Replica.Cluster.WorkQueueDepth.Max

Maximum work queue depth since the server started

Use for trending

50

What to Do About Stats Over the Limit

• Acceptable Replica.Cluster.SecondsOnQueue Queue is checked every 15 seconds Under light load, should be less than 15 seconds

Under heavy load, if the number is larger than 30, another cluster replicator should be added

• If the above statistic is low, and Replica.Cluster. WorkQueueDepth is constantly higher than 10 Perhaps your network bandwidth is too low

Consider setting up a private LAN for cluster replication traffic

51

Stats That Have Meaning but Have Gone Missing

• There aren’t any views in the Lotus version of Statrep that let you see these important statistics Matter of fact, the Clusters view is pretty worthless

52

Stats That Have Meaning but Have Gone Missing (cont.)

• But there is a view like that in the Technotics85Statrep.ntf It’s just a download from my blog It shows the key stats you need

To help track and adjust your clusters

My Column Additions to Statrep

53

• This slide explains the formulas I used in the view The important thing is that I convert seconds to minutes

You are shown the major delays

Column Title Formula FormattingMin on Q Replica.Cluster.SecondsOnQueue / 60 Fixed (One Decimal

Place)Min/Q Av Replica.Cluster.SecondsOnQueue.Avg / 60 Fixed (One Decimal

Place)Min/Q Mx Replica.Cluster.SecondsOnQueue.Max / 60 Fixed (One Decimal

Place)WkrDpth Replica.Cluster.WorkQueueDepth GeneralWD Av Replica.Cluster.WorkQueueDepth.Avg GeneralWD Mx Replica.Cluster.WorkQueueDepth.Max General

54

Demonstration: Looking at Technotics85Statrep.ntf

Demo

55



56

Event Monitoring Depends on Events4

• We want to monitor all of our servers and be notified when certain conditions occur We will specify what we want to watch for and how to be

notified We will use the Events4 database to configure all of this

• The Events4 database must have the same replica ID onevery server in your domain I have found many, many cases where Events4 was not the

same replica everywhere in the domain That ruins the monitoring architecture

The monitoring configuration, and the config for alerts and notifications, can’t replicate to some servers

57

We Know What the Replica ID Should Be for Events4

• The replica ID of system databases such as Events4 are derived from the replica ID of the address book Database Replica ID NAMES.NSF 852564AC:004EBCCF CATALOG.NSF 852564AC:014EBCCF EVENTS4.NSF 852564AC:024EBCCF ADMIN4.NSF 852564AC:034EBCCF

Notice that the first two numbers after the colon for the Events4.nsf replica are 02 Determine your address book’s replica ID, and you’ll know

the replica ID of Events4

58

Verify Events4

• You must verify that every server has the same replica of Events4 You can find this info in the catalog, if your catalog architecture

is getting file info from all servers Or you need to go to every server and open Events4

It’s vital that you validate Events4• If it is not right on a server, then you must down the server, delete

Events4, and restart your server The correct Events4 will be re-created automatically

• Make sure that EVENTS4.NSF is the same replica ID throughout the domain by opening a copy from every server and putting it on your desktop Here’s some code to help you do that

Add a Button to Your Toolbar

• Add this code to a button on your toolbar This is courtesy of Thomas Bahn He’s a smart guy, nice guy, and sometimes brings chocolates to

his friends from Europe www.assono.de/blog

_names := @Subset(@MailDbName; 1) : "names.nsf";

_servers := @PickList([Custom]; _names; "Servers"; "Select servers"; "Select servers to add database from"; 3);

_db := @Prompt([OkCancelEdit]; "Enter database"; "Enter the file name and path of the database to add."; "log.nsf");

@For( n := 1; n <= @Elements(_servers); n := n + 1; @Command([AddDatabase]; _servers[n] : _db) )

59

Add a Database Icon from All Servers to the Desktop

• This code will prompt you to pick the servers that have the database you want on your desktop Then it will prompt for the name of the database

And open it on all the servers you’ve selected• Use it to make sure all the EVENTS4.NSF are the same replica in

your domain

60

61

Now, on to Event Monitoring

• Domino can monitor for just about any condition: It can watch for a statistical threshold

Free disk space under a certain value Mail.waiting over a certain value

It could be some non-statistical event in the log An Agent that doesn’t have enough time to run

And might be in a loop A corruption problem with a database that is preventing

replication A user connecting with an unsupported version of Notes

62

What Happens Then?

• When a certain statistical condition or log entry occurs, Domino can do a bunch of different things Capture and store the event in a database Notify someone that the event happened Log the event to a Tivoli console

And lots more that we will discuss in a few minutes• For now, let’s focus on capturing the events

63

Event Monitoring

• Event monitors of all types are set in the Events4 database

• Two broad categories of events: Event Handlers

Specify the action that Domino takes when a specific event occurs

Event generators Each type of event generator has a

view that provides a list of all event generators, plus additional configuration information

64

Event Generators

• We’ll look at event generators first They deal with specific Notes/Domino issues

• There are six types of event generators: Database Event Generator Domino Server Response Event Generator Mail Routing Event Generator Statistic Event Generator Task Status Event Generator TCP Server Event Generator

Some are used more than others We’ll stick to the more popular ones

65

Database Event Generator

• Use Database Event Generators to monitor: Database activity Free space Frequency and success

of database replication ACLs

And get reports on ACL changes Including those made

by replication or an API program

Monitor specific servers or every server in the domain

66

Here’s One That Everyone Should Use

• The ACL of Names.nsf should be monitored for changes in every Notes domain Once properly set, the ACL of Names.nsf should rarely change!

All kinds of bells and whistles should go off when it does Remember, we’ll talk notification in a moment

Here’s how to set up the monitoring of the ACL Select New Database Event Generator

67

Here’s One That Everyone Should Use (cont.)

• Select Names.nsf You can choose either a single

server, such as the administration server for the address book, OR

All servers in the domain• I like to pick all servers in the domain

Admins won’t get away with anything!

But I do get a storm of messages when an ACL change occurs Every server tells me about

the change

68

Monitoring Replication

• Replication monitoring is somewhat useful You can set a time interval in which you expect some

replication to occur Just remember that it will report no replication occurred even

if there was nothing to replicate This can be confusing since it might produce a report that

looks like an error occurred even though nothing is wrong

69

Other Database Event Generators

• The unused space and user inactivity might have value in very specific situations You can run compact, but who wants compact running

anytime? And compact is generally run on a schedule anyway

You can be notified when a DB is not used But activity logging is much better at this because it can deal

with all databases on all servers

70

Server Response Generator

• Domino Server Response Event Generator Checks connectivity/port status of server’s network

• One server checks others by sending a probe It’s a good idea to try opening Names.nsf

If you can’t open Names.nsf, then something is wrong!• Set interval for checking Names.nsf – default is 3 minutes• Set response time tolerance – Default is 1,000 Msecs (one second)

These will both depend on your own environment

71

More About Probes

• The default response time is a bit on the harsh side If left at one second, you’ll get lots of notifications

You should make it ten seconds or whatever the metrics in your Service Level Agreement (SLA) requires

• Also, be careful what servers you choose to probe other servers Try to pick probing servers that are in the same LAN as the

probed servers Otherwise, your probing will be testing network latency

rather than the servers themselves

72

Mail Routing

• Mail Routing Event Generator Sends a mail-trace message to a particular user’s mail server Gathers statistics indicating the amount of time, in seconds,

it takes to deliver the message Great for troubleshooting

Generally not used day to day

73

Statistic Event Generators

• Statistic Event Generators monitor a specific Domino or platform statistic They can let you know when a stat goes over a particular

threshold These stat event generators are extremely valuable

Smart administrators use them every day!

74

Default Settings for Stats Event Generator

• Many are set by default for all servers in the domain Review these to see if they apply to your enterprise

75

Task Status

• Task Status Event Generator is another interesting troubleshooting tool It monitors the status of the Domino server and add-in tasks

76

TCP Server Events

• The TCP Server Event Generator verifies the availability of Internet ports (TCP services) on servers This also needs ISpy to work – put this in servertasks=

Load runjava ISpy Case sensitive!

• A valuable concept forsome servers But not widely used

77

Checks the Ports for You

• It generates a statistic indicating the amount of time, in milliseconds, it takes to verify that the server is responding on the specified port Each port you select has a tab where you can sometimes set

special characteristics about the probe

78

Event Handlers — My Favorite!

• We have worked pretty hard to get to this point: Understood how statistics are generated Identified stats important to the stability and performance

of servers Set up a statistic collection infrastructure

• Now we have the moment of truth The event handlers!

79

Event Handlers — My Friends

• An Event Handler defines the action that Domino takes when a specific event occurs Choosing the right action is critical to your organization

Some serious events should cause a page to be sent to the person on call Other events might merely cause an email to be sent

It all depends on what’s important to the business• We’ll talk about how you are notified in a moment

First, let’s review the awesome power of the Event Handler

80

Event Handling Options

• Just like event generators, you can include all servers in the domain or just a few This lets you target servers

with “issues”

81

Getting Trigger Happy

• The notification trigger is where it’s at: Any event matching a criteria

A wide-open trigger for any problem, statistical or something that just shows in the log

A built-in or add-in task event Looks for an event generated by a Domino task

A custom event generator An event generator that you created

82

Event Selection Criteria

• You can select: A certain type of event Different types of severities Or track a particular message that is appearing in the log

83

Demonstration: Working with Event Generators

Demo

84



85

Notification Method Selection

• You can choose from a wide variety of notification methods Some are better than others

And you can easily enable, disable, or select a time span for notification Perhaps only page certain numbers overnight but all

admins during the day

86

Notification Options

Method Result

Broadcast Reports the event to all users logged onto the server or to a specified group of users

Log to Database Logs the event to a database, typically STATREP.NSF

Mail Mails the event to a person or to a mail-in databaseLog to NT Event Viewer

Reports the event to the Windows NT Event Viewer

Pager Uses the mail address of an alphanumeric pager

Relay Relays the event to another server that is in the same Domino domain and that runs a common protocol

Run an Agent • Runs a specified agent based on the configured Event Handler• Use this method to resolve an issue without user

intervention• You specify agent name, the server and database containing

the agent, and any parameter to pass to the agent

87

Notification Options (cont.)

Method ResultRun Program Runs an add-in program or specified command to correct

problems automaticallySend a Console Command to the Server

Sends a console command, or commands, to the server according to the Event Handler that was configured

Sound Sounds an alarm on the designated server when the event occurs

UNIXLog Reports the event to the UNIX system log

Run Program Runs an add-in program or specified command to correct problems automatically

Send a Console Command to the Server

• Sends a console command, or commands, to the server according to the Event Handler that was configured• You can specify the server console commands to run

These Two Are the Best Ones to Use

Method Result Comments

SNMP Trap Sends the event as an SNMP trap. Select this method only if the specified server is running the Event Interceptor task and the Domino SNMP Agent.

This is truly an ideal notification method because it does not depend on Notes protocols actually working

Forward event to Tivoli Event Console

Allows the Tivoli Enterprise Console (TEC) to receive IBM Domino events and reformat them as TEC events. The reformatted TEC event is then sent to the TEC server that you specify in the Configuration Settings document.

Check with the Tivoli team to see if it’s possible to use this in your environment

88

89

Demonstration: Event Notification

Demo

90

Notification Methods Pros and Cons

• Any notification method that involves Notes mail has limitations If the Notes mail system is down, you won’t get notified

You especially won’t be notified about the mail system being down or a router has hung

• Do not use a configuration where the server is to let you know via email when mail is backed up The message that is being sent to you will be placed in

the queue You won’t know about the problem until it is too late

Issue

91

Paging Dr. Howard, Dr. Fine, Dr. Howard …

• A paging notification is a good choice But not if you are paging through a third-party phone system

like Verizon or AT&T They generally require an email to be sent (see

previous slide) They have no Service Level Agreement – NONE!

• Sadly, due to budget and resource constraints, we generally see these two mail or paging methods used the most in production environments

Caution

92

The Best Notification Method … Also the Most Complicated

• The best notification methodology is to go outside of Notes protocols to SNMP or a similar external source SNMP is Simple Network Management Protocol There are SNMP agents that must be started on an OS level

They are different for every major platform There are special considerations for partitioned servers

• It is a complicated solution, but once in place, it has extreme value And you never have to rely on an application layer

solution again

93

Log or Relay No Matter What Else You Do

• If an event is worth tracking, it should always be placed in a STATREP.NSF You can use the LOG option, which is used when you want to

capture the event in each server’s Statrep You can use the RELAY option to send to a Statrep that is

centrally located

94

Some Tricks of the Trade

• When problems occur, they are almost always in the server log That means you can catch them with an Event Handler

• A great way to do this is to look for specific text in the message That makes it very flexible

• Log the results into a separate database to make analysis and investigation easier And you can create multiple events to take multiple

actions if necessary Let’s look at a couple examples of this

95

A Good Example of Looking for Text

• When someone enables Full Access Administrator, a message shows up in the server log You’ll definitely want to audit this when it occurs

If it’s in the log, that means you can grab it

96

A Good Example of Looking for Text (cont.)

• If you wanted to be notified every time someone turns on Full Access Administrator, you could look for the following string “full administrator access” Set up a notification to log to Statrep

And another notification to mail it to you so you always know who is using this powerful privilege

97

Is Your SMTP Server Under Relay Attack?

• If you’re interested in the safety of your SMTP server, you might want to know when bad guys attempt to use it as an “open relay” When that happens, you’ll see something like this in the

server log Remember, if it’s in the log, that means you can grab it

98

Just Log It – But in a Special Database

• In this case, you don’t really want to be notified right away You just want to know when it happened To make it easier to analyze, place the logged entries into a

separate database

Don't Forget

99



100

Why DDM Is Awesome

• DDM is a single location where administrators can access issues that are affecting multiple servers and databases DDM database is a central repository of all monitoring data

Data collected by probes that you can configure Result messages from event generators that you configured

101

Do All Administrators Use DDM?

• Many administrators don’t use the potential of DDM as much as they should Already overwhelmed by the monitoring features of Domino Don’t understand how DDM fits into the architecture

• Some Administrators just have Probe-Aphobia But you don’t have to use probes to use DDM

Probes are not a required part of DDM They are nice to have and fun to use, but DDM functions

without them• You can get started without probes

Then add them into the configuration when you become more familiar with how DDM works

102

The Big Relationships in Monitoring

• EVENTS4.NSF – the Monitoring Configuration database is a key file in your monitoring infrastructure It also contains all of the specifics for your DDM monitoring

configuration For DDM probes For the DDM collection hierarchy – which must be set by you

103

We Know What the Replica ID Should Be for EVENTS4

• The replica ID of system databases, such as EVENTS4 and DDM.NSF, is derived from the replica ID of your domain’s address bookDatabase Replica IDNAMES.NSF 852564AC:004EBCCFCATALOG.NSF 852564AC:014EBCCFEVENTS4.NSF 852564AC:024EBCCFADMIN4.NSF 852564AC:034EBCCFDDM.NSF 852564AC:0A4EBCCF

• Notice that the first two numbers after the colon for the EVENTS4.NSF replica are: 02 for EVENTS4 and OA for DDM.NSF

104

Errors You Might See If DDM.NSF Is Not Right

• If there is a DDM.NSF on every server but they aren’t all the same replica ID, you’ll see the following error on the console every couple of minutes: Unable to replicate with server Server2: None of the selected

databases have a replica on the server You’ll get the error even if there is no connection document You’ll get this error even if there is a connection document

and you have a much longer replication interval scheduled• To fix problems related to EVENTS4.NSF and DDM.NSF replica

IDs, you must delete the bad DDM databases and restart the server DDM.NSF will be recreated automatically

105



106

Configure DDM for Centralized Data Collection

• DDM.NSF has most value when it’s a centrally located repository It will contain all of the issues that come from all of the servers

• This does not happen on its own There is no collection hierarchy set up by default

Each server collects its own DDM data in its own DDM.NSF• If your DDM hierarchy looks like below, you need to set it up

107

Collection Hierarchy Is a Must

• Without a collection hierarchy, DDM probes run on a server and report events to DDM.NSF that are on that server Then they remain only on that server’s replica of DDM.NSF

• You have to check the DDM database on each server to evaluate problems and discover potential issues This is time consuming and is contrary to the design

It reduces time you could be spending solving problems And it’s a big pain!

Which means you’ll never use it

108

Aggregate Data Centrally

• A DDM server collection hierarchy lets you aggregate the data onto a key server or servers This must be configured in the EVENTS4.NSF

• The simplest hierarchy is to configure one server to collect from all servers in the domain I totally recommend this to get you started

109

More Complex Scenarios Are Possible

• Perhaps as you become more familiar with DDM, you’ll want to roll up some data regionally So that regional administrators receive only information that is

pertinent to the server they maintain

110

Rolling Up the Data

• DDM data rollup propagates the probe results up the DDM server collection hierarchy Data rollup is accomplished using Domino’s selective

replication to transport the data The replication formulas are created automatically when you

define your DDM server collection hierarchy

111

Hierarchy Collection Interval

• The DDM system sets up its own collection interval Collection replication occurs about every five minutes

This interval cannot be modified It is not controlled through connection documents

• Every five minutes, each collection server uses pull replication to get updates from the DDM database on each monitored servers

112



113

Address Issues by Severity Level

• Looking at issues by severity gives you the chance to deal with the most important issues first They are broken out by severity category

114

Another Helpful View

• Release 8 added a new view to DDM.NSF You can see issues by database name

This lets you determine whether a problem is happening on just one server or on every copy of the database in the domain Very handy information when problem solving

115

Simplest Way to Use DDM

• I consider the By Date view to be very helpful Assign a junior Admin to check DDM events every day

Have the Admin go through all the events and fix problems

116

Working with a DDM Event

• It’s a great monitoring tool because it smartly tells you The task that reported it The severity and type

And it would be pretty good even if it just did that

117

Working with a DDM Event (cont.)

• It also suggests Probable cause, possible solution And very often offers a corrective action

Some of which are automated routines

118

Each DDM Event Has Common Actions

• Such as Open the server document or server log View the server’s NOTES.INI Opening a remote console Opening the DB with the Designer client

And other actions depending on the error that occurred

119

DDM Has Powerful Probes

• A probe is the investigative component of DDM • Probes:

Need configuration to be useful Are configurable by administrators

• A probe is an action configured to run against one or more servers, databases, and services

A probe returns its status and results to the Domino Domain Monitoring Database – DDM.NSF

120

Analysis Probes

• Configure DDM using probe documents in EVENTS4.NSF Otherwise known as the Monitoring Configuration database

• You can create multiple probes for each feature area And you can individually configure each probe to run:

Selective checks Against specific servers and/or databases At specific times

121

What’s in These Probe Documents?

• Probe type and probe subtype For example, Security is a probe type

One of its probe subtypes is Best Practices This combination of probe type and probe subtype creates

a Security probe

122

Extra Information About the Probe Is Provided

• These probe documents also contain a general description of the probe, its purpose, and its intended use

123

Configuration Specifics, Too

• Documents can also specify configurable probe targets The server(s) that will run the probe And in some cases, the servers, database, etc., that the probe

runs against Where it’s applicable, there is also configurable scheduling

information – but not for all probe types

There’s More Inside

• Probe documents can also hold configuration specifics What the probe monitors What it should report on Thresholds to watch for

And what type of severity those thresholds represent

125

Plenty of Cool Probes

• R8 gives us 58 default DDM probes to work with R7 gives us 48 – still plenty to get us started

• You can get probing as soon as R7/R8 is up• Just plug in your server info to get DDM started

You can also create new probe documents Define and customize your own probes

126

Many Types of Probes

• There are ten major types of probes in R8, nine in R7 These probes can run two different ways:

On a schedule that you specify As an active monitor of things that happen in the domain

• Some probes can run either way On a schedule or as a monitor

It depends on what you ask them to do Some can only run as a monitor or on a schedule

127

Establishing a Schedule

• Scheduled probes can be controlled with great granularity Set the probe to run:

Daily, Weekly, Monthly Beyond that, specific schedule settings can vary from

probe type to probe type

128

Don’t Worry About Getting Off Schedule

• If a Weekly/Monthly probe is missed, you can specify how you want the probe to be handled: Ignore it completely Run the missed probe on startup Run the missed probe at the next time range

129

Zeroing in on Probes

• We’re going to focus on two probes that have a high value in almost every Domino domain: Application probes Security probes

130

Agents Are Tracked by Application Probes

• Application probes monitor agents in real time• Agents behind schedule

Detects when an agent starts after its scheduled time• Long-running agents• Agents ranked by CPU usage

Evaluates the CPU usage for agents executed by Agent Manager or HTTP These have a relatively high overhead

131

Agents Are Tracked by Application Probes (cont.)

• Agents ranked by memory usage Evaluates agents memory usage executed by the Agent

Manager or HTTP tasks Note that evaluation results for the same agent may differ when

the agent runs in Agent Manager/HTTP Also, results from this probe can depend on HTTP settings

• Long-running agents Detects agents that run longer than a time you specify

132

The Five Security Check Probes

• Security probes assess the overall security of servers and databases in your domain Best Practices

Compares a set of baseline security configuration settings to the same settings in a domain

Configuration Compares settings in a specific Server document to settings

in a specified “good” Server doc This doc can be real or built by you as an example

133

The Five Security Check Probes (cont.)

• Database ACL Monitors the access control privileges that groups and

individuals have in specified databases You designate the acceptable access levels on the

Specifics tab• Database Review

Reviews the security properties for a specified database

Generates a report on probe findings

134

The Five Security Check Probes (cont.)• Security Review

Generates a report on the security settings specified in the Specifics tab of the probe document

You have the option of selecting the “Directory Profile Note” and the “Security Settings in the Server Configuration Document” And a review of all security settings in a Server doc

This can really help to tighten your domain’s security

135

Using the Assign Button

• You can assign the event to a team member and add comments about the task using the “Assign” button Or you can simply assign the event to yourself

136

Changes Are Tracked

• All changes you make to the event are tracked in the Event Change History for easy reference Finally, there is an easy, built-in process for tracking problem

resolution in your environment

137

Demo

138



139

Some of My Other Statistics Sessions to Consider

• It’s like an extension of this Jumpstart … Advanced server monitoring and alert notifications

Friday morning at 9:45 am• Don’t miss my Hands-On Lab

Drilling Down into Domino Statistics Wednesday 4:00 to 6:00 pm Thursday 1:30 to 3:30 pm (right, it’s not am)

• Session goes into detail about pulling statistical data into spreadsheets for analysis with pivot tables and graphics And you have the opportunity to get your hands dirty actually

making the graphics yourself Hope to see you there!

140

Where to Find More Information• www-1.ibm.com/support/docview.wss?uid=swg27007060

“Lotus Education on Demand: Domino Domain Monitoring (DDM)” (IBM, 2010).

• www.ibm.com/developerworks/lotus/library/stats-linux/ Joe Malek, “Lotus Domino Platform Statistics on Linux”

(developerWorks, 2004).• www-1.ibm.com/support/docview.wss?uid=swg21139259

“Configuring Multiple Cluster Replicators on a Domino Server” (IBM, 2011).

• www-1.ibm.com/support/docview.wss?uid=swg21099635 “Which Domino Server Databases Have Replica IDs Related to

the NAMES.NSF?” (IBM, 2012).• www.andypedisich.com

Download presentations and technotics85Statrep.ntf

141

7 Key Points to Take Home

• Run the Collect task on servers located centrally Don’t run it on every server

• Let cluster statistics be your guide in determining the number of cluster replicators

• A great technique for problem solving is to capture log entries using Events4 and put them into a special Statrep for easy examination

• Make sure you have the correct replica of Events4 deployed and that it’s the same replica ID on every server

142

7 Key Points to Take Home (cont.)

• Be careful using server probes over a WAN, or you’ll end up testing the network rather than the servers

• Start with a flat DDM data collection hierarchy and make it more complex only if your requirements call for it

• Make new administrators to check DDM every day and have them assign problems they can’t fix to senior admins

143

Your Turn!

How to contact me:Andy Pedisich

[email protected]

www.technotics.com

Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools

Documents

Transcript of Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools