Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4
Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf ·...
Transcript of Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf ·...
TriggersImproving Availability Through
Event-driven Automation
Sean Moe18 September 2009
Agenda Problem:
Solution:
Advantages:
Tutorial:
Productivity & availability losses inherent in large-scale computing
Generic metrics, native resource managers and trigger technologies
Benefits of Moab triggers vs. generic event managers
How do I create, manage, and utilize triggers?
Please write your questions down
Trigger Demonstration
Event
“Fire” Clap
Action
Problem
Productivity Losses &Resource Availability
Middleware FailuresUser Ineffic ienciesHardware FailuresPartitioning FailuresIntra-job Ineffic ienciesEnvironmental Ineffic iencies
Need for Automation
Convenient in small resource pools
Necessary in large systems More failures More user complaints Less time for
administrators
Solution
Triggers:Automating Responses
1) Detect an event2) Perform diagnostics3) Execute action(s)
Email Admin
Shutdown Node
Workload CanBe Moved
Temp > 60
Moab Hierarchy
Moab
ResourceManager
ResourceManager
Generic Metrics Arbitrary information associated with
resources and workload
Widespread Usage
Decisions can be made and reports can be generated based on site-specific environmental factors
Power fluctuations Machine room temperature
Machine room chiller health Power failures
Network connectivity Network card failures
Hardware failures CPU temperatures
Network file server status Hard drive failures
Enabling Generic Metrics
# Example “temp.txt”# Temperature output from various nodesnode001 GMETRIC[temp]=113node002 GMETRIC[temp]=83node003 GMETRIC[temp]=107node004 GMETRIC[temp]=85
# moab.cfgRMCFG[native] TYPE=NATIVERMCFG[native] CLUSTERQUERYURL=file://$TOOLS/temp.txt
Where Triggers Come In
Moab
ResourceManager
ResourceManager
Triggers
Native ResourceManager
Other GMetrics
Advantages
Benefits of Triggers Actions are event-based and independent of
normal job scheduling
Attached to various scheduler objects
Inherit variable namespace from parent object
Can export/import data to/from other objects
Basis for dynamic workflow control
Trigger Dependencies & Dynamic Workflow Trigger variables allow for
complex dependency graphs
Multiple execution paths
Can rely on external dependencies
Other policy restrictions are also enforced
1 2 3
7 8
9
4 5 6
Triggers vs. Other Event Managers (cron,nagios,...)
Access to global resource information
Integrated control over workload
Access to high level resource and workload management facilities
Intelligent workload-aware responses
Tutorial
Trigger Attributes
AType
EType
Action
Action Type – what type of action to perform
Event Type – what event triggers this action
Action – usually a script or a Moab-related command
AType=exec EType=start Action=”report.pl”
Advanced Trigger Attributes
BlockTimeExpireTimeFailOffsetInterval
MaxRetryMultifireOffsetPeriod
RearmtimeTimeout
SchedulingDescription
FlagsName
Threshold
Administration
RequiresSets
Unsets
Dependencies
Trigger Example #1When a job is placed on hold, run a script whose first
parameter is the job ID
AType=exec
EType=hold
Action=”$TOOLS/held_job.pl $OID”
# moab.cfgJOBCFG[DEFAULT] TRIGGER=AType=exec,EType=hold,
Action=”$TOOLS/held_job.pl $OID”
Trigger Example #2Send an email when a node goes down
AType=exec
EType=fail
Action=”$TOOLS/down.pl $OID”
MultiFire=TRUE
RearmTime=1:00
# moab.cfgNODECFG[n01] TRIGGER=AType=exec,EType=fail,
Action=”$TOOLS/down.pl $OID”,MultiFire=TRUE,RearmTime=1:00
Trigger Example #3Create a 5-minute reservation after every job
to account for the job epilogue
AType=reserve
EType=end
Action=”5:00”
Description=”Reservation for job epilogue”
# moab.cfgJOBCFG[DEFAULT] TRIGGER=AType=reserve,EType=end,Action=”5:00”,
Description=”Reservation for job epilogue”
Trigger Example #4Execute an email script when user “bob” has
too much backlog
AType=exec
EType=threshold
Action=”$TOOLS/email.pl $OID”
Threshold=backlog>100
FailOffset=1:00# moab.cfgUSRCFG[bob] TRIGGER=AType=exec,EType=threshold,
Action=”$TOOLS/email.pl $OID”,Threshold=backlog>100,FailOffset=1:00
Trigger Example #5Toggle the MAXPROC parameter for user “alice” based on
total system usage
# moab.cfgSCHEDCFG[moab] TRIGGER=AType=changeparam,EType=threshold,
Action=”USERCFG[alice] MAXPROC=5”,Threshold=usage>90%,MultiFire=TRUE
SCHEDCFG[moab] TRIGGER=AType=changeparam,EType=threshold,Action=”USERCFG[alice] MAXPROC=100”,Threshold=usage<90%,MultiFire=TRUE
Trigger Example #6If a node's temperature goes above 60, reserve the node,
notify the admin and shut it down
# moab.cfgNODECFG[DEFAULT] TRIGGER=AType=internal,EType=threshold,
Action=reserve,Sets=RESERVED,Threshold=GMetric[temp]>60
NODECFG[DEFAULT] TRIGGER=AType=exec,EType=start,Action=”$TOOLS/node_email.pl $OID”,Requires=RESERVED
NODECFG[DEFAULT] TRIGGER=AType=exec,EType=start,Action=”$TOOLS/shutdown.pl $OID”,Requires=RESERVED
Trigger Example #7If scratch space fills up, reserve the node, run a cleanup script and send a message to the administrator based on
the result# moab.cfgNODECFG[DEFAULT] TRIGGER=AType=internal,EType=threshold,
Action=reserve,Sets=DRAINED,Threshold=GMetric[scratch]>1001
NODECFG[DEFAULT] TRIGGER=AType=exec,EType=start,Action=”$TOOLS/cleanup.pl $OID”,Requires=DRAINED,Sets=CLEANED.!FAILURE
NODECFG[DEFAULT] TRIGGER=AType=mail,EType=start,Action=”$OID is cleaned”,Requires=CLEANED
NODECFG[DEFAULT] TRIGGER=AType=mail,EType=start,Action=”$OID is not cleaned”,Requires=FAILURE
More Examples# Run diagnostics when an RM fails (for 3 minutes)RMCFG[native] TYPE=NATIVE FAILTIME=3:00RMCFG[native] TRIGGER=AType=exec,EType=failure,
Action="$TOOLS/diagnose_rm.pl $OID"
# Associate a job trigger with a classCLASSCFG[batch] JOBTRIGGER=AType=exec,EType=preempt,
Action="$TOOLS/preempt_notify.pl $OID $OWNER $HOSTNAME"
# Send an email 24 hours before a reservation ends to notify/remind the userRSVCFG[apache_farm] TRIGGER=AType=exec,EType=end,Offset=-24:00:00,
Action="$TOOLS/rsv_end_email.pl $OID $OWNER $TIME"
# Standing triggerSCHEDCFG[moab] TRIGGER=AType=exec,EType=standing, Period=hour,
Action="$TOOLS/createjobs_hour.pl"
What else can I do?
Email the owner of a particular reservation when the usage drops below a specific threshold to encourage efficient use of reserved resources
Launch an evaluation script 5 minutes before a job is scheduled to complete to gather more detailed statistics about how well the job ran
Guarantee a particular account a certain service level by emailing the admin, creating a reservation, and/or contacting a hosting utility for more resources if backlog exceeds an hour of waiting time
Other Ways to Create Triggers
# Attach a trigger to an objectmschedctl -c trigger <attr>=<val>[,<attr>=<val>...]
-o <obj_type>:<obj_val>
# Dynamically add a trigger to a reservationmrsvctl -c -T <attr>=<val>[,<attr>=<val>...]
# Submit a job with a triggermsub <job_id> -l 'trig=<attr>=<val>[\&<attr>=<val>...] '
NOTE: For security reasons, only users having a QoS with the'trigger' flag can submit jobs with attached triggers
Monitoring & ModifyingTriggers Monitoring:
mdiag -T [-v]
mdiag -T [-v] trigger.id
mdiag -T [-v] job.id
mdiag -T -V (shows a global view of all triggers
associated with the current user)
Modifying:mschedctl -m trigger:2 Atype=exec,Offset=200
Where Should I Start?
Moab already handles many common cases: Job start failures Node failures
Start with simple tasks: Sending emails Executing small scripts & then sending emails
Start with the most common resource, workflow, and service failures
Summary
“I fell asleep.What did I miss?”
Losses in productivity and inefficiencies in resource availability are inherent in large computing environments
Built into Moab is an event-based trigger technology that allows for customized automated responses
As an extension of Moab, triggers have an integrated view of the resource pool – allowing for smarter resource-/workload-aware responses
Questions?
Additional Resources
Trigger documentation / how-to:http://www.clusterresources.com/products/mwm/moabdocs/1
9.0triggers.shtml
Enabling Generic Metrics documentation:http://www.clusterresources.com/products/mwm/docs/9.2acc
ounting.shtml#gmetric
Appendices
Possible Atype/EType Values
Action Types:cancel, changeparam, jobpreempt, mail, exec,
query, internal
Event Types:cancel, checkpoint, create, end, epoch, fail, hold,
migrate, modify, preempt, standing, start, threshold
Advanced Attributes BlockTime = seconds Moab will suspend normal operation until trigger finishes
executing ExpireTime = time at which trigger should be terminated if it has not already been
activated FailOffset = seconds threshold must exist before the trigger fires Interval = boolean that sets trigger to fire at regular intervals MaxRetry = times to execute action before trigger 'gives up' MultiFire = makes a trigger repeatable Offset = how long after event to fire (or before for 'end' events) RearmTime = how long to wait before rearming Requires = variable dependency Sets = variable on success, !variable on failure UnSets = variable destroyed on trigger success Timeout = how long trigger's process will run before being killed
'Special' TriggersMail TriggersRequires MAILPROGRAM parameter in moab.cfg
Internal TriggersAction=”<OBJ_TYPE>:<OBJ_ID>:<ACTION>:<CONTEXT_DATA>”
Object types: job, node, reservation, standing reservation,
scheduler, user
Actions: cancel, complete (system job), destroy (VPC), modify,
reserve
# For example:
SRCFG[prov] TRIGGER=AType=internal,EType=start,
Action”node:$HOSTLIST:modify:os=rhel4”
Trigger Variables
$ETYPE $OWNER
$GROUPHOSTLIST $TIME
$HOSTLIST $USER
$MASTERHOST $VPCID
$OID $VPCHOSTLIST
$OS
$OTYPE (rsv,job,node,sched)