Alarm Management as a winning strategy - Instrumentation
Transcript of Alarm Management as a winning strategy - Instrumentation
Alarm Management as a winning strategy
By Dave Wibberley, Managing Director: Adroit Technologies
Keywords: [Alarm Management, bad actor, benefit, best
practice, case study, champion, compliance, continuous,
decision making, disaster, economic, EEMUA 191, flooding,
guideline, ignore, incident, ISA, KPI, liability, loss, mass
acknowledgement, motivation, nuisance alarm, overload,
performance, quality, regulatory, risk, safety, strategy, trip]
The objective of an alarm system is to minimize or
prevent physical and economic loss through
operator intervention.
Abstract
In this paper Dave Wibberly looks at some of the motivators and best practices for alarm systems and then
discusses the conflict between the published KPIs, which are regarded as those required to meet best practice
vs. practical means to achieve these metrics. Through two case studies he highlights some of the difficulties
plant operators may face and the possible outcomes when they embark on implementing an alarm strategy.
1 Introduction The ultimate objective of an alarm system is to prevent, or at least minimize, physical and economic loss
through operator intervention in response to the condition that was alarmed on any given process.
Alarm management has taken on new meaning since being identified as one of the key protagonists in major
process plant disasters of recent years, such as BP Texas City, Bhopal and 3 Mile Island.
Spurred on by the findings of enquiries into these disasters some very interesting work has been done by the
large process companies, vendors and industry organisations like the Electrical Equipment Manufacturers
and Users Association (EEMUA) and the Instrumentation Society of America (ISA).
By firstly recognising the problem; accepting the need for an alarm management strategy; and applying
technology to the process, companies can achieve multiple benefits. These include a more motivated and
focused staff, clearly identified efficiencies and improved profitability while minimising the potential
liability of management.
2 Why change anything? Strategic motivators can be classified as forced or voluntary. Forced motivators may be compliance (or
regulatory) or they may relate to risk mitigation. Voluntary motivators are those which are driven by
perceived business benefits.
The adoption of an alarm management strategy may be driven by both forced and voluntary motivators.
3 What is an alarm? The generally accepted definition of an alarm is:
“An alarm is an event to which an operator must react,
respond and acknowledge”
Therefore the purpose of an alarm system is to alert an
operator(s) to a potential problem, that if not addressed
will cause some production, process, quality or safety
compromise. To put an economic slant on this: the
alarm system is there to prevent, or at least minimize,
physical and economic loss through operator
intervention.
4 Current problems Studies have shown that there are several common problems experienced in alarm systems:
• There are too many nuisance alarms
• Alarms are ignored by operators
• Alarm viewers are underutilised
• Operators perform mass alarm acknowledgments
• The systems fail to provide real insight to support operator decision making
Alarm definition An alarm is classified as an event to
which an operator must react, respond
and acknowledge (not simply
acknowledge and ignore) and no plant
should have more than 6 such alarm
occurrences an hour.
[EEMUA 191 guidelines]
These problems lead to undesirable consequences. Operators presented with too many alarms may overlook
an important indicator of an abnormal situation, or be so overwhelmed that they choose to unnecessarily trip
the unit as a safety measure instead of trying to interpret the information being conveyed.
Both of these scenarios can significantly impact the safety of plant personnel and the efficiency of plant
operations. At the same time risk is increased. There is a higher chance of some catastrophic failure
occurring, leading to massive liability issues, as happened in incidents like BP Texas City, Bhopal and 3 Mile
Island. Less catastrophic but still significant are the risks associated with consequent reduced production,
quality issues and poor staff motivation.
4.1 Nuisance alarms
Most scada installations are configured to create an over-abundance of alarms. Cited here is a large diamond
mine customer who after the project was delivered had over 10 000 alarms configured. This is information
overload for operators. According to the EEMUA 191 guidelines, an alarm is classified as an event to which
an operator must react, respond and acknowledge (not simply acknowledge and ignore) and no plant should
have more than 6 such alarm occurrences an hour.
4.2 Alarms are ignored
Many alarms are simply ignored by operators because so many are either inconsequential and/or irritating.
Where the operator is overwhelmed by trivial information (s)he may miss or ignore genuine alarms.
4.3 Alarm viewers are underutilized
Most process faults are adequately displayed by graphical components on the mimic and these are used as a
starting point to initiate the correction process making a “noisy” alarm viewer redundant when it should be
the most important global view of any process indicating the current health of the process.
4.4 Mass alarm acknowledgement
Alarms on alarm viewers tend to be acknowledged blindly.
4.5 Lack of real information to support decision making
A lot of academic and industry body work has been done around this subject and there are 3rd party software
applications making their way into the industrial arena but they are often too complicated and expensive.
These focus on the events/alarms themselves without taking into account the dynamics of the process, for
example the time it takes for operators to simply acknowledge, clear and reset the alarm.
Systems, plants and processes are run by human beings. They are often under pressure, working the third
shift, late into the night and lack motivation. The operators running the plant are the people who need to
ensure that the process runs optimally. We need to assist and motivate them by focusing their attention in the
areas that will firstly make their lives easier (by identifying problem areas) and secondly give them
information that motivates them (by showing how they are performing against targets – i.e. they are on the
right track).
5 Current best practice and KPIs The two main reference standards used currently are EEMUA 191 and ISA 18.02. These introduce standard
alarm system KPIs such as:
• Average alarm rate (incident count per hour)
• Burst rate
• Percent upset
• Priority distribution
• Standing alarms
• Most frequent alarms
• Intermittent incidents
• Intermittent counts
• Count by agent
• Count by operator
• Total active time by incident
• Total unacknowledged time by incident
• Hourly count grouped by hour of the day.
• Hourly count by incident
• Daily count
• Average acknowledge time
• Average acknowledge time per type
6 An alarm strategy Given that most sites suffer from too much alarm “noise”. The only way to solve this problem is through a
well planned and executed alarm strategy that includes technology and applications to support it. The alarm
strategy is an essential element of any enterprise’s continuous improvement program.
Figure 1. Continuous improvement cycle.
6.1 Strategy definition
A strategy is a long term plan of action designed to achieve a particular goal. Strategy is differentiated from
tactics or immediate actions with resources at hand by its nature of being extensively premeditated, and often
practically rehearsed. Strategies are used to make the problem easier to understand and solve.
6.2 Expected outcomes
In terms of operator activity, operators should:
• Be more “valuably” active
• Have better response times to alarms
• Achieve more pro-active management and operation as they are only responding to genuine issues.
• Be more focused and prioritised
In terms of process, expectations would be for:
• Improved product quality as a consequence of fewer stoppages
• Better focus on the process
• Better information around the key process variables
• Identification of problematic process equipment
• Identification of problematic process areas
From a business perspective, expectations would include:
• Better understanding of operator performance
• Identification of problematic teams
• superior decisions around asset management
6.3 Key success factors
Key factors for successful alarm strategy implementation include:
• Management’s buy-in in terms of commitment to the strategy and provision of the necessary
resources.
• Correct selection of technology that is standards based (uses an open database, OPC compliant
allowing for diverse data sources, follows EEMUA/ ISA guidelines) supports a value driven
information and knowledge approach and provides reporting capabilities, KPI measurement, alarms,
feedback mechanisms and configuration guidelines
• A culture of continuous improvement
7 The Adroit solution Adroit has developed an alarm strategy implementation methodology and it advises that this be followed.
Rather than doing a heap of work upfront, phase in the application of the strategy as you achieve milestones.
7.1 Alarm configuration
Using a SQL backend the following are some of the parameters that can be configured in the Adroit alarm
module.
Raw data:
• Name of Tag/Agent
• Description of the Tag/Agent
• Value of the Tag/Agent
• Time of alarm occurrence
• Time alarm was acknowledged
• Time alarm was cleared
Figure 2. Alarm agent configuration.
Inferred data:
• Shift data
• Delay
• Conditional aspects
• Plant name
• Plant area
• Operator logged on at the time of alarm
• Whether or not the operators need to add “Reasons and/or Notes” per incident
• Values of other relevant process variables
• Help files
7.2 Alarm results and standard KPIs
As part of the solution the user has easy access to about fifteen different views, which provide an analysis of
the alarm data in tabular and graphical formats. Some of the standard queries are shown in Figure 3.
Figure 3. Standard alarm result queries.
Figure 4: Showing incidents count per hour.
The solution displays alarm results according to a number of pre-configured KPIs; however, users are able to
make their own queries against the data. Depending on the complexity of the query, the query may have to be
run from 3rd party DB client applications.
7.2.1 Average alarm rate (incident count per hour)
This is the number of incidents that occur per hour. Standards vary for different industries but typically
should be between 6 alarms/hour under normal conditions and 60 alarms/hour under abnormal conditions.
Average rate = [Total number of alarms]/[Total number of hours].
7.2.2 Burst Rate
This is the number of incidents that occur in a 10 minute window. Standards vary for the different industries
but typically should be between 1 (normal conditions) to 10 (abnormal conditions) alarms/10 minutes. It is an
important measure of the usability of an alarm system and the operator's capability to deal with alarms. Burst
rate = 6 * [Maximum alarm count in a 10 minute period].
7.2.3 Percent Upset
This is the number of hours where there are more than 30 alarms per hour. It is a measurement of alarm
overload on operators. Percent upset = 100 * [Number of hours where alarms exceeds 30/hour]/[Total
number of hours]. Overloaded > 50%, Reactive 25%-50%, Stable 5%-25%, Robust 1%-5%, Predictive < 1%.
7.2.4 Priority Distribution
This is the priority distribution for alarms occurring over a period of time. Standards vary for the different
industries but typically the distribution should be 5% high priority alarms (>P3), 15% medium priority
alarms (=P3) and 80% low priority alarms (<P3)
7.2.5 Standing alarms
This is the number of current alarms at the end of an hour. Standards vary for the different industries but is
typically <=9.
7.2.6 Most Frequent alarms (count by type)
This is a view of the most frequent alarms over time. The top 20 of these alarms can account for almost 50%
of total alarm generation. If properly reviewed, you will find that most of these alarms should not be
classified as alarms at all in that they do not meet the proper criteria that define an alarm.
7.2.7 Intermittent Incidents
These are alarms that activate and deactivate within 10 seconds. If properly reviewed, you will find that most
of these alarms should not be classified as alarms at all in that they do not meet the proper criteria that define
an alarm.
7.2.8 Count by Agent
These are the number of incidents on a per agent basis.
7.2.9 Count by Operator
These are the number of incidents by operator.
7.2.10 Total active time by incident
This is the total amount of time an individual incident has been active.
7.2.11 Total unacknowledged time by incident
This is the total amount of time an individual incident has remained unacknowledged.
7.2.12 Hourly Count grouped by hour of the day
This is an indication of what time of day incidents tend to happen more frequently.
7.2.13 Hourly Count by Incident
This is the total number of individual incidents that occur per hour.
7.2.14 Daily Count
This is the total number of incidents occurring per day.
7.2.15 Average Acknowledge Time
This is the average time it takes for incidents to be acknowledged.
7.2.16 Average Acknowledge Time per Type
This is the average time it takes for individual incidents to be acknowledged categorised by alarm type.
8 Applying the technology - Two case studies We took our solution and put it on to two fairly large sites:
• PPC’s cement plant at Riebeck West
• Tata Nickel in Botswana
Both sites run large single Adroit server (redundant) systems, have in excess of 15 000 I/O, run 24-hour
operations and are fairly busy plants.
We used the experiences gained on these sites and believe that these lessons are the most valuable and the
ones we wish to share with you today.
8.1 Reality check
These are some of our initial observations from both plants, and they point to problem areas in alarm strategy
implementation that are common to most plants.
• There has to be a site champion, with strong management buy-in and commitment.
• Alarms are over-engineered. Probably as a result of what is common in most projects – a lack of
specification and understanding of alarming. People engineering the solution adopt a conservative
“alarm everything” approach. This is a similar syndrome to that when people first put in a historian –
they want to “log everything as fast as you can and we will sort it out later.” The end result is way too
much data at way too much resolution.
• The structured approach to alarm management requires months, maybe years to take a site from an
initial position like this to world class alarm management. But it is the usual 80/20 rule in that you
can get 80% of the way in 20% of the time, yielding massive improvements quickly. But it takes a
sustained approach to drive the system down to world-class positioning.
8.2 Initial implementation
We took the approach of getting the alarm system on in the rawest format in order to ascertain the “As is”
data and operation. So in both cases we spent the first day installing the Alarming System and getting the
databases installed.
We didn’t apply our minds to any categorisation or look to get complicated with the configuration. The idea
behind this was that if the system was “unstable” as defined by the KPIs then one should rather look to
address the “flooding” issue before knowing, for example that a certain area of the plant was worse than
another. Figuring that this focused strategy would come in when you were looking at the last 20% of the
system.
The results were quite astonishing for the first 24 hours with PPC showing average alarm rates in the order of
150 to 1100 per hour and Tata somewhat higher levels:
Figure 5: Initial alarm system performance – PPC case study.
Figure 6: Initial alarm system performance – TATA case study.
8.3 Tackling the “Bad actors”
Work was allocated to the PLC and SCADA guys to
address the top 10 “bad actors” and to address the
standing alarms, both of which impact quite heavily on
the system performance issues and again left for
another 24 hours.
After this exercise the alarm system performance
showed significant improvement with an improvement
in the basic measurement of incidents per hour
dropping by 35%!
8.4 A practical methodology
When an alarm strategy is first implemented on an existing plant, the initial observations are quite
overwhelming, but what is critical is to focus on the key areas that are exposed. As already stated, you can’t
focus on operator performance when your system is essentially “unstable”.
The following steps, which come from our case study experiences, can act as a guideline and it is critical to
remember that this is a journey that may take years to accomplish.
Focus on the low hanging fruit:
• Identify the top 10 protagonists – Top ten by count and Top ten by duration. You will find that
between them there are common problems.
• Identify and remove redundant alarms
• Identify and remove chattering alarms (short duration)
As your balance swings towards a more controllable situation you can then start looking at your own
situation and changing the targets accordingly. After you have gotten down to an acceptable alarm system
performance then look to start categorising alarms, adding in the Operator names in order to then allow you
to start focusing on the last 20%.
The continuing nature of such a program will see you then adding in supporting documentation like Standard
Operating Procedures (SOP) to support your operators.
“In the first 24 hours of running the
Adroit Alarm Management module, we
managed to decrease the average
incidents from around 500 incidents per
hour to 300 per hour.”
Danie Sadie from SAdkons, consulting for
a large cement producer in the Western
Cape
Regularly revisit your targets and continually review the performance of your alarms until you reach world-
class status.
8.5 Case study conclusions
This is a continuous program; there are no quick fixes or short cuts. There has to be complete buy-in from all
levels within the organisation. Allowances and budgets have to be allocated and made available. The rewards
will come.
You are not alone in this quest; 90% of all industrial control solutions are at the same point as you.
For more information contact Dave Wibberley, Adroit Technologies, +27 (0)11 658 8100,
[email protected], www.adroit.co.za
About the author
Dave Wibberley is the Managing Director of Adroit Technologies – the largest developer of SCADA
systems in Africa. He holds a BSc Mech. Eng. and a GDE in Industrial Engineering.