Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and...
Transcript of Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and...
Analysis of EMS Outages
Venkat Tirupati, Senior Reliability Engineer
NERC Monitoring and Situational Awareness Conference
September 18, 2013
RELIABILITY | ACCOUNTABILITY 2
Agenda
• Introduction
• Analysis of Restorations
• Contributing & Root causes with examples
• Common themes with examples
• Q & A
RELIABILITY | ACCOUNTABILITY 3
• Energy Management Systems (EMS) are extremely reliable
• EMS outages increase the risk to the reliability of the grid
• 81 Category 2b events (Oct 26, 2010 – Sep 3, 2013) reported
• 64 events – thoroughly analyzed and reviewed
• 54 entities reporting - 20 entities experiencing multiple outages
• Restoration time for partial outages: 18 to 411 min
• Restoration time for complete outages: 12 to 253 min
• Vendor agnostic failures – Software & Hardware Issues
• Several noticeable themes
Introduction
RELIABILITY | ACCOUNTABILITY 4
Analysis of Restoration Times
0
50
100
150
200
250
300
350
400
450
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77
Complete Outage Restoration Time Partial Outage Restoration Time Mean Complete Outage Restoration Time
Mean Partial Outage Restoration Time Mean Outage Restoration Time
Mean Outage Restoration Time
Mean Complete Outage Restoration Time
0
20
40
60
80
2010 2011 2012 2013
Mean Complete Outage Restoration Time
Mean Partial Outage Restoration Time
Mean Partial Outage Restoration Time
Mean Complete Outage Restoration Time: 56 Minutes Mean Partial Outage Restoration Time: 43 Minutes Mean Total Outage Restoration Time: 99 Minutes
Tim
e in
Min
ute
s
RELIABILITY | ACCOUNTABILITY 5
Restoration Time
0
2
4
6
8
10
12
Nu
mb
er
of
Eve
nts
Tit
le
Restoration Time 10 Minute Intervals
10- Minute Interval Restoration Time
RELIABILITY | ACCOUNTABILITY 6
Outage Restoration Times by Date
0
50
100
150
200
250
300
350
400
450
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77
Complete Outage Restoration Time Partial Outage Restoration Time
October 26, 2010 – September 3, 2013
2012 2011 2013
Tim
e in
Min
ute
s
RELIABILITY | ACCOUNTABILITY 7
Number of Reports - Quarterly
0
2
4
6
8
10
12
2010 Q4 2011 Q1 2011 Q2 2011 Q3 2011 Q4 2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3
Nu
mb
er o
f R
epo
rted
Ou
tage
s
October 26, 2010 – September 3, 2013
RELIABILITY | ACCOUNTABILITY 8
Characteristics of the EMS outages
0
10
20
30
40
50
60
70
80
90
Outages on Weekdays/Outages on Weekends
CIP activity led to outage/Non-CIP activity led to outage
Outage due to Planned Activity/Outage Unforeseen
69
10
31
12
71
50
Nu
mb
er o
f Ev
en
ts
Weekend Non-CIP Unforeseen
Weekday CIP Planned
RELIABILITY | ACCOUNTABILITY 9
Outage Time of the day (2 Hr Intervals)
0
1
2
3
4
5
6
7
8
9
10
0:00 - 2:00 2:00 - 4:00 4:00 - 6:00 6:00 - 8:00 8:00 - 10:00 10:00 - 12:00 12:00 - 14:00 14:00 - 16:00 16:00 - 18:00 18:00 - 20:00 20:00 - 22:00 22:00 - 24:00
1 0
1 1
5 6
2
4 4 4 3
0
7
5
9
1
3
3
5
4 3
2 3
5
Nu
mb
er o
f Ev
ents
Outage due to Planned Activity Outage Unforeseen
RELIABILITY | ACCOUNTABILITY 10
Root Causes – By Category
A1 - Design/Engineering, 9, 16%
A2 - Equipment/Material, 14,
25%
A3 - Individual Human Performance, 1
A4 - Management/Organizati
on, 17, 30%
A5 - Communication, 3
A6 - Training, 1
AZ - Information LTA, 11, 20%
RELIABILITY | ACCOUNTABILITY 11
Root Causes
0
1
2
3
4
5
6
7
8
Top Root Causes Software Failure (A2B6C07) Testing of Design/Installation LTA (A1B4C02) Information to determine cause LTA (AZ) Insufficient Job scoping (A4B3C08) Inadequate risk assessment of change (A4B5C04) Post modification testing LTA (A2B3C03) Design output scope LTA (A1B2C01) Vendor or contractor involved (AZB3C02)
RELIABILITY | ACCOUNTABILITY 12
Contributing Causes – By Category
A4 - Management/Organizati
on, 64, 28%
A2 - Equipment/Material, 74,
32%
A1 - Design/Engineering, 41, 18%
A3 - Individual Human Performance, 22, 9%
A5 - Communication, 15, 6% AX - Overall
Configuration, 12, 5%
A7 - Other, 5, 2%
RELIABILITY | ACCOUNTABILITY 13
Contributing Causes
0
5
10
15
20
25
A2B
6C0
7
A1B
2C0
1
A4B
5C0
3
A1B
4C0
2
A2B
6C0
1
A4B
5C0
5
A4B
5C0
4
A2B
3C0
3
A2B
3C0
2
A3B
3C0
1
A4B
1C0
8
A1B
2C0
8
A3B
2C0
1
A4B
3C0
8
A4B
5C1
3
A7B
1C0
2
AX
AX
B1
A1B
2C0
5
A2B
2C0
1
A2B
3C0
1
A3B
1C0
1
A3B
2C0
5
A3B
3C0
4
A4B
2
A4B
2C0
8
A4B
3
A4B
3C0
9
A4B
5C0
1
A4B
5C0
9
A5B
2C0
8
A5B
4C0
1
AX
B2
A1
A1B
1
A1B
1C0
1
A1B
1C0
3
A1B
2C0
9
A1B
3C0
1
A1B
5C0
2
A1B
3C0
2
A2B
1C0
2
A2B
7C0
1
A2B
7C0
4
A2B
6C0
5
A3B
1C0
3
A3B
1C0
4
A3B
1C0
6
A3B
2C0
2
A3B
2C0
4
A4B
1
A4B
1C0
3
A4B
1C0
4
A4B
1C0
6
A4B
1C0
5
A4B
1C0
9
A4B
2C0
7
A4B
3C1
1
A4B
4C0
5
A4B
5
A4B
5C0
2
A5
A5B
1C0
1
A5B
1C0
3
A5B
1C0
5
A5B
2
A5B
3C0
1
A5B
4C0
6
A6B
3
A7B
1
Top Contributing Causes Software Failure (A2B6C07) Design output scope LTA (A1B2C01) Inadequate vendor support of change (A4B5C03) Testing of Design/Installation LTA (A1B4C02) Defective or failed part (A2B6C01) System Interactions not considered (A4B5C05) Inadequate risk assessment of change (A4B5C04) Post Modification Testing LTA (A2B3C03) Inspection/Testing LTA (A2B3C02) Attention given to wrong issues (A3B3C01) Untimely corrective actions to known issue (A4B1C08)
RELIABILITY | ACCOUNTABILITY 14
Equipment/Material - Sub-Categories
0
5
10
15
20
25
30
35
40
45
A2B2 - Periodic/Corrective Maintenance LTA
A2B3 - Inspection/Testing LTA A2B6 - Defective or Failed A2B7 - Equipment Interactions LTA
RELIABILITY | ACCOUNTABILITY 15
Management/Organization Sub-Categories
0
5
10
15
20
25
30
35
40
A4B5 - Change Management LTA
A4B1 - Management Methods LTA
A4B3 - Work Planning Organization LTA
A4B2 - Resource Management LTA
A4B4 - Supervisory Methods LTA
RELIABILITY | ACCOUNTABILITY 16
Top Root/Contributing Causes
• Software Failure (A2B6C07)
• Design output scope LTA (A1B2C01)
• Inadequate vendor support of change (A4B5C03)
• Testing of Design/Installation LTA (A1B4C02)
• Defective or failed part (A2B6C01)
• System Interactions not considered (A4B5C05)
• Inadequate risk assessment of change (A4B5C04)
• Insufficient Job scoping (A4B3C08)
• Post Modification Testing LTA (A2B3C03)
• Inspection/Testing LTA (A2B3C02)
• Attention given to wrong issues (A3B3C01)
• Untimely corrective actions to known issue (A4B1C08)
RELIABILITY | ACCOUNTABILITY 17
• A tuning parameter on an AGC display was changed leading to a number greater than an acceptable parameter. There was no validation and it ended up generating an invalid array index and hence corrupted the database.
• A process that synchronizes data between Primary and Backup systems was aborting and continuously restarting the servers.
• SCADA application did not check for maximum number of control commands allowed and generated invalid keys that ultimately led to aborting of the application.
• Fortran array out of bound issues with control application due to a software bug
• A vendor supplied batch file did not have a proper command in a system wide script.
• Program to clean out log files was not deleting them leading to disk space issues
Software Failure (A2B6C07) - Examples
RELIABILITY | ACCOUNTABILITY 18
Software Failure (A2B6C07) - Examples
• Coding error in the alarm process code where in code generates a corrupt alarm when the concatenated string size of RTU and associated points size is more than 80 characters.
• EMS vendor revealed to the entity that the ‘Delete’ operation used to remove previous database files before updating the configuration database had intermittently been unreliable at other installations.
• A process to purge data files created for supporting outage management system had bug and the process filled up the hard disk. This caused the entity to lose control functionality.
• Failover setting parameter issue led to failover process failing
• Synchronization settings between the PCC and BCC domain servers
• Failover program did not account for failure of certain critical applications
• Unreleased semaphores clogging the system virtual memory and leading to failed integrity checks between EMS servers
RELIABILITY | ACCOUNTABILITY 19
Software Failure (A2B6C07) - Examples
• A program locked a file and caused exhaustion of system resources
• Rapid Spanning Tree Protocol incompatibilities and memory leak issues with communication equipment software
• Bug in the router software regarding spanning tree protocol
• Router encountered a software bug that prevented it from refreshing its mapping between Layer 2 and Layer 3 addresses
• A health check software had bugs
• Messaging program had software bugs and was restarting critical programs continuously
• EMS applications start up scripts had bugs
• Windows clustering functionality problems
• Automated propagate script failed to replicate the changes due to incorrect host names.
• The start/stop script did not successfully abort the program.
• Display build process failing due to Java heap memory issues
RELIABILITY | ACCOUNTABILITY 20
Top Root/Contributing Causes
• Software Failure (A2B6C07)
• Design output scope LTA (A1B2C01)
• Inadequate vendor support of change (A4B5C03)
• Testing of Design/Installation LTA (A1B4C02)
• Defective or failed part (A2B6C01)
• System Interactions not considered (A4B5C05)
• Inadequate risk assessment of change (A4B5C04)
• Insufficient Job scoping (A4B3C08)
• Post Modification Testing LTA (A2B3C03)
• Inspection/Testing LTA (A2B3C02)
• Attention given to wrong issues (A3B3C01)
• Untimely corrective actions to known issue (A4B1C08)
RELIABILITY | ACCOUNTABILITY 21
• The firewall manager tool did not have a prompt to warn the user about the firewall name change. Vendor has been contacted and will implement the barrier in the tool.
• An automated script from the vendor replicates the changes made on a server to all the other needed servers. However the automated script did not have the correct host names of the servers. This led to the repeated SSH attempts to reach the servers to fail and led to the outage.
• Design of the NAT was not compatible with the protocols used on front end processor servers and prevented a successful failover.
• Lack of redundancy with NIC cards led to network communication issues leading to EMS outage.
• A network device configuration change was made, but failover scenario was missed, leading to failure of failover on demand.
• Incorrectly defined routing paths for SCADA network
• There was no networking capability, independent from the primary control facility network, to access the BCC directly.
Design output scope LTA (A1B2C01) - Examples
RELIABILITY | ACCOUNTABILITY 22
• Entry of an a tuning parameter on a display, with a value greater than the allowed parameter, resulted in an invalid array index and led to corruption of the database. The tool should have had proper validation capability and not let the user enter invalid information.
• AGC paused due to bad telemetry inputs and most telemetry inputs failed over to the alternate tone channels. But there were some points, which did not have alternate signals and it was not very clear which inputs had failed. A procedure to identify the failed points is in place and a new display that shows AGC telemetry and status is installed.
• There was no automated process to routinely ensure that file system disk space levels were well below the warning thresholds on EMS servers.
• Failover configuration settings were not set according to the latest vendor-established standard configuration and this was leading to a repeated transition for SCADA server from online to backup mode interrupting communications from RTU.
• Greater than normal utilization of study applications led to depletion of virtual memory and RAM. System design did not consider the impact of study applications on performance.
Design output scope LTA (A1B2C01) - Examples
RELIABILITY | ACCOUNTABILITY 23
• Vendor overlooked the setting for automatic site failovers and left it enabled. The entity wanted it to be manual.
• No viable implementation of current entity’s version at the EMS vendor site to test changes – Disabling of control records was crashing the control application.
• SCADA vendor was contacted before the implementation of NAT on the SCADA network and the vendor provided no technical objections or recommendations. Design of the NAT was not compatible with the protocols used on front end processor servers and prevented a successful failover.
• A vendor did not test a system wide batch script before sending it to the entity. Entity ran it and that led to loss of RTU communications for the entire network.
• Vendor was asked for help with unresponsive domain controller to system DNS requests and why DNS service on one domain controller did not failover to another. Vendor suggested articles and suggestions but did not provide much help in diagnosing the event.
Inadequate Vendor Support of Change (A4B5C03) - Examples
RELIABILITY | ACCOUNTABILITY 24
• The vendor supplied batch file did not come with good instructions.
• Vendor moved on to supporting another package than the one provided to entity
• Vendor patches provided for the RTU communications problem did not fix the problem entity was experiencing.
• Vendors could not figure out the issue with the clustering problems
• A joint initiative application with vendor was developed. But the entity and the vendor did not foresee the loading on the production system. Software vendor recommendation for memory specifications did not account for this new application.
Inadequate Vendor Support of Change (A4B5C03) - Examples
RELIABILITY | ACCOUNTABILITY 25
• A new port scanning tool was testing against a subset of systems and showed no impact. But, after the event, the tool was validated in another test environment similar to real time and the “half open” issue & non-responsive EMS servers was reproduced.
• An entity did not adequately test the NAT implementation prior to implementing on production SCADA network. The design was not compatible with the protocols used on the front end processors and prevented a successful failover
• When an automatic site failover occurred due to network interruption, operators were not trained to login to SCADA backup server domain. This was not tested prior to the incident.
• The FAT testing did not have test steps to test the control application when control records are manually disabled in SCADA.
• When new functions are tested, scale of operations need to be considered and not just the operation itself. A new “Group control” function was tested on 5 breakers and then applied to 300 breakers. This resulted in crashing of SCADA application. This could have been easily tested on a test system, even though controls are not truly sent to RTUs.
Testing of Design/Installation LTA (A1B4C02) - Examples
RELIABILITY | ACCOUNTABILITY 26
• A new firewall security patch was not tested prior to implementing it on the system
• Entity did not confirm with the vendor that a batch meant to correct the timing alarms out of substation equipment, was run in a test environment that simulates a system configuration prior to running on the live system.
• During the test of a new RTU on the SCADA FEP, RTU and associated point concatenated name length was not tested as it generated a corrupt alarm.
• During testing it was noticed one of the variable was not getting calculated. Staff noticed that one step was missed. Instead of testing the recovery method on development system, the missing step was directly run on the production environment. The instruction was mistyped and accidently initialized all global AGC parameters.
• Passwords were changed on the system, but the critical applications were not tested to ensure their normal functionality.
Testing of Design/Installation LTA (A1B4C02) - Examples
RELIABILITY | ACCOUNTABILITY 27
• Faulty NIC Cards
• Auxiliary power regulator control in the Regulator control panel was the failed component
• Fiber optic cable lost
• Failed cards in Digital Cross Connect
• Fault within a range of ML-1000 card revisions
• Octal T1 (Card 2) in the SPO DNX and card 2’s failure to failover to card 1 properly.
• UPS module failure led to power outage for aggregation switch.
• Failure of fiber optic interface card
• Bypass switch on the MUX UPS
• UPS battery bank failed. Blown circuit boards and fuses due to depletion of temporary batteries.
Defective or a Failed Part (A2B6C01) - Examples
RELIABILITY | ACCOUNTABILITY 28
• Firewall changes and interaction with program that copies files between two servers not considered.
• Firewall changes and the impact of authentication to backup servers when failover happens not considered.
• A new port scanning tool, was designed to look for ports that were open. But before it was ran, the interaction that it would have with ports that were excluded by other scanning tools, was not considered, leading to “half open” connections, excessive resource consumption and non-responsive EMS servers.
• Power outage for redundant router and its impact on the EMS system was not considered in the design.
• Alarm program application crash impact on rest of the EMS system not considered.
System Interactions not considered (A4B5C05) - Examples
RELIABILITY | ACCOUNTABILITY 29
• An IOS upgrade and its impact on the processing of Access Control Lists (ACL) was not considered before the upgrade. This led to overloading of CPUs of critical redundant routers, prevented network traffic from reaching the destination and left dispatchers with no consoles. Without information from the vendor, considering these system interactions is very difficult.
• Spurious data in the EMS runback calculation when the system restarts. A software application is not properly deleting the information on a restart.
• The impact of manually disabled SCADA records and its effect on control application were not considered during the testing phases.
• The impact of powering down of one critical router that stores and distribute the encryption scheme was not considered.
• The impact of number of study application users on the performance of the EMS system was not considered when the EMS system was designed. Performance testing would have prevented this issue.
System Interactions not considered (A4B5C05) - Examples
RELIABILITY | ACCOUNTABILITY 30
• Change to access BCC from PCC consoles led to unnecessary dependency between domain servers
• Testing changes on the primary servers but not on backup site servers. Failover did not work .
• An ICCP build process was rebuilding other databases which were not needed.
• Asymmetrical routing. Routing protocol paths were changed for SCADA circuits, without considering backup system.
• System performance was degrading, but still additional data imports and adding non-critical data was being performed on the system.
• A new “group control” feature was tested on 5 breakers and later applied to 300 breakers, which led to SCADA application crashing.
• The data center power distribution unit (PDU) “B” has tripped offline.. This left the EMS servers with only the single feed from the “A” PDU. This resulted in voltage fluctuation of sufficient size to have caused the EMS servers to restart.
Risk Associated with Change not Identified (A4B5C04) - Examples
RELIABILITY | ACCOUNTABILITY 31
• A custom program that had been disabled had not been completely disabled. This resulted in data files which accumulated without the corresponding cleaning program functionality occurring.
• The encryption solution for the SONET ring uses two centralized key routers. System was running out of Backup because of maintenance work on the UPS systems. When the redundant UPS was also taken out of service, the impact on the communications was not considered.
• A change to the base log-on configuration that impacted both PCC and BCC was not tested for dependency issues. A test plan was created to create interdependency between PCC & BCC control systems.
• A UPS was out of service for emergency maintenance and it was primary source of power for certain critical communication equipment. Temporarily power was being sourced from another building via power cables since moving the equipment would have led to an outage. Entity took the chance that the building power would not go away. But because of a fault it did and SCADA communications was lost.
Risk Associated with Change not Identified (A4B5C04) - Examples
RELIABILITY | ACCOUNTABILITY 32
Insufficient Job Scoping (A4B3C08) – Examples
• A requirement for failover was missing and, therefore, the system was unable to failback to the PCC on demand. Job scoping should have identified this. A recent network device configuration change was made, but failover scenario was missed.
• A new port scanning tool was tested against a subset of systems and the test results showed no impact. But when the scan was run against real time network and servers including EMS servers, denial of service resulted because of “half open” connection consuming excessive computing resources. The tool was run with default options and not configured and tuned to the entity’s network, unlike other tools which were heavily tuned. The job scoping process was less than adequate to gauge special circumstances/conditions.
• Backup control center functionality if typically tested for transfer of EMS functionality and the routing of data, but the power still available to data center systems as primary control center. Job scoping of UPS system maintenance at PCC, did not consider the fact that some of the devices can be powered off and lead to suspect data at the present operations center.
RELIABILITY | ACCOUNTABILITY 33
Insufficient Job Scoping (A4B3C08) – Examples
• Network configuration was changed by disabling auto-negotiation on Ethernet interfaces and changing from switch based failover to router based failover. But the scoping did not consider that every EMS server needed to have its default gateway changed to reflect the router based failover. When link between BCC & PCC failed, EMS system failed due to excessive broadcasts from FEP due to incorrectly defined default gateway.
• The corruption of the FEP database occurred after exceeding the 30,000 FEP point count limit and resulted from an early exit from the FEP build process leaving several critical columns blank in the FEP database. No errors were noted in the offline builds while a FEP build in the production environment of the same FEP database showed extensive errors due to missing data in critical fields.
RELIABILITY | ACCOUNTABILITY 34
• Authentication server failover tested once, but not after all the firewall changes over time.
• Implementation of RSTP not tested after changing network configuration
• Testing changes on the primary servers but not on backup site servers. Failover did not work.
• Network device configuration changed and tested only on the primary system
• Firewall software changes were not tested against existing critical program functionality
• After installing IOS upgrades on critical redundant switches, impact of the changes on the network was not monitored, until traffic was denied a few days later.
Post Modification/Maintenance Testing LTA (A2B3C03) - Examples
RELIABILITY | ACCOUNTABILITY 35
• A network device configuration was changed to prevent unauthorized access to SCADA system. Testing was performed to verify a network configuration change. Network and system logs were scanned for errors and the system was monitored before, during and for a period after the configuration change. However, failover testing was not performed and a latent issue was left uncovered.
• If testing was performed to verify BCC connectivity to the substation circuits when their routing paths were updated as a result of a change, the event could have been avoided. The testing process for the change only tested the connectivity to the primary control center EMS network and not the BCC network. The asymmetrical routing situation would have been detected during the tests.
• During a DNS flexible Single Master Operation (cluster move) the GUI requested permission to elevate the privileges. This should not have occurred and revealed a configuration problem with the clustering. A permission setting for the clustered pair was not remaining set permanently until one of the network folks elevated his privileges and made the change.
Post Modification/Maintenance Testing LTA (A2B3C03) - Examples
RELIABILITY | ACCOUNTABILITY 36
• Inter-dependency of the domain servers was not tested properly
• Regression testing not conducted after password changes - FTP caused too many pings
• Failover to backup site was not conducted but only paper tests were performed
• A joint initiative between vendor and entity led to the development of an application but the loading on the application and its impact on the system was not tested prior to installation on the production system.
• A change to the base log-on configuration that impacted both PCC and BCC was not tested for dependency issues. A test plan was created to create interdependency between PCC & BCC control systems.
• Due to distribution feeder switch lockout, the power to PCC was lost and all the telephone systems lost power. The backup phone battery was bad and failed within a very short time. The testing of the emergency backup phone system was not performed thoroughly.
Inspection Testing LTA (A2B3C02) - Examples
RELIABILITY | ACCOUNTABILITY 37
• The network admin neither had any formal training nor on the job training. Hence, he was not aware of the tool that he was using to make the firewall description changes and was also not aware of the risks involved in the changes. Hence, this is a knowledge based individual human performance error.
• An EMS support engineer who was called in to assist with alarms, realized that he could not login to EMS system, thought that the problem was with EMS server # 1 and initiated a failover to backup EMS server #2. Because of replication issues, failover did not work. Engineer did not realize that the failover had halted. Engineer stopped the process manager on server # 1. Failover did not work to #2 and did not failback to #1 because of stopped process manager. The engineer later realized that it was a network issue.
Attention to Wrong Issues (A3B3C01) - Examples
RELIABILITY | ACCOUNTABILITY 38
• A week old standing alarm made quick diagnosis impossible. DNX was already in alarm before a card failed. The corrective action to replace a part and clear the alarm was untimely.
• An overloaded UPS at PCC was being swapped with transported BCC UPS. But it turns out that they were not interchangeable even though, they passed the eye test with same specifications.
Untimely Corrective actions to known problem (A4B1C08) - Examples
RELIABILITY | ACCOUNTABILITY 39
Common themes
1. Software Failures
2. Software Configuration/Installation/Maintenance
3. Hardware Failures
4. Hardware Configuration/Installation/Maintenance
5. Failover Testing Weaknesses
6. Testing Inadequacies
Common Themes
RELIABILITY | ACCOUNTABILITY 40
• Application Software Bug/Defect Base System – Alarms/Health Check/Syncing etc. Front End Processing (CFE/FEP/DAC/FCS) Supervisory Control Applications (SCADA) Automatic Generation Control (AGC) Inter Control Center Communication Protocol (ICCP) User Interface (UI) Relational Database Management Systems (RDBMS) Build Process Scripts Miscellaneous Scripts for clean up, start up, cron jobs etc.
• Communication Equipment Firmware/Software Bug/Defect Remote Terminal Units Switches Modems Routers Firewalls
• Operating System Software Bug/Defect
Unix/Linux/Windows
Software Failures
RELIABILITY | ACCOUNTABILITY 41
Common themes for EMS Outages:
1. Software failures
2. Software Configuration/Installation/Maintenance
3. Hardware Failures
4. Hardware Configuration/Installation/Maintenance
5. Failover Testing Weaknesses
6. Testing Inadequacies
Common Themes
RELIABILITY | ACCOUNTABILITY 42
Software C/I/M
• Improper sizing of parameters
• Improper user/application permission issues
• Incorrect application parameter settings
• Incorrect database settings/configuration
• Critical Infrastructure Protection (CIP) installation issues
• Improper installation of patches
• Improper propagation of changes to Spare/Backup servers.
• Improper patch management – Timing/Application etc.
• Improper maintenance of programs/patches
• Incorrect recovery procedures
• Missing documentation for programs/procedures etc.
• Improper security policy configuration changes
• External program configuration issues (Anti Virus, Service Oriented Architecture (SOA) services etc.)
• Improper configuration of security tools
RELIABILITY | ACCOUNTABILITY 43
• Application Servers/Nodes NIC cards Server hard drive control board Aux Power regulator control
• Communication Equipment Remote Terminal Unit (RTU) Switches Routers Firewalls Fiber Optic Cables Time source
• Power Sources Uninterruptible Power Supply (UPS) External Generators Power Cables
Hardware Failures
RELIABILITY | ACCOUNTABILITY 44
• Improper server redundancy set up
• Improper power sources redundancy
• Improper Local /Wide Area Network (LAN/WAN) configuration
• Improper routing of power paths
• Incorrect communication network settings
• Improper disk/memory sizing
• Improper settings on routers/switches etc.
• Improper server clustering
• Improper time source configuration
Hardware C/I/M
RELIABILITY | ACCOUNTABILITY 45
• Improper settings preventing the failover
• Improper procedure to failover
• System setup issues preventing failover
• Improper patch management between primary/spare/backup servers
• Primary server issues reflected on spare/backup as well – No Isolation
• Improper failover configurations settings
• Improper network device configuration settings for failover
• Design requirements not considering failovers
Failover Testing Weaknesses
RELIABILITY | ACCOUNTABILITY 46
• Inadequate
• Improper procedures to test
• Incomplete scope
• Not engaging all the parties involved
Testing Inadequacies
RELIABILITY | ACCOUNTABILITY 47
SW & HW Categories & Restoration Times
86
131
91
66
100
152
94
13
7
8
2
20
4
19
0
5
10
15
20
25
0
20
40
60
80
100
120
140
160
Hardware C/I/M Hardware Failure - Com
Hardware Failure - Power
Hardware Failure - Server
Software Failure - App
Software Failure - Com
Software C/I/M
Even
t C
ou
nt
Res
tora
tio
n T
ime
in M
inu
tes
Mean Outage Restoration Time (Mins) Event Count
RELIABILITY | ACCOUNTABILITY 48
Venkat Tirupati Senior Reliability Engineer 404-446-2584 office | 404-801-5621 cell [email protected]