Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and...

48
Analysis of EMS Outages Venkat Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference September 18, 2013

Transcript of Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and...

Page 1: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

Analysis of EMS Outages

Venkat Tirupati, Senior Reliability Engineer

NERC Monitoring and Situational Awareness Conference

September 18, 2013

Page 2: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 2

Agenda

• Introduction

• Analysis of Restorations

• Contributing & Root causes with examples

• Common themes with examples

• Q & A

Page 3: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 3

• Energy Management Systems (EMS) are extremely reliable

• EMS outages increase the risk to the reliability of the grid

• 81 Category 2b events (Oct 26, 2010 – Sep 3, 2013) reported

• 64 events – thoroughly analyzed and reviewed

• 54 entities reporting - 20 entities experiencing multiple outages

• Restoration time for partial outages: 18 to 411 min

• Restoration time for complete outages: 12 to 253 min

• Vendor agnostic failures – Software & Hardware Issues

• Several noticeable themes

Introduction

Page 4: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 4

Analysis of Restoration Times

0

50

100

150

200

250

300

350

400

450

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77

Complete Outage Restoration Time Partial Outage Restoration Time Mean Complete Outage Restoration Time

Mean Partial Outage Restoration Time Mean Outage Restoration Time

Mean Outage Restoration Time

Mean Complete Outage Restoration Time

0

20

40

60

80

2010 2011 2012 2013

Mean Complete Outage Restoration Time

Mean Partial Outage Restoration Time

Mean Partial Outage Restoration Time

Mean Complete Outage Restoration Time: 56 Minutes Mean Partial Outage Restoration Time: 43 Minutes Mean Total Outage Restoration Time: 99 Minutes

Tim

e in

Min

ute

s

Page 5: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 5

Restoration Time

0

2

4

6

8

10

12

Nu

mb

er

of

Eve

nts

Tit

le

Restoration Time 10 Minute Intervals

10- Minute Interval Restoration Time

Page 6: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 6

Outage Restoration Times by Date

0

50

100

150

200

250

300

350

400

450

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77

Complete Outage Restoration Time Partial Outage Restoration Time

October 26, 2010 – September 3, 2013

2012 2011 2013

Tim

e in

Min

ute

s

Page 7: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 7

Number of Reports - Quarterly

0

2

4

6

8

10

12

2010 Q4 2011 Q1 2011 Q2 2011 Q3 2011 Q4 2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3

Nu

mb

er o

f R

epo

rted

Ou

tage

s

October 26, 2010 – September 3, 2013

Page 8: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 8

Characteristics of the EMS outages

0

10

20

30

40

50

60

70

80

90

Outages on Weekdays/Outages on Weekends

CIP activity led to outage/Non-CIP activity led to outage

Outage due to Planned Activity/Outage Unforeseen

69

10

31

12

71

50

Nu

mb

er o

f Ev

en

ts

Weekend Non-CIP Unforeseen

Weekday CIP Planned

Page 9: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 9

Outage Time of the day (2 Hr Intervals)

0

1

2

3

4

5

6

7

8

9

10

0:00 - 2:00 2:00 - 4:00 4:00 - 6:00 6:00 - 8:00 8:00 - 10:00 10:00 - 12:00 12:00 - 14:00 14:00 - 16:00 16:00 - 18:00 18:00 - 20:00 20:00 - 22:00 22:00 - 24:00

1 0

1 1

5 6

2

4 4 4 3

0

7

5

9

1

3

3

5

4 3

2 3

5

Nu

mb

er o

f Ev

ents

Outage due to Planned Activity Outage Unforeseen

Page 10: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 10

Root Causes – By Category

A1 - Design/Engineering, 9, 16%

A2 - Equipment/Material, 14,

25%

A3 - Individual Human Performance, 1

A4 - Management/Organizati

on, 17, 30%

A5 - Communication, 3

A6 - Training, 1

AZ - Information LTA, 11, 20%

Page 11: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 11

Root Causes

0

1

2

3

4

5

6

7

8

Top Root Causes Software Failure (A2B6C07) Testing of Design/Installation LTA (A1B4C02) Information to determine cause LTA (AZ) Insufficient Job scoping (A4B3C08) Inadequate risk assessment of change (A4B5C04) Post modification testing LTA (A2B3C03) Design output scope LTA (A1B2C01) Vendor or contractor involved (AZB3C02)

Page 12: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 12

Contributing Causes – By Category

A4 - Management/Organizati

on, 64, 28%

A2 - Equipment/Material, 74,

32%

A1 - Design/Engineering, 41, 18%

A3 - Individual Human Performance, 22, 9%

A5 - Communication, 15, 6% AX - Overall

Configuration, 12, 5%

A7 - Other, 5, 2%

Page 13: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 13

Contributing Causes

0

5

10

15

20

25

A2B

6C0

7

A1B

2C0

1

A4B

5C0

3

A1B

4C0

2

A2B

6C0

1

A4B

5C0

5

A4B

5C0

4

A2B

3C0

3

A2B

3C0

2

A3B

3C0

1

A4B

1C0

8

A1B

2C0

8

A3B

2C0

1

A4B

3C0

8

A4B

5C1

3

A7B

1C0

2

AX

AX

B1

A1B

2C0

5

A2B

2C0

1

A2B

3C0

1

A3B

1C0

1

A3B

2C0

5

A3B

3C0

4

A4B

2

A4B

2C0

8

A4B

3

A4B

3C0

9

A4B

5C0

1

A4B

5C0

9

A5B

2C0

8

A5B

4C0

1

AX

B2

A1

A1B

1

A1B

1C0

1

A1B

1C0

3

A1B

2C0

9

A1B

3C0

1

A1B

5C0

2

A1B

3C0

2

A2B

1C0

2

A2B

7C0

1

A2B

7C0

4

A2B

6C0

5

A3B

1C0

3

A3B

1C0

4

A3B

1C0

6

A3B

2C0

2

A3B

2C0

4

A4B

1

A4B

1C0

3

A4B

1C0

4

A4B

1C0

6

A4B

1C0

5

A4B

1C0

9

A4B

2C0

7

A4B

3C1

1

A4B

4C0

5

A4B

5

A4B

5C0

2

A5

A5B

1C0

1

A5B

1C0

3

A5B

1C0

5

A5B

2

A5B

3C0

1

A5B

4C0

6

A6B

3

A7B

1

Top Contributing Causes Software Failure (A2B6C07) Design output scope LTA (A1B2C01) Inadequate vendor support of change (A4B5C03) Testing of Design/Installation LTA (A1B4C02) Defective or failed part (A2B6C01) System Interactions not considered (A4B5C05) Inadequate risk assessment of change (A4B5C04) Post Modification Testing LTA (A2B3C03) Inspection/Testing LTA (A2B3C02) Attention given to wrong issues (A3B3C01) Untimely corrective actions to known issue (A4B1C08)

Page 14: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 14

Equipment/Material - Sub-Categories

0

5

10

15

20

25

30

35

40

45

A2B2 - Periodic/Corrective Maintenance LTA

A2B3 - Inspection/Testing LTA A2B6 - Defective or Failed A2B7 - Equipment Interactions LTA

Page 15: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 15

Management/Organization Sub-Categories

0

5

10

15

20

25

30

35

40

A4B5 - Change Management LTA

A4B1 - Management Methods LTA

A4B3 - Work Planning Organization LTA

A4B2 - Resource Management LTA

A4B4 - Supervisory Methods LTA

Page 16: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 16

Top Root/Contributing Causes

• Software Failure (A2B6C07)

• Design output scope LTA (A1B2C01)

• Inadequate vendor support of change (A4B5C03)

• Testing of Design/Installation LTA (A1B4C02)

• Defective or failed part (A2B6C01)

• System Interactions not considered (A4B5C05)

• Inadequate risk assessment of change (A4B5C04)

• Insufficient Job scoping (A4B3C08)

• Post Modification Testing LTA (A2B3C03)

• Inspection/Testing LTA (A2B3C02)

• Attention given to wrong issues (A3B3C01)

• Untimely corrective actions to known issue (A4B1C08)

Page 17: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 17

• A tuning parameter on an AGC display was changed leading to a number greater than an acceptable parameter. There was no validation and it ended up generating an invalid array index and hence corrupted the database.

• A process that synchronizes data between Primary and Backup systems was aborting and continuously restarting the servers.

• SCADA application did not check for maximum number of control commands allowed and generated invalid keys that ultimately led to aborting of the application.

• Fortran array out of bound issues with control application due to a software bug

• A vendor supplied batch file did not have a proper command in a system wide script.

• Program to clean out log files was not deleting them leading to disk space issues

Software Failure (A2B6C07) - Examples

Page 18: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 18

Software Failure (A2B6C07) - Examples

• Coding error in the alarm process code where in code generates a corrupt alarm when the concatenated string size of RTU and associated points size is more than 80 characters.

• EMS vendor revealed to the entity that the ‘Delete’ operation used to remove previous database files before updating the configuration database had intermittently been unreliable at other installations.

• A process to purge data files created for supporting outage management system had bug and the process filled up the hard disk. This caused the entity to lose control functionality.

• Failover setting parameter issue led to failover process failing

• Synchronization settings between the PCC and BCC domain servers

• Failover program did not account for failure of certain critical applications

• Unreleased semaphores clogging the system virtual memory and leading to failed integrity checks between EMS servers

Page 19: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 19

Software Failure (A2B6C07) - Examples

• A program locked a file and caused exhaustion of system resources

• Rapid Spanning Tree Protocol incompatibilities and memory leak issues with communication equipment software

• Bug in the router software regarding spanning tree protocol

• Router encountered a software bug that prevented it from refreshing its mapping between Layer 2 and Layer 3 addresses

• A health check software had bugs

• Messaging program had software bugs and was restarting critical programs continuously

• EMS applications start up scripts had bugs

• Windows clustering functionality problems

• Automated propagate script failed to replicate the changes due to incorrect host names.

• The start/stop script did not successfully abort the program.

• Display build process failing due to Java heap memory issues

Page 20: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 20

Top Root/Contributing Causes

• Software Failure (A2B6C07)

• Design output scope LTA (A1B2C01)

• Inadequate vendor support of change (A4B5C03)

• Testing of Design/Installation LTA (A1B4C02)

• Defective or failed part (A2B6C01)

• System Interactions not considered (A4B5C05)

• Inadequate risk assessment of change (A4B5C04)

• Insufficient Job scoping (A4B3C08)

• Post Modification Testing LTA (A2B3C03)

• Inspection/Testing LTA (A2B3C02)

• Attention given to wrong issues (A3B3C01)

• Untimely corrective actions to known issue (A4B1C08)

Page 21: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 21

• The firewall manager tool did not have a prompt to warn the user about the firewall name change. Vendor has been contacted and will implement the barrier in the tool.

• An automated script from the vendor replicates the changes made on a server to all the other needed servers. However the automated script did not have the correct host names of the servers. This led to the repeated SSH attempts to reach the servers to fail and led to the outage.

• Design of the NAT was not compatible with the protocols used on front end processor servers and prevented a successful failover.

• Lack of redundancy with NIC cards led to network communication issues leading to EMS outage.

• A network device configuration change was made, but failover scenario was missed, leading to failure of failover on demand.

• Incorrectly defined routing paths for SCADA network

• There was no networking capability, independent from the primary control facility network, to access the BCC directly.

Design output scope LTA (A1B2C01) - Examples

Page 22: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 22

• Entry of an a tuning parameter on a display, with a value greater than the allowed parameter, resulted in an invalid array index and led to corruption of the database. The tool should have had proper validation capability and not let the user enter invalid information.

• AGC paused due to bad telemetry inputs and most telemetry inputs failed over to the alternate tone channels. But there were some points, which did not have alternate signals and it was not very clear which inputs had failed. A procedure to identify the failed points is in place and a new display that shows AGC telemetry and status is installed.

• There was no automated process to routinely ensure that file system disk space levels were well below the warning thresholds on EMS servers.

• Failover configuration settings were not set according to the latest vendor-established standard configuration and this was leading to a repeated transition for SCADA server from online to backup mode interrupting communications from RTU.

• Greater than normal utilization of study applications led to depletion of virtual memory and RAM. System design did not consider the impact of study applications on performance.

Design output scope LTA (A1B2C01) - Examples

Page 23: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 23

• Vendor overlooked the setting for automatic site failovers and left it enabled. The entity wanted it to be manual.

• No viable implementation of current entity’s version at the EMS vendor site to test changes – Disabling of control records was crashing the control application.

• SCADA vendor was contacted before the implementation of NAT on the SCADA network and the vendor provided no technical objections or recommendations. Design of the NAT was not compatible with the protocols used on front end processor servers and prevented a successful failover.

• A vendor did not test a system wide batch script before sending it to the entity. Entity ran it and that led to loss of RTU communications for the entire network.

• Vendor was asked for help with unresponsive domain controller to system DNS requests and why DNS service on one domain controller did not failover to another. Vendor suggested articles and suggestions but did not provide much help in diagnosing the event.

Inadequate Vendor Support of Change (A4B5C03) - Examples

Page 24: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 24

• The vendor supplied batch file did not come with good instructions.

• Vendor moved on to supporting another package than the one provided to entity

• Vendor patches provided for the RTU communications problem did not fix the problem entity was experiencing.

• Vendors could not figure out the issue with the clustering problems

• A joint initiative application with vendor was developed. But the entity and the vendor did not foresee the loading on the production system. Software vendor recommendation for memory specifications did not account for this new application.

Inadequate Vendor Support of Change (A4B5C03) - Examples

Page 25: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 25

• A new port scanning tool was testing against a subset of systems and showed no impact. But, after the event, the tool was validated in another test environment similar to real time and the “half open” issue & non-responsive EMS servers was reproduced.

• An entity did not adequately test the NAT implementation prior to implementing on production SCADA network. The design was not compatible with the protocols used on the front end processors and prevented a successful failover

• When an automatic site failover occurred due to network interruption, operators were not trained to login to SCADA backup server domain. This was not tested prior to the incident.

• The FAT testing did not have test steps to test the control application when control records are manually disabled in SCADA.

• When new functions are tested, scale of operations need to be considered and not just the operation itself. A new “Group control” function was tested on 5 breakers and then applied to 300 breakers. This resulted in crashing of SCADA application. This could have been easily tested on a test system, even though controls are not truly sent to RTUs.

Testing of Design/Installation LTA (A1B4C02) - Examples

Page 26: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 26

• A new firewall security patch was not tested prior to implementing it on the system

• Entity did not confirm with the vendor that a batch meant to correct the timing alarms out of substation equipment, was run in a test environment that simulates a system configuration prior to running on the live system.

• During the test of a new RTU on the SCADA FEP, RTU and associated point concatenated name length was not tested as it generated a corrupt alarm.

• During testing it was noticed one of the variable was not getting calculated. Staff noticed that one step was missed. Instead of testing the recovery method on development system, the missing step was directly run on the production environment. The instruction was mistyped and accidently initialized all global AGC parameters.

• Passwords were changed on the system, but the critical applications were not tested to ensure their normal functionality.

Testing of Design/Installation LTA (A1B4C02) - Examples

Page 27: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 27

• Faulty NIC Cards

• Auxiliary power regulator control in the Regulator control panel was the failed component

• Fiber optic cable lost

• Failed cards in Digital Cross Connect

• Fault within a range of ML-1000 card revisions

• Octal T1 (Card 2) in the SPO DNX and card 2’s failure to failover to card 1 properly.

• UPS module failure led to power outage for aggregation switch.

• Failure of fiber optic interface card

• Bypass switch on the MUX UPS

• UPS battery bank failed. Blown circuit boards and fuses due to depletion of temporary batteries.

Defective or a Failed Part (A2B6C01) - Examples

Page 28: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 28

• Firewall changes and interaction with program that copies files between two servers not considered.

• Firewall changes and the impact of authentication to backup servers when failover happens not considered.

• A new port scanning tool, was designed to look for ports that were open. But before it was ran, the interaction that it would have with ports that were excluded by other scanning tools, was not considered, leading to “half open” connections, excessive resource consumption and non-responsive EMS servers.

• Power outage for redundant router and its impact on the EMS system was not considered in the design.

• Alarm program application crash impact on rest of the EMS system not considered.

System Interactions not considered (A4B5C05) - Examples

Page 29: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 29

• An IOS upgrade and its impact on the processing of Access Control Lists (ACL) was not considered before the upgrade. This led to overloading of CPUs of critical redundant routers, prevented network traffic from reaching the destination and left dispatchers with no consoles. Without information from the vendor, considering these system interactions is very difficult.

• Spurious data in the EMS runback calculation when the system restarts. A software application is not properly deleting the information on a restart.

• The impact of manually disabled SCADA records and its effect on control application were not considered during the testing phases.

• The impact of powering down of one critical router that stores and distribute the encryption scheme was not considered.

• The impact of number of study application users on the performance of the EMS system was not considered when the EMS system was designed. Performance testing would have prevented this issue.

System Interactions not considered (A4B5C05) - Examples

Page 30: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 30

• Change to access BCC from PCC consoles led to unnecessary dependency between domain servers

• Testing changes on the primary servers but not on backup site servers. Failover did not work .

• An ICCP build process was rebuilding other databases which were not needed.

• Asymmetrical routing. Routing protocol paths were changed for SCADA circuits, without considering backup system.

• System performance was degrading, but still additional data imports and adding non-critical data was being performed on the system.

• A new “group control” feature was tested on 5 breakers and later applied to 300 breakers, which led to SCADA application crashing.

• The data center power distribution unit (PDU) “B” has tripped offline.. This left the EMS servers with only the single feed from the “A” PDU. This resulted in voltage fluctuation of sufficient size to have caused the EMS servers to restart.

Risk Associated with Change not Identified (A4B5C04) - Examples

Page 31: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 31

• A custom program that had been disabled had not been completely disabled. This resulted in data files which accumulated without the corresponding cleaning program functionality occurring.

• The encryption solution for the SONET ring uses two centralized key routers. System was running out of Backup because of maintenance work on the UPS systems. When the redundant UPS was also taken out of service, the impact on the communications was not considered.

• A change to the base log-on configuration that impacted both PCC and BCC was not tested for dependency issues. A test plan was created to create interdependency between PCC & BCC control systems.

• A UPS was out of service for emergency maintenance and it was primary source of power for certain critical communication equipment. Temporarily power was being sourced from another building via power cables since moving the equipment would have led to an outage. Entity took the chance that the building power would not go away. But because of a fault it did and SCADA communications was lost.

Risk Associated with Change not Identified (A4B5C04) - Examples

Page 32: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 32

Insufficient Job Scoping (A4B3C08) – Examples

• A requirement for failover was missing and, therefore, the system was unable to failback to the PCC on demand. Job scoping should have identified this. A recent network device configuration change was made, but failover scenario was missed.

• A new port scanning tool was tested against a subset of systems and the test results showed no impact. But when the scan was run against real time network and servers including EMS servers, denial of service resulted because of “half open” connection consuming excessive computing resources. The tool was run with default options and not configured and tuned to the entity’s network, unlike other tools which were heavily tuned. The job scoping process was less than adequate to gauge special circumstances/conditions.

• Backup control center functionality if typically tested for transfer of EMS functionality and the routing of data, but the power still available to data center systems as primary control center. Job scoping of UPS system maintenance at PCC, did not consider the fact that some of the devices can be powered off and lead to suspect data at the present operations center.

Page 33: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 33

Insufficient Job Scoping (A4B3C08) – Examples

• Network configuration was changed by disabling auto-negotiation on Ethernet interfaces and changing from switch based failover to router based failover. But the scoping did not consider that every EMS server needed to have its default gateway changed to reflect the router based failover. When link between BCC & PCC failed, EMS system failed due to excessive broadcasts from FEP due to incorrectly defined default gateway.

• The corruption of the FEP database occurred after exceeding the 30,000 FEP point count limit and resulted from an early exit from the FEP build process leaving several critical columns blank in the FEP database. No errors were noted in the offline builds while a FEP build in the production environment of the same FEP database showed extensive errors due to missing data in critical fields.

Page 34: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 34

• Authentication server failover tested once, but not after all the firewall changes over time.

• Implementation of RSTP not tested after changing network configuration

• Testing changes on the primary servers but not on backup site servers. Failover did not work.

• Network device configuration changed and tested only on the primary system

• Firewall software changes were not tested against existing critical program functionality

• After installing IOS upgrades on critical redundant switches, impact of the changes on the network was not monitored, until traffic was denied a few days later.

Post Modification/Maintenance Testing LTA (A2B3C03) - Examples

Page 35: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 35

• A network device configuration was changed to prevent unauthorized access to SCADA system. Testing was performed to verify a network configuration change. Network and system logs were scanned for errors and the system was monitored before, during and for a period after the configuration change. However, failover testing was not performed and a latent issue was left uncovered.

• If testing was performed to verify BCC connectivity to the substation circuits when their routing paths were updated as a result of a change, the event could have been avoided. The testing process for the change only tested the connectivity to the primary control center EMS network and not the BCC network. The asymmetrical routing situation would have been detected during the tests.

• During a DNS flexible Single Master Operation (cluster move) the GUI requested permission to elevate the privileges. This should not have occurred and revealed a configuration problem with the clustering. A permission setting for the clustered pair was not remaining set permanently until one of the network folks elevated his privileges and made the change.

Post Modification/Maintenance Testing LTA (A2B3C03) - Examples

Page 36: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 36

• Inter-dependency of the domain servers was not tested properly

• Regression testing not conducted after password changes - FTP caused too many pings

• Failover to backup site was not conducted but only paper tests were performed

• A joint initiative between vendor and entity led to the development of an application but the loading on the application and its impact on the system was not tested prior to installation on the production system.

• A change to the base log-on configuration that impacted both PCC and BCC was not tested for dependency issues. A test plan was created to create interdependency between PCC & BCC control systems.

• Due to distribution feeder switch lockout, the power to PCC was lost and all the telephone systems lost power. The backup phone battery was bad and failed within a very short time. The testing of the emergency backup phone system was not performed thoroughly.

Inspection Testing LTA (A2B3C02) - Examples

Page 37: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 37

• The network admin neither had any formal training nor on the job training. Hence, he was not aware of the tool that he was using to make the firewall description changes and was also not aware of the risks involved in the changes. Hence, this is a knowledge based individual human performance error.

• An EMS support engineer who was called in to assist with alarms, realized that he could not login to EMS system, thought that the problem was with EMS server # 1 and initiated a failover to backup EMS server #2. Because of replication issues, failover did not work. Engineer did not realize that the failover had halted. Engineer stopped the process manager on server # 1. Failover did not work to #2 and did not failback to #1 because of stopped process manager. The engineer later realized that it was a network issue.

Attention to Wrong Issues (A3B3C01) - Examples

Page 38: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 38

• A week old standing alarm made quick diagnosis impossible. DNX was already in alarm before a card failed. The corrective action to replace a part and clear the alarm was untimely.

• An overloaded UPS at PCC was being swapped with transported BCC UPS. But it turns out that they were not interchangeable even though, they passed the eye test with same specifications.

Untimely Corrective actions to known problem (A4B1C08) - Examples

Page 39: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 39

Common themes

1. Software Failures

2. Software Configuration/Installation/Maintenance

3. Hardware Failures

4. Hardware Configuration/Installation/Maintenance

5. Failover Testing Weaknesses

6. Testing Inadequacies

Common Themes

Page 40: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 40

• Application Software Bug/Defect Base System – Alarms/Health Check/Syncing etc. Front End Processing (CFE/FEP/DAC/FCS) Supervisory Control Applications (SCADA) Automatic Generation Control (AGC) Inter Control Center Communication Protocol (ICCP) User Interface (UI) Relational Database Management Systems (RDBMS) Build Process Scripts Miscellaneous Scripts for clean up, start up, cron jobs etc.

• Communication Equipment Firmware/Software Bug/Defect Remote Terminal Units Switches Modems Routers Firewalls

• Operating System Software Bug/Defect

Unix/Linux/Windows

Software Failures

Page 41: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 41

Common themes for EMS Outages:

1. Software failures

2. Software Configuration/Installation/Maintenance

3. Hardware Failures

4. Hardware Configuration/Installation/Maintenance

5. Failover Testing Weaknesses

6. Testing Inadequacies

Common Themes

Page 42: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 42

Software C/I/M

• Improper sizing of parameters

• Improper user/application permission issues

• Incorrect application parameter settings

• Incorrect database settings/configuration

• Critical Infrastructure Protection (CIP) installation issues

• Improper installation of patches

• Improper propagation of changes to Spare/Backup servers.

• Improper patch management – Timing/Application etc.

• Improper maintenance of programs/patches

• Incorrect recovery procedures

• Missing documentation for programs/procedures etc.

• Improper security policy configuration changes

• External program configuration issues (Anti Virus, Service Oriented Architecture (SOA) services etc.)

• Improper configuration of security tools

Page 43: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 43

• Application Servers/Nodes NIC cards Server hard drive control board Aux Power regulator control

• Communication Equipment Remote Terminal Unit (RTU) Switches Routers Firewalls Fiber Optic Cables Time source

• Power Sources Uninterruptible Power Supply (UPS) External Generators Power Cables

Hardware Failures

Page 44: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 44

• Improper server redundancy set up

• Improper power sources redundancy

• Improper Local /Wide Area Network (LAN/WAN) configuration

• Improper routing of power paths

• Incorrect communication network settings

• Improper disk/memory sizing

• Improper settings on routers/switches etc.

• Improper server clustering

• Improper time source configuration

Hardware C/I/M

Page 45: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 45

• Improper settings preventing the failover

• Improper procedure to failover

• System setup issues preventing failover

• Improper patch management between primary/spare/backup servers

• Primary server issues reflected on spare/backup as well – No Isolation

• Improper failover configurations settings

• Improper network device configuration settings for failover

• Design requirements not considering failovers

Failover Testing Weaknesses

Page 46: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 46

• Inadequate

• Improper procedures to test

• Incomplete scope

• Not engaging all the parties involved

Testing Inadequacies

Page 47: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 47

SW & HW Categories & Restoration Times

86

131

91

66

100

152

94

13

7

8

2

20

4

19

0

5

10

15

20

25

0

20

40

60

80

100

120

140

160

Hardware C/I/M Hardware Failure - Com

Hardware Failure - Power

Hardware Failure - Server

Software Failure - App

Software Failure - Com

Software C/I/M

Even

t C

ou

nt

Res

tora

tio

n T

ime

in M

inu

tes

Mean Outage Restoration Time (Mins) Event Count

Page 48: Analysis of EMS Outages - nerc.com Tirupati, Senior Reliability Engineer NERC Monitoring and Situational Awareness Conference ... • Vendor moved on to supporting another package

RELIABILITY | ACCOUNTABILITY 48

Venkat Tirupati Senior Reliability Engineer 404-446-2584 office | 404-801-5621 cell [email protected]