Root Cause for Data Analysis

8
Root Cause Analysis 1 Root Cause Analysis What is Root cause Analysis? Root Cause Analysis (or RCA) is a process by which the root c ause of a data quality issue is diagnosed remotely after a UDR or UDI is completed. Properly diagnosing root causes of cases can assist the Field Ops team in repairing the issues for future dispatches, as well as help us understand what issues are causing us the most harm, so we can try t o solve them through long term m eans. Beginning Root Cause Analysis: Hardware Used in Site Enablement RCA will require a basic understanding of how sites collect data and how parts commonly malfunction. Below is the list of the various p arts of a metering setup, and a short d escription of each. These description is by no means exhaustive, and do not c over the specifics of every enablement, but covers the setup of a site in the majority of circumstances.  There is a Metering Device, either inst alled by us or the utility that will collect data about el ectrical usage directly from the site’s electrical input.  When a DQ issue is with the meter itself, it is usu ally evidenced by incorrect readings on the DataSt ream. Drops, spikes, and zeros are usually caused by a malfunctioning meter. There are two devices use d on the majority of en ablements. Veris Meter: Veris meters are EnerNOC metering devices, meaning that we oversee their install and ar e responsible for their repair or replacement. Veris meters are attached to the mains entering a site, and measure electricity as it passes ‘through’ them. The most com mon DQ issues caused by Veris meters are continuous under reading of actual electricity usage (usually by about 2%), spikes, drops, and z eros. Utility Pulse Meters: Utility Pulse Meters are used by the utility to collect their own data. They are connected to the rest of our equipment through a pulse board, which is set up by the utility t o split the pulses they read from ours. Utility pulses are more reliable than Veris meters, but still have many similar issues (i.e. spikes, drops and zeros). A DQ issue unique to Utility Pulse Meters is a Pulse Multiplier Issue. A Pulse Multiplier issue is created when the Pulse Multiplier (the number that converts a pulse into a value in kWh).

Transcript of Root Cause for Data Analysis

Page 1: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 1/8

Root Cause Analysis

1

Root Cause Analysis

What is Root cause Analysis?

Root Cause Analysis (or RCA) is a process by which the root cause of a data quality issue is diagnosed

remotely after a UDR or UDI is completed. Properly diagnosing root causes of cases can assist the Field

Ops team in repairing the issues for future dispatches, as well as help us understand what issues are

causing us the most harm, so we can try to solve them through long term means.

Beginning Root Cause Analysis: Hardware Used in Site Enablement

RCA will require a basic understanding of how sites collect data and how parts commonly malfunction.

Below is the list of the various parts of a metering setup, and a short description of each. These

description is by no means exhaustive, and do not cover the specifics of every enablement, but covers

the setup of a site in the majority of circumstances. 

There is a Metering Device, either installed by us or the utility that will collect data about electrical

usage directly from the site’s electrical input.  When a DQ issue is with the meter itself, it is usually

evidenced by incorrect readings on the DataStream. Drops, spikes, and zeros are usually caused by a

malfunctioning meter. There are two devices used on the majority of enablements.

Veris Meter: Veris meters are EnerNOC metering devices, meaning that we oversee their install and are

responsible for their repair or replacement. Veris meters are attached to the mains entering a site, and

measure electricity as it passes ‘through’ them. The most common DQ issues caused by Veris meters

are continuous under reading of actual electricity usage (usually by about 2%), spikes, drops, and zeros.

Utility Pulse Meters:  Utility Pulse Meters are used by the utility to collect their own data. They areconnected to the rest of our equipment through a pulse board, which is set up by the utility to split the

pulses they read from ours. Utility pulses are more reliable than Veris meters, but still have many

similar issues (i.e. spikes, drops and zeros). A DQ issue unique to Utility Pulse Meters is a Pulse

Multiplier Issue. A Pulse Multiplier issue is created when the Pulse Multiplier (the number that converts

a pulse into a value in kWh).

Page 2: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 2/8

Root Cause Analysis

2

The metering device sends this data to a Site Server. A site server takes in this raw data, and attempts

to send it, over the internet, to our database. The site server is one of the largest sources of issues, but

most site server issues are evidenced by data not reaching our database. Estimations and missing

readings can be issues with the site server. The site server can also cause several miscellaneous issues,

such as spike drops or negative spikes, in specific circumstances. There are five major site servers that

we use.

Nexus: A Nexus device was the first kind of site server used. Only a few are still in use, but there are

many problems with the Nexus. Nexus site servers do not store data, they only forward it to the

database. Because of this, Nexus meters are the ones most likely to have communication issues.

ILON E3 (also called ILON 100):  An ILON E3 is the second kind of meter used by EnerNOC. Due to the

inaccuracy of the clock on ILON E3s, they are know for having massive spike drops, which usually happen

early on Sunday morning (due to the configuration of our platform). Unlike the Nexus, the ILON E3

stores historical data, but does not store historical data concerning connectivity.

ILON E4:  ILONE4s are the first site server to use chat protocols to communicate. Unlike the previous

two meters, ILON E4s automatically check for timestamp issues, so smaller spike drops are expected.ILON E4s also are easier to communicate with remotely. PowerChat can be used to communicate with

an E4 to recollect data, restart the device, and remotely gather information. E4s have a unique issue

though, a “Negative Spike”, and are the only devices that cause this. 

S1 and S2 Servers: S1 and S2 servers are both site servers designed by EnerNOC. S2s are slightly more

reliable, but both function similarly to the E4s (using a program called ToeChat instead of the E4s

PowerChat). Unlike E4s though, S2 and S1 servers continuously check their timestamp, so there should

not be spike drops on these servers.

We usually try to connect the metering device to the site server directly, to reduce DQ errors and the

amount of equipment needed at a site, but sometimes this is not possible. If not, the metering devicewill connect with the site server using a wireless transmitter. Wireless transmitters usually fail because

of an interrupted signal, although they can be the reason for bigger problems if they break or their

batteries run out.

Wireless

Transmitter is one

of the Root Cause

category.

Page 3: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 3/8

Root Cause Analysis

3

A Wireless Transmitter is usually one of two devices: 

Spinwave: a Spin wave sends a count of pulses (the signal from the metering device) to the site server.

When a Spinwave does not connect, information is not sent until the reconnection. This leads to drop

spikes, since when the Spinwave eventually connects it communicates to the site server the amount of

electricity, in pulses, read by the meter since its last connected. This leads to an interval with less

readings than it should have had, followed by an interval with more readings.

Mod hopper: Mod hopper simply sends the pulses directly to the site server. Because of this, when the

signal is not properly received, no data is collected. This is seen as gaps by the sight server, and can lead

to estimations after it has been processed by the database.

General Description of Data base

The final part of any metering setup is how the site server communicates with our database. The site

server through some means sends this information over the internet, and the complexity of that process

leads to many problems, although they are all evidenced by estimated or non-existent readings. It is

usually done in one of three ways. 

LAN: LAN stands for Local Area Network. Whenever possible, a site server will be configured to

connect with our database via the client’s internet connection.  In this case, it is directly

connected to the internet through the customer network, and problems with the customer

network can cause the server to lose communications.

VPN: VPN stands for Virtual Proxy Network. If a client requests a VPN network, the server is

connected physically in the same way as in the LAN, but information is more secure. For the

purpose of Root Cause Analysis, there is little to no difference between a VPN and a LAN

network.

Wireless: If necessary, the site server will connect with our database using a wireless signal

(over a cell phone network). In this case, a DQ issue can be caused either by a bad cell signal, a

malfunctioning modem device (the part that actually communicates with the cell towers), or a

general issue with the cellular carrier’s network. 

Completing RCA: Diagnosing a Root Cause

After assigning the case to our name and understanding the parts which commonly malfunction during

their operation, then we have to identify which malfunction is occurring. There are some typical

Page 4: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 4/8

Root Cause Analysis

4

examples which show the malfunction issues at sites. Below is the information regarding the RCA

process.

The Flow Chart

In a large portion of cases, the root cause of a case can be diagnosed by only one or two pieces of

evidence. For these cases, the RCA flow chart can be used to simply diagnose the root cause. The

flowchart can be found on the O drive here: O:\NetworkOperations\Data_Quality\Jake\Root Cause

Flow Chart.pdf  

An additional resource for use is the DQ Symptoms + Root Cause 2.0 spreadsheet. The spreadsheet

provides a few examples of DQ issues. The document can be found here:

O:\NetworkOperations\Data_Quality\Jake\DQ Symptoms + Root Cause 2.0.xlsx 

However, in two cases (Zeros and Missing data) there is not a clear diagnosis. The text below details

how to go about finding the root cause in these specific circumstances.

Invalid Zeros 

Invalid Zeros are usually caused by an improperly functioning meter, but they can be caused by a

variety of different problems. When looking at a case for invalid zeros, you should first check the cases

page. To navigate to the cases page, first go to the site page. The site page can be reached by simply

clicking on the site name on the original case. Once on the site page (which will be your main resource

for most of the Root Cause Analysis process), scroll down about ¾ of the page to find the Cases field.

Click on the “Go to list (#)” link to get to the cases page. 

Page 5: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 5/8

Page 6: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 6/8

Root Cause Analysis

6

Once you are on the Cases page, you will want to sort cases by “Last Modified Data/Time”. This is done

by simply clicking on the text field near the top of the table. In the case of invalid zeros, you will want to

look for a “Zeros – site name” case from the time period of the DQ error.  Click on each of the Zeros

cases from the time period (from the actual error to about 6 months in the future), and open them in

new windows.

Now that you have found the relevant zeros cases, you will want to see if they can help you with the

root cause. The simplest way this can happen is if the zeros case itself has a root cause and root cause

category. If so, that root cause is the root cause of your case. (Note the root cause of “No DQ

Issue/Resolved on its own” is not used on DQ.  If this is the root cause on the zeros case do not use it.)

If there is no root cause, then read over the case resolution and comments and ask yourself some

questions about the case:

Was any hardware replaced?

Was any device rebooted remotely?

Was there an issue with the power source at the site?

Zeros are most often caused by our metering devices, but can be caused by any of the hardware at a site

being wrong (whether it be the site server, wireless transmitter, or an error with the customer’s

power). As a rule, if you can figure out any hardware that was replaced as a result of the case, or any

hardware that caused the issue from the comments, than that hardware is the root cause. It will be

possible that no root cause can be found from the zeros cases, if that is true then there is a good chance

the issue is with the meter.

Case reason

shows zeros for

this site

Page 7: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 7/8

Root Cause Analysis

7

Description of Gaps, Estimation and Missing Data 

This might be a bit confusing, but Gaps, Excessive Estimations, and Missing data are all very similar DQ

problems. They are all caused by information not getting back to our database, where sometimes the

VEE program makes estimations to fill gaps and missing data.

When assessing one of these issues, you will first have to look to see if the issue is a Modhopper. When

estimations are caused by a Modhopper, they usually will be short and relatively frequent. These kinds

of estimations are caused by an interruption in the signal between the Modhopper transmitting pulse

data and the Modhopper receiving the pulse data. If estimations only last for about five to twenty

minutes, it might be caused by a Modhopper. The way to confirm if the Modhopper is the issue, if the

estimation is small, is to check if the site has a Modhopper.

This is where it starts to get a little tricky, since Modhoppers are not always properly listed on ECRM.

There are two places where a Modhopper might be listed. Firstly, the Modhopper can be listed on the

“EnerNOC Site Servers” field on ECRM.  This is found on the site page directly above the meter field. It

will usually list if a Modhopper is on the site in the “Communications” or “Additional Notes” sections.  If

the Modhopper is not listed here, you can look to see if the site survey indicated that a Modhopper is

needed. To do this, look at the “ Installation Notes” in the “Site Survey and Design” field, found on the

site page. If evidence of a Modhopper is found, and the estimations are short, than the root cause is

“Wireless Transmitter/Interrupted Signal”. 

Site Survey notes help

us to know whether any

mod Hooper is installed

at the site or not?

Page 8: Root Cause for Data Analysis

8/12/2019 Root Cause for Data Analysis

http://slidepdf.com/reader/full/root-cause-for-data-analysis 8/8

Root Cause Analysis

8

ZEUS PAGE

If still we didn’t able to identify the root cause which may or may not be, because of” communication”

problem at the site then there is still something we can try i.e. the Zeus tool.

To confirm this, go to the ZEUS page (http://encorpops07:8080/zeusweb/ZEUS.jsp), and filter on the

correct site. Once you find the correct site, click on the “Site Server” link for that site.  If the server is an

E4, S1, or S2, then it will bring up the server page. 

Once you find the server page, scroll to the bottom and review the table labeled “Device Public IP”. When a server loses communications with our database, its IP address is listed as null. To confirm if

there was no communications see if the IP address was null during the issue. If it is null, the root cause

category is “communications”. 

Null shows that there is

“communication” issue

at site