Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in...

13
Scalable Architecture for Anomaly Detection and Visualization in Power Generating Assets Paras Jain, Chirag Tailor, Sam Ford, Liexiao (Richard) Ding, Michael Phillips, Fang (Cherry) Liu, Nagi Gabraeel, Polo Chau 1/13

Transcript of Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in...

Page 1: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

Scalable Architecture for Anomaly Detection and Visualization in Power Generating Assets

Paras Jain, Chirag Tailor, Sam Ford, Liexiao (Richard) Ding, Michael Phillips, Fang (Cherry) Liu, Nagi Gabraeel, Polo Chau

1/13  

Page 2: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

Background § Each unit instrumented with 1000s

of sensors to signal incipient faults

§ Difficult for humans to monitor

§ Algorithms attempt to predict asset failure by detecting anomalies

Power generating assets such as jet engines and gas turbines

2/13  

Page 3: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

Key Challenges

v  Storage and Ingestion. Huge volume of data from many machines in real-time.

v  Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime and maintenance.

v  Visualization. Lack of an integrated visualization platform to understand and analyze flagged anomalies.

3/13  

Page 4: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

System Overview

1

2

3

4/13  

Page 5: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

1 - Scalable Data Ingestion & Storage Architecture

32  Node  

100  Units  x  1000  Sensors  

@  1HZ  

Goal  Inges<on  Rate:  100,000  sensor  readings  per  second  

5/13  

Page 6: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

1 – Simulated Training Dataset

*P.  Ratliff,  P.  GarbeF,  and  W.  Fischer,  “The  new  siemens  gas  turbine  sgt5-­‐8000h  for  more  customer  benefit,”    

v  100 units with 1000 sensors producing readings at 1Hz. o  Similar number of units owned by a regional energy provider. o  On the order with 3000 sensors in Siemens SGT5-8000H gas turbine*.

v Anomalous behavior modeled in dataset: o  Pure random noise. o  Pure random noise plus gradual degradation. o  Pure random noise plus sharp shift.

6/13  

Page 7: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

1 - Scalable Data Ingestion & Storage Results

Linear Scale Up Exceeds 100k samples/sec goal

100,000  

400,000  

300,000  

200,000  

0   10   20   30  

10  nodes  173k  readings/sec  

15  nodes  233k  readings/sec  

20  nodes  257k  readings/sec  

25  nodes  325k  readings/sec  

30  nodes  399k  readings/sec  

Throughput  (readings/sec)  

#  of  Nodes  

7/13  

Page 8: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

1 - Scalable Data Ingestion & Storage Results

20   40   60  

Inges;on  Dura;on  (seconds)  

Sensor  Readings  Ingested    (millions)  

Constant & Stable Ingestion

8  

4  

80   100  

16  

12  

20  

0  

8/13  

Page 9: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

1 - Interesting Findings

1.  Salting. HBase keys generated by OpenTSDB must be salted since continuous value timestamps all map to the same HBase node.

2.  Backpressure. HBase does not provide backpressure to OpenTSDB.

9/13  

Page 10: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

We use the False Discovery Rate (FDR) algorithm. 1.  First introduced by Benjamini and Hochberg in 1995 and used in

multiple inference clinical trials*.

2.  Suppresses false alarms: Performs a test on an increasing ratio of the original significance level for each sensor’s z-score.

3.  Scalable: our implementation using Spark processes over 939,000 sensor samples per second

2 - Flagging Anomalies with Low False Alarm Rates

*Y.  Benjamini  and  Y.  Hochberg,  “Controlling  the  false  discovery  rate:  a  prac<cal  and  powerful  approach  to  mul<ple  tes<ng,”    

10/13  

Page 11: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

3 - Anomaly Visualization

1 2

3

4

11  11/13  

Page 12: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

Ongoing Work

§  Scaling  up  inges<on  and  analysis  throughput  with  addi<onal  nodes.    

§  Migrate  anomaly  detec<on  algorithm  to  Spark  Streaming  for  online  evalua<on.  

§  Evaluate  our  system  with  domain  users  and  industry  partners  like  General  Electric  (GE).  

12/13  

Page 13: Scalable Architecture for Anomaly Detection and ...Huge volume of data from many machines in real-time. " Anomaly Detection. Prevalence of false alarms leads to unnecessary downtime

Scalable Architecture for Anomaly Detection and Visualization in Power Generating Assets

Paras Jain, Chirag Tailor, Sam Ford, Liexiao (Richard) Ding, Michael Phillips, Fang (Cherry) Liu, Nagi Gabraeel, Polo Chau

13