Planes, Trains, and Automobiles: A Data Scientist’s Guide to Modeling Engine Degradation

Post on 13-Feb-2017

613 views 0 download

Transcript of Planes, Trains, and Automobiles: A Data Scientist’s Guide to Modeling Engine Degradation

1 © 2016 Pivotal Software, Inc. All rights reserved. 1

Planes, Trains, and Automobiles

A Data Scientist’s Guide to Modeling Engine Degradation

April Song @aprilsongg Sarah Aerni @itweetsarah

2 © 2016 Pivotal Software, Inc. All rights reserved.

Gene Sequencing

Smart Grids

COST TO SEQUENCE ONE GENOME HAS FALLEN FROM

$100M IN 2001

TO $10K IN 2011 TO $1K IN 2014

READING SMART METERS EVERY 15 MINUTES IS 3000X MORE DATA INTENSIVE

Stock Market

Social Media

FACEBOOK UPLOADS 250 MILLION

PHOTOS EACH DAY

Oil Exploration

Video Surveillance

OIL RIGS GENERATE

25000 DATA POINTS PER SECOND

Medical Imaging

Mobile Sensors

All industries need technology to process and store data

3 © 2016 Pivotal Software, Inc. All rights reserved.

How can connected devices in our home be smart enough to

make daily life easier?

4 © 2016 Pivotal Software, Inc. All rights reserved.

How can we know a tree has fallen on a power line before the

residents complain?

5 © 2016 Pivotal Software, Inc. All rights reserved.

How can we use data to prevent airplane accidents?

6 © 2016 Pivotal Software, Inc. All rights reserved.

Aerospace Industry is Embracing IoT

!  Engines are being fitted with more and more sensors

!  Aircraft data networks are improving data transfer speeds

!  Real time analytics is improving efficiency and performance

7 © 2016 Pivotal Software, Inc. All rights reserved.

Pratt & Whitney’s Geared Turbo Fan Engine

!  5,000 sensors

!  10 GB data per second

!  12 hours of flight = 844 TB data

8 © 2016 Pivotal Software, Inc. All rights reserved.

WHY IS THIS A DATA SCIENCE PROBLEM?

9 © 2016 Pivotal Software, Inc. All rights reserved.

How does this…

10 © 2016 Pivotal Software, Inc. All rights reserved.

How does this…

…become this?

11 © 2016 Pivotal Software, Inc. All rights reserved.

How does this…

…become this?

By recognizing this

12 © 2016 Pivotal Software, Inc. All rights reserved.

HOW CAN IT SOLVE JET ENGINE CHALLENGES?

13 © 2016 Pivotal Software, Inc. All rights reserved.

But what can we do with this much data?

Predict thrust demands of an engine Reduction in fuel consumption

Monitor engine health and degradation

Reduced maintenance costs with increased performance, efficiency, and engine lifetime

Detect faults and anomalies during a flight

Prevention of equipment failures and accidents

14 © 2016 Pivotal Software, Inc. All rights reserved.

What We Will Cover Today

!  Jet Engine Sensor Data

!  Enabling Technologies for Data Science

!  Building Models on Large-Scale Datasets –  Detecting Engine “end-of-life” Signal via Clustering –  Tracing Engine Health Degradation using Classification

15 © 2016 Pivotal Software, Inc. All rights reserved.

Commercial Modular Aero-Propulsion System Simulation

Introduction to C-MAPSS

C-MAPSS a Matlab program that simulates a large high-bypass commercial turbofan engine capable of ~90k lbs thrust –  GUI allows point-and-click

operation of engine models –  simulates deterioration and

faults

Simplified diagram of 90k engine

16 © 2016 Pivotal Software, Inc. All rights reserved.

Overview of Flights

!  6,875 flights –  5,244 flights from

nominal engines –  1,631 flights from

fault engines

!  Flight lengths range from 74 to 85 minutes

!  Average length of flight is ~80 minutes

# of

Flig

hts

Length of Flight (Seconds)

17 © 2016 Pivotal Software, Inc. All rights reserved.

Flight Parameters

Parameter Name Description Units

Flight Conditions

time Flight time sec

alt Altitude ft

MN Mach number pct

TRA Trottle resolver angle deg

Wf Fuel flow pps

Fn Net thrust lbf

Parameter Name Description Units

Measurement Temperatures

T48 Total temperature at HPT outlet R

T2 Total temperature at fan outlet R

T24 Total temperature at LPC outlet R

T30 Total temperature at HPC outlet R

T50 Total temperature at LPT outlet R

Parameter Name Description Units

Other Measurements Nf Physical fan speed rpm

Nc Physical core speed rpm

epr Engine pressure ratio (P50/P2) --

phi Ratio of fuel flow to Ps30

pps/psiu

Ps30 Static pressure at HPC outlet psia

NfR Corrected fan speed rpm

NcR Corrected core speed rpm

BPR Bypass ratio --

farB Burner fuel-air ratio --

htBleed Bleed enthalpy --

PCNfRdmd Percent corrected fan speed pct

W31 HPT coolant bleed lmb/s

W32 LPT coolant bleed lmb/s

Health Indicators

SmHPC HPC stall margin --

SmLPC LPC stall margin --

SmFan Fan stall margin --

Pressure Measurements

P2 Pressure at fan inlet psia

P15 Total pressure in bypass-duct psia

P30 Total pressure at HPC outlet psia

18 © 2016 Pivotal Software, Inc. All rights reserved.

LARGE DATASETS REQUIRE NEW TECHNOLOGIES

At-Scale Modeling

19 © 2016 Pivotal Software, Inc. All rights reserved.

Need for new environments to process big data?

HDFS STORAGE AND MPP ARCHITECTURES DISTRIBUTE STORAGE

AND PREVENT DATA MOVEMENT VARIETY/VELOCITY

DISTRIBUTED COMPUTATION FOR PARALLELIZATION

PETABYTES OF DATA

OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK

TO ACCESS COMMON LANGUAGES

RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND

TOOLS

SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP

MANY EXISTING LIBRARIES, TOOLS AND

EXPERTISE

FLEXIBLE

SCALABLE

ENABLING

ACCESSIBLE

20 © 2016 Pivotal Software, Inc. All rights reserved.

A single address for everything analytics Analytics with Pivotal

Time-to-Insights FORECASTING CLUSTERING

REGRESSION

CLASSIFICATION

OPTIMIZATION

21 © 2016 Pivotal Software, Inc. All rights reserved.

Pivotal Greenplum MPP DB Think of it as multiple PostGreSQL

servers

Rows are distributed across segments by a particular field (or

randomly)

Segments/Workers

Master

22 © 2016 Pivotal Software, Inc. All rights reserved.

Greenplum Database Features for Data Scientists

•  Window functions: Perform calculations across a set of table rows that are somehow related to the current row

•  Analytics extensions: In-database machine learning at scale using MADlib

•  Procedural language extensions: Extended functionality using non-SQL programming languages and packages (e.g. Python and R) !  Client Access: ODBC and JDBC

access to support connections to 3rd party tools

* Only a subset of Greenplum Database features

23 © 2016 Pivotal Software, Inc. All rights reserved.

MADlib: Scalable, In-database ML

•  Open Source https://github.com/madlib/madlib •  Works on Greenplum DB, HAWQ and PostgreSQL •  In active development by Pivotal •  Downloads and Docs: http://madlib.net/

24 © 2016 Pivotal Software, Inc. All rights reserved.

•  For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R or C/C++

•  The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment

Standby Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Data Parallelism through PL/X

CREATE FUNCTION pymax ( a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu;

SQL wrapper

Source language code

Source language

declaration

User Defined Functions

25 © 2016 Pivotal Software, Inc. All rights reserved.

Altitude over time for some example flights

What does a typical flight look like?

!  Flight consists of series of ascents, cruises, and descents

!  Average cruise at 35,000 Ft is for ~ 21 minutes –  Engine health is

calculated from a snapshot of parameters during this cruise

26 © 2016 Pivotal Software, Inc. All rights reserved.

Time Series: Pressure Parameters

!  P2, P15, and P30 appear to be positively correlated except during the middle cruise –  correlation may

differ depending on regime

P2 = Pressure at Fan Inlet P15 = Total pressure in bypass-duct P30 = Total pressure at HPC outlet

27 © 2016 Pivotal Software, Inc. All rights reserved.

Life of a Nominal Engine

!  Engine health is modeled to degrade exponentially over time

!  5,244 flights from 25 nominal engines

!  Median number of flights for a nominal engine is 201

!  Median health score of nominal engines across all flights is ~.81

28 © 2016 Pivotal Software, Inc. All rights reserved.

Opportunity for Clustering of Engines

!  Nominal engines seem to degrade in at least 4 different ways –  cluster engines

based on degradation trend

–  caveat: small sample size (35 engines)

!  Additional Modeling Opportunity: –  Predict engine

health score

29 © 2016 Pivotal Software, Inc. All rights reserved.

Life of a Fault Engine !  Significant drop in

engine health is apparent after a fault flight

!  1,631 flights from 10 fault engines

!  Median number flights of fault engines is 137 flights

!  Median health score of fault engines across all flights is ~.72

Fault Flight

30 © 2016 Pivotal Software, Inc. All rights reserved.

Example: Engine Pressure Ratio (EPR) for flight 32-15, a flight with a fan fault

What happens when there is a fault?

At first glance, fault’s effects are not noticeable –  Need to zoom in to see the effects of a fault

31 © 2016 Pivotal Software, Inc. All rights reserved.

Feature Engineering: Transforming Timeseries

!  Many modeling approaches require feature extraction –  Clustering of engines –  Regression to reverse-engineer engine

health

32 © 2016 Pivotal Software, Inc. All rights reserved.

Engineering Features From Time Series

!  Goal: Represent timeseries data as variables

!  Approach: 1.  Identify the different phases

of the flight: takeoff, climbs, cruises, descents, landing

2.  For each phase and parameter calculate:

3.  Summary stats on rate of change for features

▪  mean ▪  min ▪  max ▪  stddev

▪  max – min ▪  median

mean: 13,674 stddev: 0 max: 13,674 min: 13,674 max-min: 0 median: 13,674

mean: 33,596 stddev: 5,732 max: 45,575 min: 25,959 max-min: 19,616 median: 32,556

33 © 2016 Pivotal Software, Inc. All rights reserved.

Calculating Correlations between Sensors

!  How correlated are two sensors?

!  Are correlations between the sensors different flight to flight?

!  Approach: –  1) Calculate correlations over entire flight data set and observe

trends –  2) Calculate correlations over each flight and observe trends

34 © 2016 Pivotal Software, Inc. All rights reserved.

Sensor Parameter Correlations

!  Correlations calculated on entire flight data set

!  435 total unique parameter pairs –  162 pairs are strongly

positively correlated (>.8) –  45 pairs are strongly

negatively correlated (<-.8) –  228 pairs are weakly

correlated

# of

Mea

sure

men

t Pai

rs

Correlation

35 © 2016 Pivotal Software, Inc. All rights reserved.

Top Correlated Parameter Pairs

Parameter 1 Parameter 2 Correlation

p2 alt -0.985

t2 alt -0.974

p15 alt -0.972

w31 alt -0.931

w32 alt -0.931

Parameter 1 Parameter 2 Correlation

nc htbleed .999

t30 nc .999

t30 htbleed .999

ps30 p30 .999

w31 w32 .999

Negatively Correlated Positively Correlated

p2 pressure at fan inlet t2 total temp at fan inlet p15 total pressure in bypass-duct w31 HPT cooland bleed w32 LPT cooland bleed

nc physical core speed htbleed bleed enthalpy t30 total temperature at HPC outlet ps30 total pressure at HPC outlet

36 © 2016 Pivotal Software, Inc. All rights reserved.

Top Negatively Correlated Sensors

p2 pressure at fan inlet t2 total temp at fan inlet p15 total pressure in bypass-duct w31 HPT cooland bleed w32 LPT cooland bleed

!  Potential Analysis: Calculating correlations at a regime level may reveal anomalies

37 © 2016 Pivotal Software, Inc. All rights reserved.

Top Positively Correlated Sensors

nc physical core speed htbleed bleed enthalpy t30 total temperature at HPC outlet ps30 total pressure at HPC outlet

!  Potential Analysis: Calculating correlations at a regime level may reveal anomalies

38 © 2016 Pivotal Software, Inc. All rights reserved.

Correlation Between Altitude and P2 Flight ID

39 © 2016 Pivotal Software, Inc. All rights reserved.

Correlation Between Altitude and P2 Flight ID

40 © 2016 Pivotal Software, Inc. All rights reserved.

Correlation Between Altitude and P2 Flight ID

41 © 2016 Pivotal Software, Inc. All rights reserved.

Correlation Between Altitude and P2 Flight ID

42 © 2016 Pivotal Software, Inc. All rights reserved.

Correlation Between Altitude and P2 Flight ID

43 © 2016 Pivotal Software, Inc. All rights reserved.

Correlation Between Altitude and P2 Flight ID

44 © 2016 Pivotal Software, Inc. All rights reserved.

Clustering Flights Insights on engine degradation and end of life

45 © 2016 Pivotal Software, Inc. All rights reserved.

Feature Reduction using VIF

K-Means Clustering Algorithm Objective: Group flights based on their parameter time series

Time Series for Single Sensor Data

Extract Summary Statistics for All Phases

Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector

46 © 2016 Pivotal Software, Inc. All rights reserved.

Feature Reduction using VIF

K-Means Clustering Algorithm Objective: Group flights based on their parameter time series

Time Series for Single Sensor Data

Extract Summary Statistics for All Phases

Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector

Param 1

Extract Features

47 © 2016 Pivotal Software, Inc. All rights reserved.

K-Means Clustering Algorithm

Source: http://www.naftaliharris.com/

Feature Reduction using VIF

Time Series for Single Sensor Data

Extract Summary Statistics for All Phases

For each Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector

48 © 2016 Pivotal Software, Inc. All rights reserved.

K-Means Clustering Algorithm

Feature Reduction using VIF

Time Series for Single Sensor Data

Extract Summary Statistics for All Phases

For each Cluster using K-means algorithm in MADlib with Summary Statistics as Feature Vector

Repeat process for 29 parameters

49 © 2016 Pivotal Software, Inc. All rights reserved.

Flights in Cluster 4 Indicate Engine’s end of life

Smfan Timeseries Features Clustering Results

50 © 2016 Pivotal Software, Inc. All rights reserved.

Classification-Based Similarity Metric Understanding similarities between flights

51 © 2016 Pivotal Software, Inc. All rights reserved.

Classification-Based Distance Metric

!  Binary classification methods to build models to differentiate between two groups using available attributes –  Algorithms allow us to use

optimal subset of attributes to differentiate classes (feature selection)

–  Ability to differentiate becomes a proxy for dissimilarity

Class 1 Class 2 Class 3

Classes differentiated by size and color

These classes are indistinguishable

Model accuracy HIGH : able to predict class

Model accuracy LOW: unable to predict classes

52 © 2016 Pivotal Software, Inc. All rights reserved.

Classification-Based Flight Similarity Metric

!  For a given pre-takeoff phase –  Create a non-overlapping set of all 5-second windows –  Extract features

▪  Summary statistic (402) for each parameter in the time-window ▪  Correlations between all pairs of parameters in the time-window used for

propulsion data only

Flight 1, Flight 2

Flight 1, Flight 3

Flight m, Flight n

Train Classifier for

Classification Accuracy Score

Classification Accuracy Score

Classification Accuracy Score

Engine 1

53 © 2016 Pivotal Software, Inc. All rights reserved.

Expected Results

!  745,281 total models built –  For each flight, classifier to

each other flight for the same engine

–  Modeling run-time ~11 min on 128-segment cluster

!  As engines begin to degrade, adjacent flights should be similar (low accuracy)

Class 1 Class 2 Class 3

Classes differentiated by size and color

These classes are indistinguishable

Model accuracy HIGH : able to predict class

Model accuracy LOW: unable to predict classes

54 © 2016 Pivotal Software, Inc. All rights reserved.

Expected Results

!  745,281 total models built –  For each flight, classifier to

each other flight for the same engine

–  Modeling run-time ~11 min on 128-segment cluster

!  As engines begin to degrade, adjacent flights should be similar (low accuracy)

Model accuracy HIGH : able to distinguish flights that occur after degradation

Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)

Flight number

Mod

el A

ccur

acy

REFERENCE FLIGHT

55 © 2016 Pivotal Software, Inc. All rights reserved.

Engine 1 results Model accuracy HIGH : able to distinguish flights that occur after degradation

Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)

REFERENCE FLIGHT

56 © 2016 Pivotal Software, Inc. All rights reserved.

Engine 1 results Model accuracy HIGH : able to distinguish flights that occur after degradation

Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)

REFERENCE FLIGHT

57 © 2016 Pivotal Software, Inc. All rights reserved.

Engine 1 results

Model accuracy HIGH : able to distinguish flights that occur before and after degradation

Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)

REFERENCE FLIGHT

58 © 2016 Pivotal Software, Inc. All rights reserved.

Engine 1 results

Model accuracy HIGH : able to distinguish flights that occur before and after degradation

Model accuracy LOW: unable to predict distinguish adjacent flights (little difference)

REFERENCE FLIGHT

59 © 2016 Pivotal Software, Inc. All rights reserved.

Logistic Regression Results

!  Earlier flights are more similar to each other

!  Earlier flights are more dissimilar to later flights

!  Flights up until 50th are similar to each other

!  Flights after 50th are only similar to neighboring flights but start to differ from earlier flights

!  Indicates change/degradation over time

Similar Dissimilar

60 © 2016 Pivotal Software, Inc. All rights reserved.

Examining Engine Degradation Over Time

!  Summary statistics over flights provide insights into degradation patterns –  Median/mean accuracies over

PRECEDING flights indicates what degradation occurred since the engine start

–  Observations over adjacent windows may be of interest

–  Detecting anomalies

Flight number

Mod

el A

ccur

acy

REFERENCE FLIGHT

REFERENCE FLIGHT

61 © 2016 Pivotal Software, Inc. All rights reserved.

Engine Health and Engine Classification-based Similarity

!  Median accuracy score of a flight to prior flights increases as engine health decreases

!  Abrupt changes in engine health can be found using future flights (to find an inflection)

62 © 2016 Pivotal Software, Inc. All rights reserved.

Accuracy Scores Show both Time and Degradation

!  With many more flights median accuracy increases

63 © 2016 Pivotal Software, Inc. All rights reserved.

Accuracy Scores Show both Time and Degradation

!  With many more flights median accuracy increases

64 © 2016 Pivotal Software, Inc. All rights reserved.

Accuracy Scores Show both Time and Degradation

!  With many more flights median accuracy increases

!  Degradation in engine causes median accuracy to drop faster

65 © 2016 Pivotal Software, Inc. All rights reserved.

Example of Fault: engine 32

!  Flight before fault occurs

!  avg scores of flights before fault flight is slightly higher

!  flight after fault: more flights with score > .8

66 © 2016 Pivotal Software, Inc. All rights reserved.

HPT Fault

Classification-Based Similarity Changes at Faults

67 © 2016 Pivotal Software, Inc. All rights reserved.

LPC Fault – low engine health change still detected

Classification-Based Similarity Changes at Faults

68 © 2016 Pivotal Software, Inc. All rights reserved.

LPC Fault – low engine health change still detected

Classification-Based Similarity Changes at Faults

69 © 2016 Pivotal Software, Inc. All rights reserved.

Engine Health and Median Accuracy Correlations

HPT fault flights

70 © 2016 Pivotal Software, Inc. All rights reserved.

Engine Health and Median Accuracy Correlations

fan hpc hpt

lpc lpt

71 © 2016 Pivotal Software, Inc. All rights reserved.

What Did We Learn? What is next?

•  Through technology, data exploration and feature generation becomes easier –  What we learned: Rapidly transforming large volumes of

sensor data –  What’s next: Timeseries analysis, interpolation on missing

data •  Experimentation with building models to predict engine decay

and faults –  What we learned: unsupervised techniques for clustering

and distance metrics enable us to discover signals of decay

–  What’s next: supervised approaches to detect known faults

72 © 2016 Pivotal Software, Inc. All rights reserved.

Opportunities in the Digital Brain

73 © 2016 Pivotal Software, Inc. All rights reserved.

Opportunities in the Digital Brain

CONNECTED CARS

PERSONALIZED MEDICINE

SMART METERS

SECURITY

PREDICTIVE MAINTENANCE

SPORT TRACKING

OPTIMIZATION AND EFFICIENCY

75 © 2016 Pivotal Software, Inc. All rights reserved.

Appendix

•  Propulsion dataset can be downloaded at: https://c3.nasa.gov/dashlink/resources/140/