Data Mining for Aviaon Safety · 2013-08-13 · Machine w/NLP Decision Tree w/NLP Random Forest w/...

Post on 11-Aug-2020

2 views 0 download

Transcript of Data Mining for Aviaon Safety · 2013-08-13 · Machine w/NLP Decision Tree w/NLP Random Forest w/...

DataMiningforAvia.onSafety

NikunjC.Oza,Ph.D.NASAAmesResearchCenterNikunj.C.Oza@nasa.gov

hBp://..arc.nasa.gov/people/oza

AmericanSta.s.calAssocia.on,SanFranciscoBayAreaChapter,October21,2010

Outline

•  Thebreadth/depthoftheproblems•  Combiningdiscreteandcon.nuoussequences:Mul.pleKernelAnomalyDetec.on(MKAD)Derivedfromanomalydetec.onmethodsondiscreteandcon.nuoussequences.

•  TextMining:classifica.on,topicmodeling

•  Ongoing,futurework

AccidentData

Opera.onalData(DiscreteandCon.nuousData)

Opera.onalSurveys(Text)

AnecdotalReports

Projec.ons

Forensics

WhatHappened?

DiscoveringCausalFactors

WhydiditHappen?

SUBJEC

TIVE

‐‐datacon.

nuum

‐‐‐OBJEC

TIVE

Predic.ons

WhatwillHappenNext?

Avia.onSafetyMapping

FromIrvingStatler,Avia.onSafetyMonitoringandModelingProject

DataArisesfromMoleculartoGlobalScales

VehicleTechnology Opera.ons

Molecules

MaterialsSensors So/ware Engines Aircra/

10‐2

100 101 102 10310‐1

10‐3

104 105 10610‐4

10‐5

10‐6

DataMininginSupportofGlobalOpera.ons

105

DatasharingJustculture/safetyculture

10‐2

100 101 102 10310‐1

10‐3

104 105 10610‐4

10‐5

10‐6

• 6

DataMining•  IVHM/SSATprojectgoalsrequireleveragingsubstan.aldata.– Aircrab‐produced:Sensordata,flight‐relateddata(e.g.,origin,des.na.on),coveringmanyflightsovermanyyears.

– Other:Safetyreports,trajectories•  Transformdataintousefulknowledge

– ToolsforDetec.on,Diagnosis,Prognosis,Mi.ga.on

– Levelsrangingfromflight‐leveltona.onalairspacelevel

Outline

•  Thebreadth/depthoftheproblems•  Combiningdiscreteandcon.nuoussequences:Mul.pleKernelAnomalyDetec.on(MKAD)Derivedfromanomalydetec.onmethodsondiscreteandcon.nuoussequences.

•  TextMining:classifica.on,topicmodeling

•  Ongoing,futurework

•  Needtomodelthebehaviorofdiscretesensorsandswitchesinanaircrabduringflight.

•  Focusisonprimarysensorsthatrecordpilotac.ons.

•  Theaimistodiscoveratypicalbehaviorthathaspossibleopera.onalsignificance.

•  WedevelopedsequenceMiner:Eachflightisanalyzedasasequenceofevents,accoun.ngfor

•  orderinwhichswitcheschangevalues•  frequencyofoccurrenceofswitches

•  TwoTasks:

•  Givenagroupofflights,findflightsthatareanomalous.

•  Givenananomalousflight,describetheanomaliesandthedegreeofanomalousness.

•  Methodbasedontechniquesusedinbioinforma.cs.

C=

O=

S1= S2=

S3=

–  Sequences Si in a cluster is dependent on a prototype C

–  The outlier O is dependent on the Si

–  Therefore:

–  This forms the basis of our objective function F that allows us to describe why sequences are anomalous.

P(O|C)=ΣiP(O|Si)xP(Si|C)

A B _ _ E _ A B C _ E F A B C _ _ F A _ C D E F

Missinga“B” Hasextra“D”

{ClusterMembers

OutlierSequence}

Alignment

•  P(O|Si) is proportional to the normalized length of the longest common subsequence of O and Si •  Discovery of missing and extra symbols is done by aligning the

sequences •  Missing and extra symbols are those whose addition to or deletion from an

outlier sequence O would make O more normal with respect to the objective function F

SequenceMiner

NormalizedLongestCommonSubsequence(NLCS)

wherethefunc.onsandcalculatethelongestcommonsubsequenceandlengthofthesequences.€

L h si,s j( )( )L si( ) × L s j( )

SME opinion: Possible evidence of mode confusion.

MKADforFleetwideanalysis

How to integrate all information in a concise and intuitive manner?

Compression, Feature extraction,

Fusion, Anomaly detection

Sequences D and continuous data streams C interactions

MKADFramework

Kernelonmeasurements

Preprocessingdiscretedata

Preprocessingcon.nuousdata

Model

SequenceKernelonswitching

SAXrepresenta.on

Compression Fusion Detection

0 20 40 60 80 100 120

CC

baabccbc

Firstconvertthe.meseriestoPiecewiseAggregateApproxima.on(PAA)representa.on,thenconverttosymbols

Ittakeslinear.me

SlidecourtesyofEamonnKeoghandJessicaLin

Op.miza.onproblem

Discretekernel Con.nuouskernelIntheobjec.vefunc.on,eachentryofthediscretekernelandthecon.nuouskernelrepresentsthescoreobtainedusinglongestcommonsubsequence(LCS)ofdiscretesignalsandSAXifiedcon.nuoussignals,respec.vely.

Unique Linear Decision Boundary

PairwiseSimilarityMeasure

Kernelondiscrete:NormalizedLongestCommonSubsequence(NLCS)

Kerneloncon.nuous:Inverselypropor.onaltodistancebetweenSAXrepresenta.onsofsequences

GeneralCase,Mul.pleKernels

minQ =12

α iα j βλKλ xi,x j( )λ

i, j∑

Subjectto:

OneclassSVMstrainingalgorithmsrequiresolvingthequadra.cproblem

Dual form

Linear equality constraint

Bounds on design variables

Control parameter

:Lagrangemul.pliersoftheprimalQPproblem

βλλ

∑ =1Also:

Anomalyscores

Datapoints with will be the support vectors

Magnitude of h: degree of normality/anomalousness

Sign of h: if negative – outlier if positive - normal

Indicator

Decisionboundaryisdeterminedonlybymarginandnon‐marginsupportvectorsobtainedbysolvingtheQPproblem

Synthe.cExperimentSimulation data

Type 1 – (Missing event) Flaps were not extended to normal full deployment at landing.

Type 2 - (Extra event) Landing gear was retracted after being deployed on final approach.

Type 3 – (Out of order event) Gear deployed before initial flaps below flaps limit.

Type 4 – (Continuous anomaly) High bank angles or rate of descent below 1,000 ft.

Casestudy:FOQAanomalydetec.on

•  Thetradi.onmethodscannotdetectandmonitortheseanomalousac.vi.esthatmayhaveoccurredsimultaneouslyandareheterogeneousinnature.

Touchdownpoint

MKADSummary

1.  Support flights safety experts 2.  Schedule maintenance

Application

.… anomaly detection on multivariate mixed attributes where discrete may influences the system dynamics which is reflected on the continuous data streams.

Performs

.. High detection rate on most operationally significant anomalies in fleet wide analysis on large datasets

.. Discover some “unknown unknowns”

Highlights

Outline

•  Combiningdiscreteandcon.nuoussequences:Mul.pleKernelAnomalyDetec.on(MKAD)– Derivedfromanomalydetec.onmethodsondiscreteandcon.nuoussequences.

•  TextMining:classifica.on,topicmodeling

•  Ongoing,futurework

Avia.onSafetyTextReports

•  Pilotsareencouragedtoreportincidents,concerns,orunsafecondi.ons.Tworepositories:–  TheNASA‐FAAAvia.onSafetyRepor.ngSystem(ASRS).

–  EachindividualairlinehasanAvia.onSafetyAc.onProgram(ASAP).

•  Reportscanbeanalyzedtoimproveavia.onsafety.

•  Reportsneedtobecorrectlyandconsistentlycategorizedbyeventtypetodeterminedangeroussitua.onsandtracktrendsofincidentsandevents.

•  Eventtypesarenotenoughtocaptureallanomalies.Topiciden.fica.onneededtoiden.fyneweventtypes,combina.onsofeventtypes.

ManuallyClassifyingReports•  Reviewandanalysisofreportsislaborintensive.

–  InASRS,reviewershaveclassifiedabout135,000ASRSreportsof715,000totalreportssubmiBedinto60overlappinganomalycategories.

–  Thelengthofthereportsrangefrom0.5‐4pageslong.

–  ASRSreportsaccrueatabout3,000permonth.

–  Classifica.onsnotalwaysconsistentduetomul.plereviewers,changingexperiences.

•  Historicalreportsmustberereadwhennewcategoriesareaddedorchanged.

•  Currentsystemsrelyheavilyonhumanmemoriesforhistoricalperspec.ves.

ASRSReportExcerpt

JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, 'CLRED TO LAND, ACFT ON THE RWY.' SINCE THE ACFT IN FRONT OF US WAS CLR OF THE RWY AND WE BOTH MISUNDERSTOOD TWR'S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED...

Note:Industryspecificvocabularyandabbrevia.ons.

Automa.cCategoriza.onofASRSReports

Non Adherence to ATC Clearance

Critical Equipment Problem

Runway Incursion

Landing without a Clearance

Air Space Violation

Altitude Deviation Overshoot

Fumes

Altitude Deviation Undershoot

Ground Encounter, Less Severe

...

JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, 'CLRED TO LAND, ACFT ON THE RWY.' SINCE THE ACFT IN FRONT OF US WAS CLR OF THE RWY AND WE BOTH MISUNDERSTOOD TWR'S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED…

ASRS Report ExcerptSample of 60 ASRS Anomaly Categories

Asinglereportcanbeinmul.plecategories.

MarianaSta.s.calModelOp.miza.onFlowchart

Start at Co, γo, wo

Train SVM; Build model

Test Model w/ Validation

Data

Compare to ‘best’ AUROC

Update ‘best’ AUROC;

Save parameters

Move to new C, γ, w

Calculate AUROC

Apply statistical optimization

method

Yes

Save Model; Quit

ConvergedtoaSolu.on?

BeBerthanPrevious‘best’?

Yes

NoNo

AUROC‐areaunderreceiveroperatorcurve

Mariana w/ raw text only

Support Vector Machine w/NLP

Decision Tree w/NLP

Random Forest w/ NLP

Logistic Regression w/NLP

Linear Discriminant Analysis w/NLP

Marianavs.MethodsusingNaturalLanguageProcessing

TheMarianaalgorithmcangivesubstan.allybeBerresultsoverothermethods(greenbox),evenwhenprocessingrawtext.

Areaun

derRO

CCu

rve

Categories*

0.8

1.0

0.6

0.4

0.2

0.0

Outline

•  Combiningdiscreteandcon.nuoussequences:Mul.pleKernelAnomalyDetec.on(MKAD)– Derivedfromanomalydetec.onmethodsondiscreteandcon.nuoussequences.

•  TextMining:classifica.on,topicmodeling

•  Ongoing,futurework

SubspaceApproxima.on

•  Goals–  Systemobenassumedexplainablebysimplemodelplusnoise.Construc.onofsimpleapproxima.onmayreducenoise.

–  Derivesmall,keysetoffeaturestoexplainsystembehavior.

–  Lowerstoragerequirements.

•  Issue–  Derivedfeaturesmoreintui.veiftheyareposi.ve,reflec.ngpresenceofkeycomponentsthatadduptocharacterizetheitemofinterest.

Non‐nega.veMatrixFactoriza.on

•  NMFfindsfactorsrepresen.ngpartsoffaces(e.g.,nose,mouth).Easytointerpret.

•  VectorQuan.za.onandPCAfindholis.crepresenta.ons(face+/‐otherstuff).Difficulttointerpret.

Non‐nega.veMatrixFactoriza.on

Problem:Givenanonnega.vematrixandaposi.veintegerfindnonnega.vematricestominimizethefunc.on

A ∈ Rmxn

k <min{m,n}

WHisanonnega.vematrixfactoriza.on.ColumnsofWrepresentkbasisvectorscapturedfromnexampleshavingmfeaturevalueseach.Hgivesfactor‐exampleweights.kisproblem‐specificandchosenbyuser.

NMFforclassifica.on

•  NMFfactorscanbeusedforclustering,classifica.on,regression.

•  Classifica.onofASAPreportsintooneormorecontribu.ngfactors.

Sample(ASRS)BasisVectorsBasisVector1 BasisVector2

Run 1 2 3 1 2 3

ASRSClassifica.onResults

Outline

•  Combiningdiscreteandcon.nuoussequences:Mul.pleKernelAnomalyDetec.on(MKAD)– Derivedfromanomalydetec.onmethodsondiscreteandcon.nuoussequences.

•  TextMining:classifica.on,topicmodeling

•  Ongoing,futurework

Ongoing,FutureWork

•  Anomalydetec.onoverdiscrete,con.nuous,andtext(u.lizewhenavailable,don’tpenalizewhennot).

•  Anomalycause/precursoriden.fica.on.

•  Predic.onovermul.plescales:withinflight,acrossflights,acrossfleetsoveryears.

HowDoesHumanPerformanceAffectAvia.onSafety?

PhysiologicalMeasurements* Sample Text Report JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE ACFT IN FRONT OF US. ...

NASADataMiningLab(MountainView,CA)

Luton,UK

ToNASADataMiningTeamDailydata300GBflightspermonthPhysiology,text,cockpit,enginesandflightparameters,flightpath,networkinforma.on.

ToEasyJetNearreal‐.medecisionsupport

Understandingpilotfa.gue

Mul.pleKernelAnomalyDetec.on(KDD2010)

*Diagramforno.onalpurposesonly

MiningHeterogeneousDataistheKey

K(xi,xj)

Discrete

Textual

Networks

Con.nuous

PrimarySource:aircrabCananswerwhathappenedinanaircrabduringanAvia.onSafetyIncident

PrimarySource:Humans.CananswerwhyanAvia.onSafetyIncidenthappened

PrimarySource:Radardata.CananswerwhathappenedintheNa.onalAirspaceduringAvia.onSafetyIncident

Sample Text Report JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE ACFT IN FRONT OF US. ...

MassiveData:Archivesgrowingat100Kflightspermonth.

TheAnatomyofanAvia.onSafetyIncident

Time

. . . . . .

State k-1 State k State k+1

Transition k-1

State n+1State n-1 State n

Transitionk

Transitionk+1

Transitionn-2

Transitionn-1

Transitionn

Compromised States Safe States or AccidentsSafe

Incident

AnomalousStates

Scenario=(ContextBehaviorOutcome)

FromIrvingStatler,Avia.onSafetyMonitoringandModelingProject

DASHlinkdisseminate.collaborate.innovate.hBps://dashlink.arc.nasa.gov/

DASHlinkisacollabora.vewebsitedesignedtopromote:• Sustainability• Reproducibility• Dissemina.on

• Communitybuilding

Userscancreateprofiles

• Sharepapers,uploadanddownloadopensourcealgorithms• FindNASAdatasets.

ComingSoon…DASHlink2.0.

JoinDASHlink!

ConferenceonIntelligentDataUnderstanding‐2010

•  Conferencefocusedontheoryandapplica.onsofdataminingandmachinelearningtoEarthScience,SpaceScience,EngineeringSystems

•  Loca.on:ComputerHistoryMuseum,MountainView,CA

•  Date:October5‐6,2010•  Registra.on:Free✔

•  SteeringCommiBee•  AshokSrivastava(chair)•  StephenBoyd•  JiaweiHan•  EamonnKeogh•  VipinKumar•  ZoranObradovic•  NikunjOza•  RaghuRamakrishnan•  RamasamyUthurusamy•  RamasubbuVenkatesh•  XindongWu

ProgramChairs:NiteshChawlaandPhilipYu

CallforPar.cipa.on

References

•  S.Budalako.,A.N.Srivastava,andM.E.Otey,Anomalydetec.onanddiagnosisalgorithmsfordiscretesymbolsequenceswithapplica.onstoairlinesafety,inIEEESMCPartC,39(1),2008.

•  SantanuDas,KanishkaBhaduri,NikunjC.Oza,andAshokN.Srivastava,nu‐Anomica:AFastSupportVectorBasedNoveltyDetec.onTechnique,inICDM2009.

•  SantanuDas,BryanL.MaBhews,AshokN.Srivastava,andNikunjC.Oza,Mul.plekernellearningforheterogeneousanomalydetec.on:algorithmandavia.onsafetycasestudy,inKDD2010.

•  NikunjC.Oza,J.PatrickCastle,andJohnStutz,Classifica.onofAeronau.csSystemHealthandSafetyDocuments,inIEEESMCPartC,39(6),2009.