Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and...

44
Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference June 26-27, 2007 Computer History Museum Mountain View, California, USA Sponsored by NASA Engineering and Safety Center Science Mission Directorate Aeronautics Research Mission Directorate - IVHM

Transcript of Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and...

Page 1: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

Data Mining in Aeronautics, Science, and Exploration Systems

2007 Conference

June 26-27, 2007

Computer History Museum Mountain View, California, USA

Sponsored by

NASA Engineering and Safety Center Science Mission Directorate

Aeronautics Research Mission Directorate - IVHM

Page 2: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference

Computer History Museum Mountain View, CA June 26-27, 2007

Numerous disciplines, including aeronautics, physical sciences, and space exploration, have benefited from recent advances in data and text mining, machine learning, and statistics. The Data Mining in Aeronautics, Science, and Exploration Systems (DMASES) 2007 conference provides the data mining community with an opportunity to share these advances across the larger communities of engineers and scientists working in aeronautics, aerospace, and science. This single-track conference features in-depth lectures, tutorials, discussion, and a poster session.

Conference Organizers Session Chairs Ashok N. Srivastava, Ph.D. Kevin H. Knuth, Ph.D. (Sciences) Intelligent Systems Division Department of Physics NASA Ames Research Center State University of New York, Albany

Dawn M. McIntosh Michael D. New, Capt., Ph.D. (Aeronautics) Intelligent Systems Division Delta Airlines, Inc. NASA Ames Research Center

Bob Beil Anindya Ghoshal, Ph.D. (Exp. Systems) Systems Engineering Office United Technologies Research Center NASA Engineering and Safety Center United Technologies Corp.

Page 3: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

Conference Agenda

Tuesday, June 26

8:00 AM REGISTRATION8:30 AM Morning Announcements/Introductions 8:35 AM Mining Future Datascapes - Srivastava/NASA Ames Research Center 9:15 AM Ascent Summary Data Analysis Tool for Shuttle Wing Leading Edge

Impact Detection - McIntosh/NASA Ames Research Center Exploration Systems Session

9:35 AM Distributed Mobility Management for Target Tracking in Mobile Sensor Networks - Chakrabarty/Duke University

10:20 AM * break * 10:45 AM A Structural Neural System for Data Mining and Anomaly Detection -

Schulz/University of Cincinnati11:25 AM Current Trends in Performance Prognostics Using Integrated Simulation

and Sensors - Baca/Sandia National Laboratories12:25 PM * Poster Session/Lunch *

Sciences Session 2:00 PM Problem Solving Strategies: Sampling & Heuristics - Knuth/State

University of New York, Albany2:20 PM Making the Sky Searchable: Rapid Indexing for Automated Astrometry -

Roweis/Google2:30 PM Bayesian Analysis of the Cosmic Microwave Background - Jewell/NASA

Jet Propulsion Laboratory3:00 PM Efficient & Stable Gaussian Process Calculations - Foster/San Jose

State University3:30 PM * break * 4:00 PM Understanding Large-Scale Structure in Earth Science Remote Sensing

Data Sets - Braverman/NASA Jet Propulsion Laboratory4:30 PM Data-driven Modeling for Understanding Climate-Vegetation

Interactions - Nemani/NASA Ames Research Center5:00 PM END

Page 4: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

Wednesday, June 27

8:00 AM REGISTRATION8:30 AM Morning Announcements 8:35 AM Tutorial, session I - Principles of Bayesian Methods - Sansó/University

of California, Santa Cruz 10:00 AM * break * 10:30 AM Tutorial, session II - Principles of Bayesian Methods - Sansó/University

of California, Santa Cruz 12:30 PM * Collaboration Discussions & Networking/Lunch *

Aeronautics Session 1:30 PM National Aeronautics Research & Development Policy – Overview and

Outreach - Schlickenmaier/NASA Headquarters2:00 PM Applying Knowledge Representation to Runway Incursion -

Wilczynski/University of Southern California3:00 PM The Role of Data Mining in Aviation Safety Decision Making -

McVenes/Air Line Pilots Association, International3:30 PM * break *4:00 PM Sifting NOAA Archived ACARS Data for Wind Variation to Improve

Traffic Efficiency - Ren/Georgia Institute of Technology4:30 PM Data & Text Mining in Boeing - Kao/Boeing Phantom Works5:00 PM Concluding Remarks - Srivastava 5:10 PM END

Page 5: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

Invited Presentations

Conference Coordinator Presentations

Mining Future Datascapes Ashok Srivastava, NASA Ames Research Center

Ascent Summary Data Analysis Tool for Shuttle Wing Leading Edge Impact DetectionDawn McIntosh, NASA Ames Research Center

NASA Engineering and Safety Center Data Mining and Trending Working GroupBob Beil, NASA Engineering and Safety Center

Tuesday, June 26

Distributed Mobility Management for Target Tracking in Mobile Sensor NetworksKrishnendu Chakrabarty, Duke University

A Structural Neural System for Data Mining and Anomaly DetectionMark Schulz, University of Cincinnati

Current Trends in Performance Prognostics Using Integrated Simulation and SensorsThomas J. Baca, Sandia National Laboratories

Problem Solving Strategies: Sampling and HeuristicsKevin Knuth, SUNY Albany

Making the Sky Searchable: Rapid Indexing for Automated AstronomySam Roweis, Google

Bayesian Analysis of the Cosmic Microwave BackgroundJeff Jewell, NASA Jet Propulsion Laboratory

Efficient and Stable Gaussian Process CalculationsLeslie Foster, San Jose State University

Understanding Large-Scale Structure in Earth Science Remote Sensing Data SetsAmy Braverman, NASA Jet Propulsion Laboratory

Data-Driven Modeling for Understanding Climate-Vegetation InterfacesRamakrishna Nemani, NASA Ames Research Center

Page 6: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

Efficient & Stable Gaussian Process Calculations

Leslie Foster San Jose State University

The Gaussian process technique is one popular approach for analyzing and making predictions related to large data sets. However the traditional Gaussian process approach requires solving a system of linear equations that, in many cases, is so large that it is not practical to solve in a reasonable amount of time. We describe how low-rank approximations can be used to solve these equations approximately. The resulting algorithm is fast, accurate, numerically stable, and general. We illustrate the application of the algorithm to the prediction of redshifts using broad spectrum measurements of the light from galaxies.

Page 7: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

EFFICIENT AND STABLE GAUSSIAN PROCESS

CALCULATIONS

Leslie Foster, Nabeela Aijaz, Michael Hurley, Apolo Luis,Joel Rinsky, Chandrika Satyavolu, Alex Waagen (team

leader)

MathematicsSan Jose State University

[email protected]

June 26, 2007, DMASES 2007

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEEFFICIENT AND STABLE GAUSSIAN PROCESS CALCULATIONS

Page 8: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

ABSTRACT

The Gaussian process technique is one popular approach foranalyzing and making predictions related to large data sets.However the traditional Gaussian process approach requiressolving a system of linear equations that, in many cases, is solarge that it is not practical to solve in a reasonable amount oftime. We describe how low rank approximations can be used tosolve these equations approximately. The resulting algorithm isfast, accurate, numerically stable and general. We illustrate theapplication of the algorithm to the prediction of redshifts usingbroad spectrum measurements of the light from galaxies.

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 9: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

OUTLINE

I. The Problem and Background

II. Low Rank Approximation

III. Numerical Stability and Rank Selection

IV. Results

V. Conclusions

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 10: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

PREDICTION AND ESTIMATION

Training Data:

X – data matrix of observations – n × d

y – vector of target data – n × 1

Testing Data:

X ∗ – matrix of new observations – n∗ × d

Goals:

predict y∗ corresponding to X ∗

estimate y corresponding to X

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 11: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

Approaches for prediction with large data sets:

Traditional regression

Neural networks

Support Vector Machines

E-model

. . .

Gaussian Processes

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 12: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

GAUSSIAN PROCESS SOLUTION

Form covariance matrix K (n × n),cross covariance matrix K ∗ (n∗ × n) andselect parameter λ

predict y∗ using

y∗ = K ∗(λ2I + K )−1y

(λ2I + K ) is large – for example

180000 × 180000

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 13: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

COVARIANCE FUNCTIONS AND MATRICES

Definition: A covariance function k(x , x ′) isthe measure of covariance between inputpoints x and x’.

covariance matrix (SPD): Kij = k(xi , xj)

Examples: Polynomial, SquaredExponential, Neural Network, RationalQuadratic, Matern Class, . . .

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 14: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

COMPUTATIONAL CHALLENGES

Memory: Storing covariance matrix – O(n2)

Time: Solving linear system – O(n3)

Numerical stability: accurate calculations.

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 15: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

APPLICATION: REDSHIFT CALCULATION

Indicates that an object is moving away fromyouA redshift is the change in wavelengthdivided by the initial wavelength

For example,the sound from

this train is shifted and

changes pitch when movingaway

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 16: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

APPLICATION: REDSHIFT CALCULATION

Scientists want to determine the position ofgalaxies in the universe.

Useful for understanding the structure of theuniverse.

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 17: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

APPLICATION: REDSHIFT CALCULATION

Five photometric observations for eachgalaxy denoted U,G,R,I,Z

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 18: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

APPLICATION: REDSHIFT CALCULATION

We have 180,045 examples with a knownU,G,R,I,Z and redshift.

The goal is to be able to predict a newredshift given new U,G,R,I,Z data from anew galaxy.

Testing set: 20,229 galaxies

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 19: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

BACKGROUND: LEAST SQUARES PROBLEMS

Given:n × m matrix A, n ≥ mn × 1 vector yn∗ × m matrix A∗

Solvemin ||y − Ax ||

Estimate y : y = Ax

Predict y∗: y∗ = A∗x

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 20: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

BACKGROUND: NORMAL EQUATIONS

x = (AT A)−1AT y

Advantage: Fast

Disadvantage:cond(AT A) = cond2(A)

relative error in x ∝ cond2(A) alwayspotential numerical instability

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 21: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

BACKGROUND: ORTHOGONAL (QR) FACTORIZATION

Form A = QR whereQ is n × m with orthonormal columnsR is m × m right triangular

x = R−1QT y

Disadvantages: can be slower, morememory (in Matlab)

Advantage:numerically stablecan be more accurate

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 22: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

LOW RANK APPROXIMATION

K =

m n − mm

n − m

(K11 K12

K21 K22

)=

m n − mn

(K1 K2

)

K ∗ =m n − m

n∗ (K ∗

1 K ∗2

)K ∼= K ≡ K1K−1

11 K T1

K ∗ ∼= K ∗ ≡ K ∗1 K−1

11 K T1

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 23: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

LOW RANK APPROXIMATION: SR FORMULA

Recall y∗ = K ∗(λ2I + K )−1y

Replace K with K and K ∗ with K ∗ so that

y∗ ∼= K ∗(λ2I + K )−1y =

. . . . . . . . .

y∗ ∼= K ∗1 (λ2K11 + K T

1 K1)−1K T

1 y

Subset of Regressors Formula [Wahba,1990]

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 24: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

COMPUTATIONAL CHALLENGES OVERCOME

y∗ ∼= K ∗1 (λ2K11 + K T

1 K1)−1K T

1 y

Memory: Storing covariance matrix – O(nm)

Time: Solving linear system – O(nm2)

Numerical stability: ???.

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 25: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

SR FORMULA AND LEAST SQUARES

In SR formula consider special case λ = 0

y∗ = K ∗1 (K T

1 K1)−1K T

1 y

Exactly normal equations solution to theleast squares prediction problem:min ||y − K1x || and y∗ = K ∗

1 x

Note: can be easily extended for λ �= 0

Potential numerical instability

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 26: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

CURES FOR NUMERICAL INSTABILITY

1. Use stable technique for least squaresproblem

QR factorization

"V method"

2. Make K1 as well conditioned as possible

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 27: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

THE V METHOD

Factor K1 = VV T11 where V is n × m and V11

is m × m lower triangular

y∗ = K ∗1 V−T

11 (λ2I + V T V )−1V T y

V is a rescaling of a well conditioned matrix

method is numerically stable

can be faster and need less memory

related to [Peters and Wilkinson, 1970],[Wahba, 1990, p. 136]

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 28: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

COLUMN SELECTION

Use partial Cholesky factorization withpivoting to form V

selects appropriate columns for K1

K1 will be well conditioned: cond(K1) isO(condition of optimal low rankapproximation)

[Higham, 2002, pp. 196-208]

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 29: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

CHOICE OF RANK

For least squares problems there areefficient techniques to drop columns [Bjorck,1996, p. 133]

The techniques can be easily adapted

Solve GP problem with rank mapproximation

Small additional cost to determine theaccuracy of all lower rank k approximation,k = 1, . . . , m

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 30: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

COMPUTING TIMES

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 31: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

COMPUTING TIMES

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 32: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

ESTIMATING A METHOD’S ACCURACY: BOOTSTRAP

Bootstrap: standard statistical resamplingtechnique

Generate multiple (100) samples to testmethods

Determine reliability, error bounds

Stable methods have smaller range of error

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 33: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

BOOTSTRAP RESAMPLING, n = 180045, m = 100

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 34: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

BOOTSTRAP RESAMPLING: V METHOD

V method with pivoting, n = 36009, m = 1000

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 35: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

BOOTSTRAP RESAMPLING: V + SR METHOD

V method with pivoting and SR method

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 36: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

BOOTSTRAP RESAMPLING: V, WITH AND W/O PIVOTING

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 37: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

RMSE ERROR VS. RANK

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 38: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

RMSE ERROR VS. NUMBER OF GALAXIES

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 39: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

GP VS. ALTERNATIVE METHODS

Way and Srivastava, 2006:

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 40: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

GP VS. ALTERNATIVE METHODS

Way and Srivastava, 2006 + our results:

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 41: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

SUMMARY OF RESULTS

Code solves linear algebra issues in theGaussian process approach:

Fast - O(nm2), m << n

Accurate - good predictions

Stable - bootstrap error curves flat

General - works for any kernel

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 42: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

FURTHER WORK

Outliers

hyperparameters using low rankapproximation (we used minimize from[Rasmussen and William, 2006])

additional covariance functions

lower bound on errors (ex: for redshift .02?)

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 43: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

REFERENCES

A. Bjork, Numerical Methods for Least Squares Problems,SIAM, 1996.

N. Higham, Accuracy and Stability of NumericalAlgorithms, SIAM, 2002.

G. Peters and J. Wilkinson, Comput. J. (13), pp. 309-316,1970.

C. Rasmussen and C. Williams, Gaussian Processes forMachine Learning, MIT Press, 2006.

G. Wahba, Spline Models for Observation Data, SIAM,1990.

M. Way and A. Srivastava, Astrophysical Journal (647), pp.102-115, 2006.

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA

Page 44: Data Mining in Aeronautics, Science, and Exploration ......Data Mining in Aeronautics, Science, and Exploration Systems 2007 Conference Computer History Museum Mountain View, CA June

OUTLINESTHE PROBLEM AND BACKGROUND

LOW RANK APPROXIMATIONNUMERICAL STABILITY AND RANK SELECTION

RESULTSCONCLUSIONS

ACKNOWLEDGEMENT

We would like to thank the Woodward Fund forthe financial support and the following peoplefor their guidance.

Drs. Michael Way, Ashok Srivastava, TimLee, Paul Gazis (NASA scientists)

Dr. Tim Hsu (CAMCOS director)

Drs. Bem Cayco, Wasin So and Steve Crunk(SJSU faculty)

LESLIE FOSTER, NABEELA AIJAZ, MICHAEL HURLEY, APOLO LUIS, JOEL RINSKY, CHANDRIKA SATYAVOLU, ALEX WAAGEN (TEAM LEADEDMASES 2007, JUNE 26-27, 2007, MOUNTAIN VIEW, CA