Handling Outliers and Missing Data in Statistical Data...

Post on 23-Apr-2018

220 views 2 download

Transcript of Handling Outliers and Missing Data in Statistical Data...

Handling Outliers and Missing Data in Statistical Data Models

Kaushik Mitra

Date: 17/1/2011

ECSU Seminar, ISI

Statistical Data Models

• Goal: Find structure in data• Applications

– Finance– Engineering– Sciences

• Biological

– Wherever we deal with data

• Some examples– Regression– Matrix factorization

• Challenges: Outliers and Missing data

Outliers Are Quite Common

Google search results for `male faces’

Need to Handle Outliers Properly

Noisy image Gaussian filtered image Desired result

Removing salt-and-pepper (outlier) noise

Missing Data Problem

Completing missing tracks

Incomplete tracksCompleted tracks by a sub-optimal method

Desired result

Missing tracks in structure from motion

Our Focus

• Outliers in regression

– Linear regression

– Kernel regression

• Matrix factorization in presence of missing data

Robust Linear Regression for High Dimension Problems

What is Regression?

• Regression

– Find functional relation between y and x

• x: independent variable

• y: dependent variable

– Given

• data: (yi,xi) pairs

• Model y = f(x, w)+n

– Estimate w

– Predict y for a new x

Robust Regression

• Real world data corrupted with outliers

• Outliers make estimates unreliable

• Robust regression– Unknown

• Parameter, w

• Outliers

– Combinatorial problem• N data and k outliers

• C(N,k) ways

Prior Work

• Combinatorial algorithms

– Random sample consensus (RANSAC)

– Least Median Squares (LMedS)

• Exponential in dimension

• M-estimators

– Robust cost functions

– local minima

Robust Linear Regression model

• Linear regression model : yi=xiTw+ei

– ei, Gaussian noise

• Proposed robust model: ei=ni+si

– ni, inlier noise (Gaussian)

– si, outlier noise (sparse)

• Matrix-vector form

– y=Xw+n+s

• Estimate w, s

y1

y2

.

.yN

x1T

x2T

.

.xN

T

n1

n2

.

.nN

s1

s2

.

.sN

= + +

w1

w2

.wD

Simplification

• Objective (RANSAC): Find w that minimizes the number of outliers

• Eliminate w• Model: y=Xw+n+s• Premultiple by C : CX=0, N ≥ D

– Cy=CXw+Cs+Cn– z=Cs+g

– g Gaussian

• Problem becomes: • Solve for s -> identify outliers -> LS -> w

20 ||||||||min Cszss

tosubject

20 ||||||||min sXwysws,

tosubject

Relation to Sparse Learning

• Solve:

– Combinatorial problem

• Sparse Basis Selection/ Sparse Learning

• Two approaches :

– Basis Pursuit (Chen, Donoho, Saunder 1995)

– Bayesian Sparse Learning (Tipping 2001)

20 ||||||||min Cszss

tosubject

Basis Pursuit Robust regression (BPRR)

• Solve – Basis Pursuit Denoising (Chen et. al. 1995)– Convex problem– Cubic complexity : O(N3)

• From Compressive Sensing theory (Candes 2005)– Equivalent to original problem if

• s is sparse• C satisfy Restricted Isometry Property (RIP)

• Isometry: ||s1 - s2|| = ||C(s1 – s2)||• Restricted: to the class of sparse vectors

• In general, no guarantees for our problem

Cszss

thatsuch1

min

Bayesian Sparse Robust Regression (BSRR)

• Sparse Bayesian learning technique (Tipping 2001)

– Puts a sparsity promoting prior on s :

– Likelihood : p(z/s)=Ν(Cs,εI)

– Solves the MAP problem p(s/z)

– Cubic Complexity : O(N3)

N

i issp

1

1)(

Setup for Empirical Studies

• Synthetically generated data

• Performance criteria

– Angle between ground truth

and estimated hyper-planes

Vary Outlier Fraction

BSRR performs well in all dimensions

Combinatorial algorithms like RANSAC, MSAC, LMedS not practical in high dimensions

Dimension = 2 Dimension = 8 Dimension = 32

Facial Age Estimation • Fgnet dataset : 1002 images of 82 subjects

• Regression– y : Age

– x: Geometric feature vector

Outlier Removal by BSRR

• Label data as inliers and outliers

• Detected 177 outliers in 1002 images

BSRR

Inlier MAE 3.73

Outlier MAE 19.14

Overall MAE 6.45

•Leave-one-out testing

Summary for Robust Linear Regression

• Modeled outliers as sparse variable

• Formulated robust regression as Sparse Learning problem

– BPRR and BSRR

• BSRR gives the best performance

• Limitation: linear regression model

– Kernel model

Robust RVM Using Sparse Outlier Model

Relevance Vector Machine (RVM)

• RVM model:

– : kernel function

• Examples of kernels

– k(xi, xj) = (xiTxj)

2 : polynomial kernel

– k(xi, xj) = exp( -||xi - xj||2/2σ2) : Gaussian kernel

• Kernel trick: k(xi,xj) = ψ(xi)Tψ(xj)

– Map xi to feature space ψ(xi)

N

i

i ewkwy1

0),()( ixxx

),( ixxk

RVM: A Bayesian Approach

• Bayesian approach– Prior distribution : p(w)– Likelihood :

• Prior specification– p(w) : sparsity promoting prior p(wi) = 1/|wi|– Why sparse?

• Use a smaller subset of training data for prediction• Support vector machine

• Likelihood – Gaussian noise

• Non-robust : susceptible to outliers

),|( wxyp

Robust RVM model

• Original RVM model

– e, Gaussian noise

• Explicitly model outliers, ei= ni + si

– ni, inlier noise (Gaussian)

– si, outlier noise (sparse and heavy-tailed)

• Matrix vector form

– y = Kw + n + s

• Parameters to be estimated: w and s

N

i

jj ewkwy1

0)( xx,

Robust RVM Algorithms

• y = [K|I]ws + n

– ws = [wT sT]T : sparse vector

• Two approaches

– Bayesian

– Optimization

Robust Bayesian RVM (RB-RVM)

• Prior specification

– w and s independent : p(w, s) = p(w)p(s)

– Sparsity promoting prior for s: p(si)= 1/|si|

• Solve for posterior p(w, s|y)

• Prediction: use w inferred above

• Computation: a bigger RVM

– ws instead of w

– [K|I] instead of K

Basis Pursuit RVM (BP-RVM)

• Optimization approach

– Combinatorial

• Closest convex approximation

• From compressive sensing theory

– Same solution if [K|I] satisfies RIP

• In general, can not guarantee

20 ||]|[||||||min ssw

wIKyws

tosubject

21 ||]|[||||||min ssw

wIKyws

tosubject

Experimental Setup

Prediction : Asymmetric Outliers Case

Image Denoising

• Salt and pepper noise

– Outliers

• Regression formulation

– Image as a surface over 2D grid

• y: Intensity

• x: 2D grid

• Denoised image obtained by prediction

Salt and Pepper Noise

Some More ResultsRVM RB-RVM Median Filter

Age Estimation from Facial Images

• RB-RVM detected 90 outliers

• Leave-one-person-out testing

Summary for Robust RVM

• Modeled outliers as sparse variables

• Jointly estimated parameter and outliers

• Bayesian approach gives very good result

Limitations of Regression

• Regression: y = f(x,w)+n– Noise in only “y”

– Not always reasonable

• All variables have noise– M = [x1 x2 … xN]

– Principal component analysis (PCA)• [x1 x2 … xN] = ABT

– A: principal components

– B: coefficients

– M = ABT: matrix factorization (our next topic)

Matrix Factorization in the presence of Missing Data

Applications in Computer Vision

• Matrix factorization: M=ABT

• Applications: build 3-D models from images– Geometric approach (Multiple views)

– Photometric approach (Multiple Lightings)

37

Structure from Motion (SfM)

Photometric stereo

Matrix Factorization

• Applications in Vision

– Affine Structure

from Motion (SfM)

– Photometric stereo

• Solution: SVD

– M=USVT

– Truncate S to rank r

• A=US0.5, B=VS0.5

38

M =xij

yij= CST

Rank 4 matrix

M = NST, rank = 3

Missing Data Scenario

• Missed feature tracks in SfM

• Specularities and shadow in photometric stereo

39

Incomplete feature tracks

Challenges in Missing Data Scenario

• Can’t use SVD

• Solve:

• W: binary weight matrix, λ: regularization parameter

• Challenges

– Non-convex problem

– Newton’s method based algorithm (Buchanan et. al. 2005)

• Very slow

• Design algorithm

– Fast (handle large scale data)

– Flexible enough to handle additional constraints

• Ortho-normality constraints in ortho-graphic SfM

)||||||(||||)(||min 222

FFF

TBAABMW

BA,

Proposed Solution

• Formulate matrix factorization as a low-rank semidefinite program (LRSDP)

– LRSDP: fast implementation of SDP (Burer, 2001)

• Quasi-Newton algorithm

• Advantages of the proposed formulation:

– Solve large-scale matrix factorization problem

– Handle additional constraints

41

Low-rank Semidefinite Programming (LRSDP)

• Stated as:

• Variable: R

• Constants

• C: cost

• Al, bl: constants

• Challenge

• Formulating matrix factorization as LRSDP

• Designing C, Al, bl

klbtosubject l

T

l

T ,...,2,1,min RRARRCR

Matrix factorization as LRSDP: Noiseless Case

• We want to formulate:

• As:

• LRSDP formulation:

),()(||||||||min ,,

22

,jifortosubject jiji

T

FFBA

MABBA

)(||||||||

)(||||),(||||

22

22

T

FF

T

F

T

F

trace

tracetrace

RRBA

BBBAAA

jimji

T

jiji

T

,,,, )()( MRRMAB

||,...,2,1, lbtosubject l

TTRRARRC l

C identity matrix,Al indicator matrix

Affine SfM

• Dinosaur sequence

• MF-LRSDP gives the best reconstruction

72% missing data

Photometric Stereo

• Face sequence

• MF-LRSDP and damped Newton gives the best result

42% missing data

Additional Constraints:Orthographic Factorization

• Dinosaur sequence

Summary

• Formulated missing data matrix factorization as LRSDP– Large scale problems

– Handle additional constraints

• Overall summary– Two statistical data models

• Regression in presence of outliers– Role of sparsity

• Matrix factorization in presence of missing data– Low rank semidefinite program

Thank you! Questions?

48