Overview: Missing Data Problems in Data Theory of Missing ...marlin/research/phd... · Benjamin...

5
Missing Data Problems in Machine Learning Benjamin Marlin 1 Department of Computer Science, University of Toronto Missing Data Problems in Missing Data Problems in Missing Data Problems in Missing Data Problems in Machine Learning Machine Learning Machine Learning Machine Learning Senate Thesis Defense Senate Thesis Defense Senate Thesis Defense Senate Thesis Defense Ben Marlin Machine Learning Group Department of Computer Science University of Toronto April 8, 2008 Missing Data Problems in Machine Learning Benjamin Marlin 2 Department of Computer Science, University of Toronto Overview: Overview: Overview: Overview: Theory of Missing Data Unsupervised Learning - MAR Classification with Missing Features Classification with Complete Data Unsupervised Learning - NMAR Collaborative Prediction Missing Data Problems in Machine Learning Benjamin Marlin 3 Department of Computer Science, University of Toronto Missing Data Observed Data Missing Dimensions Observed Dimensions Response Vector Data Vector Introduction: Introduction: Introduction: Introduction: Notation for Missing Data 0.3 0.7 0.2 0.9 0.1 1 1 0 0 1 5 4 1 3 2 0.3 0.7 0.1 0.2 0.9 Missing Data Problems in Machine Learning Benjamin Marlin 4 Department of Computer Science, University of Toronto Theory of Missing Data: Factorizations Data/Selection Model Factorization: • The probability of selection depends on the true values of the data variables and latent variables. MAR: MCAR: NMAR: No simplification in general. Classification of Missing Data: X Missing Data Problems in Machine Learning Benjamin Marlin 5 Department of Computer Science, University of Toronto Theory of Missing Data: Inference MCAR/MAR Posterior: NMAR Posterior: Missing Data Problems in Machine Learning Benjamin Marlin 6 Department of Computer Science, University of Toronto Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning - MAR: MAR: MAR: MAR: Models Finite Bayesian Mixture Model Dirichlet Process Mixture Model Factor Analysis Mixture Model

Transcript of Overview: Missing Data Problems in Data Theory of Missing ...marlin/research/phd... · Benjamin...

Page 1: Overview: Missing Data Problems in Data Theory of Missing ...marlin/research/phd... · Benjamin Marlin Department of Computer Science, University of Toronto 1 Missing Data Problems

Missing Data Problems in Machine Learning

Benjamin Marlin

1Department of Computer Science, University of Toronto

Missing Data Problems in Missing Data Problems in Missing Data Problems in Missing Data Problems in

Machine LearningMachine LearningMachine LearningMachine Learning

Senate Thesis DefenseSenate Thesis DefenseSenate Thesis DefenseSenate Thesis Defense

Ben MarlinMachine Learning Group

Department of Computer ScienceUniversity of Toronto

April 8, 2008

Missing Data Problems in Machine Learning

Benjamin Marlin

2Department of Computer Science, University of Toronto

Overview: Overview: Overview: Overview:

Theory of Missing Data

Unsupervised Learning - MAR

Classification with Missing Features

Classification with Complete Data

Unsupervised Learning - NMAR

Collaborative Prediction

Missing Data Problems in Machine Learning

Benjamin Marlin

3Department of Computer Science, University of Toronto

Missing Data

Observed Data

Missing

Dimensions

Observed

Dimensions

Response Vector

Data Vector

Introduction: Introduction: Introduction: Introduction: Notation for Missing Data

0.30.70.20.90.1

11001

541

32

0.30.70.1

0.20.9

Missing Data Problems in Machine Learning

Benjamin Marlin

4Department of Computer Science, University of Toronto

Theory of Missing Data: Factorizations

Data/Selection Model Factorization:

• The probability of selection depends on the true values

of the data variables and latent variables.

MAR:

MCAR:

NMAR: No simplification in general.

Classification of Missing Data:

X

Missing Data Problems in Machine Learning

Benjamin Marlin

5Department of Computer Science, University of Toronto

Theory of Missing Data: Inference

MCAR/MAR Posterior:

NMAR Posterior:

Missing Data Problems in Machine Learning

Benjamin Marlin

6Department of Computer Science, University of Toronto

Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- MAR:MAR:MAR:MAR: Models

Finite Bayesian

Mixture Model

Dirichlet Process

Mixture Model

Factor Analysis

Mixture Model

Page 2: Overview: Missing Data Problems in Data Theory of Missing ...marlin/research/phd... · Benjamin Marlin Department of Computer Science, University of Toronto 1 Missing Data Problems

Missing Data Problems in Machine Learning

Benjamin Marlin

7Department of Computer Science, University of Toronto

Restricted Boltzmann

Machine for Missing Data

Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- MAR:MAR:MAR:MAR: Models

Restricted Boltzmann

Machine for Missing Data

Missing Data Problems in Machine Learning

Benjamin Marlin

8Department of Computer Science, University of Toronto

Unsupervised Learning – NMAR:

Data Sets: Yahoo!

Collected ratings for randomly selected songs and combined them with existing ratings for user selected songs to form a novel collaborative filtering data set.

User Selected Randomly Selected

Missing Data Problems in Machine Learning

Benjamin Marlin

9Department of Computer Science, University of Toronto

Unsupervised Learning – NMAR:

Data Sets: JesterJester gauge set of 10 jokes used as complete data. Synthetic missing data was added.

• 15,000 users randomly selected

• Missing data model: µv(s) = s(v-3)+0.5

Missing Data Problems in Machine Learning

Benjamin Marlin

10Department of Computer Science, University of Toronto

Unsupervised Learning – NMAR:

Basic Experimental Protocol• We train on ratings for user selected items, and test on ratings for both user selected items, and randomly select items.

User Selected Ratings

Randomly Selected

Ratings

23455

134

4132

23455

134

4132

23455

134

4132

23455

134

4132

Observed Ratings

Test Ratings for User Selected Items

Test Ratings

for Randomly Selected items23455

134

4132

Missing Data Problems in Machine Learning

Benjamin Marlin

11Department of Computer Science, University of Toronto

Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- NMAR:NMAR:NMAR:NMAR: Models

Finite Bayesian

Mixture Model/CPT-v

Dirichlet Process

Mixture Model/CPT-v

Missing Data Problems in Machine Learning

Benjamin Marlin

12Department of Computer Science, University of Toronto

Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- NMAR:NMAR:NMAR:NMAR: Models

Conditional Restricted

Boltzmann Machine/E-v

Finite Bayesian

Mixture Model/LOGIT-vd

Page 3: Overview: Missing Data Problems in Data Theory of Missing ...marlin/research/phd... · Benjamin Marlin Department of Computer Science, University of Toronto 1 Missing Data Problems

Missing Data Problems in Machine Learning

Benjamin Marlin

14Department of Computer Science, University of Toronto

Unsupervised Learning – NMAR:

Comparison of Results on Yahoo! Data

Missing Data Problems in Machine Learning

Benjamin Marlin

15Department of Computer Science, University of Toronto

Unsupervised Learning – NMAR:

NEW: Ranking Results

• : mean of posterior predictive distribution for test item i.

• : rank of test item i according to .

• : rank of test item i according to .

Missing Data Problems in Machine Learning

Benjamin Marlin

16Department of Computer Science, University of Toronto

Unsupervised Learning – NMAR:

NEW: Comparison of Yahoo! Ranking Results

Strong Generalization:

Missing Data Problems in Machine Learning

Benjamin Marlin

17Department of Computer Science, University of Toronto

Classification: Imputation

Multiple Imputation: Replace missing feature values with samples of xm given xo drawn from several imputation models.

1 5

Mixture of Factor Analyzers

K=3, Q=1

Missing Data Problems in Machine Learning

Benjamin Marlin

18Department of Computer Science, University of Toronto

Classification: Reduced Models

Reduced Models: Each observed data subspace defined by a pattern of missing data gives a separate classification problem.

R=[0,1] R=[1,1]

R=[1,0]

Missing Data Problems in Machine Learning

Benjamin Marlin

19Department of Computer Science, University of Toronto

Classification: Response Augmentation

Response Augmentation: Set missing features to zero and augment feature representation with response indicators.

X=[0.0, 1.1, 0, 1]~

X=[2.0, 1.1, 1, 1]~

X=[-2.8, 0, 1, 0]~

Page 4: Overview: Missing Data Problems in Data Theory of Missing ...marlin/research/phd... · Benjamin Marlin Department of Computer Science, University of Toronto 1 Missing Data Problems

Missing Data Problems in Machine Learning

Benjamin Marlin

20Department of Computer Science, University of Toronto

Classification: Generative Models

Generative Model (LDA-FA):

Predictive Distribution with Missing Data:

Missing Data Problems in Machine Learning

Benjamin Marlin

21Department of Computer Science, University of Toronto

Results: Linear Discriminant Analysis

Training Data Generative

Learning

Discriminative

Learning

Missing Data Problems in Machine Learning

Benjamin Marlin

22Department of Computer Science, University of Toronto

Classification: Logistic RegressionTraining Data Zero Imputation Mean Imputation

Mix FA Imputation Augmented Reduced

Missing Data Problems in Machine Learning

Benjamin Marlin

23Department of Computer Science, University of Toronto

Classification: : Gaussian KLRTraining Data Zero Imputation Mean Imputation

Augmented Reduced Mix FA Imputation

Missing Data Problems in Machine Learning

Benjamin Marlin

24Department of Computer Science, University of Toronto

Classification: Neural NetworksTraining Data Zero Imputation Mean Imputation

Augmented Reduced Mix FA Imputation

Missing Data Problems in Machine Learning

Benjamin Marlin

25Department of Computer Science, University of Toronto

Classification:

UCI Thyroid-Sick

Page 5: Overview: Missing Data Problems in Data Theory of Missing ...marlin/research/phd... · Benjamin Marlin Department of Computer Science, University of Toronto 1 Missing Data Problems

Missing Data Problems in Machine Learning

Benjamin Marlin

26Department of Computer Science, University of Toronto

Classification:

MNIST Digit Classification with Missing Data

Missing Data Problems in Machine Learning

Benjamin Marlin

27Department of Computer Science, University of Toronto

Classification:

MNIST Digit Classification with Missing Data

Missing Data Problems in Machine Learning

Benjamin Marlin

28Department of Computer Science, University of Toronto

The End