11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky...

27
1 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota [email protected] Presented at the University of Cyprus, 2009

Transcript of 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky...

Page 1: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

11

Overview of Predictive Learning

Electrical and Computer Engineering

Vladimir Cherkassky University of Minnesota

[email protected] at the University of Cyprus, 2009

Page 2: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

2

OUTLINE

• Background and motivation

• Application study: real-time pricing of mutual funds

• Inductive Learning and Philosophy

• Two methodologies: classical statistics and predictive learning

• Statistical Learning Theory and SVM

• Summary and discussion

Page 3: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

3

Recall: Learning ~ function estimationMath terminology

• Past observations ~ data points

• Explanation (model) ~ function

Learning ~ function estimation (from data points)

Prediction ~ using estimated model to make predictions

Page 4: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

444

Statistical vs Predictive Approach• Binary Classification problem estimate decision boundary from training data

Assuming distribution P(x,y) were known:

(x1,x2) space

ii y,x

-2 0 2 4 6 8 10-6

-4

-2

0

2

4

6

8

10

x1

x2

Page 5: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

555

Classical Statistical Approach(1) parametric form of unknown distribution P(x,y) is known (2) estimate parameters of P(x,y) from training data (3) Construct decision boundary using estimated distribution

and given misclassification costs

Estimated boundary

Modeling assumption:Unknown P(x,y) can be accurately estimated fromavailable data

-2 0 2 4 6 8 10

-6

-4

-2

0

2

4

6

8

10

x1

x2

Page 6: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

666

Predictive Modeling Approach(1) parametric form of decision boundary f(x,w) is given (2) Explain available data via fitting f(x,w), or minimization of

some loss function (i.e., squared error)(3) A function f(x,w*) providing smallest fitting error is then

used for predictiion

Estimated boundary

Modeling assumption:- Need to specify f(x,w) andloss function a priori.

- No need to estimate P(x,y) -2 0 2 4 6 8 10

-6

-4

-2

0

2

4

6

8

10

x1

x2

Page 7: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

77

Philosophical Interpretation Unknown system, observed data (input x, output y)Unknown P(x,y)Goal is to estimate a function: y = f (x)

Probabilistic Approach ~Goal is to estimate the true model for data (x,y)i.e. System Identification REALISM

Predictive Approach ~Goal is to imitate (predict) System output yi.e., System Imitation INSTRUMENTALISM

Page 8: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

88

Classification with High-Dimensional Data• Digit recognition 5 vs 8:

each example ~ 16 x 16 pixel image 256-dimensional vector x

• Given finite number of labeled examples, estimate decision rule y = f(x) for classifying new images

Note: x ~ 256-dimensional vector, y ~ binary class label 0/1• Estimation of P(x,y) with finite data is not possible• Accurate estimation of decision boundary in 256-dim.

space is possible, using just a few hundred samples

Page 9: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

99

Statistical vs Predictive Predictive approach

- estimates certain properties of unknown P(x,y) that are useful for predicting y- has solid theoretical foundations (VC-theory)- successfully used in many apps

BUT its methodology + concepts are different from classical statistical estimation:- understanding of application - a priori specification of a loss function (necessary for imitation)- interpretation of predictive models is hard- possibility of several good models estimated from the same data

Page 10: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

10

OUTLINE

• Background and motivation

• Application study: real-time pricing of mutual funds

• Inductive Learning and Philosophy

• Two methodologies: classical statistics and predictive learning

• Statistical Learning Theory and SVM

• Summary and discussion

Page 11: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

1111

Quick Tour of VC-theory -1Goals of Predictive Learning

- explain (or fit) available training data- predict well future (yet unobserved) data- ample empirical evidence in many apps

Similar to biological learningExample: given 1, 3, 7, …

predict the rest of the sequence.Rule 1: Rule 2: randomly chosen odd numbersRule 3:

BUT for sequence 1, 3, 7, 15, 31, 63, …,

Rule 1 seems very reliable (why?)

11 2

kkk xx

12 kkxk

Page 12: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

1212

Quick Tour of VC-theory - 2 Main Practical Result of VC-theory:

If a model explains well past data AND

is simple, then it can predict well • This explains why Rule 1 is a good model for

sequence 1, 3, 7, 15, 31, 63, …, • Measure of model complexity ~ VC-dimension

~ Ability to explain past data 1, 3, 7, 15, 31, 63

BUT can not explain all other possible sequences Low VC-dimension (~ large falsifiability)• For linear models, VC-dim = DoF (as in statistics)• But for nonlinear models they are different different

Page 13: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

1313

Quick Tour of VC-theory - 3 Strategy for modeling high-dimensional data:

Find a model f(x) that explains past data AND

has low VC-dimension, even when dim. is large

SVM methods

for high-dim data:

Large margin =

Low VC-dimension

~ easy to falsify

Page 14: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

14

Non-separable data: classification

0),,(max)),(,( xx yffyL 2Margin

Page 15: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

15

Support Vectors

• SV’s ~ training samples with non-zero loss• SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the dataWSJ Feb 27, 2004:

About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.

• SVM Generalization ~ data compression

Page 16: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

16

Nonlinear Decision Boundary• Fixed (linear) parameterization is too rigid• Nonlinear curved margin may yield larger margin

(falsifiability) and lower error nonlinear kernel SVM

Page 17: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

17

Handwritten Digit Recognition (mid-90’s)

• Data set: postal images (zip-code), segmented, cropped;~ 7K training samples, and 2K test samples

• Data encoding: 16x16 pixel image 256-dim. vector

• Summary: test error rate ~ 3-4% - prediction accuracy better than custom NN’s- accuracy does not depend on the kernel type- 100 – 400 support vectors per class (digit)

Page 18: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

181818

Interpretation of SVM models

Humans can not provide interpretion of high-dimensional data, even when they make good decisions (predictions) using such data

i. e. digit recognition vs

How to interpret high-dimensional models?- Project data samples onto normal direction w of

SVM decision boundary D(x) = (w x) + b - Interpret univariate histograms of projections

Page 19: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

19

Univariate histogram (of projections)• Project training data onto normal vector w of trained SVM

b w x

W

0

-1

+1

0-1 +1

Page 20: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

20

Projections for high-dimensional data -1• Most training samples cluster on margin borders• For 5 vs 8 recognition data, 100 training samples: Explanation (~ fitting of training data) is easy

-1.5 -1 -0.5 0 0.5 1 1.5 20

5

10

15

20

25

30

35

40

45

Page 21: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

21

Continued..• BUT test data projections (for this SVM model) have

completely different distribution:• For 5 vs 8 recognition data, 1000 test samples:

test error ~ 6% prediction is more difficult

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

50

100

150

200

250

300

Page 22: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

22

Projections for high-dimensional data-2• For 5 vs 8 recognition data, 1000 training samples

Projections of training data:

-3 -2 -1 0 1 2 30

50

100

150

200

250

Page 23: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

23

Continued..For this SVM model, test error is ~ 1.35% And histogram of projections for 1000 test samples:

-3 -2 -1 0 1 2 30

50

100

150

200

250

Page 24: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

24

OUTLINE

• Background and motivation

• Application study: real-time pricing of mutual funds

• Inductive Learning and Philosophy

• Two methodologies: classical statistics and predictive learning

• Statistical Learning Theory and SVM

• Summary and discussion

Page 25: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

252525

Summary

In many real-life applications:

1. Estimation of models that can explain available data is easy

2. Estimation of models that can make useful predictions is very difficult

3. It is important to make clear distinction between (1) and (2)

Usually this constitutes the difference between beliefs (opinions) and predictive models

Page 26: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

2626

Current Challenges

• Non-technical:

- lack of agreement on understanding of uncertainty and risk

• Technical:

- many different fragmented disciplines dealing with predictive learning

• VC- theory gives consistent practical approach for handling uncertainty and risk but it is often misinterpreted by scientists

Page 27: 11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University.

272727

Acknowledgements• Parts of this presentation are taken

- from the forthcoming book

Introduction to Predictive Learning by V. Cherkassky and Y. Ma, Springer 2010

- and from the course EE 4389 at www.ece.umn.edu/users/cherkass/ee4389