Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML -...

28
Statistical Machine Learning Part I: Statistical Learning Theory [email protected] SML - 2015 1

Transcript of Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML -...

Page 1: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Statistical Machine Learning

Part I: Statistical Learning Theory

[email protected]

SML - 2015 1

Page 2: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Course Introduction

• The Promises of Big Data

• What kind of tools will we use?

• Do we have to program?

• For starters... a first assignment

• Why is this useful for me?

SML - 2015 2

Page 3: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

The Promises of Big Data

SML - 2015 3

Page 4: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Personal Health

• Data can help us predict when people will have to go to the hospital

Heritage Health Prize

SML - 2015 4

Page 5: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Small Businesses

• Data can help us predict the dynamics of restaurants’ popularity

Yelp.com dataset challenge

SML - 2015 5

Page 6: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Lending Money

• Data can help us predict who we can lend money to

www.lendingclub.com

SML - 2015 6

Page 7: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Lending Money

• Data can help us predict who we can lend money to

www.lendingclub.com

SML - 2015 7

Page 8: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Movies

• Data can help us predict whether people will like a given movie

Netflix Prize, Research@ATT Hao Zhang

SML - 2015 8

Page 9: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

All these problems have in common that...

Data is Available

all you have to do, is download it... and analyze it!

SML - 2015 9

Page 10: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

What we will do in 7 lectures

The graduate school has many courses on how to handle data.Check the course offerings.

In these 7 lectures, we will focus on 3 things:

• Present elementary tools: regression and classification

• Study the mathematical foundations of statistical learning theory:

◦ Choose the right models, address computational issues,◦ Address the problem of overfitting.

• Introduce advanced topics: kernel methods, sparsity.

SML - 2015 10

Page 11: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

What kind of mathematical tools?

We will adopt a mathematical formalism to propose and study algorithms.

Probability & Statistics, Linear Algebra, Optimization

SML - 2015 11

Page 12: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Mathematical Tools

• Probability & Statistics (to handle uncertainty & randomness)

◦ Probability Spaces, Random variables◦ Expectation, variance, inequalities◦ Central limit theorem, convergence in probability

• Linear Algebra (to handle high-dimensional problems)

◦ Matrix inverse, eigenvalues/vectors◦ Positive-definiteness.

• Optimization (to give the best possible answer)

◦ convex programs,◦ lagrangean, Lagrange multipliers etc.

SML - 2015 12

Page 13: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Programming

This is not a course about programming, but we will implement algorithms

I encourage you to use MATLAB

but you can use any other program (R, Python, etc...)

I do not recommend using C/C++ or other compiled languages.

SML - 2015 13

Page 14: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

For Starters...

Some simple ideas and a 1st assignment.

SML - 2015 14

Page 15: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

A function

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

x

(x−1)4−(x−3)2+(x−2)3

a polynomial plotted between 0 and 4...

SML - 2015 15

Page 16: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

A function

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

x

(x−1)4−(x−3)2+(x−2)3

... can be seen as a very detailed scatter plot.

SML - 2015 16

Page 17: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

A function

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

x

(x−1)4−(x−3)2+(x−2)3

Yet, when less points are available...

SML - 2015 17

Page 18: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

A function

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

x

(x−1)4−(x−3)2+(x−2)3

can we still guess the whole blue line?

SML - 2015 18

Page 19: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

A partially observed function

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

100

Assume we only have the red points.

SML - 2015 19

Page 20: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

We can guess by using interpolating polynomials

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

100

data 1 quadratic cubic 4th degree

Curve fitting tools can help us get back the original function.We can actually reconstruct it perfectly.

SML - 2015 20

Page 21: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Polynomial Interpolation

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

100

even if points are not evenly spaced...

SML - 2015 21

Page 22: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Polynomial Interpolation

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

100

data 1 quadratic cubic 4th degree

SML - 2015 22

Page 23: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Uncertainty in measurements

0 0.5 1 1.5 2 2.5 3 3.5 4−20

0

20

40

60

80

100

sometimes, we do not have access to the correct information...

SML - 2015 23

Page 24: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Uncertainty in measurements

0 0.5 1 1.5 2 2.5 3 3.5 4−40

−20

0

20

40

60

80

100

but rather an information corrupted by “noise”.

SML - 2015 24

Page 25: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Things become a lot more difficult

0 0.5 1 1.5 2 2.5 3 3.5 4−40

−20

0

20

40

60

80

100

data 1data 2 4th degree

If we use standard tools...

SML - 2015 25

Page 26: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Things become a lot more difficult

0 0.5 1 1.5 2 2.5 3 3.5 4

−20

0

20

40

60

80

x

(x−1)4−(x−3)2+(x−2)3

data 1data 2 4th degreeoriginal curve

we might be very far from the original function.

SML - 2015 26

Page 27: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Things become a lot more difficult

0 0.5 1 1.5 2 2.5 3 3.5 4

−20

0

20

40

60

80

x

(x−1)4−(x−3)2+(x−2)3

data 1data 2 4th degreeoriginal curve

Can we handle uncertainty in a better way?Quantify how far we might be from the true function?

How many points do we need to reconstruct a more general curve?Does this work for surfaces in higher dimensions?

SML - 2015 27

Page 28: Statistical Machine Learning Part I: Statistical Learning ... · Yelp.com dataset challenge SML - 2015 5. Lending Money • Data can help us predict who we can lend money to SML -

Things become a lot more difficult

0 0.5 1 1.5 2 2.5 3 3.5 4

−20

0

20

40

60

80

x

(x−1)4−(x−3)2+(x−2)3

data 1data 2 4th degreeoriginal curve

First assignment - due Monday October 13th 23:59 by email

• Look for a definition of interpolation, e.g. check the wikipedia page.

• Do what I just did with Matlab and send me an email with the results:

◦ Choose a function.. you can use fancier functions (sin, cos, exp etc.)◦ Plot it. Scatter plot a few points.◦ Use these points with the curve fitting tool. Interpolate & Compare.

• Finally: give me a hint of what might go wrong in higher dimensions?

SML - 2015 28