ECE-271A Statistical Learning I · the course is an introductory level course in statistical...

27
ECE-271A Statistical Learning I Nuno Vasconcelos ECE Department, UCSD

Transcript of ECE-271A Statistical Learning I · the course is an introductory level course in statistical...

Page 1: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

ECE-271AStatistical Learning I

Nuno Vasconcelos

ECE Department, UCSD

Page 2: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

2

The coursethe course is an introductory level course in statistical learning

by introductory I mean that you will not need any previous exposure to the field, not that it is basic

we will cover the foundations of Bayesian or generative learning

271B is a follow-up course on discriminant learning, in alternating years

more on generative vs discriminant later

Page 3: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

3

LogisticsExams: 1 mid-term - 35%

1 final – 45% (covers everything)

Homework (20%):• one problem set every week.

• will include a small computational problem. By small, I mean in terms of concepts, thinking, etc.

• some computational problems will require a fair amount of computer power, e.g. a few hours on a low-end PC.

• be sure to start early

• will count 20%, but almost impossible without it.

• will give you the hands-on experience needed to be able to claim that you really know learning!

Page 4: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

4

Homework policies

homework is individual

OK to work on problems with someone else

but you have to:• write your own solution

• write down the names of who you collaborated with

homework is due one week after it is issued.

Page 5: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

5

Cheetah

statistical learning only makes sense when you try it on data

we will test what we learn on a image processing problem• given the cheetah image, can we teach a

computer to segment it into object and foreground?

• the question will be answered with different techniques, typically one problem per week

• a total of 5 computer problems

try to keep an eye on the big picture,e.g. “did this improve over what we had done before?”

Page 6: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

6

Resources

Course web page: http://www.svcl.ucsd.edu/~nuno• all handouts, problem sets, code will be available there

TA: TBA,

Me: Nuno Vasconcelos, [email protected], EBU1-5603

Office hours:• TA: TBA

• mine: Fridays, 9:30-10:30AM

• for homework talk to TA first, everything else see me

My assistant:• Travis Spackman ([email protected]), outside my office,

may sometimes be involved in administrative issues

Page 7: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

7

Textsrequired:• “Pattern Classification”, Duda, Hart, and

Stork, John Willey and Sons, 2001

• will follow closely, hand-outs where needed

various other good, but optional, texts:• “Pattern Recognition and Machine Learning”, Bishop, 2006

• “Elements of Statistical Learning”, Hastie, Tibshirani, Fredman, 2001

• “Bayesian Data Analysis”, Gelman, Rubin, Stern, 2003.

• “A Probabilistic Theory of Pattern Recognition”, Devroye, Gyorfi, Lugosi, 1998 (more than what we need)

stuff you must know really well:• “Linear Algebra”, Gilbert Strang, 1988

• “Fundamentals of Applied Probability”, Drake, McGraw-Hill, 1967

Page 8: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

8

The course

why statistical learning?

there are many processes in the world that are ruled by deterministic equations• e.g. f = ma; linear systems and convolution, Fourier, etc, various

chemical laws

• there may be some “noise”, “error”, “variability”, but we can leave with those

• we don’t need statistical learning

learning is needed when• there is a need for predictions about variables in the world, Y

• that depend on factors (other variables) X

• in a way that is impossible or too difficult to derive an equation for.

Page 9: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

9

Examples

data-mining view:• large amounts of data that does not follow deterministic rules

• e.g. given an history of thousands of customer records and some questions that I can ask you, how do I predict that you will pay on time?

• impossible to derive a theorem for this, must be learned

while many associate learning with data-mining, it is by no means the only or more important application

signal processing view:• signals combine in ways that depend on “hidden structure” (e.g.

speech waveforms depend on language, grammar, etc.)

• signals are usually subject to significant amounts of “noise” (which sometimes means “things we do not know how to model”)

Page 10: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

10

Examples (cont’d)

signal processing view:• e.g. the cocktail party problem,

• although there are all these people talking, I can figure everything out.

• how do I build a chip to separate the speakers?

• model the hidden dependence as

• a linear combination of independent sources

• noise

• many other examples in the areas of wireless, communications, signal restoration, etc.

Page 11: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

11

Examples (cont’d)

perception/AI view:• it is a complex world, I cannot model

everything in detail

• rely on probabilistic models that explicitly account for the variability

• use the laws of probability to make inferences, e.g. what is

• P( burglar | alarm, no earthquake) is high

• P( burglar | alarm, earthquake) is low

• a whole field that studies “perception as Bayesian inference”

• perception really just confirms what you already know

• priors + observations = robust inference

Page 12: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

12

communications view:• detection problems:

• I see Y, and know something about the statistics of the channel. What was X?

• this is the canonic detection problem that appears all over learning.

• for example, face detection in computer vision: “I see pixel array Y. Is it a face?”

Examples (cont’d)

channelX Y

Page 13: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

13

Statistical learning

goal: given a function

and a collection of exampledata-points, learn what the function f(.) is.

this is called training.

two major types of learning:• unsupervised: only X is known, usually referred to as clustering;

• supervised: both are known during training, only X known at test time, usually referred to as classification or regression.

)(xfy =x(.)f

Page 14: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

14

Supervised learning

X can be anything, but the type of Y dictates the type of supervised learning problem• Y in {0,1} referred to as detection

• Y in {0, ..., M-1} referred to as classification

• Y real referred to as regression

theory is quite similar, algorithms similar most of the time

we will emphasize classification, but will talk about regression when particularly insightful

Page 15: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

15

Exampleclassifying fish:• fish roll down a conveyer belt

• camera takes a picture

• goal: is this a salmon or a sea-bass?

Q: what is X? What featuresdo I use to distinguish between the two fish?

this is somewhat of an art-form. Frequently, the best is to ask experts.

e.g. “obvious! use length and scale width!”

Page 16: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

16

Classification/detection

two major types of classifiers:• discriminant: directly recover the

decision boundary that best separates the classes;

• generative: fit a probability model to each class and then “analyze” the models to find the border.

a lot more on this later!

focus will be on generative learning.

discriminant will be covered by 271B.

Page 17: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

17

Caution

how do we know learning worked?

we care about generalization, i.e. accuracy outside training set

models that are too powerful canlead to over-fitting:• e.g. in regression I can always fit

exactly n pts with polynomial oforder n-1.

• is this good? how likely is the errorto be small outside the training set?

• similar problem for classification

fundamental LAW: only test set results matter!!!

Page 18: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

18

Generalization

good generalization requires controlling the trade-off between training and test error• training error large, test error large

• training error smaller, test error smaller

• training error smallest, test error largest

this trade-off is known by many names

in the generative classification world it is usually due to the bias-variance trade-off of the class models

will look at this in detail

Page 19: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

19

Class-modelingeach class is characterized by a probability density function (class conditional density)

a model is adopted, e.g. a Gaussian

training data used to estimate model parameters

overall the process is referred to as density estimation

the simplest example would be to use histograms

Page 20: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

20

Density estimation

there are, however, much better models

usually, problem has two components:• selecting a model

• estimating model parameters

models:• we will cover the whole gamut

• from the exponential family (e.g. Gaussian)

• to kernel-based density estimates

• including mixture models

• and non-parametric approaches(nearest neighbors, histograms, etc.)

Page 21: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

21

Parameter estimationtwo main camps:• maximum likelihood

• Bayesian estimates

ML will devote most attention to the quality of estimates• the bias variance/trade-off

a lot more emphasis on Bayes:• subjective probability

• what is really a prior?

• mechanics: predictive distribution, MAP estimates, etc.

• priors: conjugate, non-informative, improper

• why is the exponential family special?

Page 22: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

22

Decision rules

given class models, Bayesiandecision theory provides us withoptimal rules for classification

optimal here means minimumprobability of error, for example

we will • study BDT in detail,

• establish connections to other decision principles (e.g. linear discriminants)

• show that Bayesian decisions are usually intuitive

derive optimal rules for a range of classifiers

Page 23: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

24

Reasons to take the coursestatistical learning• tremendous amount of theory

• but things invariably go wrong

• too little data, noise, too many dimensions, training sets that do not reflect all possible variability, etc.

good learning solutions require:• knowledge of the domain (e.g. “these are the features to use”)

• knowledge of the available techniques, their limitations, etc. (e.g. here a Gaussian is enough for AB&C, but there I need a mixture)

in the absence of either of these you will fail!

we will cover the basics, but will talk about quite advanced concepts.

easier scenario in which to understand them

Page 24: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

25

Reasons to take the course

theory together with hands-on experience• will cover all theory, every week 5-6

problems

• “hands-on” component: one computational problem per week

• this will center around cheetah segmentation

• allows evaluation of the benefits of more advance techniques as they are introduced

• forces you to deal with real, noisy, data

• exposes you to working on a new domain

Page 25: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

26

Page 26: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

27

Cheetah day

last class, we will have “Cheetah Day”

what:• 5 teams

• each team will write a report on the 5 cheetah problems

• each team will give a presentation on one of the problems

why: • to make sure that we get the “big picture”

out of all this work

• presenting is always good practice

Page 27: ECE-271A Statistical Learning I · the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field,

28

Cheetah Dayhow much:• 10% of the final grade (5% report, 5%

presentation)

what to talk about:• report: comparative analysis of all solutions

of the problem

• as if you were writing a conference paper

• presentation: will be on one single problem• review what solution was

• what did this problem taught us about learning?

• what “tricks” did we learn solving it?

• how well did this solution do compared to others?

will talk about this in due time