LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

36
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

description

Q&A: questions about labs Q1: when are they “due”? - Answer: Ideally you should submit your post before you leave class on the day we do the lab. While there’s no “penalty” for turning them in later, it’s harder for me to judge where everyone is without feedback. Q2: resolved or unresolved? - Answer: Leave them unresolved when you post, and I’ll mark them resolved as I enter them into my grade book.

Transcript of LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Page 1: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LECTURE 07:

CLASSIFICATION PT. 3February 15, 2016

SDS 293

Machine Learning

Page 2: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Announcements

• On Monday SDS293 will have guests :

Page 3: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Q&A: questions about labs

• Q1: when are they “due”?- Answer: Ideally you should submit your post before you leave

class on the day we do the lab. While there’s no “penalty” for turning them in later, it’s harder for me to judge where everyone is without feedback.

• Q2: resolved or unresolved?- Answer: Leave them unresolved when you post, and I’ll mark them

resolved as I enter them into my grade book.

Page 4: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Outline

Motivation Bayes classifier K-nearest neighbors Logistic regression

Logistic model Estimating coefficients with maximum likelihood Multivariate logistic regression Multiclass logistic regression Limitations

• Linear discriminant analysis (LDA)- Bayes’ theorem

- LDA on one predictor

- LDA on multiple predictors

• Comparing Classification Methods

Page 5: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Warning!

Page 6: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Recap: logistic regression

Question: what were we tying to do using the logistic function?

Answer: estimate the conditional distribution of the response, given the predictors

Page 7: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Discussion

• What could go wrong?

Page 8: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Flashback: superheroes

According to this sample, we can perfectly predict the 3 classes using only:

Page 9: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Flashback: superheroes

Page 10: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Perfect separation

• If a predictor happens to perfectly align with the response, we call this perfect separation

• When this happens, logistic regression will grossly inflate the coefficients of that predictor (why?)

Page 11: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Another approach

• We could try modeling the distribution of the predictors in each class separately:

“density function of X”

• How would this help?

Page 12: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Refresher: Bayes’ rule

Page 13: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Refresher: Bayes’ rule

The probability that the true class is k

Given that we observedpredictor vector x

The probability that weobserve predictor vector x

Given that thetrue class is k

The probability thatthe true class is k

The probability that weobserve predictor vector x

Page 14: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Refresher: Bayes’ rule

This isn’t too hard to estimate*

*Assuming our training data is randomly sampled

Just need to estimatethe density function!

Page 15: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Flashback: toy exampleBayes’ Decision Boundary

50%

Page 16: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Using Bayes’ rule for classification

• Let’s start by assuming we have just a single predictor

• In order to estimate fk(X), we’ll need to make some assumptions about its form

Page 17: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Assumption 1: fj(X) is normally distributed

Page 18: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Assumption 1: fj(X) is normally distributed

• If the density is normal (a.k.a. Gaussian), then we can calculate the function as:

where μk and σk2 are the mean & variance of the kth class

Page 19: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Assumption 2: classes have equal variance

• For simplicity, we’ll also assume that:

• That gives us a single variance term, which we’ll denote

Page 20: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Plugging in…

For our purposes, this is a constant!

Page 21: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

More algebra!

• So really, we just need to maximize:

• This is called a discriminant function of x

Page 22: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Okay, we need an example

Bayes’ Decision Boundary at x=0

Page 23: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LDA: estimating the mean

• As usual, in practice we don’t know the actual values for the parameters and so we have to estimate them

• The linear discriminant analysis method uses the following:

(the average of all the training examples from class k)

Page 24: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LDA: estimating the variance

• Then we’ll use that estimate to get:

(weighted average of the sample variances of each class)

Page 25: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Flashback

• Remember that time I said:

• If we don’t have additional knowledge about the class membership distribution:

*Assuming our training data is randomly sampled

This isn’t too hard to estimate*

Page 26: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LDA classifier

• The LDA classifier plugs in all these estimates, and assign the observation to the class for which

is the largest

• The linear in LDA comes from the fact that this equation is linear in x

Page 27: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Back to our example

LDA Decision Boundary

Page 28: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Quick recap: LDA

• LDA on 1 predictor makes 2 assumptions: what are they?1. Observations within class are normally distributed

2. All classes have common variance

• So what would we need to change to make this work with multiple predictors?

Page 29: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LDA on multiple predictors

• Nothing! We just assume the density functions are multivariate normal:

uncorrelated correlation = 0.7

Page 30: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LDA on multiple predictors

• Well okay, not nothing…

• What happens to the mean?

• What happens to the variance?

Page 31: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

LDA on multiple predictors

• Plugging in as before, we get:

• While this looks a little terrifying, it’s actually just the matrix/vector version of what we had before:

Page 32: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Quick recap: LDA

• LDA on 1 predictor makes 2 assumptions: what are they?1. Each class is normally distributed

2. All classes have common varianceWhat can we do about

This second assumption?

Page 33: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Multiplying two x terms together,hence “quadratic”

Quadratic discriminant analysis

• What if we relax the assumption that the classes have uniform variance?

• If we plug this into Bayes’, we get:

Page 34: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Discussion: QDA vs. LDA

• Question: why does it matter whether or not we assume that the classes have common variance?

• Answer: bias/variance trade off- One covariance matrix on p predictors p(p+1)/2 parameters

- The more parameters we have to estimate, the higher variance in the model (but the lower the bias)

Page 35: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Lab: LDA and QDA

• To do today’s lab in R: nothing special• To do today’s lab in python: nothing special• Instructions and code:

http://www.science.smith.edu/~jcrouser/SDS293/labs/lab5/

• You’ll notice that the labs are beginning to get a bit more open-ended – this may cause some discomfort!- Goal: to help you become more fluent in using these methods

- Ask questions early and often: of me, of each other, of whatever resource helps you learn!

• Full version can be found beginning on p. 161 of ISLR

Page 36: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.

Coming up

• Wednesday is Rally Day- NO AFTERNOON CLASSES

- Jordan’s office hours will be shifted ahead one hour (9am to 11am)

• Monday – last day of classification: comparing methods• Solutions for A1 have been posted to the course website

and Moodle• A2 due on Monday Feb. 22nd by 11:59pm