BAS 250 Lecture 8

44
Machine Learning: Classification Algorithms Review

Transcript of BAS 250 Lecture 8

Page 1: BAS 250 Lecture 8

Machine Learning: Classification Algorithms

Review

Page 3: BAS 250 Lecture 8

AlgorithmProblem Type

Results interpretable by you?

Easy to explain algorithm to others?

Average predictive accuracy

Training speed

Prediction speed

Amount of parameter tuning needed (excluding feature selection)

Performs well with small number of observations?

Handles lots of irrelevant features well (separates signal from noise)?

Automatically learns feature interactions?

Gives calibrated probabilities of class membership? Parametric?

Features might need scaling? Algorithm

KNN Either Yes Yes Low er FastDepends on n Minimal No No No Yes No Yes KNN

Linear regression Regression Yes Yes Low er Fast Fast

None (excluding regularization) Yes No No N/A Yes

No (unless regularized)

Linear regression

Logistic regression

Classification Somew hat Somew hat Low er Fast Fast

None (excluding regularization) Yes No No Yes Yes

No (unless regularized)

Logistic regression

Naive Bayes

Classification Somew hat Somew hat Low er

Fast (excluding feature extraction) Fast

Some for feature extraction Yes Yes No No Yes No

Naive Bayes

Decision trees Either Somew hat Somew hat Low er Fast Fast Some No No Yes Possibly No No

Decision trees

Random Forests Either A little No Higher Slow Moderate Some No

Yes (unless noise ratio is very high) Yes Possibly No No

Random Forests

AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoostNeural netw orks Either No No Higher Slow Fast Lots No Yes Yes Possibly No Yes

Neural netw orks

parametric: assumptions of an underlying distributionnon-parametric-no underlying distirbutional assumptions

calibrated probabilities: probablity between 0 and 1 computed, rather than simply determining the class. tuning parameters-variables that you can manipulate to get better fits.

SUMMARY OF MACHINE LEARNING ALGORITHM FEATURES

Page 4: BAS 250 Lecture 8

Nearest Neighbor Classifiers• Basic idea:

– If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

Page 5: BAS 250 Lecture 8

Nearest-Neighbor Classifiers Requires three things

– The set of stored records– Distance Metric to compute

distance between records– The value of k, the number of

nearest neighbors to retrieve

To classify an unknown record:– Compute distance to other

training records– Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

Page 6: BAS 250 Lecture 8

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Page 7: BAS 250 Lecture 8

Nearest Neighbor Classification• Compute distance between two points:

– Euclidean distance

• Determine the class from nearest neighbor list– take the majority vote of class labels among the k-

nearest neighbors– Weigh the vote according to distance

• weight factor, w = 1/d2

i ii

qpqpd 2)(),(

Page 8: BAS 250 Lecture 8

Nearest Neighbor Classification…

• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points

from other classes

X

Page 9: BAS 250 Lecture 8

Nearest Neighbor Classification…

• Scaling issues– Attributes may have to be scaled to prevent

distance measures from being dominated by one of the attributes

– Example:• height of a person may vary from 1.5m to 1.8m• weight of a person may vary from 90lb to 300lb• income of a person may vary from $10K to $1M

Page 10: BAS 250 Lecture 8

• An inductive learning task– Use particular facts to make more generalized

conclusions

• A predictive model based on a branching series of Boolean tests– These smaller Boolean tests are less complex than a

one-stage classifier

• Let’s look at a sample decision tree…

What is a Decision Tree?

Page 11: BAS 250 Lecture 8

Predicting Commute Time

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Long

Long

Short Medium Long

No Yes No Yes

If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be?

Page 12: BAS 250 Lecture 8

Inductive Learning• In this decision tree, we made a series of

Boolean decisions and followed the corresponding branch– Did we leave at 10 AM?– Did a car stall on the road?– Is there an accident on the road?

• By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take

Page 13: BAS 250 Lecture 8

Decision Tree Algorithms• The basic idea behind any decision tree

algorithm is as follows:– Choose the best attribute(s) to split the remaining

instances and make that attribute a decision node– Repeat this process for recursively for each child– Stop when:

• All the instances have the same target attribute value• There are no more attributes• There are no more instances

Page 14: BAS 250 Lecture 8

How to determine the Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

Before Splitting: 10 records of class 0,10 records of class 1

Which test condition is the best?

Page 15: BAS 250 Lecture 8

How to determine the Best Split• Greedy approach:

– Nodes with homogeneous class distribution are preferred

• Need a measure of node impurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,High degree of impurity

Homogeneous,Low degree of impurity

Page 16: BAS 250 Lecture 8

Measures of Node Impurity• Gini Index

• Entropy

• Misclassification error

Page 17: BAS 250 Lecture 8

Measure of Impurity: GINI

• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

– Minimum (0.0) when all records belong to one class, implying most interesting information

j

tjptGINI 2)]|([1)(

C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

Page 18: BAS 250 Lecture 8

Examples for computing GINI

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Page 19: BAS 250 Lecture 8

Alternative Splitting Criteria based on INFO

• Entropy at a given node t:

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed

among all classes implying least information• Minimum (0.0) when all records belong to one class,

implying most information– Entropy based computations are similar to the

GINI index computations

j

tjptjptEntropy )|(log)|()(

Page 20: BAS 250 Lecture 8

Examples for computing Entropy

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

j

tjptjptEntropy )|(log)|()(2

Page 21: BAS 250 Lecture 8

Splitting Based on INFO...• Information Gain:

Parent Node, p is split into k partitions;ni is number of records in partition i

– Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)

– Used in ID3 and C4.5– Disadvantage: Tends to prefer splits that result in large number of

partitions, each being small but pure.

k

i

i

splitiEntropy

nnpEntropyGAIN

1)()(

Page 22: BAS 250 Lecture 8

Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have similar attribute values

• Early termination (to be discussed later)

Page 23: BAS 250 Lecture 8

Decision Tree Based Classification

• Advantages:– Inexpensive to construct– Extremely fast at classifying unknown records– Easy to interpret for small-sized trees– Accuracy is comparable to other classification

techniques for many simple data sets

Page 24: BAS 250 Lecture 8

Practical Issues of Classification• Underfitting and Overfitting

• Missing Values

• Costs of Classification

Page 25: BAS 250 Lecture 8

Notes on Overfitting• Overfitting results in decision trees that are

more complex than necessary

• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

• Need new ways for estimating errors

Page 26: BAS 250 Lecture 8

How to Address Overfitting• Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree– Typical stopping conditions for a node:

• Stop if all instances belong to the same class• Stop if all the attribute values are the same

– More restrictive conditions:• Stop if number of instances is less than some user-specified threshold• Stop if class distribution of instances are independent of the available

features (e.g., using 2 test)• Stop if expanding the current node does not improve impurity

measures (e.g., Gini or information gain).

Page 27: BAS 250 Lecture 8

Bayes Classifiers

Intuitively, Naïve Bayes computes the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.

http://blog.yhat.com/posts/naive-bayes-in-python.html

Page 28: BAS 250 Lecture 8

Bayes Classifiers• Bayesian classifiers use Bayes theorem, which

saysp(cj | d ) = p(d | cj ) p(cj)

p(d)

• p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute

• p(d | cj) = probability of generating instance d given class cj,We can imagine that being in class cj, causes you to have feature d with some probability

• p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database

• p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes

Page 29: BAS 250 Lecture 8

Different Naïve Bayes Models• Multi-variate Bernoulli Naive Bayes The binomial model is useful if your

feature vectors are binary (i.e., 0s and 1s). One application would be text classification with a bag of words model where the 0s 1s are "word occurs in the document" and "word does not occur in the document"

• Multinomial Naive Bayes The multinomial naive Bayes model is typically used for discrete counts. E.g., if we have a text classification problem, we can take the idea of bernoulli trials one step further and instead of "word occurs in the document" we have "count how often word occurs in the document", you can think of it as "number of times outcome number x_i is observed over the n trials"

• Gaussian Naive Bayes Here, we assume that the features follow a normal distribution. Instead of discrete counts, we have continuous features (e.g., the popular Iris dataset where the features are sepal width, petal width, sepal length, petal length).

Page 31: BAS 250 Lecture 8

Logistic Regression vs. Naïve Bayes

• Logistic Regression Idea: • Naïve Bayes allows computing P(Y|X) by

learning P(Y) and P(X|Y)

• Why not learn P(Y|X) directly?

Page 32: BAS 250 Lecture 8

The Logistic Function• We want a model that predicts probabilities between 0 and 1, that is, S-

shaped.• There are lots of s-shaped curves. We use the logistic model:• Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+ 1X• The function on left, loge[P/(1-P)], is called the logistic function.

0.0

0.2

0.4

0.6

0.8

1.0

x

P y x ee

x

x( )

1

Page 33: BAS 250 Lecture 8

Logistic Regression Function • Logistic regression models the logit of the outcome, instead of the

outcome i.e. instead of winning or losing, we build a model for log odds of winning or losing

• Natural logarithm of the odds of the outcome• ln(Probability of the outcome (p)/Probability of not having the

outcome (1-p))

ii2211 xβ ... xβ xβαP-1

P ln

P y x e

e

x

x( )

1

Page 34: BAS 250 Lecture 8

ROC Curves

• Originated from signal detection theory– Binary signal corrupted by Guassian noise– What is the optimal threshold (i.e. operating

point)?

• Dependence on 3 factors– Signal Strength– Noise Variance– Personal tolerance in Hit / False Alarm Rate

Page 35: BAS 250 Lecture 8

ROC Curves

• Receiver operator characteristic

• Summarize & present performance of any binary classification model

• Models ability to distinguish between false & true positives

Page 36: BAS 250 Lecture 8

Use Multiple Contingency Tables

• Sample contingency tables from range of threshold/probability.

• TRUE POSITIVE RATE (also called SENSITIVITY)

True Positives(True Positives) + (False Negatives)

• FALSE POSITIVE RATE (also called 1 - SPECIFICITY)

False Positives(False Positives) + (True Negatives)

• Plot Sensitivity vs. (1 – Specificity) for sampling and you are done

Page 37: BAS 250 Lecture 8
Page 38: BAS 250 Lecture 8
Page 39: BAS 250 Lecture 8
Page 40: BAS 250 Lecture 8
Page 41: BAS 250 Lecture 8
Page 42: BAS 250 Lecture 8

Pros/Cons of Various Classification Algorithms

Logistic regression: no distribution requirement, perform well with few categories categorical variables, compute the logistic distribution, good for few categories variables, easy to interpret, compute CI, suffer multicollinearity

Decision Trees: no distribution requirement, heuristic, good for few categories variables, not suffer multicollinearity (by choosing one of them), Interpretable

Naïve Bayes: generally no requirements,  good for few categories variables, compute the multiplication of independent distributions, suffer multicollinearity

SVM:  no distribution requirement, compute hinge loss, flexible selection of kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret

Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform single algorithm listed above.Source: Quora

Page 43: BAS 250 Lecture 8

Prediction Error and the Bias-variance tradeoff

• A good measure of the quality of an estimator ˆf (x) is the mean squared error. Let f0(x) be the true value of f (x) at the point x. Then

• This can be written as

variance + bias^2.• Typically, when bias is low, variance is high and vice-versa.

Choosing estimators often involves a tradeoff between bias and variance.

20 )]()(ˆE[)](ˆMse[ xfxfxf

20 )]()(ˆ[)](ˆVar[)](ˆMse[ xfxfExfxf

Page 44: BAS 250 Lecture 8

Note the tradeoff between Bias and Variance!