BAS 250 Lecture 8

Machine Learning: Classification Algorithms

Review

A picture’s worth a thousand words..

• http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#example-classification-plot-classifier-comparison-py

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html%23example-classification-plot-classifier-comparison-py




AlgorithmProblem Type

Results interpretable by you?

Easy to explain algorithm to others?

Average predictive accuracy

Training speed

Prediction speed

Amount of parameter tuning needed (excluding feature selection)

Performs well with small number of observations?

Handles lots of irrelevant features well (separates signal from noise)?

Automatically learns feature interactions?

Gives calibrated probabilities of class membership? Parametric?

Features might need scaling? Algorithm

KNN Either Yes Yes Low er FastDepends on n Minimal No No No Yes No Yes KNN

Linear regression Regression Yes Yes Low er Fast Fast

None (excluding regularization) Yes No No N/A Yes

No (unless regularized)

Linear regression

Logistic regression

Classification Somew hat Somew hat Low er Fast Fast

None (excluding regularization) Yes No No Yes Yes

No (unless regularized)

Logistic regression

Naive Bayes

Classification Somew hat Somew hat Low er

Fast (excluding feature extraction) Fast

Some for feature extraction Yes Yes No No Yes No

Naive Bayes

Decision trees Either Somew hat Somew hat Low er Fast Fast Some No No Yes Possibly No No

Decision trees

Random Forests Either A little No Higher Slow Moderate Some No

Yes (unless noise ratio is very high) Yes Possibly No No

Random Forests

AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoostNeural netw orks Either No No Higher Slow Fast Lots No Yes Yes Possibly No Yes

Neural netw orks

parametric: assumptions of an underlying distributionnon-parametric-no underlying distirbutional assumptions

calibrated probabilities: probablity between 0 and 1 computed, rather than simply determining the class. tuning parameters-variables that you can manipulate to get better fits.

SUMMARY OF MACHINE LEARNING ALGORITHM FEATURES

Nearest Neighbor Classifiers• Basic idea:

– If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

Nearest-Neighbor Classifiers Requires three things

– The set of stored records– Distance Metric to compute

distance between records– The value of k, the number of

nearest neighbors to retrieve

To classify an unknown record:– Compute distance to other

training records– Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Nearest Neighbor Classification• Compute distance between two points:

– Euclidean distance

• Determine the class from nearest neighbor list– take the majority vote of class labels among the k-

nearest neighbors– Weigh the vote according to distance

• weight factor, w = 1/d2

i ii

qpqpd 2)(),(

Nearest Neighbor Classification…

• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points

from other classes

X

Nearest Neighbor Classification…

• Scaling issues– Attributes may have to be scaled to prevent

distance measures from being dominated by one of the attributes

– Example:• height of a person may vary from 1.5m to 1.8m• weight of a person may vary from 90lb to 300lb• income of a person may vary from $10K to $1M

• An inductive learning task– Use particular facts to make more generalized

conclusions

• A predictive model based on a branching series of Boolean tests– These smaller Boolean tests are less complex than a

one-stage classifier

• Let’s look at a sample decision tree…

What is a Decision Tree?

Predicting Commute Time

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Long

Long

Short Medium Long

No Yes No Yes

If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be?

Inductive Learning• In this decision tree, we made a series of

Boolean decisions and followed the corresponding branch– Did we leave at 10 AM?– Did a car stall on the road?– Is there an accident on the road?

• By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take

Decision Tree Algorithms• The basic idea behind any decision tree

algorithm is as follows:– Choose the best attribute(s) to split the remaining

instances and make that attribute a decision node– Repeat this process for recursively for each child– Stop when:

• All the instances have the same target attribute value• There are no more attributes• There are no more instances

How to determine the Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

Before Splitting: 10 records of class 0,10 records of class 1

Which test condition is the best?

How to determine the Best Split• Greedy approach:

– Nodes with homogeneous class distribution are preferred

• Need a measure of node impurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,High degree of impurity

Homogeneous,Low degree of impurity

Measures of Node Impurity• Gini Index

• Entropy

• Misclassification error

Measure of Impurity: GINI

• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

– Minimum (0.0) when all records belong to one class, implying most interesting information

j

tjptGINI 2)]|([1)(

C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

Examples for computing GINI

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Alternative Splitting Criteria based on INFO

• Entropy at a given node t:

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed

among all classes implying least information• Minimum (0.0) when all records belong to one class,

implying most information– Entropy based computations are similar to the

GINI index computations

j

tjptjptEntropy )|(log)|()(

Examples for computing Entropy

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

j

tjptjptEntropy )|(log)|()(2

Splitting Based on INFO...• Information Gain:

Parent Node, p is split into k partitions;ni is number of records in partition i

– Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)

– Used in ID3 and C4.5– Disadvantage: Tends to prefer splits that result in large number of

partitions, each being small but pure.

k

i

i

splitiEntropy

nnpEntropyGAIN

1)()(

Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have similar attribute values

• Early termination (to be discussed later)

Decision Tree Based Classification

• Advantages:– Inexpensive to construct– Extremely fast at classifying unknown records– Easy to interpret for small-sized trees– Accuracy is comparable to other classification

techniques for many simple data sets

Practical Issues of Classification• Underfitting and Overfitting

• Missing Values

• Costs of Classification

Notes on Overfitting• Overfitting results in decision trees that are

more complex than necessary

• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

• Need new ways for estimating errors

How to Address Overfitting• Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree– Typical stopping conditions for a node:

• Stop if all instances belong to the same class• Stop if all the attribute values are the same

– More restrictive conditions:• Stop if number of instances is less than some user-specified threshold• Stop if class distribution of instances are independent of the available

features (e.g., using 2 test)• Stop if expanding the current node does not improve impurity

measures (e.g., Gini or information gain).

Bayes Classifiers

Intuitively, Naïve Bayes computes the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.

http://blog.yhat.com/posts/naive-bayes-in-python.html


Bayes Classifiers• Bayesian classifiers use Bayes theorem, which

saysp(cj | d ) = p(d | cj ) p(cj)

p(d)

• p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute

• p(d | cj) = probability of generating instance d given class cj,We can imagine that being in class cj, causes you to have feature d with some probability

• p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database

• p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes

Different Naïve Bayes Models• Multi-variate Bernoulli Naive Bayes The binomial model is useful if your

feature vectors are binary (i.e., 0s and 1s). One application would be text classification with a bag of words model where the 0s 1s are "word occurs in the document" and "word does not occur in the document"

• Multinomial Naive Bayes The multinomial naive Bayes model is typically used for discrete counts. E.g., if we have a text classification problem, we can take the idea of bernoulli trials one step further and instead of "word occurs in the document" we have "count how often word occurs in the document", you can think of it as "number of times outcome number x_i is observed over the n trials"

• Gaussian Naive Bayes Here, we assume that the features follow a normal distribution. Instead of discrete counts, we have continuous features (e.g., the popular Iris dataset where the features are sepal width, petal width, sepal length, petal length).

Check out these websites for more!

• http://www.datasciencecentral.com/profiles/blogs/naive-bayes-for-dummies-a-simple-explanation

• http://blog.yhat.com/posts/naive-bayes-in-python.html

• In Sklearn: • http://scikit-learn.org/stable/modules/naive_

bayes.html

http://www.datasciencecentral.com/profiles/blogs/naive-bayes-for-dummies-a-simple-explanation







http://scikit-learn.org/stable/modules/naive_bayes.html

http://scikit-learn.org/stable/modules/naive_bayes.html

Logistic Regression vs. Naïve Bayes

• Logistic Regression Idea: • Naïve Bayes allows computing P(Y|X) by

learning P(Y) and P(X|Y)

• Why not learn P(Y|X) directly?

The Logistic Function• We want a model that predicts probabilities between 0 and 1, that is, S-

shaped.• There are lots of s-shaped curves. We use the logistic model:• Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+ 1X• The function on left, loge[P/(1-P)], is called the logistic function.

0.0

0.2

0.4

0.6

0.8

1.0

x

P y x ee

x

x( )

1

Logistic Regression Function • Logistic regression models the logit of the outcome, instead of the

outcome i.e. instead of winning or losing, we build a model for log odds of winning or losing

• Natural logarithm of the odds of the outcome• ln(Probability of the outcome (p)/Probability of not having the

outcome (1-p))

ii2211 xβ ... xβ xβαP-1

P ln

P y x e

e

x

x( )

1

ROC Curves

• Originated from signal detection theory– Binary signal corrupted by Guassian noise– What is the optimal threshold (i.e. operating

point)?

• Dependence on 3 factors– Signal Strength– Noise Variance– Personal tolerance in Hit / False Alarm Rate

ROC Curves

• Receiver operator characteristic

• Summarize & present performance of any binary classification model

• Models ability to distinguish between false & true positives

Use Multiple Contingency Tables

• Sample contingency tables from range of threshold/probability.

• TRUE POSITIVE RATE (also called SENSITIVITY)

True Positives(True Positives) + (False Negatives)

• FALSE POSITIVE RATE (also called 1 - SPECIFICITY)

False Positives(False Positives) + (True Negatives)

• Plot Sensitivity vs. (1 – Specificity) for sampling and you are done

Pros/Cons of Various Classification Algorithms

Logistic regression: no distribution requirement, perform well with few categories categorical variables, compute the logistic distribution, good for few categories variables, easy to interpret, compute CI, suffer multicollinearity

Decision Trees: no distribution requirement, heuristic, good for few categories variables, not suffer multicollinearity (by choosing one of them), Interpretable

Naïve Bayes: generally no requirements, good for few categories variables, compute the multiplication of independent distributions, suffer multicollinearity

SVM: no distribution requirement, compute hinge loss, flexible selection of kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret

Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform single algorithm listed above.Source: Quora

Prediction Error and the Bias-variance tradeoff

• A good measure of the quality of an estimator ˆf (x) is the mean squared error. Let f0(x) be the true value of f (x) at the point x. Then

• This can be written as

variance + bias^2.• Typically, when bias is low, variance is high and vice-versa.

Choosing estimators often involves a tradeoff between bias and variance.

20 )]()(ˆE[)](ˆMse[ xfxfxf

20 )]()(ˆ[)](ˆVar[)](ˆMse[ xfxfExfxf

Note the tradeoff between Bias and Variance!

BAS 250 Lecture 8

Education

Transcript of BAS 250 Lecture 8