Last lecture summary (SVM)

Slide 1

Last lecture summary(SVM)Support Vector MachineSupervised algorithmWorks both asclassifier (binary)regressorDe facto better linear classificationTwo main ingrediences:maximum marginkernel functionsMaximum marginWhich line is best?Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

The maximum margin linear classifier is the optimum linear classifier.This is the simplest kind of SVM (linear SVM)Maximum margin intuitively feels safest.Only support vectors are important.Works very well.The decision boundary is found by constrained quadratic optimization.The solution is found in the form

Only points on the margin (i.e. support vectors xi) have i > 0.

Lagrange multiplierw does not to be explicitly formed, because:

Training SVM: find the sets of the parameters i and b.Classification with SVM:soft marginAllows misclassification errors.i.e. misclassified points are allowed to be inside the margin.The penalty to classification errors is given by the capacity parameter C (user adjustable parameter).Large C a high penalty to classification errors.Decrease in C: points move inside margin.CSE 802. Prepared by Martin LawKernel functionsSoft margin introduces the possibility to linearly classify the linearly non-separable data sets.What else could be done? Can we propose an approach generating non-linear classification boundary just by extending the linear classifier machinery?

X

KernelsLinear (dot) kernel

Polynomialsimple, efficient for non-linear relationshipsd degree

Gaussian

SVM parametersTraining sets the parameters i and b.The SVM has another set of parameters called hyperparameters.The soft margin constant C.Any parameters the kernel function depends onlinear kernel no hyperparameter (except for C)polynomial degreeGaussian width of GaussianFinishing SVMMulticlass SVMSVM is defined for binary classification.How to predict more than two classes (multiclass)?Simplest approach: decompose the multiclass problem into several binary problems and train several binary SVMs.1/211/31/42/32/43/41113441/rest2/rest3/rest4/restResourcesSVM and Kernels for Comput. Biol., Ratsch et al., PLOS Comput. Biol., 4 (10), 1-10, 2008What is a support vector machine, W. S. Noble, Nature Biotechnology, 24 (12), 1565-1567, 2006A tutorial on SVM for pattern recognition, C. J. C. Burges, Data Mining and Knowledge Discovery, 2, 121-167, 1998A Users Guide to Support Vector Machines, Asa Ben-Hur, Jason Westonhttp://support-vector-machines.org/http://www.kernel-machines.org/http://www.support-vector.net/companion to the book An Introduction to Support Vector Machines by Cristianini and Shawe-Taylorhttp://www.kernel-methods.net/companion to the book Kernel Methods for Pattern Analysis by Shawe-Taylor and Cristianinihttp://www.learning-with-kernels.org/Several chapters on SVM from the book Learning with Kernels by Scholkopf and Smola are available from this siteSoftwareSVMlight one of the most widely used SVM package. fast optimization, can handle very large datasets, very efficient implementation of the leaveoneout cross-validation, C++ codeSVMstruct - can model complex data, such as trees, sequences, or setsLIBSVM multiclass, weighted SVM for unbalanced data, cross-validation, automatic model selection, C++, JavaNave Bayes ClassifierExample Play Tennis

Example Learning Phase OutlookPlay=YesPlay=NoSunny2/93/5Overcast4/90/5Rain3/92/5TemperaturePlay=YesPlay=NoHot2/92/5Mild4/92/5Cool3/91/5HumidityPlay=YesPlay=NoHigh3/94/5Normal6/91/5WindPlay=YesPlay=NoStrong3/93/5Weak6/92/5P(Play=Yes) = 9/14P(Play=No) = 5/14P(Outlook=Sunny|Play=Yes) = 2/9Example - predictionAnswer this question:Will we play tennis given that its cool but sunny, humidity is high and it is blowing a strong wind?i.e. predicts this new instace:x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong)Good strategy is to predict arg max P(Y|cool,sunny,high,strong)where Y is Yes or No.

Example - Predictionx=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong)Look up tablesP(Outl=Sunny|Play=No) = 3/5P(Temp=Cool|Play=No) = 1/5P(Hum=High|Play=No) = 4/5P(Wind=Strong|Play=No) = 3/5P(Play=No) = 5/14P(Outl=Sunny|Play=Yes) = 2/9P(Temp=Cool|Play=Yes) = 3/9P(Hum=High|Play=Yes) = 3/9P(Wind=Strong|Play=Yes) = 3/9P(Play=Yes) = 9/14P(Yes|x): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x) < P(No|x), we label x to be No. Now calculate the probability that given the data x tennis will be played (or not)

28Another ApplicationDigit Recognition

X1,,Xn {0,1} (Black vs. White pixels)Y {5,6} (predict whether a digit is a 5 or a 6)Classifier5

29- what is the probability that the image represents a 5 given its pixels?

Bayes RuleSo how do we compute posterior probability that the image represents a 5 given its pixels?

Why did this help? Well, we think that we might be able to specify how features are generated by the class label (i.e. we will try to compute likelihood).

Normalization ConstantLikelihoodPriorPosteriorLets expand this for our digit recognition task:

To classify, well simply compute these two probabilities and predict based on which one is greater.

For the Bayes classifier, we need to learn two functions, the likelihood and the prior.

31Learning priorLet us assume training examples are generated by drawing instances at random from an unknown underlying distribution P(Y), then allow a teacher to label this example with its Y value.

A hundred independently drawn training examples will usually suffice to obtain a reasonable estimate of P(Y).Learning likelihoodSo this corresponds to two distinct parameters for each of the distinct instances in the instance space for X.Worse yet, to obtain reliable estimates of each of these parameters, we will need to observe each of these distinct instances multiple times.For example, if X is a vector containing 30 boolean features, then we will need to estimate more than 3 billion parameters!

The problem with explicitly modeling P(X1,,Xn|Y) is that there are usually way too many parameters:Well run out of space.Well run out of time.And well need tons of training data (which is usually not available).

The Nave Bayes ModelNave Bayes Training

MNIST Training Data38Nave Bayes TrainingTraining in Nave Bayes is easy:Estimate P(Y=v) as the fraction of records with Y=v

Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which Xi=u

39In practice, some of these counts can be zeroFix this by adding virtual counts:

This is called Smoothing.

Nave Bayes Training40Nave Bayes TrainingFor binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

41Nave Bayes Classification

42Assorted remarksWhats nice about Nave Bayes is that it returns probabilitiesThese probabilities can tell us how confident the algorithm isSo dont throw away these probabilities!Nave Bayes assumption is almost never trueStill Nave Bayes often performs surprisingly well even when its assumptions do not hold.Very good method in text processing.

Binary classifier performanceConfusion matrixTPTrue Positives is positive and is classified as positiveTNTrue Negatives is negative and is classified as negativeFPFalse Positives is negative, but is classified as positiveFNFalse Negatives is positive, but is classified as negativeKnownLabelpositivenegativePredictedpositiveTPFPLabelnegativeFNTNalso called a contingency table45AccuracyAccuracy = (TP + TN) / (TP + TN + FP + FN)

KnownLabelpositivenegativePredictedpositiveTPFPLabelnegativeFNTNInformation retrieval (IR)A query by the user to find the documents in the database. IR systems allow to narrow down the set of documents that are relevant to a particular problem.

documents containing what I am looking fordocuments not containing what I am looking for48bily + srafovany obdelnik je cela mnozina, nad kterou neco hledambily je mnozina dokumentu obsahujici to, co hledam, srafovany neobsahuje to, co hledamcerveny obdelnik je mnozina dokumentu, kterou mi vrati muj vyhledavaci systemTP true positive, FP false positive, FN false negativebeta relative value of precission, beta=1 stejna vaha na precision a recall (bezne), beta < 1 vetsi vaha na precission, beta > 1 vetsi vaha na recall

TPFPFNPrecision = TP / (TP + FP)Recall = TP / (TP + FN)How many of the things I consider to be true are actually true?How much of the true things do I find?TN49bily + srafovany obdelnik je cela mnozina, nad kterou neco hledambily je mnozina dokumentu obsahujici to, co hledam, srafovany neobsahuje to, co hledamcerveny obdelnik je mnozina dokumentu, kterou mi vrati muj vyhledavaci systemTP true positive, FP false positive, FN false negativeA perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved)A perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved).

PrecisionA measure of exactnessA perfect precision score of 1.0 means that every result retrieved by a search was relevant.But says nothing about whether all relevant documents were retrieved.

TPFPRecallA measure of completenessA perfect recall score of 1.0 means that all relevant documents were retrieved by the search.But says nothing about how many irrelevant documents were also retrieved.TPFNPrecission-Recall tradeoffReturning all documents lead to the perfect recall of 1.0.i.e. all relevant documents are present in the returned setHowever, precission is not that great, as not every result is relevant.Apparently, the relationship between them is inverse it is possible to increase one at the cost of reducing the other.They are not discussed in isolation.Either values for one measure are compared for a fixed level at the other measure (e.g. precision at the recall level of 0.75) Combine both into the F-measure.F-measureCommon F1 measure

General F measure

- relative value of precision = 1 weight precision and recall by the same amount < 1 more weight on precision > 1 more weight on recall

Sensitivity & SpecificityMeasure how good a test is at detecting binary features of interest (disease/no disease).There are 100 patients, 30 have disease A.A test designed to identify who has the disease and who does not.We want to evaluate how good the test is.

Sensitivity & SpecificityDisease+Disease-Test+252Test-568Sensitivity & Specificity25/30sensitivity68/70specificityDisease+Disease-Total Test+25227Test-56873Total 3070100Sensitivity measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are identified as sick). TP/(TP + FN)

Specificity measures the proportion of negatives which are correctly identified (e.g. the percentage of healthy people who are identified as healthy).TN/(TN + FP)

Performance EvaluationPrecision, Positive Predictive Value (PPV) TP / (TP + FP)

Recall, Sensitivity, True Positive Rate (TPR), Hit rate TP / P = TP/(TP + FN)

False Positive Rate (FPR), Fall-out FP / N = FP / (FP + TN)

Specificity, True Negative Rate (TNR) TN / (TN + FP) = 1 - FPR

Accuracy (TP + TN) / (TP + TN + FP + FN)58Types of classifiersA discrete (crisp) classifierOutput is only a class label, e.g. decision tree.A soft classifierYield a probability (score, confidence) for the given pattern.Number representing the degree to which an instance is a member of a class.Use threshold to assign to (+) or to (-) class.e. g. SVM, NN, nave Bayes.- probability (score) can be strict probabilities (e.g. Bayes classifier), or general, uncalibrated scores meaning only higher score = higher probability59ROC GraphReceiver Operating Characteristics.Plot TPR vs. FPRSensitivity vs. (1 Specificity).TPR is on the Y axis, FPR on the X axis.An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives).

random guessworsebetterperfect classificationFawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861874.never issue positive classificationalways issue positive classification0.5, 0.5perfect classificationAny classifier that appears in the lower right triangle performs worse than random guessing. This triangle is therefore usually empty in ROC graphs. If we negate a classifierthat is, reverse its classification decisions on every instanceits true positive classifications become false negative mistakes, and its false positives become true negatives. Therefore, any classifier that produces a point in the lower right triangle can be negated to produce a point in the upper left triangle. In Fig., E performs much worse than random, and is in fact the negation of B. Any classifier on the diagonal may be said to have no information about the class. A classifier below the diagonal may be said to have useful information, but it is applying the information incorrectly.61A is more conservative than B.

conservative classifiersliberal classifiersThey make positive classifications only with strong evidence so they make few false positive errors, but they often have low true positive rates as wellThey make positive classifications with weak evidence so they classify nearly all positives correctly, but they often have high false positive ratesROC Curve

Fawcet, ROC Graphs: Notes and Practical Considerations for Researchersdata set, contain 10 positives, 10 negativesscore classifier score, to be thresholded63

lowering threshold

corresponds to moving fromthe conservative to the liberal areas of the graphFawcet, ROC Graphs: Notes and Practical Considerations for Researcherseach point in the ROC graph is labeled by the score threshold that produces it. A threshold of +Inf produces the point (0, 0). As we lower the threshold to 0.9 the first positive instance is classied positive, yielding (0, 0.1). As the threshold is further reduced, the curve climbs up and to the right, ending up at (1, 1) with a threshold of 0.164

65

We have 6 positives and 4 negatives in a test set. All are scored equally.Fawcet, ROC Graphs: Notes and Practical Considerations for ResearchersAssume we have a test set in which there is a sequence of instances, four negatives and six positives, all scored equally by f.What happens when we create an ROC curve? In one extreme case, all the positives end up at the beginning of the sequence and we generate the optimistic upper L segment shown in Fig.In the opposite extreme, all the negatives end up at the beginning of the sequence and we get the pessimistic lower L shown in Fig. Any mixed ordering of the instances will give a different set of step segments within the rectangle formed by these two extremes. However, the ROC curve should represent the expected performance of the classifier, which, lacking any other information, is the average of the pessimistic and optimistic segments.66

ROC curves are insensitive to the changes in class distributionFawcet, ROC Graphs: Notes and Practical Considerations for Researchersclass distribution the proportion of positive to negative instancesTo see why this is so, consider the confusion matrix from the earlier slide. Note that the class distribution - the proportion of positive to negative instances - is the relationship of the left (+) column to the right (-) column. Any performance metric that uses values from both columns will be inherently sensitive to class skews. Metrics such as accuracy, precision, and F score use values from both columns of the confusion matrix. As a class distribution changes these measures will change as well, even if the fundamental classier performance does not. ROC graphs are based upon tp rate and fp rate, in which each dimension is a strict columnar ratio, so do not depend on class distributions.

67

Fawcet, ROC Graphs: Notes and Practical Considerations for Researchers68AUCTo compare classifiers we may want to reduce ROC performance to a single scalar value representing expected performance.A common method is to calculate the area under the ROC curve, abbreviated AUC.Its value will always be between 0 and 1.0.Random guessing has an area 0.5.Any realistic classifier should have an AUC between 0.5 and 1.0.The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

Classier B is generally better than A, it has higher AUC.

However, with the exception at FPR > 0.55 where A has slight advantage.

So it is possible for a high-AUC classifier to perform worse in a specific region of ROC space than a low-AUC classifier.

But in practice the AUC performs very well and is often used when a general measure of predictiveness is desired.

Last lecture summary (SVM)

Documents

Transcript of Last lecture summary (SVM)