Computational BioMedical Informatics
description
Transcript of Computational BioMedical Informatics
1
Computational BioMedical Informatics
SCE 5095: Special Topics Course
Instructor: Jinbo BiComputer Science and Engineering Dept.
2
Course Information
Instructor: Dr. Jinbo Bi – Office: ITEB 233– Phone: 860-486-1458– Email: [email protected]
– Web: http://www.engr.uconn.edu/~jinbo/– Time: Mon / Wed. 2:00pm – 3:15pm – Location: CAST 204– Office hours: Mon. 3:30-4:30pm
HuskyCT– http://learn.uconn.edu– Login with your NetID and password– Illustration
3
Review of last chapter
General introduction to the topics in medical informatics, and the data mining techniques involved
Review some basics of probability-statistics More slides on probability and linear algebra
uploaded to huskyCT
This class, we start to discuss supervised learning: classification and regression
4
Regression and classification
Both regression and classification problems are typically supervised learning problems
The main property of supervised learning– Training example contains the input variables
and the corresponding target label– The goal is to find a good mapping from the
input variables to the target variable
5
Classification: Definition
Given a collection of examples (training set )– Each example contains a set of variables
(features), and the target variable class. Find a model for class attribute as a function
of the values of other variables. Goal: previously unseen examples should be
assigned a class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
6
Classification Application 1
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categorical
categorical
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set Model
Learn Classifier
Past transaction records, label them
Current data, want to use the model to predict
Fraud detection – goals: Predict fraudulent cases in credit card transactions.
7
Classification: Application 2
Handwritten Digit Recognition Goal: Identify the digit of a handwritten number
– Approach:Align all images to derive the featuresModel the class (identity) based on these features
8
Illustrating Classification Task
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
9
Classification algorithms
K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Tree Logistic Regression Graphical models
10
Regression: Definition
Goal: predict the value of one or more continuous target attributes give the values of the input attributes
Difference between classification and regression only lies in the target attribute– Classification: discrete or categorical target– Regression: continuous target
Greatly studied in statistics, neural network fields.
11
Regression application 1
categorical
categorical
continuous
Continuous ta
rget
Refund Marital Status
Taxable Income Loss
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ? 10
TestSet
Training Set Model
Learn Regressor
Past transaction records, label them
Current data, want to use the model to predict
goals: Predict the possible loss from a customer
Tid Refund MaritalStatus
TaxableIncome Loss
1 Yes Single 125K 100
2 No Married 100K 120
3 No Single 70K -200
4 Yes Married 120K -300
5 No Divorced 95K -400
6 No Married 60K -500
7 Yes Divorced 220K -190
8 No Single 85K 300
9 No Married 75K -240
10 No Single 90K 9010
12
Regression applications
Examples:– Predicting sales amounts of new product
based on advertising expenditure.– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.– Time series prediction of stock market indices.
13
Regression algorithms
Least squares methods Regularized linear regression (ridge regression) Neural networks Support vector machines (SVM) Bayesian linear regression
14
Practical issues in the training
Underfitting
Overfitting
Before introducing these important concept, let us study a simple regression algorithm – linear regression
15
Least squares
We wish to use some real-valued input variables x to predict the value of a target y
We collect training data of pairs (xi,yi), i=1,…N Suppose we have a model f that maps each x
example to a value of y’ Sum of squares function:
– Sum of the squares of the deviation between the observed target value y and the predicted value y’
N
iii
N
iii xfyyy
1
2
1
2 )('
16
Least squares
Find a function f such that the sum of squares is minimized
For example, your function is in the form of linear functions f (x) = wTx
Least squares with a linear function of parameters w is called “linear regression”
N
iii xfy
1
2)(min
N
ii
Ti xwyw
1
2min
17
Linear regression
Linear regression has a closed-form solution for w
The minimum is attained at the zero derivative
)(XwyXwy
min1
2
wE
wxyw
T
N
i
Tii
0)(2)(
XwyXw
wE T
yXXXw TT 1
18
x is evenly distributed from [0,1] y = f(x) + random error y = sin(2πx) + ε, ε ~ N(0,σ)
Polynomial Curve Fitting
19
Polynomial Curve Fitting
20
Sum-of-Squares Error Function
21
0th Order Polynomial
22
1st Order Polynomial
23
3rd Order Polynomial
24
9th Order Polynomial
25
Over-fitting
Root-Mean-Square (RMS) Error:
26
Polynomial Coefficients
27
Data Set Size:
9th Order Polynomial
28
Data Set Size:
9th Order Polynomial
29
Regularization
Penalize large coefficient values
Ridge regression
30
Regularization:
31
Regularization:
32
Regularization: vs.
33
Polynomial Coefficients
34
Classification
Underfitting or Overfitting can also happen in classification approaches
We will illustrate these practical issues on classification problem
Before the illustration, we introduce a simple classification technique – K-nearest neighbor method
35
K-nearest neighbor (K-NN)
K-NN is one of the simplest machine learning algorithm
K-NN is a method for classifying test examples based on closest training examples in the feature space
An example is classified by a majority vote of its neighbors
k is a positive integer, typically small. If k = 1, then the example is simply assigned to the class of its nearest neighbor.
36
K-NN
K = 1K = 3
37
K-NN on real problem data
• Oil data set• K acts as a smoother, choosing K is model
selection• For , the error rate of the 1-nearest-neighbour
classifier is never more than twice the optimal error (obtained from the true conditional class distributions).
38
Limitation of K-NN
K-NN is a nonparametric model (no any particular function is fitted)
Nonparametric models requires storing and computing with the entire data set.
Parametric models, once fitted, are much more efficient in terms of storage and computation.
39
Probabilistic interpretation of K-NN
Given a data set with Nk data points from class Ck and , we have
and correspondingly
Since , Bayes’ theorem gives
40
Underfit and Overfit (Classification)
500 circular and 500 triangular data points.
Circular points:0.5 sqrt(x1
2+x22) 1
Triangular points:sqrt(x1
2+x22) > 1 or
sqrt(x12+x2
2) < 0.5
41
Underfit and Overfit (Classification)
500 circular and 500 triangular data points.
Circular points:0.5 sqrt(x1
2+x22) 1
Triangular points:sqrt(x1
2+x22) > 1 or
sqrt(x12+x2
2) < 0.5
42
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Number of Iterations
43
Overfitting due to Noise
Decision boundary is distorted by noise point
44
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task
45
Notes on Overfitting
Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary
Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records
Need new ways for estimating errors
46
Occam’s Razor
Given two models of similar generalization errors, one should prefer the simpler model over the more complex model
For complex models, there is a greater chance that it was fitted accidentally by errors in data
Therefore, one should include model complexity when evaluating a model
47
How to Address Overfitting
Minimize training error no longer guarantees a good model (a classifier or a regressor)
Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y )
In practice, design a procedure that gives better estimate of the error than training error
In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula
48
Model Evaluation (pp. 295—304 of data mining)
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates?
Methods for Model Comparison– How to compare the relative performance
among competing models?
49
Model Evaluation
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates?
Methods for Model Comparison– How to compare the relative performance
among competing models?
50
Metrics for Performance Evaluation
Regression– Sum of squares
– Sum of deviation
– Exponential function of the deviation
51
Metrics for Performance Evaluation
Focus on the predictive capability of a model– Rather than how fast it takes to classify or
build models, scalability, etc. Confusion Matrix:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)b: FN (false negative)c: FP (false positive)d: TN (true negative)
52
Metrics for Performance Evaluation…
Most widely-used metric:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a(TP)
b(FN)
Class=No c(FP)
d(TN)
FNFPTNTPTNTP
dcbada
Accuracy
53
Limitation of Accuracy
Consider a 2-class problem– Number of Class 0 examples = 9990– Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– Accuracy is misleading because model does
not detect any class 1 example
54
Cost Matrix
PREDICTED CLASS
ACTUALCLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)
C(i|j): Cost of misclassifying class j example as class i
55
Computing Cost of Classification
Cost Matrix
PREDICTED CLASS
ACTUALCLASS
C(i|j) + -+ -1 100- 1 0
Model M1 PREDICTED CLASS
ACTUALCLASS
+ -+ 150 40- 60 250
Model M2 PREDICTED CLASS
ACTUALCLASS
+ -+ 250 45- 5 200
Accuracy = 80%Cost = 3910
Accuracy = 90%Cost = 4255
56
Cost vs Accuracy
Count PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
Cost PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes p q
Class=No q p
N = a + b + c + d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) Accuracy]
Accuracy is proportional to cost if1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p
57
Cost-Sensitive Measures
baa
caa
(r) Recall
(p)Precision
Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes)
Count PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a b
Class=No
c d
A model that declares every record to be the positive class: b = d = 0
A model that assigns a positive class to the (sure) test record: c is small
Recall is high
Precision is high
58
Cost-Sensitive Measures (Cont’d)
cbaa
prrp
baa
caa
222(F) measure-F
(r) Recall
(p)Precision
F-measure is biased towards all except C(No|No)
dwcwbwawdwaw
4321
41Accuracy Weighted
Count PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a b
Class=No
c d
59
Model Evaluation
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates?
Methods for Model Comparison– How to compare the relative performance
among competing models?
60
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
Performance of a model may depend on other factors besides the learning algorithm:– Class distribution– Cost of misclassification– Size of training and test sets
61
Learning Curve
Learning curve shows how accuracy changes with varying sample size
Requires a sampling schedule for creating learning curve: Arithmetic sampling
(Langley, et al) Geometric sampling
(Provost et al)
Effect of small sample size:- Bias in the estimate- Variance of estimate
62
Methods of Estimation Holdout
– Reserve 2/3 for training and 1/3 for testing Random subsampling
– Repeated holdout Cross validation
– Partition data into k disjoint subsets– k-fold: train on k-1 partitions, test on the remaining one– Leave-one-out: k=n
Stratified sampling – oversampling vs undersampling
Bootstrap– Sampling with replacement
63
Methods of Estimation (Cont’d) Holdout method
– Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies
obtained Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data– Stratified cross-validation: folds are stratified so that class dist. in
each fold is approx. the same as that in the initial data
64
Methods of Estimation (Cont’d) Bootstrap
– Works well with small data sets– Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap
– Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
))(368.0)(632.0()( _1
_ settraini
k
isettesti MaccMaccMacc
65
Model Evaluation
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates?
Methods for Model Comparison– How to compare the relative performance
among competing models?
66
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive
hits and false alarms ROC curve plots TPR (on the y-axis) against FPR
(on the x-axis) Performance of each classifier represented as a
point on the ROC curve If the classifier returns a real-valued prediction,
– changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
67
ROC Curve
At threshold t:TP=50, FN=50, FP=12, TN=88
PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a(TP)
b(FN)
Class=No
c(FP)
d(TN)
TPR = TP/(TP+FN)FPR = FP/(FP+TN)
68
ROC Curve
PREDICTED CLASS
ACTUALCLASS
Class=Yes
Class=No
Class=Yes
a(TP)
b(FN)
Class=No
c(FP)
d(TN)
TPR = TP/(TP+FN)FPR = FP/(FP+TN)
(TPR,FPR): (0,0): declare everything
to be negative class– TP=0, FP = 0
(1,1): declare everything to be positive class– FN = 0, TN = 0
(1,0): ideal– FN = 0, FP = 0
69
ROC Curve
(TPR,FPR): (0,0): declare everything
to be negative class (1,1): declare everything
to be positive class (1,0): ideal
Diagonal line:– Random guessing– Below diagonal line: prediction is opposite of the
true class
70
How to Construct an ROC curve
Instance P(+|A) True Class
1 0.95 +
2 0.93 +
3 0.87 -
4 0.85 -
5 0.85 -
6 0.85 +
7 0.76 -
8 0.53 +
9 0.43 -
10 0.25 +
• Use classifier that produces posterior probability for each test instance P(+|A)
• Sort the instances according to P(+|A) in decreasing order
• Apply threshold at each unique value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold• TP rate, TPR = TP/(TP+FN)• FP rate, FPR = FP/(FP +
TN)
71
How to Construct an ROC curve
Instance P(+|A) True Class
1 0.95 +
2 0.93 +
3 0.87 -
4 0.85 -
5 0.85 -
6 0.85 +
7 0.76 -
8 0.53 +
9 0.43 -
10 0.25 +
• Use classifier that produces posterior probability for each test instance P(+|A)
• Sort the instances according to P(+|A) in decreasing order
• Pick a threshold 0.85• p>= 0.85, predicted to P• p< 0.85, predicted to N• TP = 3, FP=3, TN=2, FN=2• TP rate, TPR = 3/5=60%• FP rate, FPR = 3/5=60%
72
How to construct an ROC curveClass + - + - - - + - + +
P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Threshold >=
ROC Curve:
73
Using ROC for Model Comparison
No model consistently outperforms the other M1 is better for
small FPR M2 is better for
large FPR
Area Under the ROC curve (AUC) Ideal:
Area = 1 Random guess:
Area = 0.5
74
Revisit K-Nearest Neighbor
K-NN:– Instance-based algorithm Uses k “closest” points (nearest neighbors) for
performing classification– k-NN classifiers are lazy learners (does not
build models explicitly)– Classifying unknown examples are relatively
expensive than model-learning algorithms (or parametric approaches)
75
Nearest Neighbor Classifiers
Basic idea:– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Training Records
Test Record
Compute Distance
Choose k of the “nearest” records
76
Nearest-Neighbor Classifiers
Requires three things– The set of stored examples– Distance Metric to compute
distance between examples– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:– Compute distance to other
training records– Identify k nearest neighbors – Use class labels of nearest
neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Unknown record
77
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
78
1 nearest-neighbor
Voronoi Diagram
79
Nearest Neighbor Classification
Compute distance between two points:– Euclidean distance
Determine the class from nearest neighbor list– take the majority vote of class labels among
the k-nearest neighbors– Weigh the vote according to distance weight factor, w = 1/d2
i ii
qpqpd 2)(),(
80
Nearest Neighbor Classification…
Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points from
other classes
X
81
Nearest Neighbor Classification…
Scaling issues– Attributes may have to be scaled to prevent
distance measures from being dominated by one of the attributes
– Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M
82
Nearest Neighbor Classification…
Problem with Euclidean measure:– High dimensional data curse of dimensionality solution is to do dimension reduction first
– Can produce counter-intuitive results
1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1vs
d = 1.4142 d = 1.4142 Solution: Normalize data
83
Data normalization
Example-wise normalization– Each example is normalized
and mapped to unit sphere Feature-wise normalization
– [0,1]-normalization: normalize each feature into a unit space
– Standard normalization: normalize each feature to have mean 0 and standard deviation 1
1
1
1
1
84
Training data is given. – Each object is associated with a class label Y {1, 2,
…, K} and a feature vector of d measurements: X = (X1, …, Xd).
Build a model from the training data.
Unseen objects are to be classified as belonging to one of a number of predefined classes {1, 2, …, K}.
Linear Discriminant Analysis / Fisher’s linear disciminant
Classification
85
Two classes
Variable 1
Varia
ble
2
Best a
xis
1u
2u
86
Three classes
87
Classifiers are built from a training set (TS) L = (X1, Y1), ..., (Xn,Yn)
Classifier C built from a learning set L:
C: X {1,2, ... ,K}
Bayes classifier base on conditional densities p(Ck | X), C(X) = arg maxk p(Ck | X) This is a maximum a posterior, and p(Ck | X) is a
posterior density
Classifiers
88
The Rules of Probability
Sum Rule
Product Rule
Bayes’ Rule
posterior likelihood × prior
= p(X|Y)p(Y)
is irrelevant to Y=C
)|( YXp)|( dataXCYp )(Yp
89
p(Ck | X) = p(X | Ck) p(Ck) /p(X)
Find a class label C(X) so that maxk p(Ck | X) = maxk p(X | Ck) p(Ck)
Naïve Bayes assumes independence among all features (last class)– p(X | Ck) = p(x1 | Ck) p(x2 | Ck) . . . p(xd | Ck)
Very strong assumption
Maximum a posterior
90
kkT
k
kdk XXCXp
1
21exp
)det()2(1)|(
Assume multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k),
)(log)det()2(log
21)()|(log 1
kkd
kkT
kkk
Cp
XXCpCXp
C(X) = arg mink {(X - k)’ k-1
(X - k) + log| k | -2log(p(Ck))}
Multivariate normal dist for each class
Maximizing posterior is equivalent to maximizing p(X|CK)p(CK), and equivalent to maximizing the logorithm of p(X|CK)p(CK)
91
Two-class case
TXXXX TT 22
12211
111 loglog
1)()|()()|(
22
11 CpCXpCpCXp
)()|()()|( 2211 CpCXpCpCXp C(X) = C1If
otherwise C(X) = C2
))()(log(
)|()|(log
1
2
2
1
CpCp
CXpCXp
Equivalently, )()(
)|()|(
1
2
2
1
CpCp
CXpCXp
92
Guassian discriminant rule
For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), the classification rule is
C(X) = arg mink {(X - k)’ k-1
(X - k) + log| k |}
In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)
In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities
TXXXX TT 22
12211
111 loglog
93
iCx
Tii
ii xx
C))((1
iCxi
i xC ||1Class mean
Class covariance
Sample mean and variance
94
000021012
31
000011011
000000001
000010000
31
011011
001001
01001
0
31
))(())(())((31
111
31
.120
,112
,101
332211
321
321
TTT XXXXXX
XXX
XXX
Example
95
Two-class case
If the two classes have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2):
Quadratic rule
TXXXX TT 22
12211
111 loglog
cX T )( 121
cwX T )( 121 w
become
where
Usually, )(12211 nn
n
96
Illustration
μ1 μ2
97
Two-class case
Maximize the signal-to-noise ratio
wwww
withinT
betweenT
w max
Tbetween ))(( 1212
)(12211 nn
nwithin
where
Solution is )( 121
withinw
Between-class separation
Within-class cohesion
98
Two-class case (illustration)
LDA gives the yellow direction
Two classes overlap
Two classes are separated
99
Two-class case (illustration)
1 2
2- 1
LDA axisBest Threshold
100
Multi-class case
Two approaches– Apply Fisher LDA to each “one-versus-rest”
class
101
Multi-class case
K
i Cx
Tiiw
K
k
Tkkkb
i
xxn
S
nn
S
1
1
))((1
))((1
Transformation matrix G that projects the data to be most separable is the matrix that maximizes
WSWWSW
wT
bT
maxW
Second approach:Similarly, find multiple directions that form a low dimensional space
Correct way to write it is
1
W))((tracemax WSWWSW w
Tb
T
Between-class matrix
Within-class matrix
102
The goal is to simultaneously maximize the between-class separation and minimize the within-class cohesion
The solution to is the generalized eigenvalue problem The generalized eigenvectors are eigenvectors by solving
Intuition
gSgS wb
bw SS 1
1
W))((tracemax WSWWSW w
Tb
T
103
Graphic view of the transformation (projection)
d
n
dnA Training data (matrix)
)1( knLA
n
K-1
Reduced training data
K-1
d
)1( kdWTransformation matrix
104
Graphical view of classificationd
n
dnA
n
K-1
)1( knLAK-1
d
)1( kdG
1K-1
)1(1 kLh
Find the nearest neighborOr nearest centroid
d1
dh 1
A test data point h
105
First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA):
Dimension reduction– Finds linear combinations of the features X=X1,...,Xd with
large ratios of between-groups to within-groups sums of squares - discriminant variables;
Classification– Predicts the class of an observation X by the class whose
mean vector is closest to X in terms of the discriminant variables
Summary
106
We just introduced Fisher discriminant analysis, particularly linear discriminant analysis
Now let us discuss Support Vector Machine
107
History of SVM SVM is inspired from statistical learning theory [3]. SVM was first introduced in 1992 [1]. SVM becomes popular because of its success in
handwritten digit recognition [2]. SVM is now regarded as an important example of
“kernel methods”, arguably the hottest area in machine learning. http://www.kernel-machines.org/
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82. 1994.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 1nd edition, Springer, 1996.
108
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
109
Support Vector Machines
One Possible Solution
B1
110
Support Vector Machines
Another possible solution
B2
111
Support Vector Machines
Other possible solutions
B2
112
Support Vector Machines
Which one is better? B1 or B2? How do you define better?
B1
B2
113
Support Vector Machines
Find hyperplane maximizes the margin => B1 is better than B2
B1
B2
b11
b12
b21b22
margin
114
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Marginw
115
Support Vector Machines
What if the problem is not linearly separable?
116
Nonlinear Support Vector Machines
What if decision boundary is not linear?
117
Nonlinear Support Vector Machines
Transform data into higher dimensional space
118
Outline of SVM lecture
Linear classifier
Maximum margin classifier– Estimate the margin
SVM for separable data
SVM for non-separable data
119
Linear classifiersf x
a
y
denotes +1denotes -1
f(x,w,b) = sign(w. x +b)
How would you classify this data?
120
Linear classifiersf x
a
y
denotes +1denotes -1
f(x,w,b) = sign(w. x +b)
How would you classify this data?
121
Classifier Marginf x
a
y
denotes +1denotes -1
f(x,w,b) = sign(w. x + b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
122
Maximum Marginf x
a
y
denotes +1denotes -1
f(x,w,b) = sign(w. x + b)
The maximum margin linear classifier is the linear classifier with the maximum margin.This is the simplest kind of SVM (Called an LSVM)Linear SVM
123
Maximum Marginf x
a
y
denotes +1denotes -1
f(x,w,b) = sign(w. x + b)
The maximum margin linear classifier is the linear classifier with the maximum margin.This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those data points that the margin pushes up against
Linear SVM
124
Why Maximum Margin?
denotes +1denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
1. Intuitively this feels safest. 2. If we’ve made a small error in the
location of the boundary this gives us least chance of causing a misclassification.
3. The model is immune to removal of any non-support-vector datapoints.
4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.
5. Empirically it works very very well.
125
Estimate the Margin
What is the distance expression for a point x to a line wx+b= 0?
denotes +1denotes -1 x wx +b =
0
2 212
( )d
ii
b bd
w
x w x wx
w
126
Estimate the Margin
wx +b = 0
xy
distance
d
iiw
xwb
w
xwbw
xwb
wxwyw
wwxy
wwxy
1
22
d
have we0,byw Using
)(,
127
Estimate the Margin
What is the expression for margin?
denotes +1denotes -1 wx +b =
0
Margin
21
margin arg min ( ) arg mindD D
ii
bd
w
x x
x wx
128
Maximize Margin
denotes +1
denotes -1wx +b =
0
Margin
,
,
2,1
argmax margin( , , )
= argmax arg min ( )
argmax arg min
i
i
b
ib D
i
db Dii
b D
d
b
w
w
w x
w x
w
x
x w
129
Maximize Margin
denotes +1
denotes -1wx +b =
0
2,1
argmax arg min
subject to : 0
i
i
db Dii
i i i
b
w
D y b
w x
x w
x x w
Margin
Min-max problem
130
Maximize Margindenotes
+1denotes -1
wx +b = 0
2,1
argmax arg min
subject to : 0
i
i
db Dii
i i i
b
w
D y b
w x
x w
x x w
Margin
Strategy:
: 1i iD b x x w
21
,argmin
subject to : 1
dii
b
i i i
w
D y b
w
x x w
131
Maximum Margin Linear Classifier
How to solve it?
* * 21
,
1 1
2 2
{ , }= argmax
subject to1
1
....1
dkk
w b
N N
w b w
y w x b
y w x b
y w x b
132
Learning via Quadratic Programming
QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
Availabel open-source solvers– SVMLight http://svmlight.joachims.org/
– LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
– Matlab optimization toolbox
133
Quadratic Programming
2maxarg uuudu
RcT
T Find
nmnmnn
mm
mm
buauaua
buauauabuauaua
...:......
2211
22222121
11212111
)()(22)(11)(
)2()2(22)2(11)2(
)1()1(22)1(11)1(
...:......
enmmenenen
nmmnnn
nmmnnn
buauaua
buauauabuauaua
And subject to
n additional linear inequality constraints
e additional linear equality constraints
Quadratic criterion
Subject to
134
Quadratic Programming of SVM
* * 2
,{ , }= min
subject to 1 for all training data ( , )
iiw b
i i i i
w b w
y w x b x y
* *
,
1 1
2 2
{ , }= argmax 0 0
1
1 inequality constraints
....1
T
w b
N N
w b w w w
y w x b
y w x b
y w x b
nI
135
Non-separable
denotes +1denotes -1
This is going to be a problem!What should we do?
136
denotes +1denotes -1
This is going to be a problem!What should we do?Idea 1:
Find minimum ||w||2, while minimizing number of training set errors.
Problemette: Two things to minimize makes for an ill-define optimization
Non-separable
21
,argmin
subject to : 1
dii
b
i i i
w
D y b
w
x x w
Separable case
137
denotes +1denotes -1
This is going to be a problem!What should we do?Idea 1.1:
Minimize ||w||2 + C (#train errors)
Tradeoff parameter
Non-separable
1)( bxwy ii
Some points will violate
0 ,1)( iiii εbxwy We allow errors to occur
Hinge loss
138
denotes +1denotes -1
This is going to be a problem!What should we do?Idea 2.0:
Minimize ||w||2 + C (distance of
error points to
their correct place)
Non-separable
0 ,1)(
1
iiii
N
ii
εbxwy
139
Balance the trade off between margin and classification errors
d* * 21 1,
1 1 1 1
2 2 2 2
{ , }= min
1 , 0
1 , 0
...1 , 0
Ni ji jw b
N N N N
w b w c
y w x b
y w x b
y w x b
denotes +1denotes -1
1
2
3
Linear inseparable case
140
Determining value for c
How do we determine the appropriate value for c ? Cross-validation on training data
– Take possible choices for c– For each choice, Run a cross validation procedure Calculate the error metric (chosen properly)
– Find the choice that achieves the best metric– Use the best choice on all training data
141
A toy example on SVM (assignment 2)
0.8281 1.3162 12.0391 1.1447 21.9653 2.2966 20.4878 2.3856 10.3570 0.5606 11.4951 1.4693 22.8792 1.3368 21.0212 1.9389 11.7558 2.1281 20.6714 2.2641 1
X y
Training Data
0 0.5 1 1.5 2 2.5 30.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
x1
x2
x1
x2
x10
142
Separable case
21
,argmin
subject to : 1
dii
b
i i i
w
D y b
w
x x wMatlab scripts[N,d] = size(X);% constraintsA = diag(y) * [X ones(N,1)];Rhs = ones(N,1);% objectiveH = [eye(d) zeros(d,1)];H = [H; [zeros(1,d) 0]];f = zeros(d+1, 1);[X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs)
143
Inseparable case
01 osubject t
min11
2
iiii
N
ii
d
ii
, εεb)x(wy
εc w
Matlab scripts[N,d] = size(X);% constraintsA = [diag(y) * [X ones(N,1)] eye(N)];Rhs = ones(N,1);% objectiveH = [eye(d) zeros(d,1+N)];H = [H; [zeros(1+N,d) zeros(N,N)]];f = [zeros(d+1, 1); c*ones(N,1)];% bound constraintsLb = [-Inf * ones(d+1,1); zeros(N,1)];[X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs,[],[],Lb)
144
Next couple of slides are backup slides (not required in this class)
145
Support Vector Machine for Noisy Data
margin theoutside lies and correctly, classified is 0margin theinside liesbut ,classifiedcorrectly is 10
icationmisclassif i.e.,,0)(1
ii
ii
iii
xx
bwxy
errors. training ofnumber on the
boundupper an is 1
k
ii
Class 1
Class 2
146
Support Vector Machine for Noisy Data
* * 21
,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0 inequality constraints
....1 , 0
Ni ji j
w b
N N N N
w b w c
y w x b
y w x b
y w x b
How do we determine the appropriate value for c ?• Cross-validation
147
Support Vector Machine for Noisy Data
.,,1,0)(g subject tof(w) minimize
i kiw )(f(w))(gf(w))(w, i1
wgwL Tk
iip aaa
0 subject to
)(w,Linf )( maximize p
a
aa wDLf(w))( aDL
)(-f(w) aDL
General optimization problem Define the Lagrangian
Lagrangian dual problem Weak duality theorem
Duality gap
Let *wbe the minimum of the Lagrangian with respect to w, and let*abe the maximum of the lagrangian dual with respect to a
If the constrains g are linear functions of w, then the duality gap is 0.
i. allfor ,0)( ** wgiia
148
Support Vector Machine for Noisy Data
0),(w ** apLw
0),(w ** aa pL
k1, i 0,)(wg *i
* ia
Karush-Kuhn-Tucker Conditions
k1, i 0,)(wg *i
k1, i 0,* ia
Complementarity condition
Feasibility condition
149
Support Vector Machine for Noisy Data
i. allfor ,01 iii bwxy
iai. allfor ,0i
Use the Lagrangian formulation for the optimization problem.Introduce a positive Lagrangian multiplier for each inequality constraint.
i
Lagrangian multipliers
ii
iiiii
ii
ip bwxycwL a 12
Get the following Lagrangian:
150
Support Vector Machine for Noisy Data
ii
iiiii
ii
ip bwxycwL a 12
0021
ii
iii
ip yy
bL
aa
iii
iiii
ip xywxyw
wL
aa2102
jijijji
ii
iD xxyyL aaa,2
1
iiiii
p ccL
aa
0
function. in the involvednot are multiplier its and εBoth ii
.ε and b, w,respect towith
L of sderivative theTake
i
p
icαi 0
151
The Dual Form of QP
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 21 where ( )kl k l k lQ y y x x
Subject to these constraints:
kcαk 0
Then define:
R
kkkk yα
121 xw
01
R
kkk yα
152
The Dual Form of QP
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 21 where ( )kl k l k lQ y y x x
Subject to these constraints:
kCαk 0
Then define:
R
kkkk yα
121 xw Then classify with:
f(x,w,b) = sign(w. x + b)
01
R
kkk yα
153
An Equivalent QP
Maximize
where ).( lklkkl yyQ xx
Subject to these constraints:
kcαk 0
Then define:
R
kkkk yα
121 xw
01
R
kkk yα
Datapoints with ak > 0 will be the support vectors
..so this sum only needs to be over the support vectors.
R
k
R
lkllk
R
kk Qααα
1 11 21
154
Support Vectors
denotes +1denotes -1
1w x b
1w x b
w
Support Vectors
Decision boundary is determined only by those support vectors !
R
kkkk yα
121 xw
: 1 0i i i ii y w x ba
ai = 0 for non-support vectors
ai 0 for support vectors
155
The Dual Form of QP
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 21 where ( )kl k l k lQ y y x x
Subject to these constraints:
kcαk 0
Then define:
R
kkkk yα
121 xw Then classify with:
f(x,w,b) = sign(w. x + b)
01
R
kkk yα
How to determine b ?
156
Another approach based on support vectors:
An Equivalent QP: Determine b
A linear programming problem !
ci a0
* * 21
,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
....1 , 0
Ni ji j
w b
N N N N
w b w c
y w x b
y w x b
y w x b
1
*1
,
1 1 1 1
2 2 2 2
= argmin
1 , 0
1 , 0
....1 , 0
Ni i
Njj
b
N N N N
b
y w x b
y w x b
y w x b
Fix w
: 1 0i i i ii y w x ba
1
01
wxyyb
bwxy
iii
ii
0ii00 iii c a