© Vipin Kumar CSci 8980 Fall 2002 1
CSci 8980: Data Mining (Fall 2002)
Vipin Kumar
Army High Performance Computing Research CenterDepartment of Computer Science
University of Minnesota
http://www.cs.umn.edu/~kumar
© Vipin Kumar CSci 8980 Fall 2002 2
Model Evaluation
Metrics for Performance Evaluation– How to evaluate the performance of a model?
Methods for Performance Evaluation– How to obtain reliable estimates
Methods for Model Comparison– How to compare the relative performance among
competing models
© Vipin Kumar CSci 8980 Fall 2002 3
Metrics for Performance Evaluation
Focus on the predictive capability of a model– Rather than how fast it takes to classify or build
models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
© Vipin Kumar CSci 8980 Fall 2002 4
Metrics for Performance Evaluation…
Most widely-used metric:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a(TP)
B(FN)
Class=No c(FP)
d(TN)
FNFPTNTPTNTP
dcbada
Accuracy
© Vipin Kumar CSci 8980 Fall 2002 5
Cost Matrix
PREDICTED CLASS
ACTUALCLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)
C(i|j): Cost of misclassifying class j example as class i
Accuracy is a useful measure if C(Yes|No)=C(No|Yes) and C(Yes|Yes)=C(No|No) P(Yes) = P(No) (class distribution are equal)
© Vipin Kumar CSci 8980 Fall 2002 6
Cost vs Accuracy
Cost Matrix
PREDICTED CLASS
ACTUALCLASS
C(i|j) + -
+ -1 100
- 1 0
Model M1 PREDICTED CLASS
ACTUALCLASS
C(i|j) + -
+ 150 40
- 60 250
Model M2 PREDICTED CLASS
ACTUALCLASS
C(i|j) + -
+ 250 45
- 5 200
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255
© Vipin Kumar CSci 8980 Fall 2002 7
Cost-Sensitive Measures
cbaa
prrp
baa
caa
222
(F) measure-F
(r) Recall
(p)Precision
Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) F-measure is biased towards all except C(No|No)
dwcwbwawdwaw
4321
41Accuracy Weighted
© Vipin Kumar CSci 8980 Fall 2002 8
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
Performance of a model may depend on other factors besides the learning algorithm:– Class distribution
– Cost of misclassification
– Size of training and test sets
© Vipin Kumar CSci 8980 Fall 2002 9
Learning Curve
Learning curve shows how accuracy changes with varying sample size
Requires a sampling schedule for creating learning curve:
Arithmetic sampling(Langley, et al)
Geometric sampling(Provost et al)
Effect of small sample size:- Bias in the estimate- Variance of estimate
© Vipin Kumar CSci 8980 Fall 2002 10
Methods of Estimation
Holdout– Reserve 2/3 for training and 1/3 for testing
Random subsampling– Repeated holdout
Cross validation– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Stratified sampling – oversampling vs undersampling
Bootstrap– Sampling with replacement
© Vipin Kumar CSci 8980 Fall 2002 11
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive hits and
false alarms
ROC curve plots TP (on the y-axis) against FP (on the x-axis)
Performance of each classifier represented as a point on the ROC curve– changing the threshold of algorithm, sample
distribution or cost matrix changes the location of the point
© Vipin Kumar CSci 8980 Fall 2002 12
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
© Vipin Kumar CSci 8980 Fall 2002 13
ROC Curve
(TP,FP): (0,0): declare everything
to be negative class (1,1): declare everything
to be positive class (1,0): ideal
Diagonal line:– Random guessing
– Below diagonal line: prediction is opposite of the true class
© Vipin Kumar CSci 8980 Fall 2002 14
Using ROC for Model Comparison
No model consistently outperform the other M1 is better for
small FPR M2 is better for
large FPR
Area Under the ROC curve
Ideal: Area = 1
Random guess: Area = 0.5
© Vipin Kumar CSci 8980 Fall 2002 15
How to Construct an ROC curve
Instance P(+|A) True Class
1 0.95 +
2 0.93 +
3 0.87 -
4 0.85 -
5 0.85 -
6 0.85 +
7 0.76 -
8 0.53 +
9 0.43 -
10 0.25 +
• Use classifier that produces posterior probability for each test instance P(+|A)
• Sort the instances according to P(+|A) in decreasing order
• Apply threshold at each unique value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP + TN)
© Vipin Kumar CSci 8980 Fall 2002 16
How to construct an ROC curve
Class + - + - - - + - + +
P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Threshold >=
ROC Curve:
© Vipin Kumar CSci 8980 Fall 2002 17
Test of Significance
Given two models:– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances
Can we say M1 is better than M2?– How much confidence can we place on accuracy of
M1 and M2?
– Can the difference in performance measure be explained as a result of random fluctuations in the test set?
© Vipin Kumar CSci 8980 Fall 2002 18
Confidence Interval for Accuracy
Prediction can be regarded as a Bernoulli trial– A Bernoulli trial has 2 possible outcomes
– Possible outcomes for prediction: correct or wrong
– Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = Np = 50 0.5 = 25
Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),
Can we predict p (true accuracy of model)?
© Vipin Kumar CSci 8980 Fall 2002 19
Confidence Interval for Accuracy
For large test sets (N > 30), – acc has a normal distribution
with mean p and variance p(1-p)/N
Confidence Interval for p:
1
)/)1(
(2/12/
ZNpp
paccZP
Area = 1 -
Z/2 Z1- /2
)(2
4422
2/
22
2/
2
2/
ZN
accNaccNZZaccNp
© Vipin Kumar CSci 8980 Fall 2002 20
Confidence Interval for Accuracy
Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:– N=100, acc = 0.8
– Let 1- = 0.95 (95% confidence)
– From probability table, Z/2=1.96
1- Z
0.99 2.58
0.98 2.33
0.95 1.96
0.90 1.65
N 50 100 500 1000 5000
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811
© Vipin Kumar CSci 8980 Fall 2002 21
Comparing Performance of 2 Models
Given two models, say M1 and M2, which is better?– M1 is tested on D1 (size=n1), found error rate = e1
– M2 is tested on D2 (size=n2), found error rate = e2
– Assume D1 and D2 are independent
– If n1 and n2 are sufficiently large, then
– Approximate:
22
11
,~2
,~1
Ne
Ne
i
ii
i nee )1(
ˆ
© Vipin Kumar CSci 8980 Fall 2002 22
Comparing Performance of 2 Models
To test if performance difference is statistically significant: d = e1 – e2– d ~ NN(dt,t) where dt is the true difference
– Since D1 and D2 are independent, their variance adds up:
– At (1-) confidence level,
2)21(2
1)11(1
ˆˆ 2
2
2
1
2
2
2
1
2
nee
nee
t
ttZdd
ˆ
2/
© Vipin Kumar CSci 8980 Fall 2002 23
An Illustrative Example
Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25
d = |e2 – e1| = 0.1 (2-sided test)
At 95% confidence level, Z/2=1.96
=> Interval contains 0 => difference may not be statistically significant
0043.05000
)25.01(25.030
)15.01(15.0ˆ
d
128.0100.00043.096.1100.0 t
d
© Vipin Kumar CSci 8980 Fall 2002 24
Comparing Performance of 2 Algorithms
Each learning algorithm may produce k models:– L1 may produce M11 , M12, …, M1k
– L2 may produce M21 , M22, …, M2k
If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation)– For each set: compute dj = e1j – e2j
– dj has mean dt and variance t
– Estimate:
tkt
k
j j
t
tdd
kk
dd
ˆ
)1(
)(ˆ
1,1
1
2
2
© Vipin Kumar CSci 8980 Fall 2002 25
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.– Based on information found in the data that describes the
objects and their relationships.
– Also known as unsupervised classification.
Many applications– Understanding: group related documents for browsing or to
find genes and proteins that have similar functionality.
– Summarization: Reduce the size of large data sets.
© Vipin Kumar CSci 8980 Fall 2002 26
What is not Cluster Analysis?
Supervised classification.– Have class label information.
Simple segmentation.– Dividing students into different registration groups
alphabetically, by last name.
Results of a query.– Groupings are a result of an external specification.
Graph partitioning– Some mutual relevance and synergy, but areas are not
identical.
© Vipin Kumar CSci 8980 Fall 2002 27
Notion of a Cluster is Ambiguous
Initial points.
Four Clusters Two Clusters
Six Clusters
© Vipin Kumar CSci 8980 Fall 2002 28
Types of Clusterings
A clustering is a set of clusters.
One important distinction is between hierarchical and partitional sets of clusters.
Partitional Clustering– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset.
Hierarchical clustering– A set of nested clusters organized as a hierarchical tree.
© Vipin Kumar CSci 8980 Fall 2002 29
Partitional Clustering
Original Points A Partitional Clustering
© Vipin Kumar CSci 8980 Fall 2002 30
Hierarchical Clustering
p4p1
p3
p2
p4 p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Traditional Dendrogram
© Vipin Kumar CSci 8980 Fall 2002 31
Other Distinctions Between Sets of Clusters
Exclusive versus non-exclusive– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
Fuzzy versus non-fuzzy– In fuzzy clusterings, a point belongs to every cluster with some
weight between 0 and 1.
– Weights must sum to 1.
– Probabilistic clustering has similar characteristics.
Partial versus complete.– In some cases, we only want to cluster some of the data.
© Vipin Kumar CSci 8980 Fall 2002 32
Types of Clusters: Well-Separated
Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than to any point not in the cluster.
© Vipin Kumar CSci 8980 Fall 2002 33
Types of Clusters: Center-Based
Center-based– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the center of any other cluster.
– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster.
© Vipin Kumar CSci 8980 Fall 2002 34
Types of Clusters: Contiguity-Based
3) Contiguous Cluster(Nearest neighbor or Transitive)– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
© Vipin Kumar CSci 8980 Fall 2002 35
Types of Clusters: Density-Based
Density-based– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise and outliers are present.
– The three curves don’t form clusters since they fade into the noise, as does the bridge between the two small circular clusters.
© Vipin Kumar CSci 8980 Fall 2002 36
Similarity and Dissimilarity
Similarity– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity– Numerical measure of how different two data objects are.
– Is lower when objects are more alike.
– Minimum dissimilarity is often 0.
– Upper limit varies
Proximity refers to a similarity or dissimilarity
© Vipin Kumar CSci 8980 Fall 2002 37
Summary of Similarity/Dissimilarity for Simple
Attributes
p and q are the attribute values for two data objects.
© Vipin Kumar CSci 8980 Fall 2002 38
Euclidean Distance
Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Standardization is necessary, if scales differ.
n
kkk qpdist
1
2)(
© Vipin Kumar CSci 8980 Fall 2002 39
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x yp1 0 2p2 2 0p3 3 1p4 5 1
Distance Matrix
p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
© Vipin Kumar CSci 8980 Fall 2002 40
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
rn
k
rkk qpdist
1
1)||(
© Vipin Kumar CSci 8980 Fall 2002 41
Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors.
r = 2. Euclidean distance. r . “supremum” (Lmax norm, L norm) distance.
– This is the maximum difference between any component of the vectors.
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
© Vipin Kumar CSci 8980 Fall 2002 42
Minkowski Distance
Distance Matrix
point x yp1 0 2p2 2 0p3 3 1p4 5 1
L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0
L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5p2 2 0 1 3p3 3 1 0 2p4 5 3 2 0
© Vipin Kumar CSci 8980 Fall 2002 43
Common Properties of a Distance
Distances, such as the Euclidean distance, have some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties is a metric
© Vipin Kumar CSci 8980 Fall 2002 44
Common Properties of a Similarity
Similarities, also have some well known properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
© Vipin Kumar CSci 8980 Fall 2002 45
Similarity Between Binary Vectors
Common situation is that objects, p and q, have only binary attributes.
Compute similarities using the following quantitiesM01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
© Vipin Kumar CSci 8980 Fall 2002 46
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
© Vipin Kumar CSci 8980 Fall 2002 47
Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d. Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
© Vipin Kumar CSci 8980 Fall 2002 48
Extended Jaccard Coefficient (Tanimoto)
Variation of Jaccard for continuous or count attributes– Reduces to Jaccard for binary attributes
© Vipin Kumar CSci 8980 Fall 2002 49
Correlation
)(/))(( pstdpmeanpp kk
)(/))(( qstdqmeanqq kk
Correlation measure the linear relationship between objects. To compute correlation, we standardize data objects, p and q,
and then take the dot product.
qpqpncorrelatio ),(
© Vipin Kumar CSci 8980 Fall 2002 50
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.
© Vipin Kumar CSci 8980 Fall 2002 51
Mahalanobis Distance
Tqpqpqpsmahalanobi )()(),( 1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
© Vipin Kumar CSci 8980 Fall 2002 52
A General Approach for Combining Similarities
Sometimes attributes are of many different types, but an overall similarity is needed.
© Vipin Kumar CSci 8980 Fall 2002 53
Using Weights to Combine Similarities
May not want to treat all attributes the same.– Use weights wk which are between 0 and 1 and sum to 1.
Top Related