Classification.. continued. Prediction and Classification Last week we discussed the classification...
-
Upload
hailee-madkins -
Category
Documents
-
view
222 -
download
5
Transcript of Classification.. continued. Prediction and Classification Last week we discussed the classification...
Classification.. continued
Prediction and Classification
• Last week we discussed the classification problem..– Used the Naïve Bayes Method
• Today..we will dive into more details..
• But first how do we evaluate classifier
Abstract Binary Classification Problem
• Given n data samples where xi is a
data vector and yi is label {-1,1}.
• Aim is to learn a function
• Such that f is “accurate” on unseen data.
• [ill-specified as defined]
Algorithms to Learn Classifier
• We can use an algorithm A to learn the function f: X Y
• Then we write f as fA
• One example of A is Naïve Bayes.
• Other examples {Logistic Regression, Neural Networks, Support Vector Machines, Decision Trees, Random Forests,….}
Training vs. Test Data
• In practice to take care of the “unseen” part…we split the data into training and test sets
• We learn fA on the training set using an algorithm A
• The learned function fA is then evaluated on the test set.
Example
• Suppose we learn a function F on training set.
• Our test set consists of four data points (z1,1),(z2,-1),(z3,1),(z4,-1).
• We apply F on the four data points (without labels) and we get F(z1)=1, F(z2)=1,F(z3)=-1 and F(z4) = -1.
• Then F correctly classified z1 and z4 but incorrectly classified z2 and z3.
Confusion Matrix
Actual Label (1) Actual Label (-1)
Predicted Label (1) True Positive (N1) False Positive (N2)
Predicted Label (-1) False Negatives (N3) True Negatives (N4)
Label 1 is called Positive, Label -1 is called Negative
Let the number of test samples be N
N = N1 + N2 + N3 + N4.
True Positive Rate (TPR) = N1/(N1+N3)
True Negative Rate (TNR) = N4/(N4+N2)
False Positive Rate (FPR) = N2/(N2+N4)
False Negative Rate (FNR) = N3/(N1+N3)
Accuracy = (N1+N4)/(N1+N2+N3+N4)
Precision = N1/(N1+N2) Recall = N1/(N1+N3)
Example
Actual Label (1) Actual Label (-1)
Predicted Label (1) 10 3
Predicted Label (-1) 2 20
TPR = 5/6; TNR = 20/23; FPR = 3/23; FNR = 2/12;
Accuracy = 30/35
Precision = 10/13 and Recall = 10/12
ROC (Receiver Operating Characteristic) Curves
• Generally a learning algorithm A will return a real number…but what we want is a label {1 or -1}
• We can apply a threshold..TA 0.7 0.6 0.5 0.2 0.1 0.09 0.08 0.02 0.01
T=0.1 1 1 1 1 1 -1 -1 -1 -1
True Label
1 1 -1 -1 1 1 -1 -1 -1
A 0.7 0.6 0.5 0.2 0.1 0.09 0.08 0.02 0.01
T=0.2 1 1 1 1 -1 -1 -1 -1 -1
True Label
1 1 -1 -1 1 1 -1 -1 -1
TPR = 3/4FPR = 2/5
TPR = 2/4FPR = 2/5
ROC Curve
• An ROC Curve is the plot where the x-axis is FPR, the y-axis is the TPR and for each threshold t, the point on the plot represents the pair (FPR(t), TPR(t))
• Lets Look at the Wikipedia ROC Entry
Discussion..
• If F: Symptoms {Disease, No-Disease}– Higher Recall or Precision ?– What is the relative cost of a mis-diagnosis (and
which way)
• If F: Banner Ad {Click, No-Click}– Higher Precision means more revenue?
Random Variables
• A r.v. is a numerical quantity associated with events in an experiment.
• Suppose we roll two dice. Let X = k be the sum of the two faces.
• X can take values ranging from {2….12}.• P(X=12) = 1/36. Why ?
– Event associated with X=12 is {(6,6)}• P(X=7) = 6/36 = 1/6
– Associated Event: {(1,6),(6,1),(2,5),(5,2),(3,4),(4,3)}
Random Variable
• A random variable X can take values in a set which is:– discrete and finite.
• Lets toss a coin and X = 1 if it’s a head and X=0 if it’s a tail. X is random variable
– discrete and infinite (countable)• Let X be the number of accidents in Sydney in a day.. Then
X = 0,1,2,…..
– Infinite (uncountable)• Let X be the height of a Sydney-sider.
– X = 150, 150.11,150.112,……
Random Variable Properties
• Let X be a discrete valued random variable taking values in a set S.
The Expected (average) Value of X, E(X) is
• The Variance is
Examples
• Let X be a random variable which takes values 1 with probability p and 0 with probability 1-p. Then
Examples
• Let X be a random variable which denotes the number of “spam emails” in a batch of n emails. Assuming the probability of spam email is p.
• X={0,1,2,3,4,5}
X is a r.v. which follows a binomial distribution with parameters (n,p)… X ~ Binomial(n,p)– E(X) = np ; Var(X) = np(1-p)
Examples
• Let X be a random variable which denotes the number of tcp packets that arrive in a unit time. Then X can be modeled to follow a Poisson distribution..
• E(X) = Var(X) = λ
Continuous Distribution
• Ofcourse the most common continuous distribution is the Normal/Gaussian distribution… denoted
•
How to use r.v. for classification
• To use r.v. in classification…we have to make an assumption.– For example..Sepal Length follows a Normal Distribution.– Is this a good/reasonable assumption.
• Then we use data to estimate the parameters of the distribution..– The parameters of a Normal distribution are the mean and the
variance (square of standard deviation).– For the moment we can just use Matlab/program to do that…– Once we have the parameters we can use the distribution to
estimate the “probability” of Sepal Length taking a new value..
Fitting Distributions..Examples• 0,1,0,1,0,0
– Assume data from a Binomial distribution with 6 trials and 2 successes• In Matlab:>> binofit(2,6) = 0.3333
• 10,20,5,3,3,100– Assume data is from a Poisson distribution– X=[10 20 5 3 3 100]; – Poissfit(X); – Ans: 23.50
• What is happening ? We are just taking sample averages. The more data we have the more reliable these estimates become..
• Suppose we take Sepal Length…data vector x>> [mean,std] = normfit(x);>> ans: mean = 5.8, std=0.81
Return to the Iris Example
• We will redo the Iris Classification Example..but now will use “continuous” values for the attributes…