Discrimination and Classification. Discrimination Situation: We have two or more populations 1, 2,...
-
Upload
stanley-chaddock -
Category
Documents
-
view
214 -
download
1
Transcript of Discrimination and Classification. Discrimination Situation: We have two or more populations 1, 2,...
Discrimination and Classification
Discrimination
Situation:
We have two or more populations 1, 2, etc
(possibly p-variate normal).
The populations are known (or we have data from each population)
We have data for a new case (population unknown) and we want to identify the which population for which the new case is a member.
The Basic Problem
Suppose that the data from a new case x1, … , xp has joint density function either :
1: g(x1, … , xn) or
2: h(x1, … , xn)
We want to make the decision to
D1: Classify the case in 1 (g is the correct distribution) or
D2: Classify the case in 2 (h is the correct distribution)
The Two Types of Errors
1. Misclassifying the case in 1 when it actually lies in 2.
Let P[1|2] = P[D1|2] = probability of this type of error
2. Misclassifying the case in 2 when it actually lies in 1.
Let P[2|1] = P[D2|1] = probability of this type of error
This is similar Type I and Type II errors in hypothesis testing.
Note:
1. C1 = the region were we make the decision D1.
(the decision to classify the case in 1)
A discrimination scheme is defined by splitting p –dimensional space into two regions.
2. C2 = the region were we make the decision D2.
(the decision to classify the case in 2)
1. Set up the regions C1 and C2 so that one of the probabilities of misclassification , P[2|1] say, is at some low acceptable value . Accept the level of the other probability of misclassification P[1|2] = .
There can be several approaches to determining the regions C1 and C2. All concerned with taking into account the probabilities of misclassification P[2|1] and P[1|2]
2. Set up the regions C1 and C2 so that the total probability of misclassification:
P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]
is minimized
P[1] = P[the case belongs to 1]
P[2] = P[the case belongs to 2]
3. Set up the regions C1 and C2 so that the total expected cost of misclassification:
E[Cost of Misclassification] = ECM
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
is minimized
P[1] = P[the case belongs to 1]
P[2] = P[the case belongs to 2]
c2|1= the cost of misclassifying the case in 2 when the case belongs to 1.
c1|2= the cost of misclassifying the case in 1 when the case belongs to 2.
The Optimal Classification Rule Suppose that the data x1, … , xp has joint density function
f(x1, … , xp ;)
where is either 1 or 2.Let
g(x1, … , xp) = f(x1, … , xn ;1) and
h(x1, … , xp) = f(x1, … , xn ;2)
We want to make the decision
D1: = 1 (g is the correct distribution) against
D2: = 2 (h is the correct distribution)
111 1
2 1
, ,, ,
, ,
p
p
p
g x xLC x x k
L h x x
and
where
then the optimal regions (minimizing ECM, expected cost of misclassification) for making the decisions D1 and D2 respectively are C1 and C2
112 1
2 1
, ,, ,
, ,
p
p
p
g x xLC x x k
L h x x
12
21
2
1
c Pk
c P
Proof:
2
1 12 1 , , p p
C
P g x x dx dx
ECM = E[Cost of Misclassification]
= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]
1
1 11 2 , , p p
C
P h x x dx dx
1
1 11 , , p p
C
g x x dx dx
1
2|1 1 11 1 , , p p
C
ECM c P g x x dx dx
1
1|2 1 12 , , p p
C
c P h x x dx dx
Therefore
ECM
1
1|2 1 2|1 1 12 , , 1 , ,p p p
C
c P h x x c P g x x dx dx
2|1 1c P
Thus ECM is minimized if C1 contains all of the points (x1, …, xp) such that the integrand is negative
1|2 1 2|1 12 , , 1 , , 0p pc P h x x c P g x x
1|21
1 2|1
2, ,
, , 1n
n
c Pg x x
h x x c P
Fishers Linear Discriminant Function.
Suppose that x1, … , xp is either data from a p-variate Normal distribution with mean vector:
111 12
/ 2 1/ 2
1
2
x x
pg x e
The covariance matrix is the same for both populations 1 and 2.
1 2 or
112 22
/ 2 1/ 2
1
2
x x
ph x e
111 12
112 22
/ 2 1/ 2
/ 2 1/ 2
1
21
2
x x
p
x x
p
eg x
h x e
The Neymann-Pearson Lemma states that we should classify into populations 1 and 2 using:
1 11 12 2 1 12 2x x x xe
That is make the decision
D1 : population is 1
if > k
1 11 12 2 1 12 2or ln lnx x x x k
or 1 12 2 1 1 2lnx x x x k
1 1 12 2 22x x x
1 1 11 1 12 2lnx x x k
1 1 111 2 1 1 2 22lnx k
or
and
a x K
1 1 111 2 1 1 2 22 and lna K k
Finally we make the decision
D1 : population is 1
if
where
and
12
21
2
1
c Pk
c P
Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2].
1 1 11 11 1 2 2 1 2 1 22 2and K
11 2a x x
The function
Is called Fisher’s linear discriminant function
11 2a x x K
1
21
2
11 2a x x x S x
In the case where the populations are unknown but estimated from data
Fisher’s linear discriminant function
1201008060402000
100
200
A Pictorial representation of Fisher's procedure for two populations
x
x
1
2Classify as
Classify as
1
2
1 2
Example 1
1 : Riding-mower owners 2 : Nonowners
x1 (Income x2 (Lot size x1 (Income x2 (Lot size in $1000s) in 1000 sq ft) in $1000s) in 1000 sq ft) 20.0 9.2 25.0 9.8 28.5 8.4 17.6 10.4 21.6 10.8 21.6 8.6 20.5 10.4 14.4 10.2 29.0 11.8 28.0 8.8 36.7 9.6 16.4 8.8 36.0 8.8 19.8 8.0 27.6 11.2 22.0 9.2 23.0 10.0 15.8 8.2 31.0 10.4 11.0 9.4 17.0 11.0 17.0 7.0 27.0 10.0 21.0 7.4
403020104
8
12
Riding Mower ownersNon ownwers
Income (in thousands of dollars)
Lot
Siz
e (i
n th
ousa
nds
of s
quar
e fe
et)
Example 2Annual financial data are collected for firms approximately 2 years prior to bankruptcy and for financially sound firms at about the same point in time. The data on the four variables
• x1 = CF/TD = (cash flow)/(total debt), • x2 = NI/TA = (net income)/(Total assets), • x3 = CA/CL = (current assets)/(current liabilties, and • x4 = CA/NS = (current assets)/(net sales) are given in
the following table.
The data are given in the following table:
Bankrupt Firms Nonbankrupt Firms x1 x2 x3 x4
x1 x2 x3 x4
Firm CF/TD NI/TA CA/CL CA/NS Firm CF/TD NI/TA CA/CL CA/NS 1 -0.4485 -0.4106 1.0865 0.4526 1 0.5135 0.1001 2.4871 0.5368 2 -0.5633 -0.3114 1.5314 0.1642 2 0.0769 0.0195 2.0069 0.5304 3 0.0643 0.0156 1.0077 0.3978 3 0.3776 0.1075 3.2651 0.3548 4 -0.0721 -0.0930 1.4544 0.2589 4 0.1933 0.0473 2.2506 0.3309 5 -0.1002 -0.0917 1.5644 0.6683 5 0.3248 0.0718 4.2401 0.6279 6 -0.1421 -0.0651 0.7066 0.2794 6 0.3132 0.0511 4.4500 0.6852 7 0.0351 0.0147 1.5046 0.7080 7 0.1184 0.0499 2.5210 0.6925 8 -0.6530 -0.0566 1.3737 0.4032 8 -0.0173 0.0233 2.0538 0.3484 9 0.0724 -0.0076 1.3723 0.3361 9 0.2169 0.0779 2.3489 0.3970 10 -0.1353 -0.1433 1.4196 0.4347 10 0.1703 0.0695 1.7973 0.5174 11 -0.2298 -0.2961 0.3310 0.1824 11 0.1460 0.0518 2.1692 0.5500 12 0.0713 0.0205 1.3124 0.2497 12 -0.0985 -0.0123 2.5029 0.5778 13 0.0109 0.0011 2.1495 0.6969 13 0.1398 -0.0312 0.4611 0.2643 14 -0.2777 -0.2316 1.1918 0.6601 14 0.1379 0.0728 2.6123 0.5151 15 0.1454 0.0500 1.8762 0.2723 15 0.1486 0.0564 2.2347 0.5563 16 0.3703 0.1098 1.9914 0.3828 16 0.1633 0.0486 2.3080 0.1978 17 -0.0757 -0.0821 1.5077 0.4215 17 0.2907 0.0597 1.8381 0.3786 18 0.0451 0.0263 1.6756 0.9494 18 0.5383 0.1064 2.3293 0.4835 19 0.0115 -0.0032 1.2602 0.6038 19 -0.3330 -0.0854 3.0124 0.4730 20 0.1227 0.1055 1.1434 0.1655 20 0.4875 0.0910 1.2444 0.1847 21 -0.2843 -0.2703 1.2722 0.5128 21 0.5603 0.1112 4.2918 0.4443 22 0.2029 0.0792 1.9936 0.3018 23 0.4746 0.1380 2.9166 0.4487 24 0.1661 0.0351 2.4527 0.1370 25 0.5808 0.0371 5.0594 0.1268
Examples using SPSS
Classification or Cluster Analysis
Have data from one or several populations
Situation
• Have multivariate (or univariate) data from one or several populations (the number of populations is unknown)
• Want to determine the number of populations and identify the populations
Example Table: Numerals in eleven languages English Norwegian Danish Dutch German French Spanish Italian Polish Hungarian Finnish
one en en een ein un uno uno jeden egy yksi two to to twee zwei deux dos due dwa ketto kaksi three tre tre drie drei trois tres tre trzy harom kolme four fire fire vier vier quatre cuarto quattro cztery negy neua five fem fem vijf funf cinq cinco cinque piec ot viisi six seks seks zes sechs six seix sei szesc hat kuusi seven sju syv zeven sieben sept siete sette siedem het seitseman eight atte otte acht acht huit ocho otto osiem nyole kahdeksan nine ni ni negen neun neuf nueve nove dziewiec kilenc yhdeksan ten ti ti tien zehn dix diez dieci dziesiec tiz kymmenen
Distance Matrix Distance = # of numerals (1 to 10) differing in first letter
02276666799
0154666789
065655689
05999
1089
0777899
0215
109
013
109
04
109
0
109
08
0
ENDa
DuGFr
SpIPHFi
E N Da G Fr Sp I P H FiDu
Hierarchical Clustering Methods The following are the steps in the agglomerative Hierarchical clustering algorithm for grouping N objects (items or variables).
1. Start with N clusters, each consisting of a single entity and an N X N symmetric matrix (table) of distances (or similarities) D = (dij).
2. Search the distance matrix for the nearest (most similar) pair of clusters. Let the distance between the "most similar" clusters U and V be dUV.
3. Merge clusters U and V. Label the newly formed cluster (UV). Update the entries in the distance matrix by
a) deleting the rows and columns corresponding to clusters U and V and
b) adding a row and column giving the distances between cluster (UV) and the remaining clusters.
4. Repeat steps 2 and 3 a total of N-1 times. (All objects will be a single cluster a termination of this algorithm.) Record the identity of clusters that are merged and the levels (distances or similarities) at which the mergers take place.
Different methods of computing inter-cluster distance
1
2
3
4 5
Single LinkageCluster Distance
d15
1
2
3
4 5
Complete Linkage
d24
d13
1
2
3
4 5
Average Linkage
+d14
d15
+ d23
+d24
d25
++
6
Example
To illustrate the single linkage algorithm, we consider the hypothetical distance matrix between pairs of five objects given below:
12345
093611
07510
092
08
0
1 2 3 4 5
D = {d } =ik
Treating each object as a cluster, the clustering begins by merging the two closest items (3 & 5).
To implement the next level of clustering we need to compute the distances between cluster (35) and the remaining objects:
d(35)1 = min{3,11} = 3
d(35)2 = min{7,10} = 7
d(35)4 = min{9,8} = 8
The new distance matrix becomes:
The new distance matrix becomes:
35124
0
378
096
05
0
(35 1 2 4(
The next two closest clusters ((35) & 1) are merged to form cluster (135). Distances between this cluster and the remaining clusters become:
Distances between this cluster and the remaining clusters become:
d(135)2 = min{7,9} = 7
d(135)4 = min{8,6} = 6
The distance matrix now becomes:
3524
0
76
05
0
35 2 4
Continuing the next two closest clusters (2 & 4) are merged to form cluster (24).
Distances between this cluster and the remaining clusters become:
d(135)(24) = min{d(135)2,d(135)4)= min{7,6} = 6
The final distance matrix now becomes:
At the final step clusters (135) and (24) are merged to form the single cluster (12345) of all five items.
3524
0
6
0
3524
The results of this algorithm can be summarized
graphically on the following "dendogram"
1 3 5 2 4
0
2
4
6
Single Linkage dendogram for distances between five objects
Figure
Dendograms
for clustering the 11 languages on the basis of the ten numerals
DuGENDaFrISpPH Fi
Single Linkage dendogram for distances between numbers in 11 languages
Figure10
8
6
4
2
0
DuGENDaFrISpP HFi
Complete Linkage dendogram for distances between numbers in 11 languages
Figure10
8
6
4
2
0
Du G E N Da Fr I Sp P H Fi
Average Linkage dendogram for distances between numbers in 11 languages
Figure10
8
6
4
2
0
Example 2: Public Utility data variables
Company X1 X2 X3 X4 X5 X6 X7 X8
1 Arizona Public Service 1.06 9.2 151 54.4 1.6 9077 0.0 0.628 2 Boston Edison Co 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 3 Central Louisiana Electric Co 1.43 15.4 113 53.0 3.4 9212 0.0 1.058 4 Commonwealth Edison Co 1.02 11.2 168 56.0 0.3 6423 34.3 0.700 5 Consolidated Edison Co (NY) 1.49 8.8 192 51.2 1.0 3300 15.6 2.044 6 Florida Power & Light Co 1.32 13.5 111 60.0 -2.2 11127 22.5 1.241 7 Hawaiian Electric Co 1.22 12.2 175 67.6 2.2 7642 0.0 1.652 8 Idaho Power Co 1.10 9.2 245 57.0 3.3 13082 0.0 0.309 9 Kentucky Utilities Co 1.34 13.0 168 60.4 7.2 8406 0.0 0.862 10 Madison Gas & Electric Co 1.12 12.4 197 53.0 2.7 6455 39.2 0.623 11 Nevada Power Co 0.75 7.5 173 51.5 6.5 17441 0.0 0.768 12 New England Electric Co 1.13 10.9 178 62.0 3.7 6154 0.0 1.897 13 Northern States Power Co 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 14 Oklahoma Gas & Electric Co 1.09 12.0 96 49.8 1.4 9673 0.0 0.588 15 Pacific Gas & Electric Co 0.96 7.6 164 62.2 -0.1 6468 0.9 1.400 16 Puget Sound Power & Light Co 1.16 9.9 252 56.0 9.2 15991 0.0 0.620 17 San Diego Gas & Electric Co 0.76 6.4 136 61.9 9.0 5714 8.3 1.920 18 The Southern Co 1.05 12.6 150 56.7 2.7 10140 0.0 1.108 19 Texas Utilities Co 1.16 11.7 104 54.0 -2.1 13507 0.0 0.636 20 Wisconsin Electric Power Co 1.20 11.8 148 59.9 3.5 7287 41.1 0.702 21 United Illuminating Co 1.04 8.6 204 61.0 3.5 6650 0.0 2.116 22 Virginia Electric & Power Co 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
X1: Fixed charge coverage ratio (income/debt) X2: Rate of return on capital
X3: Cost per KW capacity in place X4: Annual load factor
X5: Peak KWH demand growth from 1974 to1975 X6: Sales (KWH per year)
X7: Percent Nuclear X8: Total fuel costs (cents per KWH)
Table: Distances between 22 Utilities
Firm number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 0.00 2 3.10 0.00 3 3.68 4.92 0.00 4 2.46 2.16 4.11 0.00 5 4.12 3.85 4.47 4.13 0.00 6 3.61 4.22 2.99 3.20 4.60 0.00 7 3.90 3.45 4.22 3.97 4.60 3.35 0.00 8 2.74 3.89 4.99 3.69 5.16 4.91 4.36 0.00 9 3.25 3.96 2.75 3.75 4.49 3.73 2.80 3.59 0.00 10 3.10 2.71 3.93 1.49 4.05 3.83 4.51 3.67 3.57 0.00 11 3.49 4.79 5.90 4.86 6.46 6.00 6.00 3.46 5.18 5.08 0.00 12 3.22 2.43 4.03 3.50 3.60 3.74 1.66 4.06 2.74 3.94 5.21 0.00 13 3.96 3.43 4.39 2.58 4.76 4.55 5.01 4.14 3.66 1.41 5.31 4.50 0.00 14 2.11 4.32 2.74 3.23 4.82 3.47 4.91 4.34 3.82 3.61 4.32 4.34 4.39 0.00 15 2.59 2.50 5.16 3.19 4.26 4.07 2.93 3.85 4.11 4.26 4.74 2.33 5.10 4.24 0.00 16 4.03 4.84 5.26 4.97 5.82 5.84 5.04 2.20 3.63 4.53 3.43 4.62 4.41 5.17 5.18 0.00 17 4.40 3.62 6.36 4.89 5.63 6.10 4.58 5.43 4.90 5.48 4.75 3.50 5.61 5.56 3.40 5.56 0.00 18 1.88 2.90 2.72 2.65 4.34 2.85 2.95 3.24 2.43 3.07 3.95 2.45 3.78 2.30 3.00 3.97 4.43 0.00 19 2.41 4.63 3.18 3.46 5.13 2.58 4.52 4.11 4.11 4.13 4.52 4.41 5.01 1.88 4.03 5.23 6.09 2.47 0.00 20 3.17 3.00 3.73 1.82 4.39 2.91 3.54 4.09 2.95 2.05 5.35 3.43 2.23 3.74 3.78 4.82 4.87 2.92 3.90 0.00 21 3.45 2.32 5.09 3.88 3.64 4.63 2.68 3.98 3.74 4.36 4.88 1.38 4.94 4.93 2.10 4.57 3.10 3.19 4.97 4.15 0.00 22 2.51 2.42 4.11 2.58 3.77 4.03 4.00 3.24 3.21 2.56 3.44 3.00 2.74 3.51 3.35 3.46 3.63 2.55 3.97 2.62 3.01 0.00
DendogramCluster Analysis of N=22 Utility companies
Euclidean distance, Average Linkage
WEP CE NSP MGE VEP FPL TU OGE S AP KU CLE SDGE BE PGE UI NE HE CENY NV PSPL IP0.000
1.000
2.000
DendogramCluster Analysis of N=22 Utility companies
Euclidean distance, Single Linkage
CENY VEPSDGE PGE VI NEE HE BE WEP NSP CE TU OGE S APMGE KU FPL CLE PSPL IP NP
2.000
1.000
0.000