Discrimination and Classification. Discrimination Situation: We have two or more populations 1, 2,...

Discrimination and Classification

Discrimination

Situation:

We have two or more populations 1, 2, etc

(possibly p-variate normal).

The populations are known (or we have data from each population)

We have data for a new case (population unknown) and we want to identify the which population for which the new case is a member.

The Basic Problem

Suppose that the data from a new case x1, … , xp has joint density function either :

1: g(x1, … , xn) or

2: h(x1, … , xn)

We want to make the decision to

D1: Classify the case in 1 (g is the correct distribution) or

D2: Classify the case in 2 (h is the correct distribution)

The Two Types of Errors

1. Misclassifying the case in 1 when it actually lies in 2.

Let P[1|2] = P[D1|2] = probability of this type of error

2. Misclassifying the case in 2 when it actually lies in 1.

Let P[2|1] = P[D2|1] = probability of this type of error

This is similar Type I and Type II errors in hypothesis testing.

Note:

1. C1 = the region were we make the decision D1.

(the decision to classify the case in 1)

A discrimination scheme is defined by splitting p –dimensional space into two regions.

2. C2 = the region were we make the decision D2.

(the decision to classify the case in 2)

1. Set up the regions C1 and C2 so that one of the probabilities of misclassification , P[2|1] say, is at some low acceptable value . Accept the level of the other probability of misclassification P[1|2] = .

There can be several approaches to determining the regions C1 and C2. All concerned with taking into account the probabilities of misclassification P[2|1] and P[1|2]

2. Set up the regions C1 and C2 so that the total probability of misclassification:

P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]

is minimized

P[1] = P[the case belongs to 1]


3. Set up the regions C1 and C2 so that the total expected cost of misclassification:

E[Cost of Misclassification] = ECM

= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]

is minimized



c2|1= the cost of misclassifying the case in 2 when the case belongs to 1.

c1|2= the cost of misclassifying the case in 1 when the case belongs to 2.

The Optimal Classification Rule Suppose that the data x1, … , xp has joint density function

f(x1, … , xp ;)

where is either 1 or 2.Let

g(x1, … , xp) = f(x1, … , xn ;1) and

h(x1, … , xp) = f(x1, … , xn ;2)

We want to make the decision

D1: = 1 (g is the correct distribution) against

D2: = 2 (h is the correct distribution)

111 1

2 1

, ,, ,

, ,

p

p

p

g x xLC x x k

L h x x

and

where

then the optimal regions (minimizing ECM, expected cost of misclassification) for making the decisions D1 and D2 respectively are C1 and C2

112 1

2 1

, ,, ,

, ,

p

p

p

g x xLC x x k

L h x x

12

21

2

1

c Pk

c P

Proof:

2

1 12 1 , , p p

C

P g x x dx dx

ECM = E[Cost of Misclassification]

= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]

1

1 11 2 , , p p

C

P h x x dx dx

1

1 11 , , p p

C

g x x dx dx

1

2|1 1 11 1 , , p p

C

ECM c P g x x dx dx

1

1|2 1 12 , , p p

C

c P h x x dx dx

Fishers Linear Discriminant Function.

Suppose that x1, … , xp is either data from a p-variate Normal distribution with mean vector:

111 12

/ 2 1/ 2

1

2

x x

pg x e

The covariance matrix is the same for both populations 1 and 2.

1 2 or

112 22

/ 2 1/ 2

1

2

x x

ph x e

111 12

112 22

/ 2 1/ 2

/ 2 1/ 2

1

21

2

x x

p

x x

p

eg x

h x e

The Neymann-Pearson Lemma states that we should classify into populations 1 and 2 using:

1 11 12 2 1 12 2x x x xe

That is make the decision

D1 : population is 1

if > k

1 11 12 2 1 12 2or ln lnx x x x k

or 1 12 2 1 1 2lnx x x x k

1 1 12 2 22x x x

1 1 11 1 12 2lnx x x k

1 1 111 2 1 1 2 22lnx k

or

and

a x K

1 1 111 2 1 1 2 22 and lna K k

Finally we make the decision

D1 : population is 1

if

where

and

12

21

2

1

c Pk

c P

Note: k = 1 and ln k = 0 if c1|2 = c2|1 and P[1] = P[2].

1 1 11 11 1 2 2 1 2 1 22 2and K

11 2a x x

The function

Is called Fisher’s linear discriminant function

11 2a x x K

1

21

2

11 2a x x x S x

In the case where the populations are unknown but estimated from data

Fisher’s linear discriminant function

1201008060402000

100

200

A Pictorial representation of Fisher's procedure for two populations

x

x

1

2Classify as

Classify as

1

2

1 2

Example 1

1 : Riding-mower owners 2 : Nonowners

x1 (Income x2 (Lot size x1 (Income x2 (Lot size in $1000s) in 1000 sq ft) in $1000s) in 1000 sq ft) 20.0 9.2 25.0 9.8 28.5 8.4 17.6 10.4 21.6 10.8 21.6 8.6 20.5 10.4 14.4 10.2 29.0 11.8 28.0 8.8 36.7 9.6 16.4 8.8 36.0 8.8 19.8 8.0 27.6 11.2 22.0 9.2 23.0 10.0 15.8 8.2 31.0 10.4 11.0 9.4 17.0 11.0 17.0 7.0 27.0 10.0 21.0 7.4

403020104

8

12

Riding Mower ownersNon ownwers

Income (in thousands of dollars)

Lot

Siz

e (i

n th

ousa

nds

of s

quar

e fe

et)

Example 2Annual financial data are collected for firms approximately 2 years prior to bankruptcy and for financially sound firms at about the same point in time. The data on the four variables

• x1 = CF/TD = (cash flow)/(total debt), • x2 = NI/TA = (net income)/(Total assets), • x3 = CA/CL = (current assets)/(current liabilties, and • x4 = CA/NS = (current assets)/(net sales) are given in

the following table.

The data are given in the following table:

Bankrupt Firms Nonbankrupt Firms x1 x2 x3 x4

x1 x2 x3 x4

Firm CF/TD NI/TA CA/CL CA/NS Firm CF/TD NI/TA CA/CL CA/NS 1 -0.4485 -0.4106 1.0865 0.4526 1 0.5135 0.1001 2.4871 0.5368 2 -0.5633 -0.3114 1.5314 0.1642 2 0.0769 0.0195 2.0069 0.5304 3 0.0643 0.0156 1.0077 0.3978 3 0.3776 0.1075 3.2651 0.3548 4 -0.0721 -0.0930 1.4544 0.2589 4 0.1933 0.0473 2.2506 0.3309 5 -0.1002 -0.0917 1.5644 0.6683 5 0.3248 0.0718 4.2401 0.6279 6 -0.1421 -0.0651 0.7066 0.2794 6 0.3132 0.0511 4.4500 0.6852 7 0.0351 0.0147 1.5046 0.7080 7 0.1184 0.0499 2.5210 0.6925 8 -0.6530 -0.0566 1.3737 0.4032 8 -0.0173 0.0233 2.0538 0.3484 9 0.0724 -0.0076 1.3723 0.3361 9 0.2169 0.0779 2.3489 0.3970 10 -0.1353 -0.1433 1.4196 0.4347 10 0.1703 0.0695 1.7973 0.5174 11 -0.2298 -0.2961 0.3310 0.1824 11 0.1460 0.0518 2.1692 0.5500 12 0.0713 0.0205 1.3124 0.2497 12 -0.0985 -0.0123 2.5029 0.5778 13 0.0109 0.0011 2.1495 0.6969 13 0.1398 -0.0312 0.4611 0.2643 14 -0.2777 -0.2316 1.1918 0.6601 14 0.1379 0.0728 2.6123 0.5151 15 0.1454 0.0500 1.8762 0.2723 15 0.1486 0.0564 2.2347 0.5563 16 0.3703 0.1098 1.9914 0.3828 16 0.1633 0.0486 2.3080 0.1978 17 -0.0757 -0.0821 1.5077 0.4215 17 0.2907 0.0597 1.8381 0.3786 18 0.0451 0.0263 1.6756 0.9494 18 0.5383 0.1064 2.3293 0.4835 19 0.0115 -0.0032 1.2602 0.6038 19 -0.3330 -0.0854 3.0124 0.4730 20 0.1227 0.1055 1.1434 0.1655 20 0.4875 0.0910 1.2444 0.1847 21 -0.2843 -0.2703 1.2722 0.5128 21 0.5603 0.1112 4.2918 0.4443 22 0.2029 0.0792 1.9936 0.3018 23 0.4746 0.1380 2.9166 0.4487 24 0.1661 0.0351 2.4527 0.1370 25 0.5808 0.0371 5.0594 0.1268

Examples using SPSS

Classification or Cluster Analysis

Have data from one or several populations

Situation

• Have multivariate (or univariate) data from one or several populations (the number of populations is unknown)

• Want to determine the number of populations and identify the populations

Example Table: Numerals in eleven languages English Norwegian Danish Dutch German French Spanish Italian Polish Hungarian Finnish

one en en een ein un uno uno jeden egy yksi two to to twee zwei deux dos due dwa ketto kaksi three tre tre drie drei trois tres tre trzy harom kolme four fire fire vier vier quatre cuarto quattro cztery negy neua five fem fem vijf funf cinq cinco cinque piec ot viisi six seks seks zes sechs six seix sei szesc hat kuusi seven sju syv zeven sieben sept siete sette siedem het seitseman eight atte otte acht acht huit ocho otto osiem nyole kahdeksan nine ni ni negen neun neuf nueve nove dziewiec kilenc yhdeksan ten ti ti tien zehn dix diez dieci dziesiec tiz kymmenen

Distance Matrix Distance = # of numerals (1 to 10) differing in first letter

02276666799

0154666789

065655689

05999

1089

0777899

0215

109

013

109

04

109

0

109

08

0

ENDa

DuGFr

SpIPHFi

E N Da G Fr Sp I P H FiDu

Hierarchical Clustering Methods The following are the steps in the agglomerative Hierarchical clustering algorithm for grouping N objects (items or variables).

1. Start with N clusters, each consisting of a single entity and an N X N symmetric matrix (table) of distances (or similarities) D = (dij).

2. Search the distance matrix for the nearest (most similar) pair of clusters. Let the distance between the "most similar" clusters U and V be dUV.

3. Merge clusters U and V. Label the newly formed cluster (UV). Update the entries in the distance matrix by

a) deleting the rows and columns corresponding to clusters U and V and

b) adding a row and column giving the distances between cluster (UV) and the remaining clusters.

4. Repeat steps 2 and 3 a total of N-1 times. (All objects will be a single cluster a termination of this algorithm.) Record the identity of clusters that are merged and the levels (distances or similarities) at which the mergers take place.

Different methods of computing inter-cluster distance

1

2

3

4 5

Single LinkageCluster Distance

d15

1

2

3

4 5

Complete Linkage

d24

d13

1

2

3

4 5

Average Linkage

+d14

d15

+ d23

+d24

d25

++

6

Example

To illustrate the single linkage algorithm, we consider the hypothetical distance matrix between pairs of five objects given below:

12345

093611

07510

092

08

0

1 2 3 4 5

D = {d } =ik

Treating each object as a cluster, the clustering begins by merging the two closest items (3 & 5).

To implement the next level of clustering we need to compute the distances between cluster (35) and the remaining objects:

d(35)1 = min{3,11} = 3

d(35)2 = min{7,10} = 7

d(35)4 = min{9,8} = 8

The new distance matrix becomes:

The new distance matrix becomes:

35124

0

378

096

05

0

(35 1 2 4(

The next two closest clusters ((35) & 1) are merged to form cluster (135). Distances between this cluster and the remaining clusters become:

Distances between this cluster and the remaining clusters become:

d(135)2 = min{7,9} = 7

d(135)4 = min{8,6} = 6

The distance matrix now becomes:

3524

0

76

05

0

35 2 4

Continuing the next two closest clusters (2 & 4) are merged to form cluster (24).

Distances between this cluster and the remaining clusters become:

d(135)(24) = min{d(135)2,d(135)4)= min{7,6} = 6

The final distance matrix now becomes:

At the final step clusters (135) and (24) are merged to form the single cluster (12345) of all five items.

3524

0

6

0

3524

The results of this algorithm can be summarized

graphically on the following "dendogram"

1 3 5 2 4

0

2

4

6

Single Linkage dendogram for distances between five objects

Figure

Dendograms

for clustering the 11 languages on the basis of the ten numerals

DuGENDaFrISpPH Fi

Single Linkage dendogram for distances between numbers in 11 languages

Figure10

8

6

4

2

0

DuGENDaFrISpP HFi

Complete Linkage dendogram for distances between numbers in 11 languages

Figure10

8

6

4

2

0

Du G E N Da Fr I Sp P H Fi

Average Linkage dendogram for distances between numbers in 11 languages

Figure10

8

6

4

2

0

Example 2: Public Utility data variables

Company X1 X2 X3 X4 X5 X6 X7 X8

1 Arizona Public Service 1.06 9.2 151 54.4 1.6 9077 0.0 0.628 2 Boston Edison Co 0.89 10.3 202 57.9 2.2 5088 25.3 1.555 3 Central Louisiana Electric Co 1.43 15.4 113 53.0 3.4 9212 0.0 1.058 4 Commonwealth Edison Co 1.02 11.2 168 56.0 0.3 6423 34.3 0.700 5 Consolidated Edison Co (NY) 1.49 8.8 192 51.2 1.0 3300 15.6 2.044 6 Florida Power & Light Co 1.32 13.5 111 60.0 -2.2 11127 22.5 1.241 7 Hawaiian Electric Co 1.22 12.2 175 67.6 2.2 7642 0.0 1.652 8 Idaho Power Co 1.10 9.2 245 57.0 3.3 13082 0.0 0.309 9 Kentucky Utilities Co 1.34 13.0 168 60.4 7.2 8406 0.0 0.862 10 Madison Gas & Electric Co 1.12 12.4 197 53.0 2.7 6455 39.2 0.623 11 Nevada Power Co 0.75 7.5 173 51.5 6.5 17441 0.0 0.768 12 New England Electric Co 1.13 10.9 178 62.0 3.7 6154 0.0 1.897 13 Northern States Power Co 1.15 12.7 199 53.7 6.4 7179 50.2 0.527 14 Oklahoma Gas & Electric Co 1.09 12.0 96 49.8 1.4 9673 0.0 0.588 15 Pacific Gas & Electric Co 0.96 7.6 164 62.2 -0.1 6468 0.9 1.400 16 Puget Sound Power & Light Co 1.16 9.9 252 56.0 9.2 15991 0.0 0.620 17 San Diego Gas & Electric Co 0.76 6.4 136 61.9 9.0 5714 8.3 1.920 18 The Southern Co 1.05 12.6 150 56.7 2.7 10140 0.0 1.108 19 Texas Utilities Co 1.16 11.7 104 54.0 -2.1 13507 0.0 0.636 20 Wisconsin Electric Power Co 1.20 11.8 148 59.9 3.5 7287 41.1 0.702 21 United Illuminating Co 1.04 8.6 204 61.0 3.5 6650 0.0 2.116 22 Virginia Electric & Power Co 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

X1: Fixed charge coverage ratio (income/debt) X2: Rate of return on capital

X3: Cost per KW capacity in place X4: Annual load factor

X5: Peak KWH demand growth from 1974 to1975 X6: Sales (KWH per year)

X7: Percent Nuclear X8: Total fuel costs (cents per KWH)

Table: Distances between 22 Utilities

Firm number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1 0.00 2 3.10 0.00 3 3.68 4.92 0.00 4 2.46 2.16 4.11 0.00 5 4.12 3.85 4.47 4.13 0.00 6 3.61 4.22 2.99 3.20 4.60 0.00 7 3.90 3.45 4.22 3.97 4.60 3.35 0.00 8 2.74 3.89 4.99 3.69 5.16 4.91 4.36 0.00 9 3.25 3.96 2.75 3.75 4.49 3.73 2.80 3.59 0.00 10 3.10 2.71 3.93 1.49 4.05 3.83 4.51 3.67 3.57 0.00 11 3.49 4.79 5.90 4.86 6.46 6.00 6.00 3.46 5.18 5.08 0.00 12 3.22 2.43 4.03 3.50 3.60 3.74 1.66 4.06 2.74 3.94 5.21 0.00 13 3.96 3.43 4.39 2.58 4.76 4.55 5.01 4.14 3.66 1.41 5.31 4.50 0.00 14 2.11 4.32 2.74 3.23 4.82 3.47 4.91 4.34 3.82 3.61 4.32 4.34 4.39 0.00 15 2.59 2.50 5.16 3.19 4.26 4.07 2.93 3.85 4.11 4.26 4.74 2.33 5.10 4.24 0.00 16 4.03 4.84 5.26 4.97 5.82 5.84 5.04 2.20 3.63 4.53 3.43 4.62 4.41 5.17 5.18 0.00 17 4.40 3.62 6.36 4.89 5.63 6.10 4.58 5.43 4.90 5.48 4.75 3.50 5.61 5.56 3.40 5.56 0.00 18 1.88 2.90 2.72 2.65 4.34 2.85 2.95 3.24 2.43 3.07 3.95 2.45 3.78 2.30 3.00 3.97 4.43 0.00 19 2.41 4.63 3.18 3.46 5.13 2.58 4.52 4.11 4.11 4.13 4.52 4.41 5.01 1.88 4.03 5.23 6.09 2.47 0.00 20 3.17 3.00 3.73 1.82 4.39 2.91 3.54 4.09 2.95 2.05 5.35 3.43 2.23 3.74 3.78 4.82 4.87 2.92 3.90 0.00 21 3.45 2.32 5.09 3.88 3.64 4.63 2.68 3.98 3.74 4.36 4.88 1.38 4.94 4.93 2.10 4.57 3.10 3.19 4.97 4.15 0.00 22 2.51 2.42 4.11 2.58 3.77 4.03 4.00 3.24 3.21 2.56 3.44 3.00 2.74 3.51 3.35 3.46 3.63 2.55 3.97 2.62 3.01 0.00

DendogramCluster Analysis of N=22 Utility companies

Euclidean distance, Average Linkage

WEP CE NSP MGE VEP FPL TU OGE S AP KU CLE SDGE BE PGE UI NE HE CENY NV PSPL IP0.000

1.000

2.000

DendogramCluster Analysis of N=22 Utility companies

Euclidean distance, Single Linkage

CENY VEPSDGE PGE VI NEE HE BE WEP NSP CE TU OGE S APMGE KU FPL CLE PSPL IP NP

2.000

1.000

0.000

Discrimination and Classification. Discrimination Situation: We have two or more populations 1, 2,...

Documents

Transcript of Discrimination and Classification. Discrimination Situation: We have two or more populations 1, 2,...