€¦ · Web viewProject #2 Answers . ... that you will need to edit your output and code in order...

Project #2 Answers STAT 873Fall 2013

Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document.

1) This problem continues examining the data set from project #1, and it involves performing a FA for the data. Use the correlation matrix for the analysis. a) (5 points) Determine the number of common factors to use. Make sure to justify your choice.

I started with three common factors because three PCs were the minimum number that appeared to be needed for the PCA.

> mod.fit3<-factanal(x = set1[,-1], factors = 3, rotation = "none")> print(x = mod.fit3, cutoff = 0.0)

Call:factanal(x = set1[, -1], factors = 3, rotation = "none")

Uniquenesses: FL APP AA LA SC LC HON SMS EXP DRV AMB GSP POT KJ SUIT 0.536 0.699 0.944 0.005 0.117 0.198 0.442 0.145 0.356 0.239 0.155 0.198 0.178 0.419 0.191

Loadings: Factor1 Factor2 Factor3FL 0.199 0.326 0.564 APP 0.362 0.412 0.027 AA 0.119 0.012 0.203 LA -0.083 0.994 -0.003 SC 0.788 0.369 -0.355 LC 0.698 0.543 -0.143 HON -0.008 0.648 -0.371 SMS 0.818 0.432 0.005 EXP 0.249 0.164 0.745 DRV 0.730 0.457 0.137 AMB 0.815 0.416 -0.091 GSP 0.695 0.565 0.027 POT 0.612 0.660 0.104 KJ 0.272 0.711 0.033 SUIT 0.453 0.368 0.684

Factor1 Factor2 Factor3SS loadings 4.364 4.106 1.708Proportion Var 0.291 0.274 0.114Cumulative Var 0.291 0.565 0.679

Test of the hypothesis that 3 factors are sufficient.The chi square statistic is 128.86 on 63 degrees of freedom.The p-value is 1.98e-06

> #Compare estimates of correlation matrix> resid3<-round(mod.fit3$correlation - (mod.fit3$loadings[,] %*% t(mod.fit3$loadings[,]) + diag(mod.fit3$uniqueness)), 4)

1

> sum(abs(resid3)>0.1)[1] 18> sum(abs(resid3)>0.2)[1] 2> max(abs(resid3))[1] 0.3711> colMeans(abs(resid3)) FL APP AA LA SC 0.03814000 0.04914000 0.07768667 0.00032000 0.01366667 LC HON SMS EXP DRV 0.02905333 0.03975333 0.02171333 0.01890667 0.03352667 AMB GSP POT KJ SUIT 0.02781333 0.03504000 0.03658667 0.06781333 0.01831333> # max(abs(resid3)) == abs(resid3) #Omitted output here to save space

The LRT gives a very small p-value indicating that more factors are needed. However, the residuals are not too large in absolute value. For example, there is only one residual greater than 0.2 in absolute value (corresponds to AA and KJ).

More than 3 common factor results in the following LRT p-values:

> mod.fit4<-factanal(x = set1[,-1], factors = 4, rotation = "none")> mod.fit4$PVAL objective 0.002469286 > mod.fit5<-factanal(x = set1[,-1], factors = 5, rotation = "none")> mod.fit5$PVAL objective 0.01791472 > mod.fit6<-factanal(x = set1[,-1], factors = 6, rotation = "none")> mod.fit6$PVAL objective 0.07840516 > mod.fit7<-factanal(x = set1[,-1], factors = 7, rotation = "none")> mod.fit7$PVALobjective 0.4876583

The LRT suggests at least five common factors are needed. To help me evaluate the residuals, I wrote my own function:

eval.model<-function(mod.fit) { resid.FA<-round(mod.fit$correlation - (mod.fit$loadings[,] %*% t(mod.fit$loadings[,]) + diag(mod.fit$uniqueness)), 4) larger0.1<-sum(abs(resid.FA)>0.1) larger0.2<-sum(abs(resid.FA)>0.2) max.resid<-max(abs(resid.FA)) mean.resid<-colMeans(abs(resid.FA)) list(LRT.pvalue = mod.fit$PVAL, resid.FA = resid.FA, larger0.1 = larger0.1, larger0.2 = larger0.2, max.resid = max.resid, mean.resid = mean.resid) }

With four common factors, below are the results:

> eval.model(mod.fit = mod.fit4)

2

$LRT.pvalue objective 0.002469286

$resid.FA FL APP AA LA SC LCFL 0.0000 0.0697 0.0237 0.0043 0.0185 0.0142APP 0.0697 0.0000 -0.0401 0.0449 0.0050 -0.1075AA 0.0237 -0.0401 0.0000 -0.0116 0.0040 -0.0376LA 0.0043 0.0449 -0.0116 0.0000 -0.0096 0.0217SC 0.0185 0.0050 0.0040 -0.0096 0.0000 0.0127LC 0.0142 -0.1075 -0.0376 0.0217 0.0127 0.0000HON -0.0715 0.0927 -0.0060 0.0157 0.0492 -0.0462SMS -0.0464 0.0348 -0.0051 0.0133 -0.0054 0.0236EXP 0.0042 -0.0264 0.0492 -0.0152 0.0297 -0.0106DRV -0.0599 -0.0966 0.0316 -0.0247 0.0083 -0.0351AMB 0.0196 0.0980 -0.0023 0.0102 0.0110 -0.0443GSP 0.0164 -0.0044 0.0027 -0.0323 -0.0150 0.0787POT 0.0023 -0.0241 0.0231 0.0024 -0.0078 -0.0168KJ 0.0008 -0.0017 0.0001 -0.0001 -0.0002 -0.0003SUIT -0.0023 0.0728 -0.0804 0.0236 -0.0017 0.0033 HON SMS EXP DRV AMB GSPFL -0.0715 -0.0464 0.0042 -0.0599 0.0196 0.0164APP 0.0927 0.0348 -0.0264 -0.0966 0.0980 -0.0044AA -0.0060 -0.0051 0.0492 0.0316 -0.0023 0.0027LA 0.0157 0.0133 -0.0152 -0.0247 0.0102 -0.0323SC 0.0492 -0.0054 0.0297 0.0083 0.0110 -0.0150LC -0.0462 0.0236 -0.0106 -0.0351 -0.0443 0.0787HON 0.0000 0.0008 0.0304 0.0412 -0.0478 -0.0114SMS 0.0008 0.0000 -0.0127 0.0151 0.0003 -0.0138EXP 0.0304 -0.0127 0.0000 -0.0165 -0.0126 0.0058DRV 0.0412 0.0151 -0.0165 0.0000 -0.0045 -0.0509AMB -0.0478 0.0003 -0.0126 -0.0045 0.0000 -0.0053GSP -0.0114 -0.0138 0.0058 -0.0509 -0.0053 0.0000POT -0.0094 -0.0134 -0.0098 0.0305 0.0160 0.0095KJ -0.0001 -0.0001 0.0004 0.0006 0.0001 0.0007SUIT 0.0430 0.0406 0.0392 0.0442 -0.0288 -0.0109 POT KJ SUITFL 0.0023 0.0008 -0.0023APP -0.0241 -0.0017 0.0728AA 0.0231 0.0001 -0.0804LA 0.0024 -0.0001 0.0236SC -0.0078 -0.0002 -0.0017LC -0.0168 -0.0003 0.0033HON -0.0094 -0.0001 0.0430SMS -0.0134 -0.0001 0.0406EXP -0.0098 0.0004 0.0392DRV 0.0305 0.0006 0.0442AMB 0.0160 0.0001 -0.0288GSP 0.0095 0.0007 -0.0109POT 0.0000 0.0001 -0.0196KJ 0.0001 0.0000 -0.0013SUIT -0.0196 -0.0013 0.0000

$larger0.1[1] 2

$larger0.2[1] 0

$max.resid3

[1] 0.1075

$mean.resid FL APP AA LA SC 0.02358667 0.04791333 0.02116667 0.01530667 0.01187333 LC HON SMS EXP DRV 0.03017333 0.03102667 0.01502667 0.01751333 0.03064667 AMB GSP POT KJ SUIT 0.02005333 0.01718667 0.01232000 0.00044000 0.02744667

Overall, this model looks better than with three common factors. Only one residual is greater than 0.1 in absolute value. Despite the LRT results, I think 4 common factors are sufficient.

Note that it takes 7 common factors before all of the residuals are less than 0.1 in absolute value (code not shown) and this barely occurs.

b) (5 points) Using the varimax method, state the FA model for the number of common factors chosen.

Below is the output from using the varimax rotation method:

> mod.fit4v<-factanal(x = set1[,-1], factors = 4, rotation = "varimax", scores = "regression")> print(x = mod.fit4v, cutoff = 0.0)

Call:factanal(x = set1[, -1], factors = 4, rotation = "varimax")

Uniquenesses: FL APP AA LA SC LC HON SMS EXP DRV 0.443 0.685 0.521 0.185 0.119 0.198 0.339 0.138 0.357 0.226 AMB GSP POT KJ SUIT 0.137 0.153 0.090 0.005 0.252

Loadings: Factor1 Factor2 Factor3 Factor4FL 0.129 0.717 0.113 -0.117 APP 0.458 0.142 0.243 0.164 AA 0.076 0.126 0.000 0.677 LA 0.231 0.239 0.838 -0.051 SC 0.918 -0.100 0.142 -0.089 LC 0.838 0.111 0.291 0.055 HON 0.252 -0.216 0.742 -0.024 SMS 0.885 0.258 0.095 -0.059 EXP 0.092 0.778 -0.051 0.165 DRV 0.767 0.389 0.172 -0.067 AMB 0.904 0.181 0.097 -0.066 GSP 0.792 0.275 0.351 0.148 POT 0.735 0.349 0.432 0.247 KJ 0.424 0.389 0.554 -0.598 SUIT 0.364 0.770 0.050 0.142

Factor1 Factor2 Factor3 Factor4SS loadings 5.570 2.473 2.099 1.013Proportion Var 0.371 0.165 0.140 0.068Cumulative Var 0.371 0.536 0.676 0.744

Test of the hypothesis that 4 factors are sufficient.

4

The chi square statistic is 84 on 51 degrees of freedom.The p-value is 0.00247

The model is quite large, so it is o.k. to write out only part of it:

z1 = 0.129 + 0.717 + 0.113 – 0.117 + 1

z15 = 0.364 + 0.770 + 0.050 + 0.142 + 15

where z1 is the standardized value for FL, …, and z15 is the standardized value for SUIT.

c) (5 points) Interpret the common factors resulting from the model in part b). Make sure to specifically comment on whether positive or negative common factor scores (or scores close to 0) would likely be preferred by the firm.

I printed the output again, but now using a cutoff of 0.1 for the factor loadings.

> print(x = mod.fit4v, cutoff = 0.1)

Call:factanal(x = set1[, -1], factors = 4, scores = "regression", rotation = "varimax")

Uniquenesses: FL APP AA LA SC LC HON SMS EXP DRV 0.443 0.685 0.521 0.185 0.119 0.198 0.339 0.138 0.357 0.226 AMB GSP POT KJ SUIT 0.137 0.153 0.090 0.005 0.252

Loadings: Factor1 Factor2 Factor3 Factor4FL 0.129 0.717 0.113 -0.117 APP 0.458 0.142 0.243 0.164 AA 0.126 0.677 LA 0.231 0.239 0.838 SC 0.918 0.142 LC 0.838 0.111 0.291 HON 0.252 -0.216 0.742 SMS 0.885 0.258 EXP 0.778 0.165 DRV 0.767 0.389 0.172 AMB 0.904 0.181 GSP 0.792 0.275 0.351 0.148 POT 0.735 0.349 0.432 0.247 KJ 0.424 0.389 0.554 -0.598 SUIT 0.364 0.770 0.142

Factor1 Factor2 Factor3 Factor4SS loadings 5.570 2.473 2.099 1.013Proportion Var 0.371 0.165 0.140 0.068Cumulative Var 0.371 0.536 0.676 0.744

Test of the hypothesis that 4 factors are sufficient.The chi square statistic is 84 on 51 degrees of freedom.The p-value is 0.00247

5

Factor 1: This common factor appears to be overall measure of an applicant without comparatively much focus on AA and EXP. Some variables, such as SC, LC, SMS, AMB, GSP, and POT have strong positive associations with this common factor.

Factor 2: This common factor appears to be an overall measure of FL, EXP, and SUIT (other variables have weaker relationships with this common factor). Note that HON does have a small negative relationship with the common factor.

Factor 3: This common factor appears to be an overall measure of LA, HON, POT, and KJ (other variables have weaker relationships with this common factor).

Factor 4: This common factor consists mostly of a contrast between AA and KJ.

For common factors 1, 2, and 3, the larger the corresponding factor score, the better the applicant. Due to the contrast for common factor 4, it is more difficult to judge what an ideal applicant would have as a score. Overall, one would expect the ideal applicant to be somewhere in the middle of the common factor 4 scores.

d) (5 points) Examine plots of the common factor scores (regression method) and interpret them in the context of the problem. For example, what do you think of applicant #42?

Below are a number of 3D and 4D plots:

> FA3.positive<-mod.fit4v$scores[,3] - min(mod.fit4v$scores[,3]) #Bubble needs to contain all values > 0> common.limits<-c(min(mod.fit4v$scores[,1:2]), max(mod.fit4v$scores[,1:2]))> col.symbol<-ifelse(test = mod.fit4v$scores[,3]>0, yes = "red", no = "blue")> symbols(x = mod.fit4v$scores[,1], y = mod.fit4v$scores[,2], circles = FA3.positive, xlab = "Common factor #1", ylab = "Common factor #2", main = "Common factor scores", inches = 0.25, xlim = common.limits, ylim = common.limits, panel.first = grid(), fg = col.symbol)> text(x = mod.fit4v$scores[,1], y = mod.fit4v$scores[,2])> abline(h = 0)> abline(v = 0)

-2 -1 0 1 2

-2-1

01

2

Common factor scores

Common factor #1

Com

mon

fact

or #

2

1

234

5

6

7 89

1011

12

1314

15

16

17

1819

20

21

2223

24

25

26

27

28

29

30

31

3233

3435

36

37

38

39404142

43

44

4546

4748

6

> library(rgl)> plot3d(x = mod.fit4v$scores[,1], y = mod.fit4v$scores[,2], z = mod.fit4v$scores[,3], xlab = "Common factor #1", ylab = "Common factor #2", zlab = "Common factor #3", type = "h", xlim = common.limits, ylim = common.limits)> plot3d(x = mod.fit4v$scores[,1], y = mod.fit4v$scores[,2], z = mod.fit4v$scores[,3], add = TRUE, col = "red", size = 6)> persp3d(x = common.limits, y = common.limits, z = matrix(data = c(0,0,0,0), nrow = 2, ncol = 2), add = TRUE, col = "green")> text3d(x = mod.fit4v$scores[,1], y = mod.fit4v$scores[,2], z = mod.fit4v$scores[,3] + 0.2, text = 1:nrow(set1))> grid3d(side = c("x", "y", "z"), col = "lightgray")

> library(MASS)> FA.score<-data.frame(Applicant = 1:nrow(set1), mod.fit4v$scores)> color.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes = "red", no = ifelse(test = set1$Applicant == 42, yes = "purple", no = "black"))> lwd.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes = 2, no = ifelse(test = set1$Applicant == 42, yes = 2, no = 1))> parcoord(x = FA.score, main = "Common factor scores (#39 and #40 in red; #42 in purple)", col = color.select, lwd = lwd.select)

7

Common factor scores (#39 and #40 in red; #42 in purple)

Applicant Factor1 Factor2 Factor3 Factor4

Applicant #42 has among the smallest common factor #1 and #3 scores, while having one of the largest common factor #2 scores. Because larger scores are better for these three common factors, this helps to highlight where applicant #42 has good and bad qualities. For example, applicant #42 had ratings of 10 for FL, EXP, and SUIT, and common factor 2 is an overall measure of these attributes.

Applicants #39 and #40 stand out as having among the largest common factor 1, 2, and 3 scores. Using these common factors alone, they would appear to be the best applicants. These applicants have middle values for common factor #4, which may also correspond to being among the best applicants (see common factor #4’s interpretation from earlier).

Overall, good applicants are those with large common factor 1, 2, and 3 values and middle values for common factor 4.

e) (5 points) Suppose a late applicant submits his/her application after the FA has been completed. The applicant receives 10’s for all 15 original variables! Through using the previous FA results, find this applicant’s common factor scores (regression method) and discuss how this particular individual would compare to the other applicants.

I used , where zr is the standardized value of the new applicant, to find the scores.

> mean.var<-apply(X = set1[,-1], MARGIN = 2, FUN = mean)> mean.var FL APP AA LA SC LC 6.000000 7.083333 7.083333 6.145833 6.937500 6.312500 HON SMS EXP DRV AMB GSP 8.041667 4.854167 4.229167 5.312500 5.979167 6.250000

8

POT KJ SUIT 5.687500 5.562500 5.958333 > sd.var<-apply(X = set1[,-1], MARGIN = 2, FUN = sd)> sd.var FL APP AA LA SC LC 2.673749 1.966023 1.987550 2.805690 2.418072 3.170048 HON SMS EXP DRV AMB GSP 2.534514 3.439381 3.308529 2.947457 2.935401 3.035254 POT KJ SUIT 3.183443 2.657036 3.300279 > new.app<-data.frame(FL = 10, APP = 10, AA = 10, LA = 10, SC = 10, LC = 10, HON = 10, SMS = 10, EXP = 10, DRV = 10, AMB = 10, GSP = 10, POT = 10, KJ = 10, SUIT = 10)> Z.new<-(new.app - mean.var)/sd.var> Z.new FL APP AA LA SC LC1 1.496026 1.483536 1.467468 1.373696 1.266505 1.163232 HON SMS EXP DRV AMB GSP1 0.7726664 1.496151 1.744229 1.590354 1.369773 1.235482 POT KJ SUIT1 1.354665 1.670094 1.224644 > #I had difficulty getting Z.new to work with the matrix algebra below. I needed> # to use as.numeric() to avoid error messages> new.score<-t(mod.fit4v$loadings[,]) %*% solve(mod.fit4v$loadings[,] %*% t(mod.fit4v$loadings[,]) + diag(mod.fit4v$uniqueness)) %*% as.numeric(Z.new)> new.score [,1]Factor1 1.0943427Factor2 1.3626963Factor3 0.8328489Factor4 -0.3399626

The common factor scores for the new applicant are . Other applicants with common factor scores close to these values will be desirable applicants for the job.

f) (5 points) If the overall goal is to find the best individuals to hire, where each of the 15 criteria are given equal weighting for the FA, suggest which applicants are the best. Remember that it is most desirable to score as high as possible among the fifteen variables, so you need to take this into account when using the FA to make your judgments.

Because we want individuals with large common factor 1, 2, and 3 scores, below are the orderings from smallest to largest:

> order(mod.fit4v$scores[,1]) [1] 42 41 43 35 29 34 47 28 48 25 15 13 26 5 14 36 4 30[19] 31 33 45 32 19 6 46 18 27 20 9 17 21 16 24 7 1 39[37] 3 8 44 40 22 23 38 2 37 12 11 10

> order(mod.fit4v$scores[,2]) [1] 47 48 37 33 38 30 32 27 28 31 12 46 34 18 21 45 35 11[19] 19 36 29 25 10 20 44 26 15 1 6 14 3 4 17 23 13 2[37] 22 24 43 16 5 9 7 8 40 39 41 42

> order(mod.fit4v$scores[,3]) [1] 42 41 11 10 29 37 1 48 28 19 12 18 47 38 3 44 16 17

9

[19] 2 35 34 43 7 6 8 30 9 36 4 25 31 26 21 22 27 15[37] 33 32 40 14 23 39 5 20 13 45 24 46

Below are the common factor scores for some good applicants:

> data.frame(good.app, mod.fit4v$scores[c(7,8,39,40),]) good.app Factor1 Factor2 Factor3 Factor41 7 0.6584032 1.388328 0.1029503 -0.064386362 8 0.8577643 1.402397 0.2205970 -0.416710783 39 0.7715668 1.515808 0.9455243 -0.369128234 40 0.9426586 1.483695 0.7964616 -0.41022530> t(new.score) Factor1 Factor2 Factor3 Factor4[1,] 1.094343 1.362696 0.8328489 -0.3399626

Next, I added the new applicant to the previous common factor score plots (only new code is given):

New code for bubble plot:

text(x = new.score[1], y = new.score[2], label = "Yes!")

-2 -1 0 1 2

-2-1

01

2

Common factor scores

Common factor #1

Com

mon

fact

or #

2

1

234

5

6

7 89

1011

12

1314

15

16

17

1819

20

21

2223

24

25

26

27

28

29

30

31

3233

3435

36

37

38

39404142

43

44

4546

4748

Yes!

New code for 3D scatter plot:

> plot3d(x = new.score[1], y = new.score[2], z = new.score[3], add = TRUE, col = "blue", size = 12) #Perfect applicant

10

New code for parallel coordinate plot:

> #Find parallel coordinate value for new applicant:> find.y<-function(x, min.val = min(x), max.val = max(x)) { (x - min.val)/(max.val - min.val) }

> #Example with common factor 1 > head(find.y(x = mod.fit4v$scores[,1])) [1] 0.71752430 0.81092811 0.73773618 0.40793806 0.34000512 [6] 0.54034599

> min.val<-apply(X = mod.fit4v$scores, MARGIN = 2, FUN = min)> max.val<-apply(X = mod.fit4v$scores, MARGIN = 2, FUN = max)> new.score1<-find.y(x = new.score[1], min.val = min.val[1], max.val = max.val[1])> new.score2<-find.y(x = new.score[2], min.val = min.val[2], max.val = max.val[2])> new.score3<-find.y(x = new.score[3], min.val = min.val[3], max.val = max.val[3])> new.score4<-find.y(x = new.score[4], min.val = min.val[4], max.val = max.val[4])> matplot(c(NA, new.score1, new.score2, new.score3, new.score4), add = TRUE, col = "green", type = "l", lwd = 5)> title(sub = "Green line is for perfect applicant")> #Note that I also ran an example with observation #39 to make sure the line was in the same place as with the original parcoord() function implementation.

11

Common factor scores (#39 and #40 in red; #42 in purple)

Applicant Factor1 Factor2 Factor3 Factor4

Green line is for perfect applicant

In order to choose among the applicants, we could use the results from e) to allow us to determine a “sweet spot” on the common factor score plots. Applicants with similar common factor scores to this spot may be the best to hire. Using this as a criterion, applicants #40 and #39 are the best.

2) This problem continues to examine the data set from project #1, and it involves performing a CA for the data. Use standardized data for the analysis. a) (12 points) Determine an appropriate number of clusters using the furthest neighbor

hierarchical clustering method. Fully justify your answer. Include the appropriate plots with your written justification.

> Z<-scale(set1[,-1])> head(Z, n =2) FL APP AA LA SC[1,] 0.00000 -0.04238674 -2.557588 -0.4083963 0.4393996[2,] 1.12202 1.48353605 -1.048192 0.6608594 1.2665046 LC HON SMS EXP DRV[1,] 0.2168737 -0.01643971 0.9146509 -0.3715145 0.9118031[2,] 0.8477791 0.37811332 1.4961509 0.2329837 1.2510787 AMB GSP POT KJ SUIT[1,] 1.029104 0.2470963 -0.2159612 0.5410164 1.224644[2,] 1.029104 0.5765580 0.7264148 0.9173756 1.224644

12

> dist.mat<-dist(x = Z, method = "euclidean")> clust.fn<-hclust(d = dist.mat, method = "complete")> plot(clust.fn)> abline(h = 9, lty = "dashed", lwd = 2) #2> abline(h = 7, lty = "dashed", lwd = 2) #4> abline(h = 5.25, lty = "dashed", lwd = 2) #8

37 38 1 332 33

30 31 13 145 15

18 1936

25 26 45 4627

20 2117 44 4 6

1110 12

39 4023

22 249

7 82 16

4341 42

47 4829

3528 34

02

46

810

Cluster Dendrogram

hclust (*, "complete")dist.mat

Hei

ght

Justifiable numbers of clusters are 2, 4, and possibly 8 clusters from examining the hierarchical tree diagram. For 2 and 4 clusters there are larger distance separations at those cut points in the tree. There is also a larger distance separation for 8 clusters too if one includes 43 in the cluster with 41 and 42.

I modified my PCA.CA.plot() function so that parallel coordinate plots would be created for the first 4 PCs. Below is the code that I put at the end of the function:

win.graph(width = 10) par(pty = "m") N<-nrow(score.cor) parcoord(x = data.frame(ID = 1:N, score.cor[,1:4]), main = "Parallel coordinate plot", col = clusters, lwd = 2)

I obtained the following plots (the scatter plot of the first two PCs is excluded because 3 or 4 PCs are needed for the data as determined in project #1).

13

Two clusters:

-4 -2 0 2 4

-4-2

02

4

PCs with furthest neighbor CA method and 2 clusters

PC #1

PC

#2

12

3

45

6

78 9

1011

1213

1415

16 17

1819

20 21

222324 25

26

27

28

29

303132 33

343536

3738

3940

4142

43

4445

46

4748

Parallel coordinate plot with furthest neighbor CA method and 2 clusters

ID Comp.1 Comp.2 Comp.3 Comp.4

14

Four clusters:

-4 -2 0 2 4

-4-2

02

4

PCs with furthest neighbor CA method and 4 clusters

PC #1

PC

#2

12

3

45

6

78 9

1011

1213

1415

16 17

1819

20 21

222324 25

26

27

28

29

303132 33

343536

3738

3940

4142

43

4445

46

4748

Parallel coordinate plot with furthest neighbor CA method and 4 clusters

ID Comp.1 Comp.2 Comp.3 Comp.4

15

Eight clusters appeared to be too many after viewing their plots so they are not reproduced here.

With two clusters, we see the separation between clusters occurs for PC #1. With four clusters, the blue and green clusters are separated from the black and red clusters with PC #1. The blue and green clusters are separated with PC #2. The black and red clusters are somewhat separated with PC #1 (black is larger than red). PC #2 and #3 have some larger red values than the black so this appears to further separate the observations. PC #4 does not provide any obvious separation among all clusters (maybe black tends to be higher than red, though).

Both two and four clusters are appropriate for this data. Students can choose either to receive full credit. I will continue examining both cases here for the answer key.

b) (6 points) Using the results from a), describe what types of applicants the clusters represent. Make sure to include the cluster memberships.

Memberships: > clusters2<-cutree(tree = clust.fn, k = 2)> clusters4<-cutree(tree = clust.fn, k = 4) set1.Applicant clusters2 clusters41 1 1 12 2 1 23 3 1 14 4 1 15 5 1 16 6 1 17 7 1 28 8 1 29 9 1 210 10 1 211 11 1 212 12 1 213 13 1 114 14 1 115 15 1 116 16 1 217 17 1 118 18 1 119 19 1 120 20 1 121 21 1 122 22 1 223 23 1 224 24 1 225 25 1 126 26 1 127 27 1 128 28 2 329 29 2 330 30 1 131 31 1 132 32 1 133 33 1 134 34 2 335 35 2 336 36 1 1

16

37 37 1 138 38 1 139 39 1 240 40 1 241 41 2 442 42 2 443 43 2 444 44 1 145 45 1 146 46 1 147 47 2 348 48 2 3

Using two clusters, we obtain the following:

Because PC #1 separates the two clusters into two distinct parts, we can use PC #1 to interpret the clusters. Remember that PC #1 represents an overall measure of the quality of the applicants, where smaller values represent better candidates. Thus, the black cluster (#1) represents overall better applicants and the red cluster (#2) represents overall worse applicants.

Using four clusters, we obtain the following:

Black cluster (#1): These candidates generally have PC #1 scores that are in the middle so they may be “second-tier” applicants. Some of these applicants have larger PC #4 values indicating larger APP, AA, LA, HON values and lower KJ values.

Red cluster (#2): These applicants are among the best because the PC #1 scores tend to be the lowest. There are also some larger PC #3 scores indicating some of these individuals may have higher AA, SC, AMB and lower FL, LA, and KJ in comparison to others.

Green cluster (#3): Similar to the blue cluster, these applicants generally have larger PC #1 values indicating they are among the worse overall applicants. They differ with the blue cluster in that they have much smaller PC #2 values.

Blue cluster (#4): Similar to the green cluster, these applicants generally have larger PC #1 values indicating they are among the worse overall applicants. They also usually have the largest PC #2 values indicating larger FL, AA, EXP, SUIT and lower SC, HON values.

17

€¦ · Web viewProject #2 Answers . ... that you will need to edit your output and code in order...

Documents

Transcript of €¦ · Web viewProject #2 Answers . ... that you will need to edit your output and code in order...