Addressing multicollinearity in semiconductor manufacturing

Case Study

Published online 14 January 2011 in Wiley Online Library(wileyonlinelibrary.com) DOI: 10.1002/qre.1173

Addressing Multicollinearity in SemiconductorManufacturingYu-Ching Chang and Christina Mastrangelo∗†

When building prediction models in the semiconductor environment, many variables, such as input/output variables,have causal relationships which may lead to multicollinearity. There are several approaches to address multicollinearity:variable elimination, orthogonal transformation, and adoption of biased estimates. This paper reviews these methodswith respect to an application that has a structure more complex than simple pairwise correlations. We also presenttwo algorithmic variable elimination approaches and compare their performance with that of the existing principalcomponent regression and ridge regression approaches in terms of residual mean square and R2. Copyright © 2011John Wiley & Sons, Ltd.

Keywords: multicollinearity; variable elimination; principal components regression; variance inflation factor

Introduction

Semiconductor manufacturing is a complex system with a long processing time: several hundred steps and many recurrentprocesses that use the same tool groups. One characteristic of semiconductor manufacturing is recurrence. Recurrenceoccurs for several reasons: processing is done layer by layer; expensive equipment is used repeatedly; and precision

requirements in alignment necessitate that some processes need to be performed by the same machine. This especially occurs inthe photolithography area. Figure 1 shows the front-end fabrication process by functional area; a layer is completed after everyloop. Owing to this recurrent characteristic, processes tend to interact making this kind of system more difficult to analyze thansequential systems. Therefore, when multivariate statistical analysis is used to analyze semiconductor systems, multicollinearityamong variables will likely exist most of the time.

If two variables are correlated, this is referred to as pairwise correlation. When that pair of variables is also highly correlatedwith one or more other pairs, as often happens in this situation, this will be referred to in this paper as ‘crossover multicollinearity’.The data used in this paper to demonstrate how to address crossover multicollinearity is from a central processing unit (CPU)which has 32 end-of-line electronic measurements (regressors) and a chip operating speed measurement (the response) perobservation. To disguise the data, it has been standardized. A subset of the correlation matrix is given in Table I. Each cell witha value denotes a ‘high’ pairwise correlation—an absolute value greater than 0.7 and used as the ‘cutoff’ value for this paper. Ifthere were only pairwise correlations present, there would only be one entry for each row or column. Table I also shows that therelationships between the variables are not transitive. For example, parameters 11 and 12 (P_11 and P_12) are highly correlatedas well as P_12 and P_30. We would expect that P_11 and P_30 will also be highly correlated. However, the correlation betweenP_11 and P_30 is not high; it is actually −0.54.

Crossover multicollinearity occurs in semiconductor manufacturing because many process steps are highly dependent on theprevious steps. Figure 2 shows an example of how this multicollinearity may occur in the etching process: step i is largely affectedby pattern definition in the lithography process i. Therefore, the measurements from lithography and etch process would be highlycorrelated. Moreover, lithography results at the next lithography step would be highly correlated to the previous lithographyprocess because they are processed by the same machine. As a result, lithography i is highly correlated to both etching i andlithography i+1 but etching i and lithography i+1 may or may not be highly correlated.

Industrial and Systems Engineering, University of Washington, Seattle, WA 98195-2650, U.S.A.∗Correspondence to: Christina Mastrangelo, Industrial and Systems Engineering, University of Washington, Seattle, WA 98195-2650, U.S.A.†E-mail: [email protected]

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 843--854

84

3

Y.-C. CHANG AND C. MASTRANGELO

Diffusion Photo

Etch

Implant

CMP(optional)

Loop

Wafer Start

Figure 1. Front-end process

Multicollinearity review

Multicollinearity is a well-known condition that affects the estimates in regression models.1, 2 Two regressors, x1 and x2, are saidto be exactly collinear if there is a linear equation and constants, c0, c1, c2, such that c1x1 +c2x2 =c0. Exact multicollinearityis said to exist if there are more than two regressors that are exactly collinear. Multicollinearity, or more precisely approximatelinearity, is defined as a set of regressors, x1, x2,. . ., xp and constants, c0, c1,. . ., cp, that have a near-linear relationship

c1x1 +c2x2 +·· ·+cpxp ≈c0. (1)

The significance of collinearity can be demonstrated by a two-regressor model

y =�1x1 +�2x2 +�,

where E(�)=0 and Var(�)=�. The least-squares normal equation is (X′X)′b=X′y, and the estimates of the regression coefficientsare

b̂= (X′X)−1X′y. (2)

The covariance of b̂ is

cov(b̂)=�(X′X)−1. (3)

Let r12 denote the correlation between x1 and x2, and rjy the correlation between xj and y where j=1, 2. The inverse of (X′X)may be written as

(X′X)−1 =⎡⎣ 1 / (1−r2

12) −r12 / (1−r212)

−r / (1−r212) 1 / (1−r2

12)

⎤⎦ .

If x1 and x2 are highly correlated, r212 will be close to 1. As a result of (2), the signs and values of the regression coefficients b̂

will be poorly estimated and unstable. Also, (3) shows that the variances/covariances are much larger than the case when x1 andx2 are less correlated.

Multicollinearity diagnostics

There are three widely accepted indicators to detect multicollinearity: simple pairwise correlations, variance inflation factors, andeigenvalues. The first and simplest indicator is the correlation matrix with the following rule of thumb: if the absolute value ofan off-diagonal element is larger than 0.8 or 0.9, the two involved regressors are usually considered highly correlated.3, 4 If thepairwise correlation is between 0.7 and 0.8, there may be mild collinearity, and this should also be taken into consideration.5

Another widely accepted measure of multicollinearity is the variance inflation factor (VIF).6 The VIF measures the combinedeffect of the dependence among the regressors on the variance of that term. It is defined as

VIFj = (1−R2j )−1,

where R2j is the coefficient of determination when xj is regressed on the remaining regressors. In general, VIFs exceeding 10

indicate serious multicollinearity, and VIFs between 5 and 10 suggests there might be some mild multicollinearity. O’Brien7

recommends that the VIF threshold be 5 or as low as 4.

84

4



Tab

leI.

Co

rrel

atio

nm

atri

x(o

nly

hig

hly

corr

elat

edp

airs

>0.

7ar

esh

ow

n)

Var

iab

les

P_01

P_02

P_03

P_04

P_11

P_12

P_17

P_18

P_19

P_20

P_21

P_22

P_23

P_24

P_28

P_29

P_30

P_31

P_01

0.92

9P_

020.

727

0.88

2P_

030.

929

0.72

70.

782

P_04

0.88

20.

782

P_11

0.84

0−0

.734

0.81

6−0

.830

−1.0

00−0

.840

P_12

0.84

00.

790

−0.7

90−0

.841

−1.0

00−0

.768

P_17

−0.8

69P_

18−0

.811

−0.8

39P_

190.

762

−0.7

60P_

20−0

.734

0.73

3P_

210.

816

0.79

00.

762

−0.9

78−0

.816

−0.7

91P_

22−0

.830

−0.7

90−0

.760

−0.9

780.

830

0.79

0P_

23−0

.869

P_24

−0.8

110.

731

P_28

−1.0

00−0

.841

0.73

3−0

.816

0.83

00.

841

P_29

−0.8

40−1

.000

−0.7

910.

790

0.84

10.

767

P_30

−0.7

680.

767

P_31

−0.8

390.

731

#o

fp

airs

12

32

66

12

22

66

12

66

22

Sum

of

pai

rs0.

929

1.60

92.

438

1.66

45.

060

5.02

90.

869

1.65

01.

522

1.46

74.

952

4.97

90.

869

1.54

25.

062

5.03

01.

535

1.56

9


84

5


Lithography i Etching i Deposition i

Lithography i+1 Etching i+1 Deposition i+1

next layer

Highly correlated

Highly correlated

Highly correlated

Not highly correlated

Figure 2. An example of the causes of ‘crossover multicollinearity’

The third method is eigenvalue analysis or, essentially, principal component analysis. A condition number is defined as

�max

�min,

where �max and �min are the largest and smallest eigenvalues of the X′X matrix. A condition number larger than 1000 is anindicator of strong multicollinearity.8 If the condition number is between 100 and 1000, there might be moderate multicollinearityinvolved. Generally, multicollinearity occurs when there is a small �min very close to zero. An example regarding how to useprincipal component analysis to detect the linear relationship among variables is given in the following section.

Techniques to remedy multicollinearity

To address multicollinearity and increase the accuracy of the estimates, four approaches are recommended:

(1) Obtain additional data.(2) Eliminate variables.(3) Transform orthogonally.(4) Adopt biased estimates.

(1) Additional data. Note that regression coefficients are severely affected by the sample data if multicollinearity exists.Farrar and Glauber1 suggest that collecting additional data may relieve the problem of multicollinearity. Unfortunately,collecting additional data is not always possible due to the cost of sampling and availability of data. Weisberg9 uses atwo-regressor model with simulated data to demonstrate that it takes several times more data to have roughly the samevariance/covariance values or the same accuracy.

(2) Variable elimination. Dropping one of the highly correlated variables is a widely used approach due to its simplicity.However, there are two drawbacks of this method. First, the information about the dropped variables is lost in the model.This is a serious problem if the dropped variables are significantly explanatory to the response. Second, determining whichvariables among the highly correlated variables should be removed from the model is another challenge—we will proposea solution to this later in the paper.

(3) Orthogonal transformation. The purpose of orthogonal transformation is to introduce another set of variables which areindependent of each other. The X matrix is transformed into an orthogonal matrix. Such techniques include PrincipalComponents Analysis10 and Gram–Schmidt Orthogonalization.11 However, the new variables may be difficult to interpretif many variables are involved in the transformation.

(4) Adopting biased estimates. The idea of adopting biased estimates is to trade unbiased estimates for ones with a smallermean-squared error (MSE). Such methods include Ridge Regression12 and Latent Root Regression.13 Though biasedestimators provide considerable improvement in the MSE, the method is criticized for not using the unbiased estimators.Another problem is that it is difficult to assess how much improvement in MSE has been achieved or how much bias hasbeen introduced.

Clearly, these approaches have their own advantages and disadvantages in addressing multicollinearity. In this paper, wepropose two simple methods motivated by the above methods—one is a VIF approach and the other is a pairwise-eliminationapproach. These two algorithmic approaches suggest which variables should be dropped at each stage.

84

6



Table II. Intermediate results based on VIF algorithm

Original 1st run 2nd run 3rd run 4th run 5th run 6th run 7th run 8th run 9th run 10th run 11th run

P_01 71 411.20 50 682.13 4.46 4.44 4.41 4.26 4.16 4.01 3.78 3.73 3.66 3.63P_02 65 875.00 2.88 2.88 2.88 2.88 2.87 2.76 2.74 2.68 2.68 2.65 2.60P_03 87 307.80 61 993.16P_04 93 522.50P_05 2.20 2.20 2.19 2.19 2.18 2.18 2.04 2.04 2.01 1.98 1.92 1.80P_06 1.60 1.64 1.64 1.63 1.63 1.63 1.60 1.58 1.58 1.57 1.52 1.51P_07 1.60 1.59 1.58 1.58 1.58 1.58 1.51 1.50 1.49 1.49 1.45 1.45P_08 2.40 2.36 2.36 2.36 2.36 2.34 2.33 2.33 2.33 2.30 2.17 2.16P_09 10 276.10 7300.67 3.23 3.22 3.19 3.11 3.07 2.95 2.91 2.56 2.32 2.32P_10 19 119.10 4.77 4.76 4.76 4.74 4.66 4.49 3.65 3.45 3.34 2.89 2.79P_11 23 962.30 22 449.84 22 254.21P_12 19 461.10 19 457.29 19 031.39 14 081.15P_13 1.30 1.35 1.33 1.33 1.32 1.30 1.29 1.29 1.29 1.26 1.26 1.26P_14 1.40 1.35 1.35 1.33 1.33 1.30 1.30 1.30 1.27 1.27 1.26 1.25P_15 2.50 2.46 2.46 2.45 2.44 2.39 2.08 2.03 2.00 1.94 1.88 1.88P_16 2.70 2.67 2.67 2.67 2.66 2.65 2.65 2.64 2.49 2.33 2.33 2.32P_17 7.20 7.18 7.18 7.18 7.15 7.14 7.13 7.13 6.92P_18 8.10 8.09 8.09 8.09 8.09 8.09 8.08P_19 3.40 3.37 3.37 3.37 3.37 3.32 2.99 2.99 2.79 2.70 2.69 2.27P_20 2.70 2.75 2.75 2.75 2.74 2.74 2.64 2.58 2.58 2.58 2.56 2.22P_21 50.10 50.05 50.03 49.51 49.08P_22 48.80 48.81 48.81 48.74 47.92 11.54P_23 6.70 6.71 6.71 6.71 6.67 6.67 6.67 6.57 6.46 3.57 3.25 3.14P_24 4.20 4.19 4.19 4.19 4.18 4.18 4.17 3.88 3.88 3.85 3.18 3.13P_25 1.80 1.83 1.83 1.83 1.83 1.83 1.81 1.81 1.80 1.80 1.79 1.77P_26 2.70 2.66 2.66 2.66 2.64 2.57 2.54 2.54 2.54 2.40 2.32 2.30P_27 2.60 2.63 2.63 2.62 2.62 2.56 2.55 2.55 2.55 2.54 2.26 2.24P_28 22 952.10 22 532.34 22 335.47 9.52 9.24 9.17 6.24 6.24 4.37 4.26 4.01P_29 19 277.50 19 224.44 19 195.96 14 115.56 9.02 8.07 7.85 7.75P_30 4.60 4.56 4.54 4.53 4.53 4.52 3.90 3.78 3.21 3.21 3.19 3.07P_31 7.30 7.27 7.27 7.27 7.26 7.00 7.00 5.08 5.00 4.56P_32 3.20 3.16 3.16 3.16 3.12 2.51 2.37 2.33 2.14 2.14 1.90 1.89RMS 0.434 0.435 0.436 0.436 0.437 0.444 0.452 0.508 0.509 0.509 0.509 0.512R2(%) 56.9 56.9 56.7 56.7 56.3 56.0 55.1 49.6 49.4 49.4 49.4 49.1

In the following section of this paper, a proposed VIF and a pairwise variable elimination algorithm is given, and the results arepresented. Two existing methods, principal components and ridge regression, are briefly introduced and their results summarized.We will evaluate the performance of these four approaches using the residual mean square (RMS) and R2. A larger R2 ispreferred since it represents the proportion of variation explained by the regressors, and a smaller RMS is also preferred sinceit leads to a narrower prediction interval. After the comparison, we address the differences between variable elimination andvariable selection. An improvement on a pairwise dropping approach can be made by utilizing the existing variable selectiontechniques. The advantages and disadvantages are discussed subsequently. Finally, conclusions regarding the four methodsare given.

Variable elimination: VIF approach

Because larger VIFs indicate a higher possibility of multicollinearity, an intuitive variable elimination approach is to dropone variable which has the largest VIF at each run until a threshold is met. Table II shows all intermediate results ofdropping the variable with the highest VIF (marked in bold italics) until all VIFs are smaller than 4. In the first run, P_04is dropped because it has the largest VIF. Based on the model without P_04, the new model generates a new set ofVIFs and a new variable with the largest VIF is picked for dropping in the next run. At the 9th run, this simple methodachieves all VIFs smaller than 5; at the 11th run, all VIFs are smaller than 4. This VIF variable elimination can ‘break’ themulticollinearity. However, the major disadvantage of this method is that interesting process variables may be dropped.A simple variant of this algorithm to overcome this disadvantage is to select two or more of the largest VIFs and dropthe least ‘interesting’ variable where ‘interesting’ is defined by the perspective of the user who would have prior processknowledge.


84

7


Variable elimination: pairwise dropping approach

Another commonly used method for combating multicollinearity is a pairwise elimination approach. This approach drops onevariable from every highly correlated pair. A simple guide for this method is to drop the redundant variables first and keep theremaining ones. However, if crossover multicollinearity exists among variables, it is more difficult to decide which variable shouldbe dropped. For example, from Table I we can see that P_03 is highly correlated with P_01, P_02, and P_04, while P_01 is highlycorrelated with P_03 only. If we drop P_01, then P_03 must be preserved in the model because it is the only variable that can‘represent’ variable P_01. However, P_03 is still highly correlated with P_02 and P_04. We need to further remove these last twoparameters. The problem of retaining the most variables while breaking all the existing highly correlated pairs becomes moredifficult when the degree of crossover becomes higher. For example, P_11 and P_12 each have six highly correlated relationships.A simple idea to break ties as quickly as possible is to remove the variable that has most relationships with the rest. For example,if we remove P_03, since it has three large ‘correlationships’, we only need to remove either P_02 or P_04. We propose a simplealgorithm as follows:

Algorithm:

(1) Select the pair having the largest absolute correlation among all the pairs.(2) Remove the variable that has the largest number of highly correlated pairs.(3) In step 2 if there is a tie, remove the variable that is least ‘interesting’.(4) Repeat until all highly correlated pairs are removed.

Applying the algorithm to the data set shown in Table I, P_01 and P_03 are selected first because they have the highestcorrelation of 0.929. Since these two parameters have ‘1 pair’ (bottom of the table), assume P_3 is less interesting and removeit. Continuing in this manner, P_12 and P_28 are selected on the 4th iteration. Because both P_12 and P_28 have ‘6 pairs’, wedrop P_28 by assuming that it is less interesting. In the next iteration, P_11 and P_29 are selected, because both P_11 and P_29have six pairs, we drop P_29 assuming it is less interesting. If we continue the algorithm, we will drop variables P_03, P_04, P_12,P_18, P_20, P_21, P_22, P_23, P_28, P_29, and P_31. Column 2 of Table III shows the VIFs of original data set which indicatethat there is severe multicollinearity in the model. Column 3 of Table III shows the VIFs after these 11 variables are dropped andthat there is no longer obvious multicollinearity in that all the VIFs are smaller than 5. Comparing this method to the above VIFelimination method when 11 variables dropped (Table II), we can see that the two methods have similar results: similar variablesare dropped; the remaining VIFs are roughly the same; and the RMS (0.5133 and 0.512) and R2 (49.1 and 49.1%) are close.

Principal components regression

Principal components regression (PCR) addresses multicollinearity by transforming the regressors into a new set of coordinateaxes which make the transformed regressors orthogonal to each other. In other words, the correlation matrix estimated from thetransformed data would be a diagonal matrix, and the transformed regressors are independent to each other. Refer to Johnsonand Wichern14 for more details. An advantage of PCR is being able to reduce the data set in the new coordinate systems.15 Toillustrate the idea, say the original model is

y=Xb+e,

where the rank of X is p. Now, let T be a p∗p orthogonal matrix whose columns are the eigenvectors of X′X and K=diag(�1,�2,. . . ,�p), where �1 ≥�2 ≥·· ·≥�p ≥0 be the eigenvalues of X′X. Now the transformed model may be written as

y=Za+e,

where Z=XT, a=T′b.Since T is an orthogonal matrix, Z= [Z1, Z2,. . . , Zp] becomes a new set of orthogonal regressors and are referred to as THE

principal components. Because

Z′Z=T′X′XT=K (4)

the eigenvalue �k is the variance due to the kth principal component. Thus, the first kth principal components would attribute∑ki=1 �i /

∑pj=1 �j of the total population variance. The implication of PCR is that if we are interested in a model that can explain

at least 90% of variability and assume that the first k principal components can achieve this, we can use only the first k principalcomponents without loss of much information. In other words, the total number of regressors in the new coordinate system isreduced by p−k. The ordered eigenvalues of the 32 principal components are given in the first column of Table IV (the cumulativevariance is in the second column. The rest of Table IV shows the eigenvectors of the last 7 principal components. Note that manyvalues of the vectors are ≈0; which correspond to �j =0. The estimates based on the first k principal components are given inTable V. Note that since the last 4 eigenvalues are close to zero, we would expect that using the first 28 PCs, in terms of RMSand R2, would not make much difference than using all of the PCs.

The pairwise variable elimination algorithm achieves an RMS of 0.5133 and R2 of 49.1% by dropping 11 variables. However,comparing these results to those of PCR using the first 21 PCs (i.e. also dropping 11 PCs) would be inappropriate, because

84

8



Table III. VIF results

Original VIF after VIF by stepwise Stepwise on after BMA on afterVariables VIF pairwise dropped regression VIF by BMA dropped model dropped model

P_01 71 411.2 3.3 21.2 2.6P_02 65 875 2.6 46 676.1 40 128.3 1.6P_03 87 307.8 23.9P_04 93 522.5 66 263.6 56 963.6P_05 2.2 1.9 2.2 2 1.9 1.8P_06 1.6 1.5 1.5 1.3 1.5 1.3P_07 1.6 1.4 1.5 1.3 1.2P_08 2.4 2.2 2P_09 10 276.1 2.5 1.9 1.7P_10 19 119.1 2.8 13 557.3 11 668.4 2.2 1.9P_11 23 962.3 3.7 23 837.8 3.3 3.2P_12 19 461.1 19 120.9P_13 1.3 1.3 1.2 1.1 1.2 1.2P_14 1.4 1.3 1.3 1.2 1.2P_15 2.5 2 2.5 2 1.7 1.6P_16 2.7 2.4 2.7 2.2 2.2 2.2P_17 7.2 3.2 7.1P_18 8.1 8.1 7.1P_19 3.4 2.7 3.4 2.9 2.4 2.3P_20 2.7 2.7P_21 50.1 49.7 36P_22 48.8 48.2 39.7P_23 6.7 6.7P_24 4.2 3.1 3.7 3.6 2.6P_25 1.8 1.8 1.8 1.7 1.7 1.6P_26 2.7 2.5 2.7 2.4 1.3 1.2P_27 2.6 2.3 2.6 2.3P_28 22 952.1 22 918.5 212.8P_29 19 277.5 19 272.2P_30 4.6 3 4.5 2.9 3 3P_31 7.3 6.9 6P_32 3.2 1.9 3.2 1.9 1.8RMS 0.4344 0.5133 0.4347 0.4385 0.5137 0.5161R2(%) 56.9 49.1 56.9 56.5 49.1 48.8

even when 11 PCs are dropped, the first 21st PCs still correspond to the original 32 variables. Nevertheless, let us assume thiscomparison is meaningful. The PCR method achieves an RMS of 0.501 and an R2 of 50.15% by using the first 21st PCs. Theperformance of the two methods is similar.

Ridge regression

Ridge regression remedies the instability problem due to multicollinearity by introducing bias into the model. Instead of solving(2), the ridge estimate, b̂R, is obtained by b̂= (X′X+�I)−1X′y, where �≥0 is a constant. By increasing �, the ill-conditioningproblem will be reduced and the estimates will be more stable. See Figure 3 for the ridge trace where each line representsestimates for a regressor at various values of �. Hoerl et al.16, 17 suggest a procedure for a proper � given by

�= p�̂2

�̂′�̂

where p is the number of regressor variables.

The Hoerl–Kennard–Baldwin (HKB) ridge constant used is 0.0636. Table VI shows the coefficient estimators at various �. Notethat in the process of ‘stabilizing’ the coefficients by increasing �, some of the coefficients remain relatively constant and havesmall values as � varies: P_07 and P_08, for example. A small coefficient implies that the variable has small prediction power. Assuch, if the variable is dropped from the regressors, the prediction will not be affected that much. As a result, we can use ridgeregression to drop some variables which have small coefficients as suggested by the HKB ridge constant. Here, it is arbitrarilyassumed that a coefficient is ‘small’ if its absolute value is below 0.06. Column 6 of Table VI indicates that 8 coefficients (in bolditalics) are small, and the variables may be dropped from the model. The last column of Table VI shows the coefficients afterthose variables are dropped. In this example, these coefficients are fairly similar to the case with all variables. Finally, comparingthe algorithm to the ridge regression results, it is not surprising that ridge regression has a smaller RMS (0.43 versus 0.52).


84

9


Table IV. Eigenvalues and the last seven eigenvectors of T

Eigenvalues Cumulative PC 26 PC 27 PC 28 PC 29 PC30 PC 31 PC32

8.3548 0.2611 −0.045 −0.06 0.027 −0.069 −0.076 0.462 −0.4454.9315 0.4152 0.011 −0.05 −0.006 −0.086 −0.117 0.392 0.443.7121 0.5312 0.013 −0.034 0.008 0.075 0.085 −0.511 0.4922.217 0.6005 −0.091 0.024 0.002 0.102 0.139 −0.467 −0.5251.6747 0.6528 −0.034 0.033 −0.018 0.001 0 0 01.5648 0.7017 −0.046 0.084 −0.006 0.001 0 0 01.3324 0.7434 −0.063 0.017 −0.002 0 0 0 01.1487 0.7793 0.009 0.05 −0.017 0 0 0 00.8306 0.8052 0.15 0.042 −0.038 −0.027 −0.029 0.175 −0.1690.8185 0.8308 −0.253 0.154 0.016 −0.047 −0.063 0.211 0.2370.7797 0.8552 0.23 −0.068 0.01 0.455 0.495 0.21 0.0630.6468 0.8754 0.029 0.091 −0.051 0.524 −0.474 0.004 −0.0270.6231 0.8948 0.028 0.018 −0.017 0 0 0 00.4918 0.9102 −0.025 0.045 0.013 −0.001 0 0 00.4556 0.9244 0.149 −0.082 −0.01 0.001 0 0 00.385 0.9365 0.097 0.035 −0.008 0.001 0 0 00.3388 0.9471 −0.481 −0.431 0.012 0.001 0 0 00.2928 0.9562 −0.283 0.528 0.002 0 0 0 00.2614 0.9644 0.021 −0.041 −0.006 0 0 0 00.2294 0.9715 −0.064 0.043 −0.012 0 0 0 00.2124 0.9782 −0.309 0.08 0.706 0.002 −0.004 −0.001 00.1702 0.9835 0.351 −0.022 0.694 0.005 −0.003 0 00.1564 0.9884 −0.301 −0.419 0.003 0.001 0 0 00.1133 0.9919 −0.08 −0.018 0.011 0 0 0 00.0909 0.9948 0.005 0.007 0.01 0 0 0 00.0838 0.9974 −0.114 −0.02 0.023 0.001 0 0 00.0728 0.9997 0.031 −0.07 −0.023 0 0 0 00.0108 0.9999 −0.231 0.069 −0.01 0.459 0.506 0.182 0.0320.0001 1 −0.029 −0.09 0.041 0.526 −0.472 −0.029 0.0020 1 −0.267 0.183 −0.032 0.001 0 0 00 1 −0.016 0.471 −0.055 −0.001 0 0 00 1 0.194 −0.096 −0.071 −0.002 0 0 0

However, the amount of bias introduced in the model is unknown. Another problem of ridge regression is that the 8 droppedvariables appear to have no connection with the VIFs. For example, ridge regression suggests we can drop or ignore P_07 andP_08; however, their VIFs are quite decent. Therefore, they are not the source of multicollinearity. Nevertheless, P_07 and P_08are dropped or ignored for the sake of more stable estimates of the coefficients.

Variable elimination and variable selection

Since dropping variables is, in a sense, similar to variable selection (one includes variables while the other excludes variables fromthe model), it is interesting to see how well the proposed variable elimination algorithm performs comparing to the existingvariable selection algorithms such as Stepwise Regression and Bayesian Model Averaging (BMA).18 Our numerical results, columns4 and 5 of Table III, show that these classical variable selection techniques cannot handle multicollinearity well. We can see thatthere are many VIFs greater than 5 in columns 4 and 5. Nevertheless, we must point out that variable elimination and variableselection view the problem from very different aspects—variable selection is to select a good set of regressors while variableelimination is to remove variables that cause collinearity. Checking basic model assumptions such as normality, reducing theeffect of multicollinearity is something practitioners need to do before building a model. Thus, these variable selection techniquescan apply to the models once multicollinearity has been removed to further exclude some insignificant variables. See columns 6and 7 of Table III for the results of applying stepwise and BMA variable selection to the model after using the pairwise droppingapproach. Some additional variable selection algorithms19, 20 are not discussed here, but it is reported that they can work wellin the presence of multicollinearity.

Performance comparison

If we judge the performance of the above four approaches in terms of RMS and R2, ridge regression is the leading choice.However, ridge regression does not result in unbiased estimates. The PCR method has similar performance as the other twomethods when the same numbers of regressors are in the model. However, PCR is more difficult to use, and it can be challenging

85

0



Tab

leV

.Pr

inci

pal

com

po

nen

tsre

gre

ssio

n

All

PCs

Firs

t31

PCs

Firs

t30

PCs

Firs

t29

PCs

Firs

t28

PCs

Firs

t27

PCs

Firs

t26

PCs

Firs

t21

PCs

Firs

t16

PCs

Firs

t15

PCs

Firs

t14

PCs

100%

100%

100%

100%

99%

99%

99%

98%

94%

92%

91%

vari

abili

tyva

riab

ility

vari

abili

tyva

riab

ility

vari

abili

tyva

riab

ility

vari

abili

tyva

riab

ility

vari

abili

tyva

riab

ility

vari

abili

ty

Var

iab

leEs

tim

ate

Esti

mat

eEs

tim

ate

Esti

mat

eEs

tim

ate

Esti

mat

eEs

tim

ate

Esti

mat

eEs

tim

ate

Esti

mat

eEs

tim

ate

P_01

3.89

46.

108

0.16

50.

300

0.08

30.

057

0.02

80.

006

0.03

80.

037

0.00

9P_

027.

415

5.22

40.

182

0.38

80.

119

0.12

50.

101

0.06

50.

041

0.04

60.

073

P_03

−4.1

56−6

.603

−0.0

27−0

.176

0.05

80.

050

0.03

4−0

.007

0.02

30.

020

−0.0

13P_

04−8

.761

−6.1

50−0

.143

−0.3

88−0

.068

−0.0

70−0

.059

−0.0

47−0

.026

−0.0

250.

012

P_05

0.19

90.

201

0.20

30.

204

0.20

50.

223

0.23

90.

199

0.09

00.

093

0.06

5P_

06−0

.100

−0.1

01−0

.103

−0.1

03−0

.101

−0.0

95−0

.055

−0.0

47−0

.088

−0.0

92−0

.025

P_07

0.02

70.

027

0.02

40.

025

0.02

50.

028

0.03

60.

026

0.04

70.

044

0.09

8P_

080.

013

0.01

10.

011

0.01

10.

011

0.02

80.

052

0.09

40.

086

0.08

30.

112

P_09

1.40

62.

245

−0.0

110.

040

−0.0

44−0

.007

0.01

3−0

.044

−0.0

35−0

.040

−0.0

58P_

103.

566

2.38

6−0

.333

−0.2

22−0

.370

−0.3

86−0

.313

−0.2

24−0

.128

−0.1

33−0

.104

P_11

2.46

92.

154

−0.5

45−1

.416

0.01

20.

001

−0.0

31−0

.003

−0.0

46−0

.049

−0.0

37P_

12−2

.577

−2.4

43−2

.494

−1.6

61−0

.018

0.03

30.

076

0.04

50.

011

0.01

0−0

.016

P_13

−0.1

63−0

.162

−0.1

67−0

.167

−0.1

68−0

.151

−0.1

42−0

.165

−0.1

75−0

.177

−0.1

66P_

14−0

.030

−0.0

31−0

.030

−0.0

29−0

.032

−0.0

46−0

.024

−0.0

42−0

.074

−0.0

75−0

.126

P_15

−0.0

71−0

.071

−0.0

76−0

.076

−0.0

74−0

.064

−0.1

03−0

.092

0.01

60.

010

0.02

0P_

160.

065

0.06

60.

066

0.06

60.

068

0.07

60.

093

0.03

90.

024

0.01

90.

020

P_17

−0.0

72−0

.072

−0.0

74−0

.073

−0.0

70−0

.082

−0.2

870.

045

0.05

60.

055

0.05

4P_

18−0

.655

−0.6

56−0

.657

−0.6

57−0

.658

−0.6

59−0

.408

−0.0

41−0

.022

−0.0

20−0

.026

P_19

0.00

00.

000

0.00

20.

001

0.00

10.

006

−0.0

130.

045

−0.0

59−0

.056

−0.0

78P_

20−0

.040

−0.0

40−0

.038

−0.0

39−0

.039

−0.0

27−0

.007

0.04

50.

187

0.18

40.

119

P_21

0.56

50.

566

0.57

90.

586

0.59

3−0

.114

−0.0

76−0

.121

−0.0

35−0

.035

−0.0

39P_

220.

802

0.80

30.

804

0.80

80.

824

0.12

90.

119

0.15

30.

063

0.06

20.

062

P_23

−0.0

49−0

.049

−0.0

53−0

.053

−0.0

49−0

.052

−0.2

510.

035

−0.0

17−0

.019

−0.0

31P_

24−0

.083

−0.0

82−0

.078

−0.0

78−0

.077

−0.0

87−0

.096

0.05

9−0

.013

−0.0

11−0

.016

P_25

−0.1

20−0

.121

−0.1

22−0

.122

−0.1

22−0

.132

−0.1

29−0

.109

−0.1

26−0

.124

−0.0

68P_

26−0

.069

−0.0

69−0

.069

−0.0

69−0

.065

−0.0

88−0

.097

−0.0

79−0

.056

−0.0

56−0

.029

P_27

0.00

50.

004

0.00

40.

004

0.00

30.

026

−0.0

070.

006

−0.0

18−0

.021

−0.0

17P_

281.

938

1.78

0−0

.563

−1.4

51−0

.012

−0.0

030.

030

0.00

20.

046

0.04

80.

037

P_29

−2.8

35−2

.846

−2.4

74−1

.644

0.00

7−0

.034

−0.0

77−0

.046

−0.0

11−0

.010

0.01

6P_

300.

280

0.27

80.

280

0.27

90.

281

0.31

30.

400

0.36

40.

119

0.11

30.

108

P_31

−0.3

98−0

.400

−0.4

02−0

.402

−0.4

04−0

.349

−0.1

250.

036

0.06

50.

065

0.06

0P_

32−0

.034

−0.0

35−0

.041

−0.0

41−0

.046

0.02

5−0

.021

−0.0

610.

071

0.07

30.

075

RMS

0.43

40.

434

0.43

60.

436

0.43

70.

448

0.46

40.

501

0.54

30.

543

0.55

5R2

(%)

56.9

456

.93

56.7

356

.73

56.6

855

.59

53.9

550

.15

45.9

645

.95

44.7

0


85

1


Figure 3. Ridge trace of the variables

to interpret the meaning of the new variables. Moreover, the variable elimination methods can work with variable selectiontechniques such as BMA but the PCR method cannot. If we apply BMA on the pairwise variable elimination method, we canachieve a similar RMS of 0.5161 and an R2 of 48.8% by dropping six more variables. If the PCR method uses its first 15th PCs, itcan only achieve an RMS of 0.543 and an R2 of 46.0%.

Issues related to multicollinearity techniques

Because the multicollinearity methods (VIFs, pairwise elimination, and eigenvalues) use different criterion for detecting collinearity,they do not always lead to the same conclusions. Two discrepancies are given as examples: first, even if a variable is highlycorrelated with other variables, its VIF is not necessarily very high; second, some variables might have high VIFs, but they do notbelong to any highly correlated pairs.

To illustrate the first type of discrepancy, P_19(VIF=3.4), P_20(VIF=2.7), P_24(VIF=4.2), P_30(VIF=4.6) have acceptable VIFs,but they are all in highly correlated pairs. See Tables III and I, respectively. P_18 and P_24 have a correlation of −0.81. If themodel has only these regressors, we would think there would be collinearity. However, if we regress P_24 on P_18, the R-squareis 0.658, and the VIF is 2.92; thus, it does not indicate collinearity according to the suggested threshold of 5–10.

For the second type of discrepancy, P_09 and P_10 have very high VIFs (10 276 and 19 119, respectively) but they do notappear in any highly correlated pairs. A closer look at the entire correlation matrix (not included in this paper) shows that allpairwise correlations, with these two variables, appear to be moderate (mostly <0.6). The reason is that the correlation matrixinvolves only two variables at every entry; thus, it has limited ability to detect multicollinearity if the linear relationship includesthree or more variables.

Why do these discrepancies occur? The correlation matrix cannot detect a relationship involving three or more variables, andthe VIF method calculates a score but does not capture a relationship between variables. The two discrepancies may be partiallyexplained by looking at the principal components. Using (4), say that a �p →0 which implies that Zp →0, and hence, XTp →0.Applying this to the last principal component in Table IV gives

0=−0.445P_01+0.44P_02+0.492P_03−0.525P_04−0.169P_09+0.237P_10+0.063P_11+−0.027P_12+0.032P_28+0.002P_29 (5)

The variables in (5) have the largest VIFs in Table III (column 2). Yet again, looking at PC30 in Table IV shows a similar relationshipthat now includes P_21 and P_22. Referring to Table III, they also have large VIFs. In other words, the VIFs are aligned with theprincipal components. If this is the case, dropping any variable in (5) should break the relationship.

However, this is not the case. Recall the definition of multicollinearity in (1). Assume that there are two unknown linearrelationships in the system which are given by

c1x1 +c2x2 +c3x3 ≈ c7,

c4x4 +c5x5 +c6x6 ≈ c8,

where ci are constants and variables x1 ,x2 ,x3 are independent of any x4 ,x5 ,x6 and vice versa. According to (1), we can rewritethe above equations as

c1

c7x1 + c2

c7x2 + c3

c7x3 ≈ c4

c8x4 + c5

c8x5 + c6

c8x6

85

2



Tab

leV

I.C

oef

ficie

nts

of

esti

mat

esat

diff

eren

tri

dg

eco

nst

ants

.Th

ela

stco

lum

nco

nta

ins

the

coef

ficie

nts

of

the

vari

able

saf

ter

dro

pp

ing

Rid

ge

Aft

erd

rop

pin

gco

nst

ant

00.

010.

020.

040.

0636

0.08

0.16

0.32

0.64

�=

0.06

36

Inte

rcep

t0.

0000

0.00

000.

0000

0.00

000.

0000

0.00

000.

0000

0.00

000.

0000

0.00

00P_

013.

8935

3.73

523.

3583

2.71

922.

2062

1.95

121.

2613

0.76

310.

4526

2.16

51P_

027.

4150

5.55

384.

5423

3.39

502.

6500

2.31

041.

4553

0.87

720.

5279

2.89

29P_

03−4

.155

7−3

.979

6−3

.562

3−2

.855

1−2

.287

6−2

.005

6−1

.242

9−0

.692

5−0

.350

0−2

.262

4P_

04−8

.760

6−6

.543

0−5

.337

8−3

.970

8−3

.083

2−2

.678

5−1

.659

8−0

.971

2−0

.555

1−3

.370

0P_

050.

1990

0.20

030.

2011

0.20

190.

2025

0.20

280.

2036

0.20

430.

2048

0.19

54P_

06−0

.099

8−0

.100

5−0

.100

9−0

.101

2−0

.101

4−0

.101

5−0

.101

6−0

.101

5−0

.101

3−0

.116

4P_

070.

0266

0.02

640.

0262

0.02

590.

0257

0.02

560.

0254

0.02

530.

0253

P_08

0.01

280.

0121

0.01

180.

0115

0.01

130.

0112

0.01

110.

0111

0.01

12P_

091.

4059

1.34

521.

2018

0.95

870.

7637

0.66

680.

4044

0.21

500.

0970

0.75

86P_

103.

5657

2.56

252.

0172

1.39

850.

9966

0.81

340.

3520

0.03

98−0

.148

81.

1235

P_11

2.46

921.

7758

1.34

710.

8392

0.51

640.

3775

0.07

93−0

.042

2−0

.058

30.

2682

P_12

−2.5

773

−2.3

236

−2.1

405

−1.8

708

−1.6

425

−1.5

183

−1.1

217

−0.7

480

−0.4

553

−1.8

902

P_13

−0.1

632

−0.1

637

−0.1

642

−0.1

650

−0.1

655

−0.1

658

−0.1

666

−0.1

672

−0.1

675

−0.1

621

P_14

−0.0

300

−0.0

302

−0.0

303

−0.0

304

−0.0

306

−0.0

307

−0.0

310

−0.0

314

−0.0

318

P_15

−0.0

708

−0.0

718

−0.0

724

−0.0

732

−0.0

736

−0.0

738

−0.0

743

−0.0

744

−0.0

744

−0.0

657

P_16

0.06

460.

0652

0.06

550.

0659

0.06

620.

0663

0.06

680.

0672

0.06

760.

0507

P_17

−0.0

720

−0.0

722

−0.0

723

−0.0

723

−0.0

722

−0.0

721

−0.0

717

−0.0

713

−0.0

709

−0.0

354

P_18

−0.6

554

−0.6

557

−0.6

559

−0.6

562

−0.6

563

−0.6

564

−0.6

565

−0.6

565

−0.6

560

−0.6

627

P_19

0.00

010.

0003

0.00

050.

0006

0.00

070.

0007

0.00

080.

0008

0.00

08P_

20−0

.040

0−0

.039

7−0

.039

6−0

.039

4−0

.039

2−0

.039

1−0

.039

0−0

.038

8−0

.038

6P_

210.

5650

0.56

890.

5716

0.57

510.

5776

0.57

870.

5813

0.58

170.

5783

0.51

91P_

220.

8022

0.80

390.

8051

0.80

700.

8085

0.80

920.

8112

0.81

170.

8087

0.76

96P_

23−0

.048

5−0

.049

2−0

.049

6−0

.050

0−0

.050

1−0

.050

1−0

.050

0−0

.049

6−0

.049

2P_

24−0

.083

2−0

.082

0−0

.081

2−0

.080

3−0

.079

6−0

.079

3−0

.078

5−0

.077

9−0

.077

5−0

.080

9P_

25−0

.120

0−0

.120

6−0

.120

9−0

.121

2−0

.121

4−0

.121

5−0

.121

8−0

.121

9−0

.122

1−0

.121

8P_

26−0

.068

8−0

.068

6−0

.068

4−0

.068

1−0

.067

8−0

.067

7−0

.067

2−0

.066

6−0

.066

3−0

.065

9P_

270.

0046

0.00

440.

0042

0.00

400.

0038

0.00

370.

0035

0.00

320.

0031

P_28

1.93

801.

3762

1.01

870.

5912

0.32

030.

2049

−0.0

343

−0.1

164

−0.1

088

0.04

12P_

29−2

.835

0−2

.564

3−2

.351

8−2

.034

4−1

.768

5−1

.625

7−1

.179

6−0

.770

8−0

.457

0−2

.044

4P_

300.

2801

0.27

970.

2797

0.27

970.

2799

0.28

000.

2804

0.28

080.

2813

0.25

99P_

31−0

.398

4−0

.399

5−0

.400

2−0

.400

9−0

.401

4−0

.401

6−0

.402

1−0

.402

3−0

.401

8−0

.415

3P_

32−0

.034

3−0

.036

0−0

.037

1−0

.038

6−0

.039

7−0

.040

2−0

.041

7−0

.042

9−0

.043

4RM

S0.

4344

0.43

450.

4347

0.43

500.

4353

0.43

540.

4359

0.43

630.

4367

0.43

80R2

0.56

940.

5693

0.56

920.

5687

0.56

860.

5684

0.56

800.

5675

0.56

720.

5659


85

3


thus only one equation can be derived by principal component analysis which looks like

cax1 +cbx2 +ccx3 −cdx4 −cex5 −cf x6 ≈0,

where ca ∼cf are linear combinations of c1 ∼c8. As a result, two or more linear relationships will be identified as one linearrelationship. Thus, dropping one variable among the equation identified by principal component analysis does not guaranteethat the linear relationship will be broken.

In short, the three methods have their own issues. The correlation matrix is the simplest method, and it provides an easyoption to drop variables; however relationships involving more than 2 variables cannot be detected. VIFs detect the existence ofmulticollinearity but provide no information regarding which variables should be dropped. Principal component analysis shedssome light on the linear relationships among variables. However, it alone is not enough to break all collinearity, because it usuallydetects only one complex relationship in the system.

Conclusions

In this paper, we address a special type of multicollinearity referred to here as crossover multicollinearity which is observed inthe semiconductor manufacturing. This condition may be remedied by using principal component regression or ridge regression.However, we propose the use of a pairwise dropping algorithm in combination with a VIF variable elimination method to reducethe effects of multicollinearity. In comparing the results of our algorithm to PCR, both provide similar RMS and R2 values. Inaddition, the proposed method can work with the existing variable selection techniques while PCR cannot. Another advantageof our algorithm is that it is more understandable. The new set of regressors in PCR may have no any physical meaning and bedifficult to interpret.

Ridge regression does provide excellent results in terms of RMS and R2, and it does make the estimates more stable and reducethe RMS. However, ridge regression ignores the source of the problem—that is, what variables actually cause multicollinearity.

References1. Farrar DE, Glauber RR. Multicollinearity in regression analysis: The problems revisited. Review of Economics and Statistics 1967; 49:92--107.2. Montgomery DC, Peck EA, Vining G. Introduction to Linear Regression Analysis (4th edn). Wiley: Hoboken, NJ, 2006.3. Mason CH, Perreault WD. Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research 1991; 28:268--280.4. Mason RL, Gunst RF, Webster JT. Regression analysis and problems of multicollinearity. Communication in Statistics 1975; 4:277--292.5. Tabachnick BG, Fidell LS. Testing hypotheses in multiple regression. Using Multivariate Statistics. Allyn and Bacon: Boston, 2001.6. Marquardt DW. Generalized inverse, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics 1970; 12:591--612.7. O’Brien RM. A caution regarding rules of thumb for variance inflation factors. Quality and Quantity 2007; 41:673--690.8. Belsey DA, Kuh E, Welsch RE. Regression Diagnostics. Wiley: Hoboken, NJ, 1980.9. Weisberg S. Applied Linear Regression. Wiley: Hoboken NJ, 2005.

10. Massy WF. Principal component regression in exploratory statistical research. Journal of American Statistical Association 1965; 60:234--256.11. Farebrother RW. Gram–Schmidt regression. Applied Statistics 1974; 23:470--476.12. Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970; 12:55--67.13. Webster JT, Gunst RF, Mason RL. Latent root regression analysis. Technometrics 1974; 16:513--522.14. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Prentice-Hall: Upper Saddle River, NJ, 2002.15. Hocking RR. The analysis and selection of variables in linear regression. Biometrics 1976; 32:1--49.16. Hoerl AE, Kennard RW, Baldwin KF. Ridge regression: Some simulations. Communications in Statistics 1975; 4:105--123.17. Hoerl AE, Kennard RW. Ridge regression: Iterative estimation of the biasing parameter. Communications in Statistics 1976; A5:77--88.18. Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for regression models. Journal of the American Statistical Association 1997;

92:179--191.19. Thall PF, Russell KE, Simon RM. Variable selection in regression via repeated data splitting. Journal of Computational and Graphical Statistics

1997; 6:416--434.20. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of Royal Statistics Society B 2005 67:301--320.

Authors’ biographies

Dr Christina Mastrangelo is an Associate Professor of Industrial & Systems Engineering at the University of Washington. Sheholds BS, MS and Ph.D. degrees in Industrial Engineering from Arizona State University. Dr. Mastrangelo’s research interests liein the areas of operational modeling for semiconductor manufacturing, system-level modeling for infectious disease control,multivariate quality control, and statistical monitoring methods for continuous and batch processing. She is a member of ASA,ASEE, ASQ, INCOSE, INFORMS, and a senior member of IIE.

Yu-Ching Chang was a doctoral student of Department of Industrial and Systems Engineering at the University of Washingtonwhile writing this paper and later earning his Ph.D. He had worked for several years for Taiwan Semiconductor ManufacturingCompany. His research interests focus on production and operations management in the semiconductor manufacturing industry.

85

4


Addressing multicollinearity in semiconductor manufacturing

Documents

Transcript of Addressing multicollinearity in semiconductor manufacturing