Partial Least Squares Regression (PLSR)

Post on 18-Nov-2014

878 views 80 download

Tags:

Transcript of Partial Least Squares Regression (PLSR)

Partial Least Squares Partial Least Squares Regression (PLSR)Regression (PLSR)

• Partial least squares (PLS) is a method for constructing predictive models when the predictors are many and highly collinear.

• Note that the emphasis is on predicting the responses and not necessarily on trying to understand the underlying relationship between the variables.

• When prediction is the goal and there is no practical need to limit the number of measured factors, PLS can be a useful tool.

• PLS was developed in the 1960’s by Herman Wold as an econometric technique, but some of its most avid proponents (including Wold’s son Svante) are chemical engineers and chemometricians.

• Partial least squares regression (PLSR) is a multivariate data analytical technique designed to handle intercorrelated regressors.

• It is based on Herman Wold’s general PLS principle in which complicated, multivariate systems analysis problems are solved by sequence of simple least squares regressions.

How Does PLS Work?How Does PLS Work?

• In principle, MLR can be used with very many predictors.

• However, if the number of predictors gets too large (for example, greater than the number of observations), you are likely to get a model that fits the sampled data perfectly but that will fail to predict new data well.

• This phenomenon is called over-fitting.

• In such cases, although there are many manifest predictors, there may be only a few underlying or latent factors that account for most of the variation in the response.

• The general idea of PLS is to try to extract these latent factors, accounting for as much of the manifest predictor variation as possible while modeling the responses well.

• For this reason, the acronym PLS has also been taken to mean ‘‘projection to latent structure.’’

• The overall goal is to use the predictors to predict the responses in the population.

• This is achieved indirectly by extracting latent variables T and U from sampled factors and responses, respectively.

• The extracted factors T (also referred to as X-scores) are used to predict the Y-scores U, and then the predicted Y-scores are used to construct predictions for the responses.

• This procedure actually covers various techniques, depending on which source of variation is considered most crucial.

• PCR is based on the spectral decomposition of XtX, where X is the matrix of predictor values;

• PLS is based on the singular value decomposition of XtY .

• If the number of extracted factors is greater than or equal to the rank of the sample factor space, then PLS is equivalent to MLR.

• An important feature of the method is that usually a great deal fewer factors are required.

• One One approach approach toto extract extract optimum number ofoptimum number of factors factors is to construct the PLS model for a given number of factors on one set of data and then to test it on another, choosing the number of extracted factors for which the total prediction error is minimized.

• Alternatively, van der Voet (1994) suggests choosing the least number of extracted factors whose residuals are not significantly greater than those of the model with minimum error.

• If no convenient test set is available, then each observation can be used in turn as a test set; this is known as cross-validation.

• The PLSR is a bilinear regression method that extracts a small number of factor, ta, a = 1, 2,…, A that are linear combinations of the K X variables, and use these factors as regressors for y.

• What is special for the PLSR compared to principal component regression (PCR) is that the y variable is used actively in determining how the regression factors ta are computed from the X.

• Each PLSR factor ta is defined so that it describes as much as possible of the covariance between X and y remaining after the previous a-1 factors have been estimated and subtracted.

• The purpose of using PLSR in multivariate calibration is to obtain good insight and good predictive ability at the same time.

• In classical stepwise multiple linear regression (SMLR) the collinearity is handled by picking out a small subset of individual, distinctly different X variables from all the available X variables.

• This reduced subset is used as regressors for y, leaving the other X variables unused.

• The estimated factors are often defined to be orthogonal to one another.

• The model for regressions on estimated latent variables can be summarized as follows:

T = w(X)

X = p(T) + E

y = q(T) + f

y = q(w(X)) + f = b(X) + f

• In practice, the model parameters have to be estimated from empirical data.

• Since the regression is intended for later prediction of y and X, the factor scores T are generally defined as functions of X:T = w(X).

• The major difference between calibration methods is how T is estimated.

• For instance, in PCR it is estimated as a series of eigenvector spectra for (X – 1x(X – 1xTT))TT(X – 1x(X – 1xTT),), etc.

• In PLSR w() is defined as a sequence of X versus y covariances.

PLS-Regression (PLS-R)PLS-Regression (PLS-R)PLS-A Powerful Alternative to PCRPLS-A Powerful Alternative to PCR

• It is possible to obtain the same prediction results as PCR, but based on a smaller number of components, by allowing the y-data structure to intervene directly in the X-decomposition.

• This by condensing the two-stage PCR process into just one: PLS-R (Partial Least Squares Regression).

• Usually the term used is just PLS, which has also been interpreted to signify Projection to Latent Structures.

• PLS claims to do the same job as PCR, only with fewer bilinear components.

PLS(X, Y); Initial Comparison with PLS(X, Y); Initial Comparison with PCA(X),PCA(Y)PCA(X),PCA(Y)

• In comparision between PCR and PLS, PLS uses the y-data structure, the y-variance, directly as a guiding hand in decomposing the X-matrix, so that the outcome constitutes as optimal regression, precisely in the strict prediction validation sense.

• A very first approximation to an understanding of how the PLS-approach works (though not entirely correct) is tentatively and simply to view it as two simultaneous PCA-analyses, PCA of X and PCA of Y.

• The equivalent PCA equations are presented at the following Figure.

• Note how the score and loading complements in X are called T and P respectively (X also has an alternative W-loading in addition to the familiar P-loading), while these are called U and Q respectively for the Y-space.

A

T

A

T

FQUY

EPTX

• However PLS does not really perform two independent PCA-analyses on the two spaces.

• On the contrary, PLS actively connects the X- and Y-spaces by specifying the u-score vector (s) to act as the starting points for (actually instead of) the t-score vectors in the X-space decomposition.

w = loading weight p = x loading q = y loading

• Thus the starting proxy-t1 is actually u1 in the PLS-R method, thereby letting the Y-data structure directly guide the otherwise much more “PCA-like” decomposition of X.

• Subsequently u1 is later substituted by t1 at the relevant stage in the PLS-algorithm in which the Y-space is decomposed.

• The crucial point is that it is the u1 (reflecting the Y-space structure) that first influences the X-decomposition leading to calculation of the X-loadings, but these are now termed “w” (for “loading-weights”).

• Then the X-space t-vectors are calculated, formally in a “standard” PCA fashion, but necessarily based on this newly calculated w-vector.

• This t-vector is now immediately used as the starting proxy- u1-vector, i.e. instead of u1, as described above only symmetrically with the X- and the Y-space interchanged.

• By this means, the X-data structure also influences the “PCA (Y)-like” decomposition.

B = W(PTW)-1QT

• Thus, what might at first sight appear as two sets of independent PCA decompositions is in fact based on these interchanged score vectors.

• In this way we have achieved the goal of modeling the X- and Y-space interdependently. PLS actively reduces the influence of large X-variations which do not correlate with Y.

• PCR is based on the spectral decomposition of X’X, where X is the matrix of variables and PLS is based on the singular value decomposition of X’Y.

• Alternative overview of PLS (indirect modeling) states that the overall goal is to use the variables to predict the responses in the population.

• This is achieved indirectly by extracting latent variables T and U from sampled variables and responses, respectively.

• The extracted factors T (also referred to as X-scores) are used to predict the Y-scores U, and then the predicted Y-scores are used to construct predictions for the responses.

Interpretation of PLS modelsInterpretation of PLS models

• In principle PLS models are interpreted in much the same way as PCA and PCR models.

• Plotting the X- and the Y-loadings in the same plot allows you to study the inter-variable relationship, now also including the relationship between the X- and Y-variables.

• Since PLS focuses on Y, the Y-relevant information is usually expected already in early components.

• There are however situations where the variation related to Y is very subtle, so many components will be necessary to explain enough of Y.

Loadings (p) and loading weights (w)Loadings (p) and loading weights (w)

• The P-loadings are very much like the well-known PCA-loadings; they express the relationship between the raw data matrix X and its score, T. (in PLS these may be called PLS scores.)

• These loadings may be interpreted in the same way as in PCA or PCR, so long as it is aware that the scores have been calculated by PLS.

• In many PLS applications P and W are quite similar. This means that the dominant structures in X “happen” to be directed more or less along the same directions as those with maximum correlation to Y.

• The loading weights, W, however represent the effective loadings directly connected to building the sought for regression relationship between X and Y.

• In PLS there is also a set of Y-loadings, Q, which are the regression coefficients from the Y-variables onto the scores, U.

• Q and W may be used to interpret relationships between the X- and Y-variables, and to interpret the patterns in the score plots related to these loadings.

Loading plot of non-spectra variablesLoading plot of non-spectra variables

Loading plot of spectra variablesLoading plot of spectra variables

• The fact that both P and W are important however, is clear from construction of the formal regression equation Y = XB from any specific PLS solution with A components.

• This B matrix is calculated from:

B = W(PTW)-1QT

This B-matrix is often used for practical (numerical) prediction purposes.

When to use which method?When to use which method?

• PLS-approach is easy to understand conceptually and to be preferred because it is direct, and effective.

• PLS is said to produce results, which are easier to interpret because they are less complex (using fewer components).

• Often PCR may give prediction errors as low as those of PLS, but almost invariably by using more PCs to do the jobs.

• PLS2 is a natural method to start with when there are many Y-variables.

• You quickly get an overview of the basic patterns and see if there is significant correlation between the Y-variables.

• PLS2 may actually in a few cases even give better results if Y is collinear, because it utilises all the available information in Y.

• The drawback is that you may need different numbers of PCs for the different Y-variables, which you must remember at interpretation and prediction.

Exercise- Interpretation of PLS (Jam)Exercise- Interpretation of PLS (Jam)