PCA: Loadings Plot (p1/p3)

32
NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5 -0.20 -0.10 0.00 0.10 0.20 -0.20 -0.10 0.00 0.10 p[3] p[1] 34-months of 1 day rev. 2 (incl. chip data) no. 2.M4 (PCA-X), Bad residu p[1]/p[3] X SEASON 33LI214.AI 52FFC117.PV 52FFC166.PV 52FIC104.PV 52FIC115.PV 52FIC116.PV 52FIC154. 52FIC164.PV 52FIC165.PV 52FIC167.PV 52FIC177.PV 52HIC812.PV 52IIC128.PV 52IIC178.PV 52JCC139.PV 52JI189.AI 52JIC139.AI 52LIC106.PV 52PCA111.PV 52PCA161.PV 52PCB111.PV 52PCB161.PV 52PIC105.PV 52PIC159.PV 52PIC705.PV 52PIC961.PV 52SIC110.PV 52SQI110.AI 52TI011.AI 52TI031.AI 52TI118.AI 52TI168.AI 52TIC010.CO 52TIC793.PV 52XAI130.AI 52XIC130.AI 52XIC180.AI 52XPI130.AI 52XQI195.AI 52ZIC147.PV 52ZIC148.PV 52ZIC197.PV 52ZIC198. 53AI034.AI 53AI054.AI 53FFC455.PV 53FI012.AI 53HIC762.PV 53LIC011.PV 53LIC301.PV 53NI716.AI 53NIC013.PV 53PIC210.PV 53PIC305.PV 53PIC308.PV 53PIC309.PV 53WI012.AI Pex_L1_Blan Pex_L1_Cons Pex_L1_CSF Pex_L1_LMF Pex_L1_P200 Pex_L1_PFC Pex_L1_PFL Pex_L1_PFM Pex_L1_R100 Pex_L1_R14 Pex_L1_R28 Pex_L1_R48 53LIC510.PV 52FR960.AI 52FRA703.AI 52KQC139.AI 52KQC189.AI 52PI128.AI 52PI178.AI 52PI706.AI 52PIA143.AI 52PIA193.AI 52PIB143.AI 52PIB193.AI 52PIP143.AI 52PIP193.AI 52SI055.AI 52SIA110.AI 52TIC102.PV 52TIC711.PV 52TR964.AI 52XIC811.PV 52X_130.AI_split_L1. 52ZI144.AI 52ZI194.AI 53AIC453.PV 53LR405.AI 53LV301.AI 53NIC100.PV 811FI102.AI 811FI104.AI 85FQ101.AI 85LCB320.AI 85LCS320.AI CopDENS CopSICC Cop>9/8 Cop>7/8 Cop>5/8 Cop>3/8 Cop>3/16 Cop<3/16 CopECOR CopCAR CopECLA PCA: Loadings Plot (p1/p3) Season

description

PCA: Loadings Plot (p1/p3). Season. Conclusions: p3. Summer. INTERPRETATION Component 3: Summer chips vs. winter chips. Winter. So what have we accomplished?. - PowerPoint PPT Presentation

Transcript of PCA: Loadings Plot (p1/p3)

Page 1: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

-0.20

-0.10

0.00

0.10

0.20

-0.20 -0.10 0.00 0.10

p[3]

p[1]

34-months of 1 day rev. 2 (incl. chip data) no. 2.M4 (PCA-X), Bad residuals removedp[1]/p[3]

X

SEASON

33LI214.AI52FFC117.PV

52FFC166.PV

52FIC104.PV

52FIC115.PV

52FIC116.PV

52FIC154.PV52FIC164.PV

52FIC165.PV

52FIC167.PV

52FIC177.PV

52HIC812.PV

52IIC128.PV

52IIC178.PV52JCC139.PV

52JI189.AI

52JIC139.AI52LIC106.PV

52PCA111.PV52PCA161.PV 52PCB111.PV

52PCB161.PV

52PIC105.PV52PIC159.PV

52PIC705.PV

52PIC961.PV52SIC110.PV 52SQI110.AI

52TI011.AI52TI031.AI

52TI118.AI52TI168.AI

52TIC010.CO

52TIC793.PV

52XAI130.AI52XIC130.AI52XIC180.AI52XPI130.AI

52XQI195.AI

52ZIC147.PV

52ZIC148.PV

52ZIC197.PV52ZIC198.PV

53AI034.AI53AI054.AI

53FFC455.PV

53FI012.AI

53HIC762.PV53LIC011.PV

53LIC301.PV

53NI716.AI

53NIC013.PV

53PIC210.PV

53PIC305.PV

53PIC308.PV

53PIC309.PV

53WI012.AI

Pex_L1_Blan

Pex_L1_ConsPex_L1_CSF

Pex_L1_LMF

Pex_L1_P200

Pex_L1_PFC

Pex_L1_PFLPex_L1_PFM

Pex_L1_R100

Pex_L1_R14

Pex_L1_R28

Pex_L1_R48

53LIC510.PV

52FR960.AI

52FRA703.AI52KQC139.AI52KQC189.AI

52PI128.AI

52PI178.AI

52PI706.AI

52PIA143.AI52PIA193.AI

52PIB143.AI

52PIB193.AI

52PIP143.AI

52PIP193.AI

52SI055.AI

52SIA110.AI

52TIC102.PV

52TIC711.PV

52TR964.AI

52XIC811.PV

52X_130.AI_split_L1.

52ZI144.AI

52ZI194.AI

53AIC453.PV

53LR405.AI53LV301.AI

53NIC100.PV

811FI102.AI

811FI104.AI

85FQ101.AI

85LCB320.AI85LCS320.AI

CopDENS

CopSICCCop>9/8Cop>7/8

Cop>5/8

Cop>3/8Cop>3/16Cop<3/16

CopECORCopCARCopECLA

PCA: Loadings Plot (p1/p3)

SeasonSeason

Page 2: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

-0.20

-0.10

0.00

0.10

0.20

-0.20 -0.10 0.00 0.10

p[3]

p[1]

34-months of 1 day rev. 2 (incl. chip data) no. 2.M4 (PCA-X), Bad residuals removedp[1]/p[3]

X

SEASON

33LI214.AI52FFC117.PV

52FFC166.PV

52FIC104.PV

52FIC115.PV

52FIC116.PV

52FIC154.PV52FIC164.PV

52FIC165.PV

52FIC167.PV

52FIC177.PV

52HIC812.PV

52IIC128.PV

52IIC178.PV52JCC139.PV

52JI189.AI

52JIC139.AI52LIC106.PV

52PCA111.PV52PCA161.PV 52PCB111.PV

52PCB161.PV

52PIC105.PV52PIC159.PV

52PIC705.PV

52PIC961.PV52SIC110.PV 52SQI110.AI

52TI011.AI52TI031.AI

52TI118.AI52TI168.AI

52TIC010.CO

52TIC793.PV

52XAI130.AI52XIC130.AI52XIC180.AI52XPI130.AI

52XQI195.AI

52ZIC147.PV

52ZIC148.PV

52ZIC197.PV52ZIC198.PV

53AI034.AI53AI054.AI

53FFC455.PV

53FI012.AI

53HIC762.PV53LIC011.PV

53LIC301.PV

53NI716.AI

53NIC013.PV

53PIC210.PV

53PIC305.PV

53PIC308.PV

53PIC309.PV

53WI012.AI

Pex_L1_Blan

Pex_L1_ConsPex_L1_CSF

Pex_L1_LMF

Pex_L1_P200

Pex_L1_PFC

Pex_L1_PFLPex_L1_PFM

Pex_L1_R100

Pex_L1_R14

Pex_L1_R28

Pex_L1_R48

53LIC510.PV

52FR960.AI

52FRA703.AI52KQC139.AI52KQC189.AI

52PI128.AI

52PI178.AI

52PI706.AI

52PIA143.AI52PIA193.AI

52PIB143.AI

52PIB193.AI

52PIP143.AI

52PIP193.AI

52SI055.AI

52SIA110.AI

52TIC102.PV

52TIC711.PV

52TR964.AI

52XIC811.PV

52X_130.AI_split_L1.

52ZI144.AI

52ZI194.AI

53AIC453.PV

53LR405.AI53LV301.AI

53NIC100.PV

811FI102.AI

811FI104.AI

85FQ101.AI

85LCB320.AI85LCS320.AI

CopDENS

CopSICCCop>9/8Cop>7/8

Cop>5/8

Cop>3/8Cop>3/16Cop<3/16

CopECORCopCARCopECLA

INTERPRETATIONINTERPRETATION Component 3: Component 3: Summer chips vs. Summer chips vs. winter chipswinter chips

INTERPRETATIONINTERPRETATION Component 3: Component 3: Summer chips vs. Summer chips vs. winter chipswinter chips

Conclusions: p3

Summer

Winter

Page 3: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

So what have we accomplished?

Using PCA, we have determined that 45% of the variability in the original 130 variables can be represented by using just 3 new variables or “components”. These three components are orthogonal, meaning that the variation within each one occurs independently of the others. In other words, the new components are uncorrelated with each other.

Component 1Explains 32%

Component 2Explains 7%

Component 3Explains 6%

REFINER THROUGHPUT

BRIGHTNESS

SU

MM

ER

/ W

INT

ER

Example 1

Page 4: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

What exactly are the new components?

Each new component is simply a linear combination of the original variables. For instance in this case component 3 is nothing more and nothing less than the following equation:

Component 3 = 0.242472 x “SEASON”+ 0.159948 x “85LCS320.AI”+ many more positive terms…– 0.224472 x “52ZI144.AI”– 0.214372 x “52TR964.AI”– many more negative terms…

Obviously this equation, when written out fully, has 130 terms, one for each original variable. Many of these, however, have coefficients close to zero, meaning that they have little impact on that component.

Example 1

Page 5: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

What about the unexplained variance?

Our PCA model has captured 45% of the variability in the original dataset. What about the other 55%?

The unexplained variance has several sources:• We only retained three components. More variance is captured by the higher-order components, but much of this is noise and of no use to us as process engineers.• In any case, our linear model is a simplification of the original dataset, and so can never explain 100% of the variance.• Outliers and other problems with the original data can severely weaken the model (“Garbage in, garbage out”)• Some of the variables impacting the process were not measured (or may even be unmeasurable)

This last point is very important for our example, since many key chip characteristics including wood species were never measured.

Example 1

Page 6: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Use of PLS

Now we will have a brief look at the use of PLS, using the same data.

An important pulp characteristic is average fibre length, because longer fibres make stronger paper. This characteristic is represented in our data by three variables: “Pex_L1_LMF”, “Pex_L1_R28” and

“Pex_L1_R28”. We will designate these three variables as Y’s.

The rest of the pulp characteristics were excluded from the PLS analysis.

All the other variables were designated as X’s.

Example 1

Page 7: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Com

p[1

]

Com

p[2

]

Com

p[3

]

Comp No.

34-months of 1 day rev. 2 (incl. chip data) no. 2.M8 (PLS), Untitled R2Y(cum)Q2(cum)

Results for PLS Model

This is the R2 and Q2 plot for the PLS model. The R2 values tell us that the first component explains 23% of the variability in the original Y’s, the second another 13% and the third another 8%, for a total of 44%.

The Q2 values are only slightly lower, meaning that the model performs relatively well in predicting new Y values.

Example 1

Page 8: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

PLS: Score/Loadings Plot

When doing PLS, one of the main things we want to know are which X’s are important to the model. In other words, which X’s are correlated with our Y’s?

We can determine this by studying score and loadings plots which show the X’s and Y’s in relation to the new components. However, these plots can be messy and complicated to read, as shown on the next page.

Note that the axes are labelled differently for the PLS plots. Instead of p(1), for instance, the abcissa is designated w*c(1). This refers to the dual nature of this plot, showing both X and Y space together.

Example 1

Page 9: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

-0.20

-0.10

0.00

0.10

0.20

0.30

-0.30 -0.20 -0.10 0.00 0.10 0.20 0.30

w*c

[2]

w*c[1]

Jun 20 02(1). 10 seconds COMPLETE WITH 45 min LAG.M9 (PLS), Untitledw*c[1]/w*c[2]

XY

33LI214.AI

52FFC117.PV

52FFC166.PV

52FIC115.PV

52FIC116.PV

52FIC154.PV

52FIC164.PV 52FIC165.PV52FIC167.PV

52FIC177.PV

52HIC812.PV

52IIC128.PV52IIC178.PV

52JCC139.PV

52JI189.AI

52JIC139.AI52LIC106.PV

52PCA111.PV

52PCA161.PV

52PCB111.PV

52PCB161.PV

52PIC105.PV

52PIC159.PV

52PIC961.PV

52SIC110.PV

52SQI110.AI

52TI011.AI

52TI031.AI52TI118.AI

52TI168.AI

52TIC010.CO

52TIC793.PV52XAI130.AI

52XIC130.AI

52XIC180.AI

52XPI130.AI52XQI195.AI

52ZIC147.PV52ZIC148.PV

52ZIC197.PV

52ZIC198.PV

53FI012.AI

53HIC762.PV

53LIC011.PV

53LIC301.PV

53PIC305.PV

53PIC308.PV

53PIC309.PV53WI012.AI

53LIC510.PV

52FR960.AI

52KQC139.AI52KQC189.AI52PI128.AI

52PI178.AI

52PIA143.AI

52PIA193.AI

52PIB143.AI

52PIB193.AI

52PIP143.AI 52PIP193.AI

52SI055.AI

52SIA110.AI

52TIC102.PV

52TIC711.PV

52TR964.AI

52X_130.AI_split_L1.

52ZI144.AI

52ZI194.AI

53AIC453.PV

53LR405.AI

53LV301.AI

85LCB320.AIPex_L1_Blan

Pex_L1_Cons

Pex_L1_CSF

PLS Loadings plot

Interpretation of this messy and confusing plot is not Interpretation of this messy and confusing plot is not obvious. We therefore turn to other outputs…obvious. We therefore turn to other outputs…Interpretation of this messy and confusing plot is not Interpretation of this messy and confusing plot is not obvious. We therefore turn to other outputs…obvious. We therefore turn to other outputs…

Page 10: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

PLS: Other plots

We will now look at a number of different plots that can help us interpret the PLS results.

The first is the “X/Y Overview plot”, which gives R2 and Q2 for each original X. This tells us how well each original variable was modelled.

Example 1

Page 11: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

0.00

0.20

0.40

0.60

0.80

1.00

VA

PE

UR

EN

TR

EE

GE

NE

RA

EG

OU

T A

CC

.TA

MIS

PR

IM

EG

OU

T R

EJE

TS

RA

FF

INE

HY

DR

O V

ER

S N

IVE

LLE

ME

CO

NS

IS D

RA

IN G

EN

ER

.

CO

NS

RE

G 1 C

UV

DE

T

PR

ES

S A

CC

P E

PU

R P

RIM

Pex_L1_B

lan

Pex_L1_C

ons

Pex_L1_C

SF

Pex_L1_LM

F

Pex_L1_P

200

Pex_L1_P

FC

Pex_L1_P

FL

Pex_L1_P

FM

Pex_L1_R

100

Pex_L1_R

14

Pex_L1_R

28

Pex_L1_R

48

PT

M V

AP

EU

R G

EN

ER

EE

PR

ES

S V

AP

DE

S C

YC

LO

N

VA

PE

UR

RA

FF

.VE

RS

GE

N

CO

NS

. P

TM

VE

RS

MA

CH

.

Var ID (Var. Sec. ID:1)

32-months of 1 day.M2 (PLS), Untitled R2VY[4](cum)Q2VY[4](cum)

X/Y Overview

Example 1

Page 12: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

PLS: Other plots

The next type of plot is the “coefficient plot”, which shows the actual PLS equation in graphical form. Coefficients for each X are shown as positive or negative bars.

Example 1

Page 13: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

-0.030

-0.020

-0.010

0.000

0.010

0.020

0.030

0.040

0.050

SE

AS

ON

NIV

EA

U S

ILO

A C

OP

EA

UD

IL H

P C

YC

L 1

L

X10

DIL

MP

Z C

EN

R5 LX

1D

EB

IT V

AP

EU

R A

U R

AF

D

IL M

P Z

CE

N R

1 LX

1D

EB

IT V

AP

EU

R A

U R

AF

D

IL H

P Z

CO

N R

5 LX

1H

YD

RO

. S

ULF

.RA

FF

.NO

.D

IL H

P C

YC

L 5

L

X10

CH

DIL

CY

CL 5

KLX

100

NO

MB

RE

RA

FF

. S

ELE

CT

IC

HA

RG

E V

IS C

YC

LO

NE

1C

HA

RG

E V

IS C

YC

LO

NE

5C

ON

TR

. E

NE

RG

IE (

éner

CH

AR

GE

RA

FF

INE

UR

5 (

CH

AR

GE

RA

FF

INE

UR

1 (

NIV

EA

U P

RE

CH

AU

FF

EU

R

PR

ES

S R

AF

F. N

O 1

GE

NP

RE

SS

RA

FF

. N

O 5

GE

NP

RE

SS

RA

FF

. N

O 1

AT

MP

RE

SS

RA

FF

. N

O 5

AT

MP

RE

SS

PR

EC

HA

UF

FE

UR

P

RE

SS

ALIM

EN

T. R

AF

F.

PR

ES

S V

AP

EU

R H

P P

TM

P

RO

DU

C L

IGN

E 1

VIS

0T

OT

AL P

RO

DU

C. R

AF

F 1

TE

MP

EG

OU

TT

EU

R N

O 1

T

EM

P E

GO

UT

TE

UR

NO

2

TE

MP

RA

FF

INE

UR

NO

1

TE

MP

RA

FF

INE

UR

NO

5

TE

MP

. E

AU

LA

VA

GE

CO

PT

EM

P R

EC

HA

UF

EA

U B

LE

NE

RG

I S

PE

CIF

LIG

NE

E

NE

RG

I S

PE

C R

AF

NO

E

NE

RG

I S

PE

C R

AF

NO

R

AT

IO E

NE

RG

IE S

PE

C.

TO

TA

L F

EE

DG

. F

ICT

IFS

PO

SIT

PLA

QU

ES

V R

AF

P

OS

IT P

LA

QU

ES

H R

AF

P

OS

IT P

LA

QU

ES

V R

AF

P

OS

IT P

LA

QU

ES

H R

AF

D

EB

IT P

AT

E C

UV

.DE

T.1

BY

PA

SS

ALIM

FIL

T D

SM

NIV

EA

U C

U.D

ET

.P531-4

NIV

EA

U C

UV

.EA

U B

LC

.BP

RE

SS

EA

U D

ILU

TIO

N M

PR

ES

S E

AU

DIL

UT

ION

HP

RE

SS

EA

U D

ILU

TIO

NP

RO

D. LIG

NE

1 D

ET

EN

TN

IVE

AU

RE

S.N

AO

H 5

(%

DE

BIT

VA

PE

UR

HP

PT

MR

AF

F 1

VIE

DE

S P

LA

QU

RA

FF

5 V

IE D

ES

PLA

QU

PR

ES

S. C

YC

LO

NE

RA

FF

.P

RE

SS

CY

CLO

NE

RA

FF

5F

OR

CE

CH

AM

BR

E A

RA

F

FO

RC

E C

HA

MB

RE

A R

AF

F

OR

CE

CH

AM

BR

E B

RA

F

FO

RC

E C

HA

MB

RE

B R

AF

P

RE

SS

CH

AM

BR

E P

RA

F

PR

ES

S C

HA

MB

RE

P R

AF

V

ITE

SS

VIS

SO

RT

I P

RE

VIT

ES

S V

IS A

LIM

RA

F

TE

MP

TR

EM

IE T

EM

NO

1T

EM

P C

ON

DE

NS

AT

S L

AV

TE

MP

VA

PE

UR

HP

PT

MS

plit

Lig

ne #

1P

OS

IT S

TA

TO

R R

AF

1P

OS

IT S

TA

TO

R R

AF

5P

H P

AT

E V

ER

S N

IVE

LN

IVE

AU

TO

TA

L H

D.1

A

SO

RT

IE V

ALV

E 3

01

DE

BIT

PT

M M

.P. 1 E

T

DE

BIT

CA

SS

E M

.P. 1-

TO

TA

L P

AT

E T

HE

RM

.ME

CS

OR

TIE

VA

LV

E L

V-3

20B

PO

INT

CO

NS

.NIV

.LC

B32C

oeffC

S[1

](53N

IC013.P

V)

Var ID (Var. Sec. ID:1)

32-months of 1 day.M2 (PLS), UntitledCoeffCS[1](53NIC013.PV)

PLS coefficients

Example 1

Page 14: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

PLS: VIP plots

Another very useful output is the ‘Variable Importance Plot’ (VIP) which ranks the X’s in terms of importance to the model.

Note that, because no designed experiment has taken place, we cannot infer that these X’s influence the Y’s. MVA on its own does not prove cause and effect. All we can say is that they are correlated, meaning that they tend to change at the same time. The real cause may be external, like a change in raw material quality.

Let’s have a look at the VIP plot.

Example 1

Page 15: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

0.00

0.50

1.00

1.50

2.00

2.50

SE

AS

ON

53P

IC305.P

V

52Z

I144.A

I

53P

IC308.P

V

52P

IP193.A

I

52Z

IC148.P

V

52P

IA193.A

I

52P

IP143.A

I

52T

IC102.P

V

52T

R964.A

I

52F

IC167.P

V

52T

I011.A

I

811F

I102.A

I

52P

IC961.P

V

52T

I031.A

I

52F

IC116.P

V

52P

I128.A

I

52F

IC164.P

V

52JI1

89.A

I

52X

QI1

95.A

I

52X

_130.A

I_split

_L1.

53F

I012.A

I

52Z

IC197.P

V

52F

FC

117.P

V

52T

I118.A

I

52F

R960.A

I

Cop>3/8

52P

CA

161.P

V

VIP

[3]

Var ID (Primary)

34-months of 1 day rev. 2 (incl. chip data) no. 2.M8 (PLS), PLS of Y1's onlyVIP[3]

X’s

Y’s

“Variable Importance Plot”

These are the X’s that have the strongest correlation to our Y’s.

Average fibre length variables

Example 1

Page 16: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

The most important X’s

The most important X, according the VIP plot, is “Season”. This means that the fibre length varies more with the season than with any other X variable.

The other X’s on the list are mainly refiner operating parameters such as dilution water flows, hydraulic pressures, and energy inputs. An expert on refiner operation would find these results interesting, but we will not examine them in detail here.

Example 1

Page 17: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

The limitations of PLS

PLS results are difficult to interpret. It is always preferable to perform a PCA on the entire dataset first, to get a feel for the overall trends.

One of the trickiest aspects of PLS is that the first component in the X space must correspond to the first component in the Y space, the second with the second, and so forth. Finding a physical interpretation for each of these can be extremely difficult.

It is critical for the student to understand that only those X’s which were measured can be included in the PLS model. There is nothing magical about PCA or PLS. These techniques can only find patterns and correlations that existed in the original data in the first place.

MVA is not magic

Example 1

Page 18: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

End of Example 1:

We’re starting to tame the MVA lion!

Page 19: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

2.2: Example (2)

Using Fewer Variables

Page 20: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Why use fewer variables?

One obvious problem with the previous example is that the plots are very hard to read, because there are so many variables. We will therefore look at smaller number of variables from the same dataset.

There is another good reason for doing this. In the previous example, our first “throughput” component dominated the others, probably because so many process variables are associated either directly or indirectly with the overall flowrate through the system.

In other words, there was a great deal of redundancy in our choice of variables. This was not inherently a bad thing, and we did manage to learn some useful things about our process, but perhaps by reducing the number of initial variables we can learn other things as well.

Example 2

Page 21: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Iterative nature of MVA

At this point, our approach is probably starting to look CONFUSED. The student may be wondering:

• Do we use all the data, or remove the outliers first?• Do we do PCA, or PLS?• Do we use all the variables, or fewer variables?

The answer is that MVA is very iterative, and there is no foolproof recipe. The results of one step guide the next. Sometimes you have to try different things to get useful results, bearing in mind what you know about the process itself and the dataset.

People who are adept at using MVA have a tendency to try all kinds of things, all kinds of different ways. In fact, just doing a basic PCA is the easy part. The difficult part is deciding what to try next, because there are countless possibilities. Knowledge of the process itself is key, which is why this is a job for chemical engineers and not statisticians.

Example 2

Page 22: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Which variables to use?

Getting back to the example, we made a ‘short list’ of key variables based on our knowledge of the process itself. Just because hundreds of variables are available does not mean that we are obliged to use them all for each MVA trial.

The variables related mainly to chip quality (density and moisture content) and to pulp quality (brightness, consistency, …). Also included were “SEASON”, given its prominence in the previous PCA analysis, bleach consumption and specific refiner energy.

In all, only 14 variables were used.

Example 2

Page 23: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Com

p[1]

Com

p[2]

Comp No.

34-months of 1 day rev. 2 (incl. chip data) no. 2.M25 (PCA-X), PCA of Tom Browne variables R2X(cum)Q2(cum)

PCA on 14 variables

This is the R2 and Q2 plot for the 14 variables. The MVA software only found 2 components, which is not uncommon when there are so few initial variables. The first component explains 28% of the variability in the original data, the second another 16%, for a total of 44%.

The Q2 values are much lower, with a cumulative of barely 24%. This means that the predictive value of the model is much lower than before. This is hardly surprising, since the inherent information contained within the 116 excluded variables is now missing.

Example 2

Page 24: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Score Plot for 14 variables

The score plot for the 14 variables is shown on the next page. It is impossible to create a 3-D score plot in this case, since there are only 2 components.

The vast majority of the days fall on or near the first component. Is is plainly obvious from this graph that the first component is related to individual seasons, with clear segregation between the three years.

Note how this first component resembles the second component from example 1 (more on this later…)

Autumn: Sep 1 – Nov 30Winter: Dec 1 – Feb 28Spring: Mar 1 – May 31Summer: Jun 1 – Aug 31

Autumn: Sep 1 – Nov 30Winter: Dec 1 – Feb 28Spring: Mar 1 – May 31Summer: Jun 1 – Aug 31

Example 2

Page 25: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Score Plot

-4

-2

0

2

4

6

8

10

12

14

-5 -4 -3 -2 -1 0 1 2 3 4 5

t[2]

t[1]

34-months of 1 day rev. 2 (incl. chip data) no. 2.M25 (PCA-X), PCA of Tom Browne variablest[1]/t[2]Colored according to classes in M25

No ClassClass 1Class 2Class 3Class 4Autumn: Sep 1 – Nov 30

Winter: Dec 1 – Feb 28Spring: Mar 1 – May 31Summer: Jun 1 – Aug 31

Autumn: Sep 1 – Nov 30Winter: Dec 1 – Feb 28Spring: Mar 1 – May 31Summer: Jun 1 – Aug 31

INTERPRETATIONINTERPRETATIONComponent 1: Individual seasonsComponent 1: Individual seasonsINTERPRETATIONINTERPRETATIONComponent 1: Individual seasonsComponent 1: Individual seasons

WINTER SUMMER

2nd component strongly influenced by these points

Jun 25 – Jul 1, 01

Aug 8 – 12, 01

2001 2000

2002

2000

2001/2002

Page 26: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Second component

The second component is largely influenced by the observations in the upper-right quadrant (remember, it is the observations that influence the components, not the other way around). Looking back at the original data, we saw that these observations fell within certain specific periods in June and August 2001.

What differentiated these periods from the rest of our three-year timeframe?

Trying to figure this out by looking at the original data would be very tedious, if not impossible. We therefore make use of the ‘Contribution plot’ for one of the dates of interest.

The contribution plot shows the values of the original variables for that observation point (June 29, 2001) relative to the average of all the observations taken together. It gives us a quick, visual answer to “What’s different about this observation?”

Example 2

Page 27: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

SE

AS

ON

52F

IC16

5.P

V

52X

AI1

30.A

I

Pex

_L1_

Bla

n

Pex

_L1_

Con

s

Pex

_L1_

CS

F

Pex

_L1_

LMF

Pex

_L1_

P20

0

Pex

_L1_

R10

0

Pex

_L1_

R14

Pex

_L1_

R28

Pex

_L1_

R48

Cop

DE

NS

Cop

SIC

C

Sco

re C

ontr

ib(O

bs 5

84 -

Ave

rage

), W

eigh

t=p1

p2

Var ID (Primary)

34-months of 1 day rev. 2 (incl. chip data) no. 2.M25 (PCA-X), PCA of Tom Browne variablesScore Contrib(Obs 584 - Average), Weight=p1p2

Contribution Plot:June 29, 2001

More fines than average

Fewer long fibres than

average

Page 28: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Contribution plot resultsThe bars on the contribution plot graph tell an important story: during the period of interest, the refiners generated more fines than usual, and fewer long fibres. It appears that the refiners were chopping up the fibres, eliminating the longest size fractions while generating fine fragments. This is not a desirable process performance, and therefore a significant finding.

A study of the loadings plot confirms that the second component is definitely related to fibre length (variables in red ovals). Note that a variable does not have to lie directly upon a component to influence it; in this case, very few of the variables are close to the component line, yet clearly they are affecting it. Their distance from the axis merely means that they are also related to the first component. Note that specific energy is also related to the second component (green oval). This is highly significant, since it is this energy that chops the fibres!

Bleach consumption, pulp brightness and season are related to the first component (blue ovals). Again, this is similar to example 2.

Example 2

Page 29: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

-0.50

-0.40

-0.30

-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

-0.40 -0.30 -0.20 -0.10 0.00 0.10 0.20 0.30

p[2]

p[1]

34-months of 1 day rev. 2 (incl. chip data) no. 2.M25 (PCA-X), PCA of Tom Browne variablesp[1]/p[2]

X

SEASONHYDRO. SULF.RAFF.NO.5

ENERGI SPECIF LIGNE 1

Pex_L1_Blan

Pex_L1_ConsPex_L1_CSF

Pex_L1_LMF

Pex_L1_P200

Pex_L1_R100

Pex_L1_R14

Pex_L1_R28

Pex_L1_R48 Copeaux DENSITECopeaux SICCITE

Loadings Plot

INTERPRETATIONINTERPRETATIONComponent 2: Fibre Component 2: Fibre lengthlength

INTERPRETATIONINTERPRETATIONComponent 2: Fibre Component 2: Fibre lengthlength

+

-

_Blan

Page 30: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

The most striking difference between the example 1 results and the example 2 results is that the “throughput” component has disappeared. This is because we have removed all the variables that relate to this process parameter.

This leaves us to wonder if the two components we found in example 2 are just the second and third components from example 1. In other words, now that we’ve eliminated throughput, the next most significant component has been “promoted” to become the first component, and the third to second. Because all components are statistically independent, this is plausible.

1 2 3The physical interpretations of these components seem to be compatible, so this shift is entirely possible. If so, a comparison of examples 1 and 2 could give us further insights into the process.

Same two components?

X

Example 2

Page 31: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

Absolutely!

We were able to generate cleaner, easier to interpret graphs, while focussing on the variables we were the most interested in.

Once again, we saw the importance of Season, lending credence to our physical interpretations in Example 1. Other similarities with the example 1 results, particularly the two components themselves, could yield further insights about what is actually going on in the process.

However, the Q2 for this bare-bones case is quite low, meaning this model has poor predictive value. Also, a great many important variables were left out completely, so this is not a full picture, but rather an additional view of our original dataset.

Was it worth trying fewer variables?

Example 2

Page 32: PCA: Loadings Plot (p1/p3)

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 2, Rev.: 5

End of Example 2:

Getting smarter…