Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood...

24
Document downloaded from: This paper must be cited as: The final publication is available at Copyright Additional Information http://doi.org/10.1002/cem.2804 http://hdl.handle.net/10251/81809 Wiley Folch Fortuny, A.; Arteaga Moreno, FJ.; Ferrer, A. (2016). Assessment of maximum likelihood PCA missing data imputation. Journal of Chemometrics. 30(7):386-393. doi:10.1002/cem.2804.

Transcript of Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood...

Page 1: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

Document downloaded from:

This paper must be cited as:

The final publication is available at

Copyright

Additional Information

http://doi.org/10.1002/cem.2804

http://hdl.handle.net/10251/81809

Wiley

Folch Fortuny, A.; Arteaga Moreno, FJ.; Ferrer, A. (2016). Assessment of maximumlikelihood PCA missing data imputation. Journal of Chemometrics. 30(7):386-393.doi:10.1002/cem.2804.

Page 2: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

1

AssessmentofmaximumlikelihoodPCAmissingdataimputation

A.Folch-Fortuny1,*,F.Arteaga2,A.Ferrer1

1Dep.deEstadísticaeInvestigaciónOperativaAplicadasyCalidad,UniversitatPolitècnicade

València,CaminodeVeras/n,46022Valencia,Spain.

2Dep.ofBiostatisticsandInvestigation,UniversidadCatólicadeValenciaSanVicenteMártir,

C/Quevedo2.46001Valencia,Spain.

Maximum likelihood principal component analysis (MLPCA) was originally proposed to

incorporate measurement error variance information in principal component analysis (PCA)

models.MLPCAcanbeusedtofitPCAmodelsinthepresenceofmissingdata,simplybyassigning

very large variances to thenon-measuredvalues.An assessmentofmaximum likelihoodmissing

dataimputationisperformedinthispaper,analysingthealgorithmofMLPCAandadaptingseveral

methodsforPCAmodelbuildingwithmissingdatatoitsmaximumlikelihoodversion.Inthisway,

knowndataregression(KDR),KDRwithprincipalcomponentregression(PCR),KDRwithpartial

least squares regression (PLS), and trimmed scores regression (TSR)methods are implemented

withintheMLPCAmethodtoworkasdifferent imputationsteps.Sixdatasetsareanalysedusing

severalpercentagesofmissingdata,comparingtheperformanceoftheoriginalalgorithm,andits

adaptedregression-basedmethods,withotherstate-of-the-artmethods.

Keywords: maximum likelihood principal component analysis, missing data, regression-based

methods,PCAmodelbuilding,trimmedscoresregression

1. INTRODUCTION

Principal componentanalysis1 (PCA) isoneof themostappliedmethods fordataunderstanding.

Theoriginalvariablesareprojectedontothelatentspace,wheredatamostvary,andanewsetof

*Correspondenceto:A.Folch-Fortuny([email protected])

Page 3: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

2

uncorrelated variables are obtained, the principal components (PCs), summarizing the most

relevant features of data. Wentzell et al.2 proposed in 1997 a new PCA approach, based on

maximum likelihood fitting, called maximum likelihood PCA (MLPCA). This methodology allows

incorporating information about themeasurement errors in themodel.MLPCA has beenwidely

applied in several works within chemistry and biology, e.g. to analyse reflectance Fourier

transformed infrared (FTIR)microspectroscopic data3 and ionmass spectroscopic data4, to fault

detectioninprocessindustry5,tothecharacterizationofmeasurementerrorsinnuclearmagnetic

resonance (NMR) data6 and gene expression data7, to determine the appropriate number of

reactions in stoichiometric modelling8, and as a useful preprocessing tool for metabolomic,

proteomic,transcriptomic9andenvironmental10dataanalysis.

Shortly after thepublicationof the originalMLPCAalgorithm, an applicationof thismethodwas

proposed addressing themissing data (MD)problem in PCAmodel building (PCA-MB)11.MLPCA

dealswiththemissingvaluesbyassigningthemlargevariancespriortoimplementingthemethod,

whichguidesthealgorithmtofitaPCAmodeldisregardingthesedatapoints.TheMLPCAapproach

for MD has been applied successfully in the literature to fluorescent, chromatographic, near-

infraredspectroscopic11,spectrophotometric12,andenvironmental13data.

Folch-Fortuny et al.14 address the problem of PCA-MB with missing data. In this work, several

methodsoriginallyproposedforPCAmodelexploitation(PCA-ME)15,16,i.e.whenafixedPCAmodel

isusedtoinfermissingvaluesinnewincompleteobservations,areadaptedtothemodelbuilding

context. Basically, the idea was to adapt the known iterative algorithm (IA)17 for PCA-MB by

replacingthepredictionofthePCAmodeltothatresultingwhenwetreateachincompleterowin

thedata set as anewobservationwithmissingvalues, andapplying theprojection to themodel

plane (PMP)method forPCA-ME18. This adaptation arose from the fact that PMP is, under some

generalconditions,equivalenttoIAandtheminimizationofthesquaredpredictionerror(SPE)for

PCA-ME,asprovedin15.Thus,theaimin14wastoassesswhetherthisequivalenceheldinthePCA-

MBcontext.

Theregression-basedmethods,proposedin15,werealsoadapted,jointlywithPMP,tothePCA-MB

context in 14. Thesemethods are: known data regression (KDR), KDRwith principal component

Page 4: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

3

regression(PCR),KDRwithpartial leastsquaresregression(PLS)andtrimmedscoresregression

(TSR). All thesemethods impute themissing values in a data set by fitting different regression-

basedschemesbetweentheavailabledataandthemissingpositions.Severalothermethodswere

compared to the previous ones in 14, including the modified NIPALS algorithm19, the nonlinear

programmingapproach20andmultipleimputationbydataaugmentation21.Theconclusionwasthat

TSR represents a good compromise solutionbetweenpredictionquality, robustness against data

structure and computation time14; outperforming other approaches implemented in commercial

software as ProSensus22, SIMCA23 and PLS Toolbox24. TSR and most of the other approaches

compared in 14 are now implemented in a freely available user-friendly MATLAB toolbox25

(http://mseg.webs.upv.es).

Nelson26 showed the equivalence between the scores calculation by columns inMLPCA and the

PMPalgorithmforPCA-ME.Here,wearegoing toprove theequivalencebetweenthe imputation

stepbycolumnsinMLPCAalgorithmandtheadaptedPMPmethodforPCA-MB.

The aim of this paper is, thus, to answer three questions that arise from the aforementioned

equivalence:

i)Oncethealgorithmsconverge,aretheimputedvaluesofMLPCAandPMPforPCA-MBequal?

ii)SinceTSRoutperformsPMP,asprovenin14,iftheimputationstepinMLPCAissubstitutedbya

TSR-basedimputation,doestheimputationoutperformtheoriginalMLPCA?

iii) In any case, does MLPCA, or its adapted version with TSR, outperform the original TSR

algorithm?

To answer these research questions, we propose here to adapt the regression-basedmethods14

(KDR, KDRwith PCR, KDRwith PLS and TSR) to work as different imputation steps within the

MLPCA algorithm, providing a framework formaximum likelihoodmissing data imputation. The

performanceofthesemethodsiscomparedtoPMPandTSRmethodsusingsixdatasets,actualand

simulatedones,fromdifferentresearchareas.

The rest of the paper is organised as follows. Section 2 proves the equivalence between the

imputationstepbycolumnsofMLPCAandthePMPmethodforPCAmodelbuilding,anddescribes

Page 5: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

4

howtheregression-basedmethodsareadaptedtoitsmaximumlikelihood(ML)version.Section3

describes the data sets used in this study, as well as how the comparative study is performed.

Section4showstheresultsoftheMLregression-basedmethods,jointlywiththeoriginalPMPand

TSRalgorithms.Finally,theconclusionsarehighlightedinSection5.

2. METHODOLOGY

LetXbeanNbyKmatrix,xiTits𝑖!"rowandykits𝑘!"column.Eachrowrepresentsapointinthe𝐾-

dimensionalspaceofthe𝐗observations,andeachcolumnapointinthe𝑁-dimensionalspaceofthe

𝐗variables.Row𝑖canbedecomposedin𝐱!! = 𝐱!!,! + 𝛆!!,where𝐱!

!,!arethetruevaluesand𝛆!!are

their measurement errors26. As well, column 𝑘 can be decomposed in its true and error parts:

𝐲! = 𝐲!! + 𝛈! . Both errors are assumednormally distributed in each of the𝐾 and𝑁 dimensions,

respectively.

Themaximisationofthelikelihoodisobtainedbyminimisingthefollowingobjectivefunction:

𝑆! = (𝐱!! − 𝐱!!)!

!!!

𝚺!!!(𝐱! − 𝐱!) = (𝐲!! − 𝐲!!)!

!!!

𝚿!!!(𝐲! − 𝐲!)

(1)

where 𝚺! is the covariance matrix of the errors 𝛆!! of observation 𝐱!!, and𝚿! is the covariance

matrixoftheerrors𝛈!ofvariable𝐲k.Theestimationofbothvectorsarisefrom:

𝐱! = 𝐏(𝐏!𝚺!!!𝐏)!!𝐏!𝚺!!!𝐱! (2)

𝐲! = 𝐔(𝐔!𝚿!!!𝐔)!!𝐔!𝚿!

!!𝐲! (3)

where𝐔(𝑁×𝐴),𝐃(𝐴×𝐴)and𝐏(𝐾×𝐴)representthesingularvaluedecompositionof𝐗 = 𝐔𝐃𝐏! =

[𝐲!. . . 𝐲!] = [𝐱!. . . 𝐱!]!,using𝐴dimensionsorcomponents.

MLPCAalgorithmisanalternatingleastsquaresprocedurethatstartsimputinginitialguessesfor𝐔

and𝐏basedontheSVDdecompositionof𝐗.Ateachiteration,thealgorithmhastwosteps.Thefirst

oneconsistsofprojectingtherows𝐱!!onthecolumnsof𝐏,computingtheobjectivefunction,and

recalculating𝐔and𝐏fromanSVDusingtheestimations.Thesecondstepconsistsofprojectingthe

Page 6: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

5

columns𝐲! on the columns of𝐔, computing also the objective function, and finally recalculating

again𝐔and𝐏fromanSVD.Convergenceisachievedwhenthedifferencebetweentheestimations

oftheobservationsarebelowaspecifiedthreshold27.

The adaptation of MLPCA tomodel building withmissing data assumes uncorrelated errors for

bothobjects𝐱!!andvariables𝐲! , thereforematrices𝚺! and𝚿!arediagonal26,27. Inthisalgorithm,

largevariances(10!")areassignedto themissingmeasurements,andones to theavailableones.

Therefore,theinversionofmatrices𝚺! and𝚿!producesdiagonalmatriceswith1sand0s.Theones

serve to fit these specific measurements in the PCA and the 0s to disregard the missing

measurementsinthemultivariatemodel.

Letusassumethatrow𝑖hasmissingvalues.Thevaluesinthisvectorcanberearrangedtohavethe

missingvaluesinitsfirst𝑅! positionswithoutlossofgenerality,andtheremaining𝐾 − 𝑅! available

valuesat theend.Thispartition inxiT, inducesapartition in the𝐗 data set,being𝐗# (𝑁×𝑅!) the

missingpartand𝐗*(𝑁×(𝐾 − 𝑅!))theavailablepart,accordingtorowi.Additionally,thispartition

canbetransferredintoaSVD(orPCA)model,𝐗 = 𝐔𝐃𝐏!,being𝐏#(𝑅!×𝐴)themissingpartofthe

loadingsmatrix,and𝐏*((𝐾 − 𝑅!)×𝐴)theavailablepart.Figure1showsaschemeofthisnotation.

Figure 1. Partition induced in𝐗 matrix by themissing data in its 𝑖!" row. Grey squares denote

missingpositionsinthedataset.

Usingthispartition,theinverseofmatrix𝚺! canbewrittenas:

𝚺!!! =0 00 IK-!!

(4)

XxiT

xi#Txi*TxiT X#X*

P#TP*Trearrange

columns

SVD

model

N

K K

K

N

K

AR

R R

Page 7: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

6

where𝐈!!!! istheidentitymatrixwith𝐾 − 𝑅! rows/columns,accordingtothemissingdatapattern

in𝐱!!.

SubstitutingthisexpressioninEquation2,observation𝐱iTcanbecomputedas:

𝐱i=𝐱i#

𝐱i*= 𝐏#𝐏* 𝐏#T 𝐏*T

0 00 IK-!!

𝐏#𝐏*

-1

𝐏#T 𝐏*T0 00 IK-!!

0xi*

=

= 𝐏#𝐏*

𝐏*T𝐏*-1𝐏*Txi*=

𝐏# 𝐏*T𝐏*-1𝐏*Txi*

𝐏* 𝐏*T𝐏*-1𝐏*Txi*

(5)

Alternatively,theinverseofmatrix𝚿!canbewrittenas:

𝚿!!! =

0 00 IN-!!

(6)

where𝐈!!!! is the identitymatrixwith𝑁 − 𝑅! rows/columns,accordingtocolumn𝐲! .Following

Equation3,𝐲!isthereforecomputedas:

𝐲k=𝐲k#

𝐲k*= 𝐔#𝐔* 𝐔#T 𝐔*T

0 00 IN-!!

𝐔#𝐔*

-1

𝐔#T 𝐔*T0 00 IN-!!

0yk* =

= 𝐔#𝐔*

𝐔*T𝐔*-1𝐔*Tyk

*=𝐔# 𝐔*T𝐔*

-1𝐔*Tyk

*

𝐔* 𝐔*T𝐔*-1𝐔*Tyk

*

(7)

where𝐔#(𝑅!×𝐴)and𝐔*((𝑁 − 𝑅!)×𝐴)arethemissingandavailablepartsof𝐔.

TheMLPCA imputation step of themissing values 𝐱!#! is the same as the PMPmethod for PCA

modelbuildingpresentedrecentlyin14.ThemaindifferencebetweenMLPCAalgorithmandPMPis

that the former performs the imputation iteratively first by observations and then by variables,

insteadofonlybyobservations,asPMPdoes.Andadditionally,theconvergenceinPMPisachieved

basedonlyontheimputedmissingvalues,insteadoftheimputationoftheavailablemeasurements,

asitisinMLPCA.

In 14 itwas shown that the imputation step in the adapted PMP algorithm for PCA-MB could be

substitutedbytheregression-basedmethodspresentedin15(KDRanditsvariants,andTSR).Most

ofthesemethodsshowedasuperiorperformancethanPMPacrossseveralcasestudies.So,theidea

hereconsistsofadaptingthealternatingimputationofMLPCAalgorithmtoincludetheimputation

Page 8: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

7

stepoftheregression-basedmethods,thusproposingamaximumlikelihood(ML)framework:ML-

KDR,ML-KDRwithPCR,ML-KDRwithPLSandML-TSR.

Theimputationstepoftheregression-basedmissingdatamethodsis:

𝐱! =𝐱!#

𝐱!∗=

𝐒#∗𝐋! 𝐋!!𝐒∗∗𝐋!!!𝐋!!𝐱!∗

𝐒∗∗𝐋! 𝐋!!𝐒∗∗𝐋!!!𝐋!!𝐱!∗

(8)

where𝐒isthecovariancematrixof𝐗,and:

𝐒 = [𝐗#𝐗∗]![𝐗#𝐗∗]/(𝑁 − 1) = 𝐗#!𝐗# 𝐗#!𝐗∗𝐗∗!𝐗# 𝐗∗!𝐗∗

/(𝑁 − 1) = 𝐒## 𝐒#∗𝐒∗# 𝐒∗∗

(9)

Thekeymatrix𝐋inEquation8particulariseswhichmethodoftheframeworkisbeingusedforthe

imputation:𝐋 = 𝐈 forKDR;𝐋 = 𝐕!:! forKDRwithPCR,where𝐕1:ρ is theeigenvectormatrixof𝐒∗∗

and𝜌 ≤ rank(𝐒∗∗);𝐋 = 𝐖∗ forKDRwithPLS,where𝐖* is the loadingsmatrixof thePLSmodel

𝐓!"# = 𝐗∗!𝐖∗;and𝐋 = 𝐏∗forTSR.

Therefore,toadapttheMLPCAoriginalalgorithm11tousetheregression-basedmethods,wehave

tosubstitutetheimputationstep(Equations2-3)by:

𝐱! = 𝐒𝚲𝒊𝐋𝒊(𝐋!!𝚲!!𝐒𝚲𝒊𝐋𝒊)!!𝐋!!𝚲!!𝐱! (10)

𝐲! = 𝐒𝚽𝒌𝐋𝒌(𝐋!!𝚽!!𝐒𝚽𝒌𝐋𝒌)!!𝐋!!𝚽!

!𝐲! (11)

where𝐋matrix is thesameas intheregression-basedframework,particularising forthemissing

datapatterninrow𝑖orcolumn𝑘.And:

𝚲! =𝟎IK-!!

(12)

𝚽! =𝟎

IN-!! (13)

TheequivalencebetweenEquations10and8isshowninAppendixB.FormoredetailsonMLPCA,

readersarereferredto26,27andtheoriginalpaper11.TheMatlabsourcecodeof thealgorithmfor

PCAmodel buildingwithmissing data is reproduced here in Appendix Bwith slight changes to

introducetheimputationstepoftheregression-basedmethods.

Page 9: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

8

3. DATASETSANDCOMPARATIVESTUDY

Six data sets are used in the present study to compare the results of the different imputation

methods included in the framework. The first data set contains FTIR miscroscopy spectra of a

polymerlaminateconsistingofthreelayers:polyethylene(PE),isophtalicpolyester(IPE,presence

originally unknown), and polyethylene terephthalate (PET). The polymer was scanned in a

seventeen point transect across the different layers, obtaining measurements from 81

wavelengths28-30. The second case study consists of a set of measured and inferred fluxes from

Pichia pastoris cultures on heterogeneous culture media31. The measured fluxes were collected

fromaliteraturereview;lateron,agreymodellingapproachwasappliedtoinfertheintracellular

fluxesaccordingtotheobservedextracellularones.Fromtheoriginaldatasetwith3600scenarios

and45 fluxes,a representativesampleof105 individuals is selected for thepresentcomparative

study. This data set has 3 biologically relevant PCs. Finally, a simulated data set, with 100

observations and 10 variables, is used to compare the performance of the different maximum

likelihoodmethods32,33. Thisdata set has4 eigenvalues (3, 2.5, 2 and1.5) explaining90%of the

varianceindata.

Threeadditionaldatasetsareanalysedhere,takenfrom14,wheretheadaptationoftheregression

basedmethodstoPCA-MBwasproposed.The firstoneconsistsof thepercentagecompositionof

eight fatty acids in 75 olive oils of South Apulia34. The second one is a set of NIR spectra (750-

1550nmin2nmintervals)measuredon40diesel fuels35.Andthelastoneisa100×10simulated

datasetwith3componentsexplaining40,30and20%ofvariance32,33.

Twoperformancecriteriaareusedtocomparetheresultsofthedifferentmethods.Thefirstoneis

themeansquaredpredictionerror(MSPE):

𝑀𝑆𝑃𝐸 𝑀𝑒𝑡ℎ𝑜𝑑 =𝑥!" − 𝑥!"!"#!!"

!!!!!

!!!!

𝑁𝐾 (14)

wherexij isthepredictedvalueforthe𝑗!"variableofthe𝑖!"observationinthepredictionmatrix

𝐗 = 𝐓𝐏! obtained from the complete data set; xijMethod the analogous prediction obtained after

Page 10: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

9

applying the correspondingmethod on the incomplete data set; andN andK are the number of

rowsandcolumnsinthedataset,respectively.Theoriginalregression-basedframeworkmethods

use, as convergence criterion, the difference between consecutive imputations ofmissing values.

Instead,MLPCAusethedifferencebetweentheavailablemeasurementsandtheirpredictionsfrom

thecurrentPCAmodel.Forthis,wedecidedtoshowtheMSPEsfortheavailablemeasurementsand

themissingonesseparately,usingEquation14.

Thesecondcriterionconsistsofthecosinebetweenthefirstloadingvectorobtainedusingthefull

datamatrixanditscorrespondingonefromtheincompletedataset.ThecosinesoffurtherPCsare

notshown,sincetheirvaluesarestronglyaffectedbythedeviationsofthefirstPC14.

Sixdifferent levelsofmissingvaluesaregenerated foralldatasets, ranging from10%to60%of

missingdata.Also,50differentMDpatternsaregeneratedforeachpercentageofmissingdata,in

order to build confidence intervals for the MSPEs. The intervals are built based on the LSD

significanceofathree-factormixed-effectsANOVA,wheremethodandpercentageofMDarefixed-

effectfactors(andalsotheirinteraction),andthereplicatesistherandom-effectfactor(nestedto

percentage).GiventhepositiveskewnessofMSPE,alogarithmictransformationisusedtoeasethe

visualizationoftheplots.

4. RESULTS

Inthissectiontheresultsofthecomparativestudyarepresented.However,wedecidedtoexclude

theresultsofML-KDR,ML-KDRwithPCRandML-KDRwithPLSdue to largecomputation times,

something alreadyobserved in 14, anddue to the instability of someof them, especiallyML-KDR

(also observed in 14withKDR) andML-KDRwithPLS.Therefore, the results ofMLPCA,ML-TSR,

TSRandPMPareshown,inordertoanswerthethreeresearchquestionsposedintheIntroduction.

4.1.FTIRmicrospectroscopy

Figure 2 shows the results of the first case study. The two upper plots, A) and B), show the

logarithm of the MSPEs and the cosines of PC#1, respectively. Figures 2C-2D show also the

Page 11: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

10

logarithm of the MSPEs but considering only the measured values and the imputed values

separately.RegardingFigure2A,thereexistnostatisticaldifferencesbetweenMLPCAandML-TSR

inallpercentagesofmissingdata.TSRandPMPstatisticallyoutperformbothMLapproaches for

lowpercentages ofmissingdata (10-20%). From50%onwards, TSR is superior toPMP,MLPCA

andML-TSR.The cosines shown inFigure2Bare coherentwith the resultsof theMSPEs, having

TSRthehighestcosinesfrom30%to60%.

Figure 2. FTIR data set results. A) Logarithm of the MSPE for all measurements. B) Cosines

associatedtothefirstPC.C)LogarithmoftheMSPEfortheavailablemeasurements.D)Logarithm

of the MSPE for the missing data. The dashed ellipses in a) mark the statistically significant

differencesbetweengroupsofmethods.InA)TSRisstatisticallysuperiortoMLPCAwith30-40%

ofMD.However,sincethereisnomethodstatisticallysignificantfromalltherest,asingledashed

ellipseenclosesallofthem.

10 20 30 40 50 60−14

−12

−10

−8

−6

−4

−2

0

2

% of Missing Data

log(

MS

PE

)

FTIR data set

MLPCAMLTSRTSRPMP

10 20 30 40 50 60−14

−12

−10

−8

−6

−4

−2

0

2

% of Missing Data

log(

MS

PE

) (m

easu

rem

ents

)

FTIR data set

MLPCAMLTSRTSRPMP

10 20 30 40 50 600.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

% of Missing Data

Cos

ines

ass

ocia

ted

to P

C#1

FTIR data set

MLPCAMLTSRTSRPMP

10 20 30 40 50 60−18

−16

−14

−12

−10

−8

−6

−4

% of Missing Data

log(

MS

PE

) (m

issi

ng)

FTIR data set

MLPCAMLTSRTSRPMP

A) B)

c)

C) D)

Page 12: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

11

TheresultsinFigure2CshowthatTSRandPMParesuperiortotheMLapproachesintermsofthe

measured values, which implies that the PCAmodel fitted once the data is imputed with these

methods is closer to the original one than using maximum likelihood estimations. Figure 2D is

indeedverysimilartoFigure2A,duetothefactthattheerrorsintheimputedvaluesbetweenthe

truePCAmodelandtheimputedonearewaylargerthaninthemeasuredvalues,asexpected.

4.2.P.pastorisculturesonheterogeneousculturemedia

TheresultswiththeP.pastorisdatasetaresimilartothepreviousones,bothinMSPEsandcosines

(seeFigure3A-3B).TSRandPMPachievethestatisticallybestperformancefrom20%-40%ofMD;

andagain,from50%onwards,TSRbecomesthebestapproach,beingPMPsuperiortoMLPCAand

ML-TSR.TheperformancesofTSRandPMPare indeed coherentwith the results observed in 14,

whichconfirmtheresultsobtainedinthatpaper.ThecosinesshowninFigure3Barecoherentwith

theMSPEvalues.TheloweristhelogarithmoftheMSPE,thecloseraretheloadingvectorsofthe

reconstructedmatrixtotheactualones.

Page 13: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

12

Figure3.P.pastorisdataset results.A)Logarithmof theMSPE forallmeasurements.B)Cosines

associatedtothefirstPC.C)LogarithmoftheMSPEfortheavailablemeasurements.D)Logarithm

of the MSPE for the missing data. The dashed ellipses in A) mark the statistically significant

differencesbetweengroupsofmethods.

RegardingFigures3C-3D, theperformanceofallmethods isalsosimilar to the firstexample.For

low percentages ofMD, the differences amongmethods are smaller in themeasured values, but

from30%ofMDonwards,thePCAmodelobtainedwithTSRimputationresemblesmoretothereal

one.

4.3Simulateddataset

IntheSimulateddatasetwith4PCs,thedifferencesamongTSR,ML-TSR,PMPandMLPCAarenot

statisticallysignificantforlowpercentagesofmissingvalues(10-20%)(seeFigure4A).With30%

10 20 30 40 50 60−8

−6

−4

−2

0

2

4

6

% of Missing Data

log(

MSP

E)P. pastoris data set

MLPCAMLTSRTSRPMP

10 20 30 40 50 60−8

−6

−4

−2

0

2

4

6

8

% of Missing Data

log(

MSP

E) (m

easu

rem

ents

)

P. pastoris data set

MLPCAMLTSRTSRPMP

10 20 30 40 50 60

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

% of Missing Data

Cos

ines

ass

ocia

ted

to P

C#1

P. pastoris data set

MLPCAMLTSRTSRPMP

10 20 30 40 50 60−12

−10

−8

−6

−4

−2

0

% of Missing Data

log(

MSP

E) (m

issi

ng)

P. pastoris data set

MLPCAMLTSRTSRPMP

A) B)

c)

C) D)

Page 14: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

13

ofMD,TSRbecomesstatistically thebestmethodandPMPtheworstone.This issomething that

wasobservedin14,alsousingasimulateddataset32,33.Thehigheristhepercentageofmissingdata,

the more difficult is to impute properly for PMP. For higher percentages (30-60%), there are

statistical differences among allmethods: TSRmaintains the best performance, followed byML-

TSR,MLPCAandPMP.ThisisthefirstcasestudywherethereexistdifferencesbetweenMLPCAand

ML-TSR,beingthe latterstatisticallysuperior.Thesedifferences intheMSPEs canalsobeseen in

Figure4B,whereallmethodsbutTSRshowhugedeviationsfromthetrueprincipalcoordinateof

thedatawithlow-mediumpercentagesofMD(10-40%).

Figure4. Simulateddataset results.A)Logarithmof theMSPE forallmeasurements.B)Cosines

associatedtothefirstPC.C)LogarithmoftheMSPEfortheavailablemeasurements.D)Logarithm

of the MSPE for the missing data. The dashed ellipses in A) mark the statistically significant

differencesbetweengroupsofmethods.

10 20 30 40 50 60−5

0

5

10

15

% of Missing Data

log(

MS

PE

)

Simulated data set (4 PCs)

MLPCAMLTSRTSRPMP

10 20 30 40 50 60−4

−2

0

2

4

6

8

10

12

14

16

% of Missing Data

log(

MS

PE

) (m

easu

rem

ents

)

Simulated data set (4 PCs)

MLPCAMLTSRTSRPMP

10 20 30 40 50 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

% of Missing Data

Cos

ines

ass

ocia

ted

to P

C#1

Simulated data set (4 PCs)

MLPCAMLTSRTSRPMP

10 20 30 40 50 60−8

−6

−4

−2

0

2

4

6

8

10

% of Missing Data

log(

MS

PE

) (m

issi

ng)

Simulated data set (4 PCs)

MLPCAMLTSRTSRPMP

A) B)

c)

C) D)

Page 15: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

14

Inthisthirdexamplethedifferencesamongmethodsregardingthemeasuredvaluesarenarrower

(seeFigure4C),butstillshowingthesuperiorityofTSR.

4.4Additionaldatasets

ThreemoredatasetsareusedtocomparetheperformanceoftheML-basedmethodsagainstPMP

and TSR in its original form: the olive oil data set, the diesel NIR data set, and a 3-component

simulateddataset.ThefigurescontainingthelogarithmoftheMSPEsandthecosinesassociatedto

thefirstcomponentareavailableasSupportingInformationofthispaper.

Summarizingtheresults,inthesedatasetstheperformanceofTSRisstatisticallysuperiortoPMP

(asprovenin14),andtoMLPCAandML-TSRformedium-highpercentages(30-60%)andalsofor

lowpercentages (10-20%) in theoliveoilanddieselNIRdataset.Also, thereconstructionof the

availablemeasurementswithTSRismoresimilartothePCAoncompletedatathantheML-based

approachesinbothdatasets.Theseresultsarecoherentwithsections4.1-4.2.ComparingML-TSR

andMLPCA,theformeryieldsbetterresultsthanMLPCAforhighpercentagesofmissingdata(50-

60%) in the 3-component simulated data set, as happened in section 4.3.with the 4-component

simulateddataset.

5. CONCLUSIONS

Toconclude,itisworthtoremembertheresearchquestionsposedatthebeginningofthepaper:

• AretheimputedvaluesofMLPCAandPMPformodelbuildingequal?Theanswerisno.The

PMP imputationstepperformedalternativelyby rowsandcolumns inMLPCAdrives the

imputationinadifferentdirectionthanperformingitonlybycolumns,asPMPdoes.Based

on the six data sets analysed here, PMP, if converges, has better results than MLPCA.

However, PMP suffered from convergence problems in some case studies,whileMLPCA

convergeinalldatasetsandallMDpercentages.

Page 16: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

15

• DoesML-TSRoutperformtheimputationofMLPCA?Theanswer,basedonthecasestudies

analysedhere,isthatwhenthelatentstructureiscomplex,andthepercentageofmissing

data ishigh,ML-TSRmayoutperformMLPCA. Inother cases, theoverall resultshaveno

statisticallysignificantdifferences.However,MLPCAtendstobebetween2-5timesfaster

thanML-TSR.

• Does MLPCA or ML-TSR outperform the original TSR algorithm? The answer is no. TSR

outperforms the ML approaches for medium-high percentages of missing data. For low

percentages,dependingonthecasestudyanalysed,itisstatisticallysuperiororthereexist

nostatisticaldifferencecomparedtotheothermethods.

Finally, we recommend the use of trimmed score regression over MLPCA for PCA model

building with missing data, since the both the reconstruction of the available and imputed

valuesisstatisticallymoreaccuratethanusingMLPCAorML-TSR.

AppendixA.Regression-basedimputationstepinMLPCA.

TheequivalencebetweenEquations10and8isprovenhere.Letusassumethatwerearrangethe

valuesinrow𝐱!!tohavethe𝑅! missingvalues,𝐱!#!,atthefirstpositions,andtheremaining𝐾 − 𝑅!

availableones,𝐱!#!,attheend.WecanuseEquation4inEquation10tointroducetheextensionof

themissingdatapartition,𝐗 = [𝐗#𝐗∗].Bearing inmind that thedecompositionof thecovariance

matrixof𝐗(seeEquation9),andmatrix𝚲! (Equation12),Equation10canbewrittenas:

𝐱! = 𝐒𝚲𝒊𝐋𝒊(𝐋!!𝚲!!𝐒𝚲𝒊𝐋𝒊)!!𝐋!!𝚲!!𝐱! =𝐒## 𝐒#∗𝐒∗# 𝐒∗∗

𝟎IK-!!

𝐋𝒊(𝐋!![𝟎 𝐈!!!!]𝐒## 𝐒#∗𝐒∗# 𝐒∗∗

𝟎IK-!!

𝐋𝒊)!!𝐋!![𝟎 𝐈!!!!]𝐱!

= 𝐒## 𝐒#∗𝐒∗# 𝐒∗∗

𝟎Li

([𝟎 𝐋!!]𝐒## 𝐒#∗𝐒∗# 𝐒∗∗

𝟎Li)!![𝟎 𝐋!!]

0xi*

=𝐒#*𝐋𝒊 𝐋!!𝐒**𝐋𝒊

-1𝐋!!xi*

𝐒**𝐋𝒊 𝐋!!𝐒**𝐋𝒊-1𝐋!!xi*

(15)

TheproofusingEquation11isanalogous;substituting𝐱!by𝐲!,changingthesubindices𝑖by𝑘and

thematrix𝚲! by𝚽! , and bearing inmind that the𝐋!matrix is obtained using themissing data

patternof𝐲!.

Page 17: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

16

AppendixB.Matlabsourcecode.

The source code for MLPCA is shown here. It consists of a modification of the original MLPCA

algorithm11 to perform the imputation stepusing the regressionbased frameworkmethods. The

functionmlpmp.mcorrespondtotheoriginalimputationstepofMLPCA,whichisequivalenttothe

PMPmethodforPCAmodelbuilding,asprovedinthispaper.OnlytheMLversionofTSRisshown,

sincetheotherapproachesarenotcompetitive.ThesourcecodefortheoriginalTSRandPMPcan

befoundin14.

function [U,S,V,SOBJ,ErrFlag,count]=mlpca_generic(X,stdX,p,type)

%

%From Andrews and Wentzell (1997). Analytica Chimica Acta 350, 341-352

%

% This function performs MLPCA with missing data

%

% X mxn matrix of observations.

% stdX mxn matrix of standard deviations associated with X (zeros for

missing meassurements).

% p is the model dimensionality.

% type MD method chosen:

% 0 MLPCA, equivalent to PMP

% 1 MLTSR (maximum likelihood TSR)

% 2 MLPCR (maximum likelihood KDR with PCR)

%

% U,S,V the pseudo-svd parameters.

% SOBJ value of the objetive function.

% ErrFlag indicates exit conditions: 0 = normal termination, 1 = max

iterations exceeded.

% count number of iterations needed

Page 18: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

17

%

% Initialization

%

convlim=1e-10;

maxiter=500;

XX=X;

varX=stdX.^2;

[i,j]=find(varX==0);

errmax=max(max(varX));

for k=1:length(i),

varX(i(k),j(k))=1e10*errmax;

end

n=length(XX(1,:));

%

% Generate initial estimates

%

for i=1:length(X(:,1)),

for j=1:length(X(:,1)),

denom=min([nnz(X(i,:)) nnz(X(j,:))]);

CV(i,j)=(X(i,:)*X(j,:)')/denom;

end

end

[U,S,V]=svd(CV,0);

U0=U(:,1:p);

MLXaux=XX;

%

% Loop for alternating least squares

%

type

count=0;

Page 19: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

18

Sold=0;

ErrFlag=-1;

while ErrFlag<0,

count=count+1; %%%%

Sobj=0;

MLX=zeros(size(XX));

for i=1:n,

% Method selection

switch type

case 0

[MLX, Q]=mlpmp(XX,varX, U0, n, i, MLX);

case 1

[MLX, Q]=mltsr(XX, MLXaux', varX, U0, n, i, MLX);

otherwise

error('Wrong method')

end

dx=XX(:,i)-MLX(:,i);

Sobj=Sobj+dx'*Q*dx;

end

if rem(count,2)==1,

abs(Sold-Sobj)/Sobj;

if(abs(Sold-Sobj)/Sobj)<convlim,

ErrFlag=0;

elseif count>maxiter,

ErrFlag=1;

end

end

if ErrFlag<0,

Sold=Sobj;

[U,S,V]=svd(MLX,0);

Page 20: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

19

XX=XX';

varX=varX';

n=length(XX(1,:));

U0=V(:,1:p);

MLXaux=MLX';

end

end

%

% Finished

%

[U,S,V]=svd(MLX,0);

U=U(:,1:p);

S=S(1:p,1:p);

V=V(:,1:p);

SOBJ=Sobj;

function [ MLX, Q] = mlpmp( XX, varX, U0, n, i, MLX )

% MLPCA imputation, equivalent to PMP

Q=diag(varX(:,i).^(-1));

F=inv(U0'*Q*U0);

MLX(:,i)=U0*F*U0'*Q*XX(:,i);

end

function [ MLX, Q] = mltsr( XX, MLXaux, varX, U0, n, i, MLX )

% MLPCA with TSR missing data imputation

Q=diag(varX(:,i).^(-1));

CV=cov(MLXaux);

MLX(:,i)= CV*Q*U0*pinv(U0'*Q*CV*Q*U0)*U0'*Q*XX(:,i);

end

Page 21: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

20

SupportingInformation

Threeadditionalfigureswiththeresultsofthecomparativestudyusingthelastthreedatasetsare

availableonline.

Acknowledgements

ResearchinthisstudywaspartiallysupportedbytheSpanishMinistryofScienceandInnovation

andFEDERfundsfromtheEuropeanUnionthroughgrantDPI2011-28112-C04-02andDPI2014-

55276-C5-1R,andtheSpanishMinistryofEconomyandCompetitivenessthroughgrantECO2013-

43353-R.

REFERENCES

1. JolliffeIT,PrincipalComponentAnalysis.Springer-Verlag,NY,USA,1986.

2. Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR, Maximum likelihood

principalcomponentanalysis,J.Chemometr.1997,11,339-366.

3. Ristolainen M, Alén R, Malkavaara P, Pere J, Reflectance FTIR microspectroscopy for

studyingeffectofXylanremovalonunbleachedandbleachedbirchkraftpulps,Holzforschung2002,

56(5),513-521.

4. KeenanMR,Maximumlikelihoodprincipalcomponentanalysisoftime-of-flightsecondary

ionmassspectrometryspectralimages,J.Vac.Sci.Technol.A2005,23(4),746-750.

5. SangWC,Martin EB,Morris AJ, Lee I-B, Fault detection based on amaximum-likelihood

principalcomponentanalysis(PCA)mixture,Ind.Eng.Chem.Res.2005,44(7),2316-2327.

6. KarakachTK,WentzellPD,WalterJA,Characterizationofthemeasurementerrorstructure

in1D1HNMRdataformetabolomicsstudies,Anal.Chim.Acta2009,636(2),163-174.

Page 22: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

21

7. WentzellPD,HouS,Exploratorydataanalysiswithnoisymeasurements, J.Chemometrics

2012,26(6),264-281.

8. Mailier J, Remy M, Vande Wouwer A, Stoichiometric identification with maximum

likelihoodprincipalcomponentanalysis,J.Math.Biol.2013,67(4),739-765.

9. Hoefsloot HCJ, Verouden MPH, Westerhuis JA, Smilde AK, Maximum likelihood scaling

(MALS),J.Chemometr.2006,20(3-4),120-127.

10. DadashiM,AbdollahiH,TaulerR,MaximumLikelihoodPrincipalComponentAnalysis as

initial projection step in Multivariate Curve Resolution analysis of noisy data, Chem Intell. Lab.

2012,118,33-40.

11. Andrews DT, Wentzell PD, Applications of maximum likelihood principal component

analysis:incompletedatasetsandcalibrationtransfer,Anal.Chim.Acta1997,350(3),341-352.

12. Ho P, Silva MCM, Hogg TA, Multiple imputation and maximum likelihood principal

component analysis of incompletemultivariatedata froma studyof the ageingof port, Chemom.

Intell.Lab.2001,55,1-11.

13. Stanimirova I, Practical approaches to principal component analysis for simultaneously

dealingwithmissingandcensoredelementsinchemicaldata,Anal.Chim.Acta2013,796,27-37.

14. Folch-FortunyA,ArteagaF,FerrerA,PCAmodelbuildingwithmissingdata:newproposals

andacomparativestudy,Chemom.Intell.Lab.2015,146,77-88.

15. Arteaga F, Ferrer A, Dealing with missing data in MSPC: several methods, different

interpretations,someexamples,J.Chemometr.2002,16,408-418.

16. ArteagaF,FerrerA,Frameworkforregression-basedmissingdataimputationmethodsin

on-lineMSPC,J.Chemometr.2005,19,439-447.

17. WalczakB,MassartDL,DealingwithmissingdataPartI,Chemom.Intell.Lab.2001,58,15-

27.

18. Nelson PRC, Taylor PA, MacGregor JF, Missing data methods in PCA and PLS: Score

calculationswithincompleteobservations,Chemom.Intell.Lab.1996,35,45-65.

Page 23: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

22

19. Wold S, Albano C, Dunn WJ, Esbensen K, Hellberg S, Johansson E, Sjöström M, Pattern

recognition: finding and using regularities inmultivariate data, in:MartensH, RusswurmH (Jr.)

(Eds.),FoodResearchandDataAnalysis,vol.3,AppliedSciencePub:London,UK,1983,183–185.

20. López-Negrete de la Fuente R, García-Muñoz S, Biegler LT, An efficient nonlinear

programmingstrategyforPCAmodelswithincompletedatasets,J.Chemom.,2010,24,301–311.

21. SchaferJL,AnalysisofIncompleteMultivariateData,CRCPress:NewYork,USA,1997.

22. ProSensus MultiVariate release 15.02, ProSensus Inc, Ancaster, Ontario, Canada, 2015.

(http://www.prosensus.com).

23. SIMCArelease14,Umetrics,Umea,Sweden,2015.(http://www.umetrics.com).

24. PLS Toolbox release 7.9.5, Eigenvector Research Inc, Manson, Washington, USA, 2015.

(http://www.eigenvector.com).

25. Folch-Fortuny A, Arteaga F, Ferrer A, Missing Data Imputation Toolbox for MATLAB,

submitted.

26. Nelson PRC, Treatment of missing measurements in PCA and PLS models, Ph.D.

Dissertation, Department of Chemical Engineering, McMaster University, Hamilton, Ontario,

Canada,2002.

27 Arteaga F, Control estadístico multivariante de procesos con datos faltantes mediante

análisisdecomponentesprincipales,Ph.D.thesisUniversidadPolitécnicadeValencia,2003.

28. Guilment J,Markel S,WindigW, Infrared chemicalmicro-imaging assisted by interactive

self-modelingmultivariateanalysis,Appl.Spectr1994,48,320-326.

29. Windig W, Markel S, Simple-to-use interactive self-modeling mixture analysis of FTIR

microscopydata,J.Mol.Struct.1993,292,161-170.

30. WindigW, Spectraldata files for self-modeling curve resolutionwith examplesusing the

Simplismaapproach,Chemom.Intell.Lab.1997,36,3-16.

Page 24: Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood principal component analysis, missing data, regression-based methods, PCA model building,

23

31. González-MartínezJM,Folch-FortunyA,LlanerasF,TortajadaM,PicóJ,FerrerA,Metabolic

fluxunderstandingofPichiapastorisgrownonheterogeneousculturemedia,Chemom.Intell.Lab.

2014,134,89-99

32. Arteaga F, Ferrer A, How to simulate normal data sets with the desired correlation

structure,Chemom.Intell.Lab.2010,101,38-42.

33. Arteaga F, Ferrer A, Building covariance matrices with the desired structure, Chemom.

Intell.Lab.2013,127,80-88.

34. ForinaM,ArmaninoC,LanteriS,TiscorniaE,Classificationofoliveoilsfromtheirfattyacid

composition, in: Martens H, Russwurm Jr H. (Eds.), Food Research and Data Analysis, Applied

SciencePub,London1983,pp.189–214.

35. HutzlerSA,BesseeGB,RemoteNear-InfraredFuelMonitoringSystem,InterimReport,U.S.

ArmyTARDECFuelsandLubricantsResearchFacility,SouthwestResearchInstitute,SanAntonio,

UnitedStates,1997.