Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood...
Transcript of Document downloaded from: This paper must be cited as v11.pdf · Keywords: maximum likelihood...
Document downloaded from:
This paper must be cited as:
The final publication is available at
Copyright
Additional Information
http://doi.org/10.1002/cem.2804
http://hdl.handle.net/10251/81809
Wiley
Folch Fortuny, A.; Arteaga Moreno, FJ.; Ferrer, A. (2016). Assessment of maximumlikelihood PCA missing data imputation. Journal of Chemometrics. 30(7):386-393.doi:10.1002/cem.2804.
1
AssessmentofmaximumlikelihoodPCAmissingdataimputation
A.Folch-Fortuny1,*,F.Arteaga2,A.Ferrer1
1Dep.deEstadísticaeInvestigaciónOperativaAplicadasyCalidad,UniversitatPolitècnicade
València,CaminodeVeras/n,46022Valencia,Spain.
2Dep.ofBiostatisticsandInvestigation,UniversidadCatólicadeValenciaSanVicenteMártir,
C/Quevedo2.46001Valencia,Spain.
Maximum likelihood principal component analysis (MLPCA) was originally proposed to
incorporate measurement error variance information in principal component analysis (PCA)
models.MLPCAcanbeusedtofitPCAmodelsinthepresenceofmissingdata,simplybyassigning
very large variances to thenon-measuredvalues.An assessmentofmaximum likelihoodmissing
dataimputationisperformedinthispaper,analysingthealgorithmofMLPCAandadaptingseveral
methodsforPCAmodelbuildingwithmissingdatatoitsmaximumlikelihoodversion.Inthisway,
knowndataregression(KDR),KDRwithprincipalcomponentregression(PCR),KDRwithpartial
least squares regression (PLS), and trimmed scores regression (TSR)methods are implemented
withintheMLPCAmethodtoworkasdifferent imputationsteps.Sixdatasetsareanalysedusing
severalpercentagesofmissingdata,comparingtheperformanceoftheoriginalalgorithm,andits
adaptedregression-basedmethods,withotherstate-of-the-artmethods.
Keywords: maximum likelihood principal component analysis, missing data, regression-based
methods,PCAmodelbuilding,trimmedscoresregression
1. INTRODUCTION
Principal componentanalysis1 (PCA) isoneof themostappliedmethods fordataunderstanding.
Theoriginalvariablesareprojectedontothelatentspace,wheredatamostvary,andanewsetof
*Correspondenceto:A.Folch-Fortuny([email protected])
2
uncorrelated variables are obtained, the principal components (PCs), summarizing the most
relevant features of data. Wentzell et al.2 proposed in 1997 a new PCA approach, based on
maximum likelihood fitting, called maximum likelihood PCA (MLPCA). This methodology allows
incorporating information about themeasurement errors in themodel.MLPCA has beenwidely
applied in several works within chemistry and biology, e.g. to analyse reflectance Fourier
transformed infrared (FTIR)microspectroscopic data3 and ionmass spectroscopic data4, to fault
detectioninprocessindustry5,tothecharacterizationofmeasurementerrorsinnuclearmagnetic
resonance (NMR) data6 and gene expression data7, to determine the appropriate number of
reactions in stoichiometric modelling8, and as a useful preprocessing tool for metabolomic,
proteomic,transcriptomic9andenvironmental10dataanalysis.
Shortly after thepublicationof the originalMLPCAalgorithm, an applicationof thismethodwas
proposed addressing themissing data (MD)problem in PCAmodel building (PCA-MB)11.MLPCA
dealswiththemissingvaluesbyassigningthemlargevariancespriortoimplementingthemethod,
whichguidesthealgorithmtofitaPCAmodeldisregardingthesedatapoints.TheMLPCAapproach
for MD has been applied successfully in the literature to fluorescent, chromatographic, near-
infraredspectroscopic11,spectrophotometric12,andenvironmental13data.
Folch-Fortuny et al.14 address the problem of PCA-MB with missing data. In this work, several
methodsoriginallyproposedforPCAmodelexploitation(PCA-ME)15,16,i.e.whenafixedPCAmodel
isusedtoinfermissingvaluesinnewincompleteobservations,areadaptedtothemodelbuilding
context. Basically, the idea was to adapt the known iterative algorithm (IA)17 for PCA-MB by
replacingthepredictionofthePCAmodeltothatresultingwhenwetreateachincompleterowin
thedata set as anewobservationwithmissingvalues, andapplying theprojection to themodel
plane (PMP)method forPCA-ME18. This adaptation arose from the fact that PMP is, under some
generalconditions,equivalenttoIAandtheminimizationofthesquaredpredictionerror(SPE)for
PCA-ME,asprovedin15.Thus,theaimin14wastoassesswhetherthisequivalenceheldinthePCA-
MBcontext.
Theregression-basedmethods,proposedin15,werealsoadapted,jointlywithPMP,tothePCA-MB
context in 14. Thesemethods are: known data regression (KDR), KDRwith principal component
3
regression(PCR),KDRwithpartial leastsquaresregression(PLS)andtrimmedscoresregression
(TSR). All thesemethods impute themissing values in a data set by fitting different regression-
basedschemesbetweentheavailabledataandthemissingpositions.Severalothermethodswere
compared to the previous ones in 14, including the modified NIPALS algorithm19, the nonlinear
programmingapproach20andmultipleimputationbydataaugmentation21.Theconclusionwasthat
TSR represents a good compromise solutionbetweenpredictionquality, robustness against data
structure and computation time14; outperforming other approaches implemented in commercial
software as ProSensus22, SIMCA23 and PLS Toolbox24. TSR and most of the other approaches
compared in 14 are now implemented in a freely available user-friendly MATLAB toolbox25
(http://mseg.webs.upv.es).
Nelson26 showed the equivalence between the scores calculation by columns inMLPCA and the
PMPalgorithmforPCA-ME.Here,wearegoing toprove theequivalencebetweenthe imputation
stepbycolumnsinMLPCAalgorithmandtheadaptedPMPmethodforPCA-MB.
The aim of this paper is, thus, to answer three questions that arise from the aforementioned
equivalence:
i)Oncethealgorithmsconverge,aretheimputedvaluesofMLPCAandPMPforPCA-MBequal?
ii)SinceTSRoutperformsPMP,asprovenin14,iftheimputationstepinMLPCAissubstitutedbya
TSR-basedimputation,doestheimputationoutperformtheoriginalMLPCA?
iii) In any case, does MLPCA, or its adapted version with TSR, outperform the original TSR
algorithm?
To answer these research questions, we propose here to adapt the regression-basedmethods14
(KDR, KDRwith PCR, KDRwith PLS and TSR) to work as different imputation steps within the
MLPCA algorithm, providing a framework formaximum likelihoodmissing data imputation. The
performanceofthesemethodsiscomparedtoPMPandTSRmethodsusingsixdatasets,actualand
simulatedones,fromdifferentresearchareas.
The rest of the paper is organised as follows. Section 2 proves the equivalence between the
imputationstepbycolumnsofMLPCAandthePMPmethodforPCAmodelbuilding,anddescribes
4
howtheregression-basedmethodsareadaptedtoitsmaximumlikelihood(ML)version.Section3
describes the data sets used in this study, as well as how the comparative study is performed.
Section4showstheresultsoftheMLregression-basedmethods,jointlywiththeoriginalPMPand
TSRalgorithms.Finally,theconclusionsarehighlightedinSection5.
2. METHODOLOGY
LetXbeanNbyKmatrix,xiTits𝑖!"rowandykits𝑘!"column.Eachrowrepresentsapointinthe𝐾-
dimensionalspaceofthe𝐗observations,andeachcolumnapointinthe𝑁-dimensionalspaceofthe
𝐗variables.Row𝑖canbedecomposedin𝐱!! = 𝐱!!,! + 𝛆!!,where𝐱!
!,!arethetruevaluesand𝛆!!are
their measurement errors26. As well, column 𝑘 can be decomposed in its true and error parts:
𝐲! = 𝐲!! + 𝛈! . Both errors are assumednormally distributed in each of the𝐾 and𝑁 dimensions,
respectively.
Themaximisationofthelikelihoodisobtainedbyminimisingthefollowingobjectivefunction:
𝑆! = (𝐱!! − 𝐱!!)!
!!!
𝚺!!!(𝐱! − 𝐱!) = (𝐲!! − 𝐲!!)!
!!!
𝚿!!!(𝐲! − 𝐲!)
(1)
where 𝚺! is the covariance matrix of the errors 𝛆!! of observation 𝐱!!, and𝚿! is the covariance
matrixoftheerrors𝛈!ofvariable𝐲k.Theestimationofbothvectorsarisefrom:
𝐱! = 𝐏(𝐏!𝚺!!!𝐏)!!𝐏!𝚺!!!𝐱! (2)
𝐲! = 𝐔(𝐔!𝚿!!!𝐔)!!𝐔!𝚿!
!!𝐲! (3)
where𝐔(𝑁×𝐴),𝐃(𝐴×𝐴)and𝐏(𝐾×𝐴)representthesingularvaluedecompositionof𝐗 = 𝐔𝐃𝐏! =
[𝐲!. . . 𝐲!] = [𝐱!. . . 𝐱!]!,using𝐴dimensionsorcomponents.
MLPCAalgorithmisanalternatingleastsquaresprocedurethatstartsimputinginitialguessesfor𝐔
and𝐏basedontheSVDdecompositionof𝐗.Ateachiteration,thealgorithmhastwosteps.Thefirst
oneconsistsofprojectingtherows𝐱!!onthecolumnsof𝐏,computingtheobjectivefunction,and
recalculating𝐔and𝐏fromanSVDusingtheestimations.Thesecondstepconsistsofprojectingthe
5
columns𝐲! on the columns of𝐔, computing also the objective function, and finally recalculating
again𝐔and𝐏fromanSVD.Convergenceisachievedwhenthedifferencebetweentheestimations
oftheobservationsarebelowaspecifiedthreshold27.
The adaptation of MLPCA tomodel building withmissing data assumes uncorrelated errors for
bothobjects𝐱!!andvariables𝐲! , thereforematrices𝚺! and𝚿!arediagonal26,27. Inthisalgorithm,
largevariances(10!")areassignedto themissingmeasurements,andones to theavailableones.
Therefore,theinversionofmatrices𝚺! and𝚿!producesdiagonalmatriceswith1sand0s.Theones
serve to fit these specific measurements in the PCA and the 0s to disregard the missing
measurementsinthemultivariatemodel.
Letusassumethatrow𝑖hasmissingvalues.Thevaluesinthisvectorcanberearrangedtohavethe
missingvaluesinitsfirst𝑅! positionswithoutlossofgenerality,andtheremaining𝐾 − 𝑅! available
valuesat theend.Thispartition inxiT, inducesapartition in the𝐗 data set,being𝐗# (𝑁×𝑅!) the
missingpartand𝐗*(𝑁×(𝐾 − 𝑅!))theavailablepart,accordingtorowi.Additionally,thispartition
canbetransferredintoaSVD(orPCA)model,𝐗 = 𝐔𝐃𝐏!,being𝐏#(𝑅!×𝐴)themissingpartofthe
loadingsmatrix,and𝐏*((𝐾 − 𝑅!)×𝐴)theavailablepart.Figure1showsaschemeofthisnotation.
Figure 1. Partition induced in𝐗 matrix by themissing data in its 𝑖!" row. Grey squares denote
missingpositionsinthedataset.
Usingthispartition,theinverseofmatrix𝚺! canbewrittenas:
𝚺!!! =0 00 IK-!!
(4)
XxiT
xi#Txi*TxiT X#X*
P#TP*Trearrange
columns
SVD
model
N
K K
K
N
K
AR
R R
6
where𝐈!!!! istheidentitymatrixwith𝐾 − 𝑅! rows/columns,accordingtothemissingdatapattern
in𝐱!!.
SubstitutingthisexpressioninEquation2,observation𝐱iTcanbecomputedas:
𝐱i=𝐱i#
𝐱i*= 𝐏#𝐏* 𝐏#T 𝐏*T
0 00 IK-!!
𝐏#𝐏*
-1
𝐏#T 𝐏*T0 00 IK-!!
0xi*
=
= 𝐏#𝐏*
𝐏*T𝐏*-1𝐏*Txi*=
𝐏# 𝐏*T𝐏*-1𝐏*Txi*
𝐏* 𝐏*T𝐏*-1𝐏*Txi*
(5)
Alternatively,theinverseofmatrix𝚿!canbewrittenas:
𝚿!!! =
0 00 IN-!!
(6)
where𝐈!!!! is the identitymatrixwith𝑁 − 𝑅! rows/columns,accordingtocolumn𝐲! .Following
Equation3,𝐲!isthereforecomputedas:
𝐲k=𝐲k#
𝐲k*= 𝐔#𝐔* 𝐔#T 𝐔*T
0 00 IN-!!
𝐔#𝐔*
-1
𝐔#T 𝐔*T0 00 IN-!!
0yk* =
= 𝐔#𝐔*
𝐔*T𝐔*-1𝐔*Tyk
*=𝐔# 𝐔*T𝐔*
-1𝐔*Tyk
*
𝐔* 𝐔*T𝐔*-1𝐔*Tyk
*
(7)
where𝐔#(𝑅!×𝐴)and𝐔*((𝑁 − 𝑅!)×𝐴)arethemissingandavailablepartsof𝐔.
TheMLPCA imputation step of themissing values 𝐱!#! is the same as the PMPmethod for PCA
modelbuildingpresentedrecentlyin14.ThemaindifferencebetweenMLPCAalgorithmandPMPis
that the former performs the imputation iteratively first by observations and then by variables,
insteadofonlybyobservations,asPMPdoes.Andadditionally,theconvergenceinPMPisachieved
basedonlyontheimputedmissingvalues,insteadoftheimputationoftheavailablemeasurements,
asitisinMLPCA.
In 14 itwas shown that the imputation step in the adapted PMP algorithm for PCA-MB could be
substitutedbytheregression-basedmethodspresentedin15(KDRanditsvariants,andTSR).Most
ofthesemethodsshowedasuperiorperformancethanPMPacrossseveralcasestudies.So,theidea
hereconsistsofadaptingthealternatingimputationofMLPCAalgorithmtoincludetheimputation
7
stepoftheregression-basedmethods,thusproposingamaximumlikelihood(ML)framework:ML-
KDR,ML-KDRwithPCR,ML-KDRwithPLSandML-TSR.
Theimputationstepoftheregression-basedmissingdatamethodsis:
𝐱! =𝐱!#
𝐱!∗=
𝐒#∗𝐋! 𝐋!!𝐒∗∗𝐋!!!𝐋!!𝐱!∗
𝐒∗∗𝐋! 𝐋!!𝐒∗∗𝐋!!!𝐋!!𝐱!∗
(8)
where𝐒isthecovariancematrixof𝐗,and:
𝐒 = [𝐗#𝐗∗]![𝐗#𝐗∗]/(𝑁 − 1) = 𝐗#!𝐗# 𝐗#!𝐗∗𝐗∗!𝐗# 𝐗∗!𝐗∗
/(𝑁 − 1) = 𝐒## 𝐒#∗𝐒∗# 𝐒∗∗
(9)
Thekeymatrix𝐋inEquation8particulariseswhichmethodoftheframeworkisbeingusedforthe
imputation:𝐋 = 𝐈 forKDR;𝐋 = 𝐕!:! forKDRwithPCR,where𝐕1:ρ is theeigenvectormatrixof𝐒∗∗
and𝜌 ≤ rank(𝐒∗∗);𝐋 = 𝐖∗ forKDRwithPLS,where𝐖* is the loadingsmatrixof thePLSmodel
𝐓!"# = 𝐗∗!𝐖∗;and𝐋 = 𝐏∗forTSR.
Therefore,toadapttheMLPCAoriginalalgorithm11tousetheregression-basedmethods,wehave
tosubstitutetheimputationstep(Equations2-3)by:
𝐱! = 𝐒𝚲𝒊𝐋𝒊(𝐋!!𝚲!!𝐒𝚲𝒊𝐋𝒊)!!𝐋!!𝚲!!𝐱! (10)
𝐲! = 𝐒𝚽𝒌𝐋𝒌(𝐋!!𝚽!!𝐒𝚽𝒌𝐋𝒌)!!𝐋!!𝚽!
!𝐲! (11)
where𝐋matrix is thesameas intheregression-basedframework,particularising forthemissing
datapatterninrow𝑖orcolumn𝑘.And:
𝚲! =𝟎IK-!!
(12)
𝚽! =𝟎
IN-!! (13)
TheequivalencebetweenEquations10and8isshowninAppendixB.FormoredetailsonMLPCA,
readersarereferredto26,27andtheoriginalpaper11.TheMatlabsourcecodeof thealgorithmfor
PCAmodel buildingwithmissing data is reproduced here in Appendix Bwith slight changes to
introducetheimputationstepoftheregression-basedmethods.
8
3. DATASETSANDCOMPARATIVESTUDY
Six data sets are used in the present study to compare the results of the different imputation
methods included in the framework. The first data set contains FTIR miscroscopy spectra of a
polymerlaminateconsistingofthreelayers:polyethylene(PE),isophtalicpolyester(IPE,presence
originally unknown), and polyethylene terephthalate (PET). The polymer was scanned in a
seventeen point transect across the different layers, obtaining measurements from 81
wavelengths28-30. The second case study consists of a set of measured and inferred fluxes from
Pichia pastoris cultures on heterogeneous culture media31. The measured fluxes were collected
fromaliteraturereview;lateron,agreymodellingapproachwasappliedtoinfertheintracellular
fluxesaccordingtotheobservedextracellularones.Fromtheoriginaldatasetwith3600scenarios
and45 fluxes,a representativesampleof105 individuals is selected for thepresentcomparative
study. This data set has 3 biologically relevant PCs. Finally, a simulated data set, with 100
observations and 10 variables, is used to compare the performance of the different maximum
likelihoodmethods32,33. Thisdata set has4 eigenvalues (3, 2.5, 2 and1.5) explaining90%of the
varianceindata.
Threeadditionaldatasetsareanalysedhere,takenfrom14,wheretheadaptationoftheregression
basedmethodstoPCA-MBwasproposed.The firstoneconsistsof thepercentagecompositionof
eight fatty acids in 75 olive oils of South Apulia34. The second one is a set of NIR spectra (750-
1550nmin2nmintervals)measuredon40diesel fuels35.Andthelastoneisa100×10simulated
datasetwith3componentsexplaining40,30and20%ofvariance32,33.
Twoperformancecriteriaareusedtocomparetheresultsofthedifferentmethods.Thefirstoneis
themeansquaredpredictionerror(MSPE):
𝑀𝑆𝑃𝐸 𝑀𝑒𝑡ℎ𝑜𝑑 =𝑥!" − 𝑥!"!"#!!"
!!!!!
!!!!
𝑁𝐾 (14)
wherexij isthepredictedvalueforthe𝑗!"variableofthe𝑖!"observationinthepredictionmatrix
𝐗 = 𝐓𝐏! obtained from the complete data set; xijMethod the analogous prediction obtained after
9
applying the correspondingmethod on the incomplete data set; andN andK are the number of
rowsandcolumnsinthedataset,respectively.Theoriginalregression-basedframeworkmethods
use, as convergence criterion, the difference between consecutive imputations ofmissing values.
Instead,MLPCAusethedifferencebetweentheavailablemeasurementsandtheirpredictionsfrom
thecurrentPCAmodel.Forthis,wedecidedtoshowtheMSPEsfortheavailablemeasurementsand
themissingonesseparately,usingEquation14.
Thesecondcriterionconsistsofthecosinebetweenthefirstloadingvectorobtainedusingthefull
datamatrixanditscorrespondingonefromtheincompletedataset.ThecosinesoffurtherPCsare
notshown,sincetheirvaluesarestronglyaffectedbythedeviationsofthefirstPC14.
Sixdifferent levelsofmissingvaluesaregenerated foralldatasets, ranging from10%to60%of
missingdata.Also,50differentMDpatternsaregeneratedforeachpercentageofmissingdata,in
order to build confidence intervals for the MSPEs. The intervals are built based on the LSD
significanceofathree-factormixed-effectsANOVA,wheremethodandpercentageofMDarefixed-
effectfactors(andalsotheirinteraction),andthereplicatesistherandom-effectfactor(nestedto
percentage).GiventhepositiveskewnessofMSPE,alogarithmictransformationisusedtoeasethe
visualizationoftheplots.
4. RESULTS
Inthissectiontheresultsofthecomparativestudyarepresented.However,wedecidedtoexclude
theresultsofML-KDR,ML-KDRwithPCRandML-KDRwithPLSdue to largecomputation times,
something alreadyobserved in 14, anddue to the instability of someof them, especiallyML-KDR
(also observed in 14withKDR) andML-KDRwithPLS.Therefore, the results ofMLPCA,ML-TSR,
TSRandPMPareshown,inordertoanswerthethreeresearchquestionsposedintheIntroduction.
4.1.FTIRmicrospectroscopy
Figure 2 shows the results of the first case study. The two upper plots, A) and B), show the
logarithm of the MSPEs and the cosines of PC#1, respectively. Figures 2C-2D show also the
10
logarithm of the MSPEs but considering only the measured values and the imputed values
separately.RegardingFigure2A,thereexistnostatisticaldifferencesbetweenMLPCAandML-TSR
inallpercentagesofmissingdata.TSRandPMPstatisticallyoutperformbothMLapproaches for
lowpercentages ofmissingdata (10-20%). From50%onwards, TSR is superior toPMP,MLPCA
andML-TSR.The cosines shown inFigure2Bare coherentwith the resultsof theMSPEs, having
TSRthehighestcosinesfrom30%to60%.
Figure 2. FTIR data set results. A) Logarithm of the MSPE for all measurements. B) Cosines
associatedtothefirstPC.C)LogarithmoftheMSPEfortheavailablemeasurements.D)Logarithm
of the MSPE for the missing data. The dashed ellipses in a) mark the statistically significant
differencesbetweengroupsofmethods.InA)TSRisstatisticallysuperiortoMLPCAwith30-40%
ofMD.However,sincethereisnomethodstatisticallysignificantfromalltherest,asingledashed
ellipseenclosesallofthem.
10 20 30 40 50 60−14
−12
−10
−8
−6
−4
−2
0
2
% of Missing Data
log(
MS
PE
)
FTIR data set
MLPCAMLTSRTSRPMP
10 20 30 40 50 60−14
−12
−10
−8
−6
−4
−2
0
2
% of Missing Data
log(
MS
PE
) (m
easu
rem
ents
)
FTIR data set
MLPCAMLTSRTSRPMP
10 20 30 40 50 600.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
% of Missing Data
Cos
ines
ass
ocia
ted
to P
C#1
FTIR data set
MLPCAMLTSRTSRPMP
10 20 30 40 50 60−18
−16
−14
−12
−10
−8
−6
−4
% of Missing Data
log(
MS
PE
) (m
issi
ng)
FTIR data set
MLPCAMLTSRTSRPMP
A) B)
c)
C) D)
11
TheresultsinFigure2CshowthatTSRandPMParesuperiortotheMLapproachesintermsofthe
measured values, which implies that the PCAmodel fitted once the data is imputed with these
methods is closer to the original one than using maximum likelihood estimations. Figure 2D is
indeedverysimilartoFigure2A,duetothefactthattheerrorsintheimputedvaluesbetweenthe
truePCAmodelandtheimputedonearewaylargerthaninthemeasuredvalues,asexpected.
4.2.P.pastorisculturesonheterogeneousculturemedia
TheresultswiththeP.pastorisdatasetaresimilartothepreviousones,bothinMSPEsandcosines
(seeFigure3A-3B).TSRandPMPachievethestatisticallybestperformancefrom20%-40%ofMD;
andagain,from50%onwards,TSRbecomesthebestapproach,beingPMPsuperiortoMLPCAand
ML-TSR.TheperformancesofTSRandPMPare indeed coherentwith the results observed in 14,
whichconfirmtheresultsobtainedinthatpaper.ThecosinesshowninFigure3Barecoherentwith
theMSPEvalues.TheloweristhelogarithmoftheMSPE,thecloseraretheloadingvectorsofthe
reconstructedmatrixtotheactualones.
12
Figure3.P.pastorisdataset results.A)Logarithmof theMSPE forallmeasurements.B)Cosines
associatedtothefirstPC.C)LogarithmoftheMSPEfortheavailablemeasurements.D)Logarithm
of the MSPE for the missing data. The dashed ellipses in A) mark the statistically significant
differencesbetweengroupsofmethods.
RegardingFigures3C-3D, theperformanceofallmethods isalsosimilar to the firstexample.For
low percentages ofMD, the differences amongmethods are smaller in themeasured values, but
from30%ofMDonwards,thePCAmodelobtainedwithTSRimputationresemblesmoretothereal
one.
4.3Simulateddataset
IntheSimulateddatasetwith4PCs,thedifferencesamongTSR,ML-TSR,PMPandMLPCAarenot
statisticallysignificantforlowpercentagesofmissingvalues(10-20%)(seeFigure4A).With30%
10 20 30 40 50 60−8
−6
−4
−2
0
2
4
6
% of Missing Data
log(
MSP
E)P. pastoris data set
MLPCAMLTSRTSRPMP
10 20 30 40 50 60−8
−6
−4
−2
0
2
4
6
8
% of Missing Data
log(
MSP
E) (m
easu
rem
ents
)
P. pastoris data set
MLPCAMLTSRTSRPMP
10 20 30 40 50 60
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
% of Missing Data
Cos
ines
ass
ocia
ted
to P
C#1
P. pastoris data set
MLPCAMLTSRTSRPMP
10 20 30 40 50 60−12
−10
−8
−6
−4
−2
0
% of Missing Data
log(
MSP
E) (m
issi
ng)
P. pastoris data set
MLPCAMLTSRTSRPMP
A) B)
c)
C) D)
13
ofMD,TSRbecomesstatistically thebestmethodandPMPtheworstone.This issomething that
wasobservedin14,alsousingasimulateddataset32,33.Thehigheristhepercentageofmissingdata,
the more difficult is to impute properly for PMP. For higher percentages (30-60%), there are
statistical differences among allmethods: TSRmaintains the best performance, followed byML-
TSR,MLPCAandPMP.ThisisthefirstcasestudywherethereexistdifferencesbetweenMLPCAand
ML-TSR,beingthe latterstatisticallysuperior.Thesedifferences intheMSPEs canalsobeseen in
Figure4B,whereallmethodsbutTSRshowhugedeviationsfromthetrueprincipalcoordinateof
thedatawithlow-mediumpercentagesofMD(10-40%).
Figure4. Simulateddataset results.A)Logarithmof theMSPE forallmeasurements.B)Cosines
associatedtothefirstPC.C)LogarithmoftheMSPEfortheavailablemeasurements.D)Logarithm
of the MSPE for the missing data. The dashed ellipses in A) mark the statistically significant
differencesbetweengroupsofmethods.
10 20 30 40 50 60−5
0
5
10
15
% of Missing Data
log(
MS
PE
)
Simulated data set (4 PCs)
MLPCAMLTSRTSRPMP
10 20 30 40 50 60−4
−2
0
2
4
6
8
10
12
14
16
% of Missing Data
log(
MS
PE
) (m
easu
rem
ents
)
Simulated data set (4 PCs)
MLPCAMLTSRTSRPMP
10 20 30 40 50 600.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
% of Missing Data
Cos
ines
ass
ocia
ted
to P
C#1
Simulated data set (4 PCs)
MLPCAMLTSRTSRPMP
10 20 30 40 50 60−8
−6
−4
−2
0
2
4
6
8
10
% of Missing Data
log(
MS
PE
) (m
issi
ng)
Simulated data set (4 PCs)
MLPCAMLTSRTSRPMP
A) B)
c)
C) D)
14
Inthisthirdexamplethedifferencesamongmethodsregardingthemeasuredvaluesarenarrower
(seeFigure4C),butstillshowingthesuperiorityofTSR.
4.4Additionaldatasets
ThreemoredatasetsareusedtocomparetheperformanceoftheML-basedmethodsagainstPMP
and TSR in its original form: the olive oil data set, the diesel NIR data set, and a 3-component
simulateddataset.ThefigurescontainingthelogarithmoftheMSPEsandthecosinesassociatedto
thefirstcomponentareavailableasSupportingInformationofthispaper.
Summarizingtheresults,inthesedatasetstheperformanceofTSRisstatisticallysuperiortoPMP
(asprovenin14),andtoMLPCAandML-TSRformedium-highpercentages(30-60%)andalsofor
lowpercentages (10-20%) in theoliveoilanddieselNIRdataset.Also, thereconstructionof the
availablemeasurementswithTSRismoresimilartothePCAoncompletedatathantheML-based
approachesinbothdatasets.Theseresultsarecoherentwithsections4.1-4.2.ComparingML-TSR
andMLPCA,theformeryieldsbetterresultsthanMLPCAforhighpercentagesofmissingdata(50-
60%) in the 3-component simulated data set, as happened in section 4.3.with the 4-component
simulateddataset.
5. CONCLUSIONS
Toconclude,itisworthtoremembertheresearchquestionsposedatthebeginningofthepaper:
• AretheimputedvaluesofMLPCAandPMPformodelbuildingequal?Theanswerisno.The
PMP imputationstepperformedalternativelyby rowsandcolumns inMLPCAdrives the
imputationinadifferentdirectionthanperformingitonlybycolumns,asPMPdoes.Based
on the six data sets analysed here, PMP, if converges, has better results than MLPCA.
However, PMP suffered from convergence problems in some case studies,whileMLPCA
convergeinalldatasetsandallMDpercentages.
15
• DoesML-TSRoutperformtheimputationofMLPCA?Theanswer,basedonthecasestudies
analysedhere,isthatwhenthelatentstructureiscomplex,andthepercentageofmissing
data ishigh,ML-TSRmayoutperformMLPCA. Inother cases, theoverall resultshaveno
statisticallysignificantdifferences.However,MLPCAtendstobebetween2-5timesfaster
thanML-TSR.
• Does MLPCA or ML-TSR outperform the original TSR algorithm? The answer is no. TSR
outperforms the ML approaches for medium-high percentages of missing data. For low
percentages,dependingonthecasestudyanalysed,itisstatisticallysuperiororthereexist
nostatisticaldifferencecomparedtotheothermethods.
Finally, we recommend the use of trimmed score regression over MLPCA for PCA model
building with missing data, since the both the reconstruction of the available and imputed
valuesisstatisticallymoreaccuratethanusingMLPCAorML-TSR.
AppendixA.Regression-basedimputationstepinMLPCA.
TheequivalencebetweenEquations10and8isprovenhere.Letusassumethatwerearrangethe
valuesinrow𝐱!!tohavethe𝑅! missingvalues,𝐱!#!,atthefirstpositions,andtheremaining𝐾 − 𝑅!
availableones,𝐱!#!,attheend.WecanuseEquation4inEquation10tointroducetheextensionof
themissingdatapartition,𝐗 = [𝐗#𝐗∗].Bearing inmind that thedecompositionof thecovariance
matrixof𝐗(seeEquation9),andmatrix𝚲! (Equation12),Equation10canbewrittenas:
𝐱! = 𝐒𝚲𝒊𝐋𝒊(𝐋!!𝚲!!𝐒𝚲𝒊𝐋𝒊)!!𝐋!!𝚲!!𝐱! =𝐒## 𝐒#∗𝐒∗# 𝐒∗∗
𝟎IK-!!
𝐋𝒊(𝐋!![𝟎 𝐈!!!!]𝐒## 𝐒#∗𝐒∗# 𝐒∗∗
𝟎IK-!!
𝐋𝒊)!!𝐋!![𝟎 𝐈!!!!]𝐱!
= 𝐒## 𝐒#∗𝐒∗# 𝐒∗∗
𝟎Li
([𝟎 𝐋!!]𝐒## 𝐒#∗𝐒∗# 𝐒∗∗
𝟎Li)!![𝟎 𝐋!!]
0xi*
=𝐒#*𝐋𝒊 𝐋!!𝐒**𝐋𝒊
-1𝐋!!xi*
𝐒**𝐋𝒊 𝐋!!𝐒**𝐋𝒊-1𝐋!!xi*
(15)
TheproofusingEquation11isanalogous;substituting𝐱!by𝐲!,changingthesubindices𝑖by𝑘and
thematrix𝚲! by𝚽! , and bearing inmind that the𝐋!matrix is obtained using themissing data
patternof𝐲!.
16
AppendixB.Matlabsourcecode.
The source code for MLPCA is shown here. It consists of a modification of the original MLPCA
algorithm11 to perform the imputation stepusing the regressionbased frameworkmethods. The
functionmlpmp.mcorrespondtotheoriginalimputationstepofMLPCA,whichisequivalenttothe
PMPmethodforPCAmodelbuilding,asprovedinthispaper.OnlytheMLversionofTSRisshown,
sincetheotherapproachesarenotcompetitive.ThesourcecodefortheoriginalTSRandPMPcan
befoundin14.
function [U,S,V,SOBJ,ErrFlag,count]=mlpca_generic(X,stdX,p,type)
%
%From Andrews and Wentzell (1997). Analytica Chimica Acta 350, 341-352
%
% This function performs MLPCA with missing data
%
% X mxn matrix of observations.
% stdX mxn matrix of standard deviations associated with X (zeros for
missing meassurements).
% p is the model dimensionality.
% type MD method chosen:
% 0 MLPCA, equivalent to PMP
% 1 MLTSR (maximum likelihood TSR)
% 2 MLPCR (maximum likelihood KDR with PCR)
%
% U,S,V the pseudo-svd parameters.
% SOBJ value of the objetive function.
% ErrFlag indicates exit conditions: 0 = normal termination, 1 = max
iterations exceeded.
% count number of iterations needed
17
%
% Initialization
%
convlim=1e-10;
maxiter=500;
XX=X;
varX=stdX.^2;
[i,j]=find(varX==0);
errmax=max(max(varX));
for k=1:length(i),
varX(i(k),j(k))=1e10*errmax;
end
n=length(XX(1,:));
%
% Generate initial estimates
%
for i=1:length(X(:,1)),
for j=1:length(X(:,1)),
denom=min([nnz(X(i,:)) nnz(X(j,:))]);
CV(i,j)=(X(i,:)*X(j,:)')/denom;
end
end
[U,S,V]=svd(CV,0);
U0=U(:,1:p);
MLXaux=XX;
%
% Loop for alternating least squares
%
type
count=0;
18
Sold=0;
ErrFlag=-1;
while ErrFlag<0,
count=count+1; %%%%
Sobj=0;
MLX=zeros(size(XX));
for i=1:n,
% Method selection
switch type
case 0
[MLX, Q]=mlpmp(XX,varX, U0, n, i, MLX);
case 1
[MLX, Q]=mltsr(XX, MLXaux', varX, U0, n, i, MLX);
otherwise
error('Wrong method')
end
dx=XX(:,i)-MLX(:,i);
Sobj=Sobj+dx'*Q*dx;
end
if rem(count,2)==1,
abs(Sold-Sobj)/Sobj;
if(abs(Sold-Sobj)/Sobj)<convlim,
ErrFlag=0;
elseif count>maxiter,
ErrFlag=1;
end
end
if ErrFlag<0,
Sold=Sobj;
[U,S,V]=svd(MLX,0);
19
XX=XX';
varX=varX';
n=length(XX(1,:));
U0=V(:,1:p);
MLXaux=MLX';
end
end
%
% Finished
%
[U,S,V]=svd(MLX,0);
U=U(:,1:p);
S=S(1:p,1:p);
V=V(:,1:p);
SOBJ=Sobj;
function [ MLX, Q] = mlpmp( XX, varX, U0, n, i, MLX )
% MLPCA imputation, equivalent to PMP
Q=diag(varX(:,i).^(-1));
F=inv(U0'*Q*U0);
MLX(:,i)=U0*F*U0'*Q*XX(:,i);
end
function [ MLX, Q] = mltsr( XX, MLXaux, varX, U0, n, i, MLX )
% MLPCA with TSR missing data imputation
Q=diag(varX(:,i).^(-1));
CV=cov(MLXaux);
MLX(:,i)= CV*Q*U0*pinv(U0'*Q*CV*Q*U0)*U0'*Q*XX(:,i);
end
20
SupportingInformation
Threeadditionalfigureswiththeresultsofthecomparativestudyusingthelastthreedatasetsare
availableonline.
Acknowledgements
ResearchinthisstudywaspartiallysupportedbytheSpanishMinistryofScienceandInnovation
andFEDERfundsfromtheEuropeanUnionthroughgrantDPI2011-28112-C04-02andDPI2014-
55276-C5-1R,andtheSpanishMinistryofEconomyandCompetitivenessthroughgrantECO2013-
43353-R.
REFERENCES
1. JolliffeIT,PrincipalComponentAnalysis.Springer-Verlag,NY,USA,1986.
2. Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR, Maximum likelihood
principalcomponentanalysis,J.Chemometr.1997,11,339-366.
3. Ristolainen M, Alén R, Malkavaara P, Pere J, Reflectance FTIR microspectroscopy for
studyingeffectofXylanremovalonunbleachedandbleachedbirchkraftpulps,Holzforschung2002,
56(5),513-521.
4. KeenanMR,Maximumlikelihoodprincipalcomponentanalysisoftime-of-flightsecondary
ionmassspectrometryspectralimages,J.Vac.Sci.Technol.A2005,23(4),746-750.
5. SangWC,Martin EB,Morris AJ, Lee I-B, Fault detection based on amaximum-likelihood
principalcomponentanalysis(PCA)mixture,Ind.Eng.Chem.Res.2005,44(7),2316-2327.
6. KarakachTK,WentzellPD,WalterJA,Characterizationofthemeasurementerrorstructure
in1D1HNMRdataformetabolomicsstudies,Anal.Chim.Acta2009,636(2),163-174.
21
7. WentzellPD,HouS,Exploratorydataanalysiswithnoisymeasurements, J.Chemometrics
2012,26(6),264-281.
8. Mailier J, Remy M, Vande Wouwer A, Stoichiometric identification with maximum
likelihoodprincipalcomponentanalysis,J.Math.Biol.2013,67(4),739-765.
9. Hoefsloot HCJ, Verouden MPH, Westerhuis JA, Smilde AK, Maximum likelihood scaling
(MALS),J.Chemometr.2006,20(3-4),120-127.
10. DadashiM,AbdollahiH,TaulerR,MaximumLikelihoodPrincipalComponentAnalysis as
initial projection step in Multivariate Curve Resolution analysis of noisy data, Chem Intell. Lab.
2012,118,33-40.
11. Andrews DT, Wentzell PD, Applications of maximum likelihood principal component
analysis:incompletedatasetsandcalibrationtransfer,Anal.Chim.Acta1997,350(3),341-352.
12. Ho P, Silva MCM, Hogg TA, Multiple imputation and maximum likelihood principal
component analysis of incompletemultivariatedata froma studyof the ageingof port, Chemom.
Intell.Lab.2001,55,1-11.
13. Stanimirova I, Practical approaches to principal component analysis for simultaneously
dealingwithmissingandcensoredelementsinchemicaldata,Anal.Chim.Acta2013,796,27-37.
14. Folch-FortunyA,ArteagaF,FerrerA,PCAmodelbuildingwithmissingdata:newproposals
andacomparativestudy,Chemom.Intell.Lab.2015,146,77-88.
15. Arteaga F, Ferrer A, Dealing with missing data in MSPC: several methods, different
interpretations,someexamples,J.Chemometr.2002,16,408-418.
16. ArteagaF,FerrerA,Frameworkforregression-basedmissingdataimputationmethodsin
on-lineMSPC,J.Chemometr.2005,19,439-447.
17. WalczakB,MassartDL,DealingwithmissingdataPartI,Chemom.Intell.Lab.2001,58,15-
27.
18. Nelson PRC, Taylor PA, MacGregor JF, Missing data methods in PCA and PLS: Score
calculationswithincompleteobservations,Chemom.Intell.Lab.1996,35,45-65.
22
19. Wold S, Albano C, Dunn WJ, Esbensen K, Hellberg S, Johansson E, Sjöström M, Pattern
recognition: finding and using regularities inmultivariate data, in:MartensH, RusswurmH (Jr.)
(Eds.),FoodResearchandDataAnalysis,vol.3,AppliedSciencePub:London,UK,1983,183–185.
20. López-Negrete de la Fuente R, García-Muñoz S, Biegler LT, An efficient nonlinear
programmingstrategyforPCAmodelswithincompletedatasets,J.Chemom.,2010,24,301–311.
21. SchaferJL,AnalysisofIncompleteMultivariateData,CRCPress:NewYork,USA,1997.
22. ProSensus MultiVariate release 15.02, ProSensus Inc, Ancaster, Ontario, Canada, 2015.
(http://www.prosensus.com).
23. SIMCArelease14,Umetrics,Umea,Sweden,2015.(http://www.umetrics.com).
24. PLS Toolbox release 7.9.5, Eigenvector Research Inc, Manson, Washington, USA, 2015.
(http://www.eigenvector.com).
25. Folch-Fortuny A, Arteaga F, Ferrer A, Missing Data Imputation Toolbox for MATLAB,
submitted.
26. Nelson PRC, Treatment of missing measurements in PCA and PLS models, Ph.D.
Dissertation, Department of Chemical Engineering, McMaster University, Hamilton, Ontario,
Canada,2002.
27 Arteaga F, Control estadístico multivariante de procesos con datos faltantes mediante
análisisdecomponentesprincipales,Ph.D.thesisUniversidadPolitécnicadeValencia,2003.
28. Guilment J,Markel S,WindigW, Infrared chemicalmicro-imaging assisted by interactive
self-modelingmultivariateanalysis,Appl.Spectr1994,48,320-326.
29. Windig W, Markel S, Simple-to-use interactive self-modeling mixture analysis of FTIR
microscopydata,J.Mol.Struct.1993,292,161-170.
30. WindigW, Spectraldata files for self-modeling curve resolutionwith examplesusing the
Simplismaapproach,Chemom.Intell.Lab.1997,36,3-16.
23
31. González-MartínezJM,Folch-FortunyA,LlanerasF,TortajadaM,PicóJ,FerrerA,Metabolic
fluxunderstandingofPichiapastorisgrownonheterogeneousculturemedia,Chemom.Intell.Lab.
2014,134,89-99
32. Arteaga F, Ferrer A, How to simulate normal data sets with the desired correlation
structure,Chemom.Intell.Lab.2010,101,38-42.
33. Arteaga F, Ferrer A, Building covariance matrices with the desired structure, Chemom.
Intell.Lab.2013,127,80-88.
34. ForinaM,ArmaninoC,LanteriS,TiscorniaE,Classificationofoliveoilsfromtheirfattyacid
composition, in: Martens H, Russwurm Jr H. (Eds.), Food Research and Data Analysis, Applied
SciencePub,London1983,pp.189–214.
35. HutzlerSA,BesseeGB,RemoteNear-InfraredFuelMonitoringSystem,InterimReport,U.S.
ArmyTARDECFuelsandLubricantsResearchFacility,SouthwestResearchInstitute,SanAntonio,
UnitedStates,1997.