17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

11
Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem P. J. Brown,T.Fearn, and M. Vannucci Motivated by calibration problems in near-infrared (N IR) spectroscopy, we consider the linear regression setting in which the many predictor variables arise from sampling an essentially continuous curve at equally spaced points and there may be multiple predictands. We tackle this regression problem by calculating the wavelet transforms of the discretized curves, then applying a Bayesian variable selection method using mixture priors to the multivariate regression of predictands on wavelet coef cients. For prediction purposes, we average over a set of likely models. Applied to a particular problem in NIR spectroscopy, this approach was able to nd subsets of the wavelet coef cients with overall better predictive performance than the more usual approaches. In the application, the available predictors are measurements of the N IR re ectance spectrum of biscuit dough pieces at 256 equally spaced wavelengths. The aim is to predict the composition (i.e., the fat, our, sugar, and water content) of the dough pieces using the spectral variables. Thus we have a multivariate regression of four predictands on 256 predictors with quite high intercorrelation among the predictors. A training set of 39 samples is available to t this regression. Applying a wavelet transform replaces the 256 measurements on each spectrum with 256 wavelet coef cients that carry the same information. The variable selection method could use subsets of these coef cients that gave good predictions for all four compositional variables on a separate test set of samples. Selecting in the wavelet domain rather than from the original spectral variables is appealing in this application, because a single wavelet coef cient can carry information from a band of wavelengths in the original spectrum. This band can be narrow or wide, depending on the scale of the wavelet selected. KEY WORDS: Markov chain Monte Carlo; Mixture prior; Model averaging; Multivariate regression; Near-infrared spectroscopy; Variable selection. 1. INTRODUCTION This article presents a new way of tackling linear regression problems in which the predictor variables arise from sampling an essentially continuous curve at equally spaced points. The work was motivated by calibration problems in near-infrared (N IR) spectroscopy, of which the following example is typical. 1.1 Near-Infrared Spectroscopy of Biscuit Doughs Quantitative NIR spectroscopy is used to analyze such diverse materials as food and drink, pharmaceutical prod- ucts, and petrochemicals. The NIR spectrum of a sample of, say, wheat our is a continuous curve measured by modern scanning instruments at hundreds of equally spaced wave- lengths. The information contained in this curve can be used to predict the chemical composition of the sample. The prob- lem lies in extracting the relevant information from possibly thousands of overlapping peaks. Osborne, Fearn, and Hindle (1993) described applications in food analysis and reviewed some of the standard approaches to the calibration problem. The example studied in detail here arises from an exper- iment done to test the feasibility of NIR spectroscopy to P. J. Brown is P zer Professor of Medical Statistics, Institute of Math- ematics and Statistics, University of Kent, Canterbury, Kent, CT2 7NF, U.K. (E-mail: [email protected] ). T. Fearn is Professor of Applied Statistics, Department of Statistical Science, University College London, WC1E 6BT, U.K. (E-mail: [email protected] ). M. Vannucci is Assistant Professor of Statistics, Department of Statistics, Texas A&M University, College Station, TX 77843 (E-mail: [email protected] ). This work was supported by the U.K. Engineering and Physical Sciences Research Coun- cil under the Stochastic Modelling in Science and Technology Initiative, grant GK/K73343. M. Vannucci also acknowledges support from the Texas Higher Advanced Research Program, grant 010366-0075 from Texas A&M Inter- national Research Travel Assistant Program and from the National Science Foundation CAREER award DMS-0093208. The spectroscopic calibration problem was provided by the Flour Milling and Baking Research Association. The authors thank the associate editor and a referee, as well as Mike West, Duke University, and Adrian Raftery, University of Washington, for sugges- tions that helped improve the article. measure the composition of biscuit dough pieces (formed but unbaked biscuits), for possible on-line implementation. (For a full description of the experiment, see Osborne, Fearn, Miller, and Douglas 1984.) Brie y, two similar sample sets were made up, with the standard recipe varied to provide a large range for each of the four constituents under investigation: fat, sucrose, dry our, and water. The calculated percentages of these four ingredients represent the q D 4 responses. There were n D 39 samples in the calibration or training set, with sample 23 excluded from the original 40 as an outlier, and a further m D 39 in the separate prediction or validation set, again after one outlier was excluded. Thus Y and Y f , the matrices of compositional data for the training and validation sets, are both of dimension 39 4. An NIR re ectance spectrum is available for each dough piece. The original spectral data consist of 700 points mea- sured from 1100 to 2498 nanometers (nm) in steps of 2 nm. For our analyses using wavelets, we have chosen to reduce the number of spectral points to save computational time. The rst 140 and last 49 wavelengths, which were thought to contain little useful information, were removed, leaving a wavelength range from 1380 nm to 2400 nm, over which we took every other point, thus increasing the gap to 4 nm and reducing the number of points to p D 256. The matrices X and X f of spec- tral data are then 39 256. Samples of three centered spectra are given on the left side of Figure 1. The aim is to derive an equation that will predict the response values Y from the spectral data X for future sam- ples where Y is unknown but X can be measured cheaply and rapidly. © 2001 American Statistical Association Journal of the American Statistical Association June 2001, Vol. 96, No. 454, Applications and Case Studies 398

Transcript of 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

Page 1: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

Bayesian Wavelet Regression on Curves WithApplication to a Spectroscopic

Calibration ProblemP J Brown T Fearn and M Vannucci

Motivated by calibration problems in near-infrared (NIR) spectroscopy we consider the linear regression setting in which the manypredictor variables arise from sampling an essentially continuous curve at equally spaced points and there may be multiple predictandsWe tackle this regression problem by calculating the wavelet transforms of the discretized curves then applying a Bayesian variableselection method using mixture priors to the multivariate regression of predictands on wavelet coef cients For prediction purposeswe average over a set of likely models Applied to a particular problem in NIR spectroscopy this approach was able to nd subsetsof the wavelet coef cients with overall better predictive performance than the more usual approaches In the application the availablepredictors are measurements of the NIR re ectance spectrum of biscuit dough pieces at 256 equally spaced wavelengths The aim isto predict the composition (ie the fat our sugar and water content) of the dough pieces using the spectral variables Thus we havea multivariate regression of four predictands on 256 predictors with quite high intercorrelation among the predictors A training set of39 samples is available to t this regression Applying a wavelet transform replaces the 256 measurements on each spectrum with 256wavelet coef cients that carry the same information The variable selection method could use subsets of these coef cients that gavegood predictions for all four compositional variables on a separate test set of samples Selecting in the wavelet domain rather than fromthe original spectral variables is appealing in this application because a single wavelet coef cient can carry information from a band ofwavelengths in the original spectrum This band can be narrow or wide depending on the scale of the wavelet selected

KEY WORDS Markov chain Monte Carlo Mixture prior Model averaging Multivariate regression Near-infrared spectroscopyVariable selection

1 INTRODUCTION

This article presents a new way of tackling linear regressionproblems in which the predictor variables arise from samplingan essentially continuous curve at equally spaced points Thework was motivated by calibration problems in near-infrared(N IR) spectroscopy of which the following example is typical

11 Near-Infrared Spectroscopy of Biscuit Doughs

Quantitative NIR spectroscopy is used to analyze suchdiverse materials as food and drink pharmaceutical prod-ucts and petrochemicals The NIR spectrum of a sample ofsay wheat our is a continuous curve measured by modernscanning instruments at hundreds of equally spaced wave-lengths The information contained in this curve can be usedto predict the chemical composition of the sample The prob-lem lies in extracting the relevant information from possiblythousands of overlapping peaks Osborne Fearn and Hindle(1993) described applications in food analysis and reviewedsome of the standard approaches to the calibration problem

The example studied in detail here arises from an exper-iment done to test the feasibility of NIR spectroscopy to

P J Brown is P zer Professor of Medical Statistics Institute of Math-ematics and Statistics University of Kent Canterbury Kent CT2 7NFUK (E-mail philipjbrownukcacuk) T Fearn is Professor of AppliedStatistics Department of Statistical Science University College LondonWC1E 6BT UK (E-mail tomstatsuclacuk) M Vannucci is AssistantProfessor of Statistics Department of Statistics Texas AampM UniversityCollege Station TX 77843 (E-mail mvannuccistattamuedu) This workwas supported by the UK Engineering and Physical Sciences Research Coun-cil under the Stochastic Modelling in Science and Technology Initiative grantGKK73343 M Vannucci also acknowledges support from the Texas HigherAdvanced Research Program grant 010366-0075 from Texas AampM Inter-national Research Travel Assistant Program and from the National ScienceFoundation CAREER award DMS-0093208 The spectroscopic calibrationproblem was provided by the Flour Milling and Baking Research AssociationThe authors thank the associate editor and a referee as well as Mike WestDuke University and Adrian Raftery University of Washington for sugges-tions that helped improve the article

measure the composition of biscuit dough pieces (formed butunbaked biscuits) for possible on-line implementation (For afull description of the experiment see Osborne Fearn Millerand Douglas 1984) Brie y two similar sample sets weremade up with the standard recipe varied to provide a largerange for each of the four constituents under investigationfat sucrose dry our and water The calculated percentagesof these four ingredients represent the q D 4 responses Therewere n D 39 samples in the calibration or training set withsample 23 excluded from the original 40 as an outlier anda further m D 39 in the separate prediction or validation setagain after one outlier was excluded Thus Y and Yf thematrices of compositional data for the training and validationsets are both of dimension 39 4

An NIR re ectance spectrum is available for each doughpiece The original spectral data consist of 700 points mea-sured from 1100 to 2498 nanometers (nm) in steps of 2 nmFor our analyses using wavelets we have chosen to reduce thenumber of spectral points to save computational time The rst140 and last 49 wavelengths which were thought to containlittle useful information were removed leaving a wavelengthrange from 1380 nm to 2400 nm over which we took everyother point thus increasing the gap to 4 nm and reducing thenumber of points to p D 256 The matrices X and Xf of spec-tral data are then 39 256 Samples of three centered spectraare given on the left side of Figure 1

The aim is to derive an equation that will predict theresponse values Y from the spectral data X for future sam-ples where Y is unknown but X can be measured cheaply andrapidly

copy 2001 American Statistical AssociationJournal of the American Statistical Association

June 2001 Vol 96 No 454 Applications and Case Studies

398

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 399

1400 1600 1800 2000 2200 2400ndash 014

ndash 012

ndash 01

ndash 008

ndash 006

Ref

lect

ance

1400 1600 1800 2000 2200 24000

005

01

Ref

lect

ance

1400 1600 1800 2000 2200 24000

002

004

006

008

Ref

lect

ance

Wavelength

0 50 100 150 200 250ndash 15

ndash 1

ndash 05

0

05

DW

T

0 50 100 150 200 250ndash 05

0

05

1

DW

T

0 50 100 150 200 250ndash 02

0

02

04

DW

T

Wavelet coefficient

Figure 1 Original Spectra (left column) and Wavelet Transforms (right column)

12 Standard Analyses

The most commonly used approaches to this calibrationproblem regress Y on X with the linear form

Y D XB C E

being justi ed either by appeals to the BeerndashLambert law(Osborne et al 1993) or on the grounds that it works in prac-tice In Section 6 we also investigate logistic transformationsof the responses showing that overall their impact on the pre-diction performance of the model is not bene cial

The problem is not straightforward because there are manymore predictor variables (256) than training samples (39) inour example The most commonly used methods for overcom-ing this dif culty fall into two broad classes variable selectionand factor-based methods When scanning NIR instruments rst appeared the standard approach was to select (typicallyusing a stepwise procedure) predictors at a small number ofwavelengths and use multiple linear regression with this subset(Hrushka 1987) Later this approach was largely supersededby methods that reduce the p spectral variables to scores ona much smaller number of factors and then regress on thesescores Two variantsmdashprincipal components regression (PCRCowe and McNicol 1985) and partial least squares regres-sion (PLS Wold Martens and Wold 1983)mdashare now widelyused with equal effectiveness as the standard approaches Theincreasing power of computers has triggered renewed researchinterest in wavelength selection now using computer-intensivesearch methods

13 Selecting Wavelet Coefrsquo cients

The approach that we investigate here involves selectingvariables but we select from derived variables The idea isto transform each spectrum into a set of wavelet coef cientsthe whole of which would suf ce to reconstruct the spectrumand select good predictors from among these There are goodreasons for thinking that this approach might have advantagesover selecting from the original variables

In previous work (Brown Fearn and Vannucci 1999Brown Vannucci and Fearn 1998ab) we explored Bayesianapproaches to the problem of selecting predictor variables inthis multivariate regression context We apply this methodol-ogy to wavelet selection here

We know that wavelets can be used successfully for com-pression of curves like the spectra in our example in the sensethat the curves can be accurately reconstructed from a fractionof the full set of wavelet coef cients (Trygg and Wold 1998Walczak and Massart 1997) Furthermore the wavelet decom-position of the curve is a local one so that if the informationrelevant to our prediction problem is contained in a particularpart or parts of the curve as it typically is then this informa-tion will be carried by a very small number of wavelet coef- cients Thus we may expect selection to work The abilityof wavelets to model the curve at different levels of resolu-tion gives us the option of selecting from our curve at a rangeof bandwidths In some situations it may be advantageous toselect a sharp band as we do when we select one of the origi-nal variables in other situations a broad band averaging overmany adjacent points may be preferable Selecting from thewavelet coef cients gives us both of these options automati-cally and in a very computationally ef cient framework

400 Journal of the American Statistical Association June 2001

Not surprisingly Fourier analysis has also been success-fully used for data compression and denoising of NIR spec-tral data (McClure Hamid Giesbrecht and Weeks 1984) Forthis purpose there is probably very little to choose betweenthe Fourier and wavelet approaches When it comes to select-ing small numbers of coef cients for prediction however thelocal nature of wavelets makes them the obvious choice

It is worth emphasizing that what we are doing here isnot what is commonly described as wavelet regression Weare not tting a smooth function to a single noisy curve byusing either thresholding or shrinkage of wavelet coef cients(see Clyde and George 2000 for Bayesian approaches that usemixture modeling and Donoho Johnstone Kerkyacharian andPicard 1995) Unlike those authors we have several curvesthe spectra of 39 dough pieces each of which is transformedto wavelet coef cients We then select some of these waveletcoef cients (the same ones for each spectrum) not becausethey give a good reconstruction of the curves (which they donot) or to remove noise (of which there is very little to startwith) from the curves but rather because the selected coef -cients are useful for predicting some other quantity measuredon the dough pieces One consequence of this is that it isnot necessarily the large wavelet coef cients that will be use-ful small coef cients in critical regions of the spectrum alsomay carry important predictive information Thus the standardthresholding or shrinkage approaches are just not relevant tothis problem

2 PRELIMINARIES

21 Wavelet Bases and Wavelet Transforms

Wavelets are families of functions that can accuratelydescribe other functions in a parsimonious way In L2425 forexample an orthonormal wavelet basis is obtained as trans-lations and dilations of a mother wavelet ndash as ndashj1 k4x5 D2j=2ndash42jx ƒ k5 with j1 k integers A function f is then repre-sented by a wavelet series as

f4x5 DX

j1 k2

dj1 kndashj1 k4x51 (1)

with wavelet coef cients dj1 kD

Rf 4x5ndashj1 k4x5dx describing

features of the function f at the spatial location 2ƒjk andfrequency proportional to 2j (or scale j)

Daubechies (1992) proposed a class of wavelet families thathave compact support and a maximum number of vanishingmoments for any given smoothness These are used exten-sively in statistical applications

Wavelets have been an extremely useful tool for the analysisand synthesis of discrete data Let Y D 4y11 1 yn5 n D 2J be a sample of a function at equally spaced points This vec-tor of observations can be viewed as an approximation to thefunction at the ne scale J A fast algorithm the discretewavelet transform (DWT) exists for decomposing Y into a setof wavelet coef cients (Mallat 1989) in only O4n5 operationsThe DWT operates in practice by means of linear recursive l-ters For illustration purposes we can write it in matrix formas Z D WY where W is an orthogonal matrix correspondingto the discrete wavelet transform and Z is a vector of waveletcoef cients describing features of the function at scales from

the ne J ƒ 1 to a coarser one say J ƒ r An algorithm forthe inverse construction also exists

Wavelet transforms can be computed very rapidly and havegood compression properties Because they are localized inboth time and frequency wavelets have the ability to repre-sent many classes of functions in a sparse form by describingimportant features with few coef cients (For a general expo-sition of wavelet theory see Daubechies 1992)

22 Matrix-Variate Distributions

In what follows we use notation for matrix-variate distribu-tions due to Dawid (1981) We write

V ƒ M reg 4acirc 1egrave5

when the random matrix V has a matrix-variate normal distri-bution with mean M and covariance matrices ƒiiegrave and lsquo jjacirc

for its ith row and jth column Such a V could be gener-ated as V D MCA0UB where M1 A1 and B are xed matricessuch that A0A D acirc and B0B D egrave and U is a random matrixwith independent standard normal entries This notation hasthe advantage of preserving the matrix structure instead ofreshaping V as a vector It also makes for much easier formalBayesian manipulation

The other notation that we use is

W copymiddot4bdquo3 egrave5

for a random matrix W with an inverse Wishart distributionwith scale matrix egrave and shape parameter bdquo The shape param-eter differs from the more conventional degrees of freedomagain making for very easy Bayesian manipulations With Uand B de ned as earlier and with U as n p with n gt pW D B04U0U5ƒ1B has an inverse Wishart distribution withshape parameter bdquo D nƒpC 1 and scale matrix egrave The expec-tation of W exists for bdquo gt 2 and is then egrave=4bdquo ƒ 25 (Moredetails of these notations and a corresponding form for thematrix-variate T can be found in Brown 1993 App A orDawid 1981)

3 MODELING

31 Multivariate Regression Model

The basic setup that we consider is a multivariate linearregression model with n observations on a q-variate responseand p explanatory variables Let Y denote the n q matrix ofobserved values of the responses and let X be the n p matrixof predictor variables Our special concern is with functionalpredictor data that is the situation in which each row of X isa vector of observations of a curve x4t5 at p equally spacedpoints

The standard multivariate normal regression model hasconditional on 1B1egrave and X

Y ƒ 1n 0 ƒ XB reg 4 In1egrave51 (2)

where 1n is an n 1 vector of 1rsquos is a q 1 vector of inter-cepts and B D 4sbquo11 1sbquoq5 is a p q matrix of regressioncoef cients Without loss of generality we assume that thecolumns of X have been centered by subtracting their means

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 401

The unknown parameters are B and the q q errorcovariance matrix egrave A conjugate prior for this model is asfollows First given egrave

0 ƒ 00 reg 4h1egrave5 (3)

and independently

B ƒ B0 reg 4H1 egrave50 (4)

The marginal distribution of egrave is then

egrave copymiddot4bdquo3 Q50 (5)

Note that the priors on both and B have covariances depen-dent on egrave in a way that directly extends the univariate regres-sion natural conjugate prior distributions

In practice we let h ˆ to represent vague prior knowl-edge about and take B0 D 0 leaving the speci cation of Hbdquo and Q to incorporate prior knowledge about our particularapplication

32 Transformation to Wavelets

We now transform the predictor variables by applying toeach row of X a wavelet transform as described in Section 21In matrix form multiplying each row of X by the same matrixW is equivalent to multiplying X on the right side by W0

The wavelet transform is orthogonal (ie W0W D I) andthus (2) can be written as

Y ƒ 1n 0 ƒ XW0WB reg 4 In1 egrave50 (6)

We can now express the model in terms of wavelet transfor-mations of the predictors as

Y ƒ 1n0 ƒ Z QB reg 4 In1egrave51 (7)

where Z D XW0 is now a matrix of wavelet coef cients andQB D WB is a matrix of regression coef cients The trans-formed prior on QB in the case of inclusion of all predictors is

QB reg 4 QH1 egrave51 (8)

where QH D WHW0 and the parameters and egrave are unchangedby the orthogonal transformations as are the priors (3) and (5)

In practice wavelets exploit the recursive application of lters and the W-matrix notation is more useful for expla-nation than for computation Vannucci and Corradi (1999)proposed a fast recursive algorithm for computing quantitiessuch as WHW0 Their algorithm has a useful link to the two-dimensional DWT (DWT2) making computations simple Thematrix WHW0 can be computed from H with an O4n25 algo-rithm (For more details see secs 31 and 32 of Vannucci andCorradi 1999)

33 A Framework for Variable Selection

To perform selection in the wavelet coef cient domain wefurther elaborate the prior on QB by introducing a latent binaryp-vector ƒ The jth element of ƒ1 ƒj may be either 1 or0 depending on whether the jth column of Z is or is notincluded in the model When ƒj is unity the covariance matrixof the corresponding row of QB is ldquolargerdquo and when ƒj is 0 thecovariance matrix is a zero matrix We have assumed that theprior expectation of QB is 0 and so ƒj

D 0 effectively deletesthe jth explanatory variable (or wavelet coef cient) from themodel This gives conditional on ƒ

QBƒ reg 4 QHƒ1egrave51 (9)

where QBƒ and QHƒ are just QB and QH with the rows and inthe case of QH columns for which ƒj

D 0 deleted Under thisprior each row of QB is modeled as having a scale mixture ofthe type

6 QB76j27 41 ƒ ƒj5I0C ƒjN 401 Qhjjegrave51 (10)

with Qhjj equal to the jth diagonal element of the matrix QH DWHW0 and I0 a distribution placing unit mass on the 1 qzero vector Note that the rows of QB are not independent

A simple prior distribution 4ƒ5 for ƒ takes the ƒj to beindependent with Pr4ƒj

D 15 D wj and Pr4ƒjD 05 D 1 ƒ wj

with hyperparameters wj to be speci ed for j D 11 1 p Inour example we take all of the wj equal to a common w sothat the nonzero elements of ƒ have a binomial distributionwith expectation pw

Mixture priors have been widely used for variable selectionin the original model space originally by Leamer (1978) andmore recently by George and McCulloch (1997) and Mitchelland Beauchamp (1988) for the linear multiple regression caseCarlin and Chib (1995) Chipman (1996) and Geweke (1996)among others concentrated on special features of these priorsClyde DeSimone and Parmigiani (1996) used model mix-ing in prediction problems with correlated predictors whenexpressing the space of models in terms of an orthogonaliza-tion of the design matrix Their methods are not directly appli-cable to our situation because the wavelet transforms do notleave us with an orthogonal design The use of mixture pri-ors for selection in the multivariate regression setup has beeninvestigated by Brown et al (1998ab)

4 SELECTING WAVELET COEFFICIENTS

41 Posterior Distribution of ƒ

The posterior distribution of ƒ given the data 4ƒ mdash Y1Z5assigns a posterior probability to each ƒ-vector and thus toeach possible subset of predictors (wavelet coef cients) Thisposterior arises from the combination of a likelihood that givesgreat weight to subsets explaining a high proportion of thevariation in the responses Y and a prior for ƒ that penalizeslarge subsets It can be computed by integrating out 1B1

and egrave from the joint posterior distribution of these parametersand ƒ given the data With the vague (h ˆ) prior for this parameter is essentially estimated by the mean Y in thecalibration data (see Smith 1973) and to simplify the formulas

402 Journal of the American Statistical Association June 2001

that follow we now assume that the columns of Y have beencentered (Full details of the prior to posterior analysis havebeen given by Brown et al 1998b who also considered otherprior structures) After some manipulation we have

4ƒ mdash Y1Z5 g4ƒ5 D QHƒ Z0ƒZƒ

C QHƒ1ƒ

ƒq=2

ƒ4nCbdquoCqƒ15=2 4ƒ51 (11)

where QƒD Q C Y0Y ƒ Y0Zƒ4Z0

ƒ ZƒC QHƒ1

ƒ 5ƒ1Z0ƒ Y and Zƒ is

Z with the columns for which ƒjD 0 deleted Care is needed

in computing (11) the alternative forms discussed later maybe useful

A simplifying feature of this setup is that all of the com-putations can be formulated as least squares problems withmodi ed Y and Z matrices By writing

QZƒD

sup3Zƒ

QH12ƒ

Ipƒ

acute1 QY D Y

01

where QH1=2ƒ is a matrix square root of QHƒ and pƒ is the number

of 1rsquos in ƒ the relevant quantities entering into (11) can becomputed as

QHƒ Z0ƒZƒ

C QHƒ1ƒ

D QZ0ƒ

QZƒ (12)

and

QƒD Q C QY0 QY ƒ QY0 QZƒ4 QZ0

ƒQZƒ5ƒ1 QZ0

ƒQY3 (13)

that is Qƒ is given by Q plus the residual sum of prod-ucts matrix from the least squares regression of QY on QZƒ The QR decomposition can then be used (see eg Seber1984 chap 10 sec 11b) which avoids ldquosquaringrdquo as in (12)and (13)

42 Metropolis Search

Equation (11) gives the posterior probability of each of the2p different ƒ vectors and thus of each choice of waveletcoef cient subsets What remains to do is to look for ldquogoodrdquowavelet components by computing these posterior probabili-ties When p is much greater than about 25 too many subsetsexist for this to be feasible Fortunately we can use simula-tion methods that will nd the ƒ vectors with relatively highposterior probabilities We can then quickly identify usefulcoef cients that have high marginal probabilities of ƒj

D 1Here we use a Metropolis search as suggested for model

selection by Madigan and York (1995) and applied to variableselection for regression by Brown et al (1998a) George andMcCulloch (1997) and Raftery Madigan and Hoeting (1997)The search starts from a randomly chosen ƒ0 and then movesthrough a sequence of further values of ƒ At each step thealgorithm generates a new candidate ƒ by randomly modify-ing the current one Two types of moves are used

iexcl Add or delete a component by choosing at random onecomponent in the current ƒ and changing its value Thismove is chosen with probability rdquo

iexcl Swap two components by choosing independently at ran-dom a 0 and a 1 in the current ƒ and changing both ofthem This move is chosen with probability 1ƒ rdquo

The new candidate model ƒ uuml is accepted with probability

ming4ƒ uuml 5

g4ƒ511 0 (14)

Thus a more probable ƒ uuml is always accepted and a less prob-able one may be accepted There is scope for further ingenu-ity in designing the sequence of random moves For examplemoves that add or subtract or swap two or three or more at atime or a combination of these may be useful

The sequence of ƒrsquos generated by the search is a realizationof a Markov chain and the choice of acceptance probabilitiesensures that the equilibrium distribution of this chain is thedistribution given by (11) In typical uses of such schemesthe realizations are monitored after a suitable burn-in periodto verify that they appear stationary Here we have a closedform for the posterior distribution and are using the chainsimply to explore this distribution Thus we have not beenso concerned about strict convergence of the Markov chainFollowing Brown et al (1998a) we adopt a strategy of runningthe chain from a number of different starting points (four here)and looking at the four marginal distributions provided by thecomputed g4cent5 values of the visited ƒrsquos We also look for goodindication of mixing and explorations with returns Becausewe know the relative probabilities we do not need to worryabout using a burn-in period

Note that given the form of the acceptance probability (14)ƒ-vectors with high posterior probability have a greater chanceof appearing in the sequence Thus we might expect that along run of such a chain would visit many of the best subsets

5 PREDICTION

Suppose now that we wish to predict Yf an m q matrix offurther Y-vectors given the corresponding X-vectors Xf 1 4mp5 First we treat Xf exactly as the training data have beentreated by subtracting the training data means and transform-ing to wavelet coef cients Zf 1 4m p5 The model for Yf following the model for the training data (6) is

Yfƒ 1m 0 ƒ Zf

QB reg 4 Im1egrave50 (15)

If we believe our Bayesian mixture model then logically weshould apply the same latent structure model to prediction aswell as to training This has the practical appeal of providingaveraging over a range of likely models (Madigan and Raftery1994)

Results of Brown et al (1998b) demonstrate that with thecolumns of Yf centered using the mean Y from the train-ing set the expectation of the predictive distribution p4Yf

mdashƒ1 Z1 Y5 is given by Zf

OBƒ with

OBƒD Z0

ƒZƒC QHƒ1

ƒ

ƒ1Z0

ƒY D QH12ƒ

QZ0ƒ

QZƒ

ƒ1 QZ0ƒ

QY0 (16)

Averaging over the posterior distribution of ƒ gives

OYfD

ZfOBƒ 4ƒ mdash Z1Y51 (17)

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 403

and we might choose to approximate this by some restrictedset of ƒ values perhaps the r most likely values from theMetropolis search

6 APPLICATION TO NEAR-INFRAREDSPECTROSCOPY OF BISCUIT DOUGHS

We now apply the methodology developed earlier to thespectroscopic calibration problem described in Section 11First however we report the results of some other analyses ofthese data

61 Analysis by Standard Methods

For all of the analyses carried out here both compositionaland spectral data were centered by subtracting the training setmeans from the training and validation data The responsesbut not the spectral data were also scaled to give each of thefour variables unit variance in the training set

Mean squared prediction errors were converted back to theoriginal scale by multiplying them by the training sample vari-ances This preprocessing of the data makes no difference tothe standard analyses which treat the response variables sep-arately but it simpli es the prior speci cation for our multi-variate wavelet analysis

Osborne et al (1984) derived calibrations by multipleregression using various stepwise procedures to select wave-lengths for each constituent separately The mean squarederror of predictions on the 39 validation samples for their cal-ibrations are reported in the rst row of Table 1

Brown et al (1999) also selected small numbers of wave-lengths to nd calibration equations for this example TheirBayesian decision theory approach differed from the approachof Osborne in being multivariate (ie in trying to nd onesmall subset of wavelengths suitable for predicting all fourconstituents) and in using a more extensive search using sim-ulated annealing The results for this alternative wavelengthselection approach are given in the second row of Table 1

For the purpose of comparison we carried out two otheranalyses using partial least squares regression (PLS) and prin-cipal components regression (PCR) These approaches bothof which construct factors from the full spectral data and thenregress constituents on the factors are very much the standardtools in NIR spectroscopy (see eg Geladi and Martens 1996Geladi Martens Hadjiiski and Hopke 1996) For the compu-tations we used the PLS Toolbox 20 of Wise and Gallagher(Eigenvector Research Manson WA) Although there aremultivariate versions of PLS we took the usual approach ofcalibrating for each constituent separately The number of fac-tors used selected in each case by cross-validation on thetraining set was ve for each of the PLS equations and six

Table 1 Mean Squared Errors of Prediction on the 39 Biscuit DoughPieces in the Validation Set Using Four Calibration Methods

Method Fat Sugar Flour Water

Stepwise MLR 0044 10188 0722 0221Decision theory 0076 0566 0265 0176PLS 0151 0583 0375 0105PCR 0160 0614 0388 0106

for each of the PCR equations The results given in rows threeand four of Table 1 show that as usual there is little to choosebetween the two methods These results are for PLS and PCRusing the reduced 256-point spectrum that we used for ourwavelet analysis Repeating the analyses using the original700-point spectrum yielded results very similar to those forPLS and somewhat worse than those reported for PCR

Because shortly we need to specify a prior distribution onregression coef cients it is interesting to examine those result-ing from a factor-type approach Combining the coef cientsfor the regression of constituent on factor scores with theloadings that produce scores from the original spectral vari-ables gives the coef cient vector that would be applied to ameasured spectrum to give a prediction Figure 2 plots thesevectors for the PLS equations for the four constituents show-ing the smoothness in the 256 coef cients that we attempt tore ect in our prior distribution for B

62 Wavelet Transforms of Spectra

To each spectrum we apply a wavelet transform convert-ing it to a set of 256 wavelet coef cients We used theMATLAB toolbox Wavbox 43 (Taswell 1995) for this stepUsing spectra with 2m (m integer) points is not a real restric-tion here Methods exist to overcome the limitation allow-ing the DWT to be applied on any length of data We usedMP(4) (Daubechies 1992 p 194) wavelets with four vanish-ing moments The Daubechies wavelets have compact supportimportant for good localization and a maximum number ofvanishing moments for a given smoothness A large number ofvanishing moments leads to high compressibility because the ne-scale wavelet coef cients are essentially 0 where the func-tions are smooth On the other hand support of the waveletsincreases with an increasing number of vanishing momentsso there is a trade-off with the localization properties Somerather limited exploration suggested that the chosen waveletfamily is a good compromise for these data

The graphs on the right side of Figure 1 show the wavelettransforms corresponding to the three NIR spectra in the leftcolumn Coef cients are ordered from coarsest to nest

63 Prior Settings

We need to specify the values of H bdquo and Q in (3) (4)and (5) and the prior probability w that an element of ƒ is1 We wish to put in weak but proper prior information aboutegrave We choose bdquo D 3 because this is the smallest integer valuesuch that the expectation of egrave E4egrave5 D Q=4bdquoƒ25 exists Thescale matrix Q is chosen as Q D k Iq with k D 005 compara-ble in size to the expected error variances of the standardizedY given X With bdquo small the choice of Q is unlikely to becritical

Much more likely to be in uential are the choices of H andw in the priors for B and ƒ To re ect the smoothness in thecoef cients B as exempli ed in Figure 2 while keeping theform of H simple we have taken H to be the variance matrixof a rst-order autoregressive process with hij

D lsquo 2mdashiƒjmdash Wederived the values lsquo 2 D 254 and D 032 by maximizing atype II likelihood (Good 1965) Integrating B and egrave fromthe joint distribution given by (2) (3) (4) and (5) for the

404 Journal of the American Statistical Association June 2001

1400 1600 1800 2000 2200 2400ndash 2

0

2

4

6

Coe

ffici

ent

Fat

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4Sugar

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

Wavelength

Coe

ffici

ent

Flour

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

6

Wavelength

Water

Figure 2 Coefrsquo cient Vectors From 5-Factor PLS Equations

regression on the full untransformed spectra with h ˆ andB0

D 0 we get

f mdashKmdashƒq=2mdashQmdash4bdquoCqƒ15=2mdashQ C Y0Kƒ1Ymdashƒ4bdquoCnCqƒ15=21 (18)

where

K D In C XHX0

and the columns of Y are centered as in Section 41 Withk D 005 and bdquo D 3 already xed (18) is a function via H oflsquo 2 and We used for our prior the values of these hyperpa-rameters that maximize (18) Possible underestimation due tothe use of the full spectra was taken into account by multiply-ing the estimate of lsquo 2 by the in ation factor 256=20 re ect-ing our prior belief as to the expected number of includedcoef cients

Figure 3 shows the diagonal elements of the matrix QH DWHW0 implied by our choice of H The variance matrix of theith column of QB in (8) is lsquo ii

QH so this plot shows the patternin the prior variances of the regression coef cients when thepredictors are wavelet coef cients The wavelet coef cientsare ordered from coarsest to nest so the decreasing priorvariance means that there will be more shrinkage at the nerlevels This is a logical consequence of the smoothness that wehave tried to express in the prior distribution The spikes in theplot at the level transitions are from the boundary conditionproblems of the discrete wavelet transform

We know from experience that good predictions can usuallybe obtained with 10 or so selected spectral points in exam-ples of this type Having no previous experience in selecting

wavelet coef cients and wanting to induce a similarly ldquosmallrdquomodel without constraining the possibilities too severely wechose w in the prior for ƒ so that the expected model sizewas pw D 20 We have given equal prior probability hereto coef cients at the different levels Although we consid-ered the possibility of varying w in blocks we had no strongprior opinions about which levels were likely to provide themost useful coef cients apart from a suspicion that neitherthe coarsest nor the nest levels would feature strongly

64 Implementing the Metropolis Search

We chose widely different starting points for the fourMetropolis chains by setting to 1 the rst 1 the rst 20 the rst 128 (ie half) and all elements of ƒ There were 100000iterations in each run where an iteration comprised eitheraddingdeleting or swapping as described in Section 42The two moves were chosen with equal probability rdquo D 1=2Acceptance of the possible move by (14) was by generation ofa Bernoulli random variable Computation of g4ƒ5 and g4ƒ uuml 5

was done using the QR decomposition of MATLABFor each chain we recorded the visited ƒrsquos and their corre-

sponding relative probability g4ƒ5 No burn-in was necessaryas relatively unlikely ƒrsquos would automatically be down-weighted in our analysis There were approximately 38000ndash40000 successful moves for each of the four runs Of thesemoves around 95 were swaps The relative probabilities ofthe set of distinct visited ƒ were then normalized to 1 overthis set Figure 4 plots the marginal probabilities for com-ponents of ƒ1 P4ƒj

D 151 j D 11 1256 The spikes showwhere regressor variables have been included in subsets with

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 2: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 399

1400 1600 1800 2000 2200 2400ndash 014

ndash 012

ndash 01

ndash 008

ndash 006

Ref

lect

ance

1400 1600 1800 2000 2200 24000

005

01

Ref

lect

ance

1400 1600 1800 2000 2200 24000

002

004

006

008

Ref

lect

ance

Wavelength

0 50 100 150 200 250ndash 15

ndash 1

ndash 05

0

05

DW

T

0 50 100 150 200 250ndash 05

0

05

1

DW

T

0 50 100 150 200 250ndash 02

0

02

04

DW

T

Wavelet coefficient

Figure 1 Original Spectra (left column) and Wavelet Transforms (right column)

12 Standard Analyses

The most commonly used approaches to this calibrationproblem regress Y on X with the linear form

Y D XB C E

being justi ed either by appeals to the BeerndashLambert law(Osborne et al 1993) or on the grounds that it works in prac-tice In Section 6 we also investigate logistic transformationsof the responses showing that overall their impact on the pre-diction performance of the model is not bene cial

The problem is not straightforward because there are manymore predictor variables (256) than training samples (39) inour example The most commonly used methods for overcom-ing this dif culty fall into two broad classes variable selectionand factor-based methods When scanning NIR instruments rst appeared the standard approach was to select (typicallyusing a stepwise procedure) predictors at a small number ofwavelengths and use multiple linear regression with this subset(Hrushka 1987) Later this approach was largely supersededby methods that reduce the p spectral variables to scores ona much smaller number of factors and then regress on thesescores Two variantsmdashprincipal components regression (PCRCowe and McNicol 1985) and partial least squares regres-sion (PLS Wold Martens and Wold 1983)mdashare now widelyused with equal effectiveness as the standard approaches Theincreasing power of computers has triggered renewed researchinterest in wavelength selection now using computer-intensivesearch methods

13 Selecting Wavelet Coefrsquo cients

The approach that we investigate here involves selectingvariables but we select from derived variables The idea isto transform each spectrum into a set of wavelet coef cientsthe whole of which would suf ce to reconstruct the spectrumand select good predictors from among these There are goodreasons for thinking that this approach might have advantagesover selecting from the original variables

In previous work (Brown Fearn and Vannucci 1999Brown Vannucci and Fearn 1998ab) we explored Bayesianapproaches to the problem of selecting predictor variables inthis multivariate regression context We apply this methodol-ogy to wavelet selection here

We know that wavelets can be used successfully for com-pression of curves like the spectra in our example in the sensethat the curves can be accurately reconstructed from a fractionof the full set of wavelet coef cients (Trygg and Wold 1998Walczak and Massart 1997) Furthermore the wavelet decom-position of the curve is a local one so that if the informationrelevant to our prediction problem is contained in a particularpart or parts of the curve as it typically is then this informa-tion will be carried by a very small number of wavelet coef- cients Thus we may expect selection to work The abilityof wavelets to model the curve at different levels of resolu-tion gives us the option of selecting from our curve at a rangeof bandwidths In some situations it may be advantageous toselect a sharp band as we do when we select one of the origi-nal variables in other situations a broad band averaging overmany adjacent points may be preferable Selecting from thewavelet coef cients gives us both of these options automati-cally and in a very computationally ef cient framework

400 Journal of the American Statistical Association June 2001

Not surprisingly Fourier analysis has also been success-fully used for data compression and denoising of NIR spec-tral data (McClure Hamid Giesbrecht and Weeks 1984) Forthis purpose there is probably very little to choose betweenthe Fourier and wavelet approaches When it comes to select-ing small numbers of coef cients for prediction however thelocal nature of wavelets makes them the obvious choice

It is worth emphasizing that what we are doing here isnot what is commonly described as wavelet regression Weare not tting a smooth function to a single noisy curve byusing either thresholding or shrinkage of wavelet coef cients(see Clyde and George 2000 for Bayesian approaches that usemixture modeling and Donoho Johnstone Kerkyacharian andPicard 1995) Unlike those authors we have several curvesthe spectra of 39 dough pieces each of which is transformedto wavelet coef cients We then select some of these waveletcoef cients (the same ones for each spectrum) not becausethey give a good reconstruction of the curves (which they donot) or to remove noise (of which there is very little to startwith) from the curves but rather because the selected coef -cients are useful for predicting some other quantity measuredon the dough pieces One consequence of this is that it isnot necessarily the large wavelet coef cients that will be use-ful small coef cients in critical regions of the spectrum alsomay carry important predictive information Thus the standardthresholding or shrinkage approaches are just not relevant tothis problem

2 PRELIMINARIES

21 Wavelet Bases and Wavelet Transforms

Wavelets are families of functions that can accuratelydescribe other functions in a parsimonious way In L2425 forexample an orthonormal wavelet basis is obtained as trans-lations and dilations of a mother wavelet ndash as ndashj1 k4x5 D2j=2ndash42jx ƒ k5 with j1 k integers A function f is then repre-sented by a wavelet series as

f4x5 DX

j1 k2

dj1 kndashj1 k4x51 (1)

with wavelet coef cients dj1 kD

Rf 4x5ndashj1 k4x5dx describing

features of the function f at the spatial location 2ƒjk andfrequency proportional to 2j (or scale j)

Daubechies (1992) proposed a class of wavelet families thathave compact support and a maximum number of vanishingmoments for any given smoothness These are used exten-sively in statistical applications

Wavelets have been an extremely useful tool for the analysisand synthesis of discrete data Let Y D 4y11 1 yn5 n D 2J be a sample of a function at equally spaced points This vec-tor of observations can be viewed as an approximation to thefunction at the ne scale J A fast algorithm the discretewavelet transform (DWT) exists for decomposing Y into a setof wavelet coef cients (Mallat 1989) in only O4n5 operationsThe DWT operates in practice by means of linear recursive l-ters For illustration purposes we can write it in matrix formas Z D WY where W is an orthogonal matrix correspondingto the discrete wavelet transform and Z is a vector of waveletcoef cients describing features of the function at scales from

the ne J ƒ 1 to a coarser one say J ƒ r An algorithm forthe inverse construction also exists

Wavelet transforms can be computed very rapidly and havegood compression properties Because they are localized inboth time and frequency wavelets have the ability to repre-sent many classes of functions in a sparse form by describingimportant features with few coef cients (For a general expo-sition of wavelet theory see Daubechies 1992)

22 Matrix-Variate Distributions

In what follows we use notation for matrix-variate distribu-tions due to Dawid (1981) We write

V ƒ M reg 4acirc 1egrave5

when the random matrix V has a matrix-variate normal distri-bution with mean M and covariance matrices ƒiiegrave and lsquo jjacirc

for its ith row and jth column Such a V could be gener-ated as V D MCA0UB where M1 A1 and B are xed matricessuch that A0A D acirc and B0B D egrave and U is a random matrixwith independent standard normal entries This notation hasthe advantage of preserving the matrix structure instead ofreshaping V as a vector It also makes for much easier formalBayesian manipulation

The other notation that we use is

W copymiddot4bdquo3 egrave5

for a random matrix W with an inverse Wishart distributionwith scale matrix egrave and shape parameter bdquo The shape param-eter differs from the more conventional degrees of freedomagain making for very easy Bayesian manipulations With Uand B de ned as earlier and with U as n p with n gt pW D B04U0U5ƒ1B has an inverse Wishart distribution withshape parameter bdquo D nƒpC 1 and scale matrix egrave The expec-tation of W exists for bdquo gt 2 and is then egrave=4bdquo ƒ 25 (Moredetails of these notations and a corresponding form for thematrix-variate T can be found in Brown 1993 App A orDawid 1981)

3 MODELING

31 Multivariate Regression Model

The basic setup that we consider is a multivariate linearregression model with n observations on a q-variate responseand p explanatory variables Let Y denote the n q matrix ofobserved values of the responses and let X be the n p matrixof predictor variables Our special concern is with functionalpredictor data that is the situation in which each row of X isa vector of observations of a curve x4t5 at p equally spacedpoints

The standard multivariate normal regression model hasconditional on 1B1egrave and X

Y ƒ 1n 0 ƒ XB reg 4 In1egrave51 (2)

where 1n is an n 1 vector of 1rsquos is a q 1 vector of inter-cepts and B D 4sbquo11 1sbquoq5 is a p q matrix of regressioncoef cients Without loss of generality we assume that thecolumns of X have been centered by subtracting their means

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 401

The unknown parameters are B and the q q errorcovariance matrix egrave A conjugate prior for this model is asfollows First given egrave

0 ƒ 00 reg 4h1egrave5 (3)

and independently

B ƒ B0 reg 4H1 egrave50 (4)

The marginal distribution of egrave is then

egrave copymiddot4bdquo3 Q50 (5)

Note that the priors on both and B have covariances depen-dent on egrave in a way that directly extends the univariate regres-sion natural conjugate prior distributions

In practice we let h ˆ to represent vague prior knowl-edge about and take B0 D 0 leaving the speci cation of Hbdquo and Q to incorporate prior knowledge about our particularapplication

32 Transformation to Wavelets

We now transform the predictor variables by applying toeach row of X a wavelet transform as described in Section 21In matrix form multiplying each row of X by the same matrixW is equivalent to multiplying X on the right side by W0

The wavelet transform is orthogonal (ie W0W D I) andthus (2) can be written as

Y ƒ 1n 0 ƒ XW0WB reg 4 In1 egrave50 (6)

We can now express the model in terms of wavelet transfor-mations of the predictors as

Y ƒ 1n0 ƒ Z QB reg 4 In1egrave51 (7)

where Z D XW0 is now a matrix of wavelet coef cients andQB D WB is a matrix of regression coef cients The trans-formed prior on QB in the case of inclusion of all predictors is

QB reg 4 QH1 egrave51 (8)

where QH D WHW0 and the parameters and egrave are unchangedby the orthogonal transformations as are the priors (3) and (5)

In practice wavelets exploit the recursive application of lters and the W-matrix notation is more useful for expla-nation than for computation Vannucci and Corradi (1999)proposed a fast recursive algorithm for computing quantitiessuch as WHW0 Their algorithm has a useful link to the two-dimensional DWT (DWT2) making computations simple Thematrix WHW0 can be computed from H with an O4n25 algo-rithm (For more details see secs 31 and 32 of Vannucci andCorradi 1999)

33 A Framework for Variable Selection

To perform selection in the wavelet coef cient domain wefurther elaborate the prior on QB by introducing a latent binaryp-vector ƒ The jth element of ƒ1 ƒj may be either 1 or0 depending on whether the jth column of Z is or is notincluded in the model When ƒj is unity the covariance matrixof the corresponding row of QB is ldquolargerdquo and when ƒj is 0 thecovariance matrix is a zero matrix We have assumed that theprior expectation of QB is 0 and so ƒj

D 0 effectively deletesthe jth explanatory variable (or wavelet coef cient) from themodel This gives conditional on ƒ

QBƒ reg 4 QHƒ1egrave51 (9)

where QBƒ and QHƒ are just QB and QH with the rows and inthe case of QH columns for which ƒj

D 0 deleted Under thisprior each row of QB is modeled as having a scale mixture ofthe type

6 QB76j27 41 ƒ ƒj5I0C ƒjN 401 Qhjjegrave51 (10)

with Qhjj equal to the jth diagonal element of the matrix QH DWHW0 and I0 a distribution placing unit mass on the 1 qzero vector Note that the rows of QB are not independent

A simple prior distribution 4ƒ5 for ƒ takes the ƒj to beindependent with Pr4ƒj

D 15 D wj and Pr4ƒjD 05 D 1 ƒ wj

with hyperparameters wj to be speci ed for j D 11 1 p Inour example we take all of the wj equal to a common w sothat the nonzero elements of ƒ have a binomial distributionwith expectation pw

Mixture priors have been widely used for variable selectionin the original model space originally by Leamer (1978) andmore recently by George and McCulloch (1997) and Mitchelland Beauchamp (1988) for the linear multiple regression caseCarlin and Chib (1995) Chipman (1996) and Geweke (1996)among others concentrated on special features of these priorsClyde DeSimone and Parmigiani (1996) used model mix-ing in prediction problems with correlated predictors whenexpressing the space of models in terms of an orthogonaliza-tion of the design matrix Their methods are not directly appli-cable to our situation because the wavelet transforms do notleave us with an orthogonal design The use of mixture pri-ors for selection in the multivariate regression setup has beeninvestigated by Brown et al (1998ab)

4 SELECTING WAVELET COEFFICIENTS

41 Posterior Distribution of ƒ

The posterior distribution of ƒ given the data 4ƒ mdash Y1Z5assigns a posterior probability to each ƒ-vector and thus toeach possible subset of predictors (wavelet coef cients) Thisposterior arises from the combination of a likelihood that givesgreat weight to subsets explaining a high proportion of thevariation in the responses Y and a prior for ƒ that penalizeslarge subsets It can be computed by integrating out 1B1

and egrave from the joint posterior distribution of these parametersand ƒ given the data With the vague (h ˆ) prior for this parameter is essentially estimated by the mean Y in thecalibration data (see Smith 1973) and to simplify the formulas

402 Journal of the American Statistical Association June 2001

that follow we now assume that the columns of Y have beencentered (Full details of the prior to posterior analysis havebeen given by Brown et al 1998b who also considered otherprior structures) After some manipulation we have

4ƒ mdash Y1Z5 g4ƒ5 D QHƒ Z0ƒZƒ

C QHƒ1ƒ

ƒq=2

ƒ4nCbdquoCqƒ15=2 4ƒ51 (11)

where QƒD Q C Y0Y ƒ Y0Zƒ4Z0

ƒ ZƒC QHƒ1

ƒ 5ƒ1Z0ƒ Y and Zƒ is

Z with the columns for which ƒjD 0 deleted Care is needed

in computing (11) the alternative forms discussed later maybe useful

A simplifying feature of this setup is that all of the com-putations can be formulated as least squares problems withmodi ed Y and Z matrices By writing

QZƒD

sup3Zƒ

QH12ƒ

Ipƒ

acute1 QY D Y

01

where QH1=2ƒ is a matrix square root of QHƒ and pƒ is the number

of 1rsquos in ƒ the relevant quantities entering into (11) can becomputed as

QHƒ Z0ƒZƒ

C QHƒ1ƒ

D QZ0ƒ

QZƒ (12)

and

QƒD Q C QY0 QY ƒ QY0 QZƒ4 QZ0

ƒQZƒ5ƒ1 QZ0

ƒQY3 (13)

that is Qƒ is given by Q plus the residual sum of prod-ucts matrix from the least squares regression of QY on QZƒ The QR decomposition can then be used (see eg Seber1984 chap 10 sec 11b) which avoids ldquosquaringrdquo as in (12)and (13)

42 Metropolis Search

Equation (11) gives the posterior probability of each of the2p different ƒ vectors and thus of each choice of waveletcoef cient subsets What remains to do is to look for ldquogoodrdquowavelet components by computing these posterior probabili-ties When p is much greater than about 25 too many subsetsexist for this to be feasible Fortunately we can use simula-tion methods that will nd the ƒ vectors with relatively highposterior probabilities We can then quickly identify usefulcoef cients that have high marginal probabilities of ƒj

D 1Here we use a Metropolis search as suggested for model

selection by Madigan and York (1995) and applied to variableselection for regression by Brown et al (1998a) George andMcCulloch (1997) and Raftery Madigan and Hoeting (1997)The search starts from a randomly chosen ƒ0 and then movesthrough a sequence of further values of ƒ At each step thealgorithm generates a new candidate ƒ by randomly modify-ing the current one Two types of moves are used

iexcl Add or delete a component by choosing at random onecomponent in the current ƒ and changing its value Thismove is chosen with probability rdquo

iexcl Swap two components by choosing independently at ran-dom a 0 and a 1 in the current ƒ and changing both ofthem This move is chosen with probability 1ƒ rdquo

The new candidate model ƒ uuml is accepted with probability

ming4ƒ uuml 5

g4ƒ511 0 (14)

Thus a more probable ƒ uuml is always accepted and a less prob-able one may be accepted There is scope for further ingenu-ity in designing the sequence of random moves For examplemoves that add or subtract or swap two or three or more at atime or a combination of these may be useful

The sequence of ƒrsquos generated by the search is a realizationof a Markov chain and the choice of acceptance probabilitiesensures that the equilibrium distribution of this chain is thedistribution given by (11) In typical uses of such schemesthe realizations are monitored after a suitable burn-in periodto verify that they appear stationary Here we have a closedform for the posterior distribution and are using the chainsimply to explore this distribution Thus we have not beenso concerned about strict convergence of the Markov chainFollowing Brown et al (1998a) we adopt a strategy of runningthe chain from a number of different starting points (four here)and looking at the four marginal distributions provided by thecomputed g4cent5 values of the visited ƒrsquos We also look for goodindication of mixing and explorations with returns Becausewe know the relative probabilities we do not need to worryabout using a burn-in period

Note that given the form of the acceptance probability (14)ƒ-vectors with high posterior probability have a greater chanceof appearing in the sequence Thus we might expect that along run of such a chain would visit many of the best subsets

5 PREDICTION

Suppose now that we wish to predict Yf an m q matrix offurther Y-vectors given the corresponding X-vectors Xf 1 4mp5 First we treat Xf exactly as the training data have beentreated by subtracting the training data means and transform-ing to wavelet coef cients Zf 1 4m p5 The model for Yf following the model for the training data (6) is

Yfƒ 1m 0 ƒ Zf

QB reg 4 Im1egrave50 (15)

If we believe our Bayesian mixture model then logically weshould apply the same latent structure model to prediction aswell as to training This has the practical appeal of providingaveraging over a range of likely models (Madigan and Raftery1994)

Results of Brown et al (1998b) demonstrate that with thecolumns of Yf centered using the mean Y from the train-ing set the expectation of the predictive distribution p4Yf

mdashƒ1 Z1 Y5 is given by Zf

OBƒ with

OBƒD Z0

ƒZƒC QHƒ1

ƒ

ƒ1Z0

ƒY D QH12ƒ

QZ0ƒ

QZƒ

ƒ1 QZ0ƒ

QY0 (16)

Averaging over the posterior distribution of ƒ gives

OYfD

ZfOBƒ 4ƒ mdash Z1Y51 (17)

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 403

and we might choose to approximate this by some restrictedset of ƒ values perhaps the r most likely values from theMetropolis search

6 APPLICATION TO NEAR-INFRAREDSPECTROSCOPY OF BISCUIT DOUGHS

We now apply the methodology developed earlier to thespectroscopic calibration problem described in Section 11First however we report the results of some other analyses ofthese data

61 Analysis by Standard Methods

For all of the analyses carried out here both compositionaland spectral data were centered by subtracting the training setmeans from the training and validation data The responsesbut not the spectral data were also scaled to give each of thefour variables unit variance in the training set

Mean squared prediction errors were converted back to theoriginal scale by multiplying them by the training sample vari-ances This preprocessing of the data makes no difference tothe standard analyses which treat the response variables sep-arately but it simpli es the prior speci cation for our multi-variate wavelet analysis

Osborne et al (1984) derived calibrations by multipleregression using various stepwise procedures to select wave-lengths for each constituent separately The mean squarederror of predictions on the 39 validation samples for their cal-ibrations are reported in the rst row of Table 1

Brown et al (1999) also selected small numbers of wave-lengths to nd calibration equations for this example TheirBayesian decision theory approach differed from the approachof Osborne in being multivariate (ie in trying to nd onesmall subset of wavelengths suitable for predicting all fourconstituents) and in using a more extensive search using sim-ulated annealing The results for this alternative wavelengthselection approach are given in the second row of Table 1

For the purpose of comparison we carried out two otheranalyses using partial least squares regression (PLS) and prin-cipal components regression (PCR) These approaches bothof which construct factors from the full spectral data and thenregress constituents on the factors are very much the standardtools in NIR spectroscopy (see eg Geladi and Martens 1996Geladi Martens Hadjiiski and Hopke 1996) For the compu-tations we used the PLS Toolbox 20 of Wise and Gallagher(Eigenvector Research Manson WA) Although there aremultivariate versions of PLS we took the usual approach ofcalibrating for each constituent separately The number of fac-tors used selected in each case by cross-validation on thetraining set was ve for each of the PLS equations and six

Table 1 Mean Squared Errors of Prediction on the 39 Biscuit DoughPieces in the Validation Set Using Four Calibration Methods

Method Fat Sugar Flour Water

Stepwise MLR 0044 10188 0722 0221Decision theory 0076 0566 0265 0176PLS 0151 0583 0375 0105PCR 0160 0614 0388 0106

for each of the PCR equations The results given in rows threeand four of Table 1 show that as usual there is little to choosebetween the two methods These results are for PLS and PCRusing the reduced 256-point spectrum that we used for ourwavelet analysis Repeating the analyses using the original700-point spectrum yielded results very similar to those forPLS and somewhat worse than those reported for PCR

Because shortly we need to specify a prior distribution onregression coef cients it is interesting to examine those result-ing from a factor-type approach Combining the coef cientsfor the regression of constituent on factor scores with theloadings that produce scores from the original spectral vari-ables gives the coef cient vector that would be applied to ameasured spectrum to give a prediction Figure 2 plots thesevectors for the PLS equations for the four constituents show-ing the smoothness in the 256 coef cients that we attempt tore ect in our prior distribution for B

62 Wavelet Transforms of Spectra

To each spectrum we apply a wavelet transform convert-ing it to a set of 256 wavelet coef cients We used theMATLAB toolbox Wavbox 43 (Taswell 1995) for this stepUsing spectra with 2m (m integer) points is not a real restric-tion here Methods exist to overcome the limitation allow-ing the DWT to be applied on any length of data We usedMP(4) (Daubechies 1992 p 194) wavelets with four vanish-ing moments The Daubechies wavelets have compact supportimportant for good localization and a maximum number ofvanishing moments for a given smoothness A large number ofvanishing moments leads to high compressibility because the ne-scale wavelet coef cients are essentially 0 where the func-tions are smooth On the other hand support of the waveletsincreases with an increasing number of vanishing momentsso there is a trade-off with the localization properties Somerather limited exploration suggested that the chosen waveletfamily is a good compromise for these data

The graphs on the right side of Figure 1 show the wavelettransforms corresponding to the three NIR spectra in the leftcolumn Coef cients are ordered from coarsest to nest

63 Prior Settings

We need to specify the values of H bdquo and Q in (3) (4)and (5) and the prior probability w that an element of ƒ is1 We wish to put in weak but proper prior information aboutegrave We choose bdquo D 3 because this is the smallest integer valuesuch that the expectation of egrave E4egrave5 D Q=4bdquoƒ25 exists Thescale matrix Q is chosen as Q D k Iq with k D 005 compara-ble in size to the expected error variances of the standardizedY given X With bdquo small the choice of Q is unlikely to becritical

Much more likely to be in uential are the choices of H andw in the priors for B and ƒ To re ect the smoothness in thecoef cients B as exempli ed in Figure 2 while keeping theform of H simple we have taken H to be the variance matrixof a rst-order autoregressive process with hij

D lsquo 2mdashiƒjmdash Wederived the values lsquo 2 D 254 and D 032 by maximizing atype II likelihood (Good 1965) Integrating B and egrave fromthe joint distribution given by (2) (3) (4) and (5) for the

404 Journal of the American Statistical Association June 2001

1400 1600 1800 2000 2200 2400ndash 2

0

2

4

6

Coe

ffici

ent

Fat

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4Sugar

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

Wavelength

Coe

ffici

ent

Flour

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

6

Wavelength

Water

Figure 2 Coefrsquo cient Vectors From 5-Factor PLS Equations

regression on the full untransformed spectra with h ˆ andB0

D 0 we get

f mdashKmdashƒq=2mdashQmdash4bdquoCqƒ15=2mdashQ C Y0Kƒ1Ymdashƒ4bdquoCnCqƒ15=21 (18)

where

K D In C XHX0

and the columns of Y are centered as in Section 41 Withk D 005 and bdquo D 3 already xed (18) is a function via H oflsquo 2 and We used for our prior the values of these hyperpa-rameters that maximize (18) Possible underestimation due tothe use of the full spectra was taken into account by multiply-ing the estimate of lsquo 2 by the in ation factor 256=20 re ect-ing our prior belief as to the expected number of includedcoef cients

Figure 3 shows the diagonal elements of the matrix QH DWHW0 implied by our choice of H The variance matrix of theith column of QB in (8) is lsquo ii

QH so this plot shows the patternin the prior variances of the regression coef cients when thepredictors are wavelet coef cients The wavelet coef cientsare ordered from coarsest to nest so the decreasing priorvariance means that there will be more shrinkage at the nerlevels This is a logical consequence of the smoothness that wehave tried to express in the prior distribution The spikes in theplot at the level transitions are from the boundary conditionproblems of the discrete wavelet transform

We know from experience that good predictions can usuallybe obtained with 10 or so selected spectral points in exam-ples of this type Having no previous experience in selecting

wavelet coef cients and wanting to induce a similarly ldquosmallrdquomodel without constraining the possibilities too severely wechose w in the prior for ƒ so that the expected model sizewas pw D 20 We have given equal prior probability hereto coef cients at the different levels Although we consid-ered the possibility of varying w in blocks we had no strongprior opinions about which levels were likely to provide themost useful coef cients apart from a suspicion that neitherthe coarsest nor the nest levels would feature strongly

64 Implementing the Metropolis Search

We chose widely different starting points for the fourMetropolis chains by setting to 1 the rst 1 the rst 20 the rst 128 (ie half) and all elements of ƒ There were 100000iterations in each run where an iteration comprised eitheraddingdeleting or swapping as described in Section 42The two moves were chosen with equal probability rdquo D 1=2Acceptance of the possible move by (14) was by generation ofa Bernoulli random variable Computation of g4ƒ5 and g4ƒ uuml 5

was done using the QR decomposition of MATLABFor each chain we recorded the visited ƒrsquos and their corre-

sponding relative probability g4ƒ5 No burn-in was necessaryas relatively unlikely ƒrsquos would automatically be down-weighted in our analysis There were approximately 38000ndash40000 successful moves for each of the four runs Of thesemoves around 95 were swaps The relative probabilities ofthe set of distinct visited ƒ were then normalized to 1 overthis set Figure 4 plots the marginal probabilities for com-ponents of ƒ1 P4ƒj

D 151 j D 11 1256 The spikes showwhere regressor variables have been included in subsets with

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 3: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

400 Journal of the American Statistical Association June 2001

Not surprisingly Fourier analysis has also been success-fully used for data compression and denoising of NIR spec-tral data (McClure Hamid Giesbrecht and Weeks 1984) Forthis purpose there is probably very little to choose betweenthe Fourier and wavelet approaches When it comes to select-ing small numbers of coef cients for prediction however thelocal nature of wavelets makes them the obvious choice

It is worth emphasizing that what we are doing here isnot what is commonly described as wavelet regression Weare not tting a smooth function to a single noisy curve byusing either thresholding or shrinkage of wavelet coef cients(see Clyde and George 2000 for Bayesian approaches that usemixture modeling and Donoho Johnstone Kerkyacharian andPicard 1995) Unlike those authors we have several curvesthe spectra of 39 dough pieces each of which is transformedto wavelet coef cients We then select some of these waveletcoef cients (the same ones for each spectrum) not becausethey give a good reconstruction of the curves (which they donot) or to remove noise (of which there is very little to startwith) from the curves but rather because the selected coef -cients are useful for predicting some other quantity measuredon the dough pieces One consequence of this is that it isnot necessarily the large wavelet coef cients that will be use-ful small coef cients in critical regions of the spectrum alsomay carry important predictive information Thus the standardthresholding or shrinkage approaches are just not relevant tothis problem

2 PRELIMINARIES

21 Wavelet Bases and Wavelet Transforms

Wavelets are families of functions that can accuratelydescribe other functions in a parsimonious way In L2425 forexample an orthonormal wavelet basis is obtained as trans-lations and dilations of a mother wavelet ndash as ndashj1 k4x5 D2j=2ndash42jx ƒ k5 with j1 k integers A function f is then repre-sented by a wavelet series as

f4x5 DX

j1 k2

dj1 kndashj1 k4x51 (1)

with wavelet coef cients dj1 kD

Rf 4x5ndashj1 k4x5dx describing

features of the function f at the spatial location 2ƒjk andfrequency proportional to 2j (or scale j)

Daubechies (1992) proposed a class of wavelet families thathave compact support and a maximum number of vanishingmoments for any given smoothness These are used exten-sively in statistical applications

Wavelets have been an extremely useful tool for the analysisand synthesis of discrete data Let Y D 4y11 1 yn5 n D 2J be a sample of a function at equally spaced points This vec-tor of observations can be viewed as an approximation to thefunction at the ne scale J A fast algorithm the discretewavelet transform (DWT) exists for decomposing Y into a setof wavelet coef cients (Mallat 1989) in only O4n5 operationsThe DWT operates in practice by means of linear recursive l-ters For illustration purposes we can write it in matrix formas Z D WY where W is an orthogonal matrix correspondingto the discrete wavelet transform and Z is a vector of waveletcoef cients describing features of the function at scales from

the ne J ƒ 1 to a coarser one say J ƒ r An algorithm forthe inverse construction also exists

Wavelet transforms can be computed very rapidly and havegood compression properties Because they are localized inboth time and frequency wavelets have the ability to repre-sent many classes of functions in a sparse form by describingimportant features with few coef cients (For a general expo-sition of wavelet theory see Daubechies 1992)

22 Matrix-Variate Distributions

In what follows we use notation for matrix-variate distribu-tions due to Dawid (1981) We write

V ƒ M reg 4acirc 1egrave5

when the random matrix V has a matrix-variate normal distri-bution with mean M and covariance matrices ƒiiegrave and lsquo jjacirc

for its ith row and jth column Such a V could be gener-ated as V D MCA0UB where M1 A1 and B are xed matricessuch that A0A D acirc and B0B D egrave and U is a random matrixwith independent standard normal entries This notation hasthe advantage of preserving the matrix structure instead ofreshaping V as a vector It also makes for much easier formalBayesian manipulation

The other notation that we use is

W copymiddot4bdquo3 egrave5

for a random matrix W with an inverse Wishart distributionwith scale matrix egrave and shape parameter bdquo The shape param-eter differs from the more conventional degrees of freedomagain making for very easy Bayesian manipulations With Uand B de ned as earlier and with U as n p with n gt pW D B04U0U5ƒ1B has an inverse Wishart distribution withshape parameter bdquo D nƒpC 1 and scale matrix egrave The expec-tation of W exists for bdquo gt 2 and is then egrave=4bdquo ƒ 25 (Moredetails of these notations and a corresponding form for thematrix-variate T can be found in Brown 1993 App A orDawid 1981)

3 MODELING

31 Multivariate Regression Model

The basic setup that we consider is a multivariate linearregression model with n observations on a q-variate responseand p explanatory variables Let Y denote the n q matrix ofobserved values of the responses and let X be the n p matrixof predictor variables Our special concern is with functionalpredictor data that is the situation in which each row of X isa vector of observations of a curve x4t5 at p equally spacedpoints

The standard multivariate normal regression model hasconditional on 1B1egrave and X

Y ƒ 1n 0 ƒ XB reg 4 In1egrave51 (2)

where 1n is an n 1 vector of 1rsquos is a q 1 vector of inter-cepts and B D 4sbquo11 1sbquoq5 is a p q matrix of regressioncoef cients Without loss of generality we assume that thecolumns of X have been centered by subtracting their means

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 401

The unknown parameters are B and the q q errorcovariance matrix egrave A conjugate prior for this model is asfollows First given egrave

0 ƒ 00 reg 4h1egrave5 (3)

and independently

B ƒ B0 reg 4H1 egrave50 (4)

The marginal distribution of egrave is then

egrave copymiddot4bdquo3 Q50 (5)

Note that the priors on both and B have covariances depen-dent on egrave in a way that directly extends the univariate regres-sion natural conjugate prior distributions

In practice we let h ˆ to represent vague prior knowl-edge about and take B0 D 0 leaving the speci cation of Hbdquo and Q to incorporate prior knowledge about our particularapplication

32 Transformation to Wavelets

We now transform the predictor variables by applying toeach row of X a wavelet transform as described in Section 21In matrix form multiplying each row of X by the same matrixW is equivalent to multiplying X on the right side by W0

The wavelet transform is orthogonal (ie W0W D I) andthus (2) can be written as

Y ƒ 1n 0 ƒ XW0WB reg 4 In1 egrave50 (6)

We can now express the model in terms of wavelet transfor-mations of the predictors as

Y ƒ 1n0 ƒ Z QB reg 4 In1egrave51 (7)

where Z D XW0 is now a matrix of wavelet coef cients andQB D WB is a matrix of regression coef cients The trans-formed prior on QB in the case of inclusion of all predictors is

QB reg 4 QH1 egrave51 (8)

where QH D WHW0 and the parameters and egrave are unchangedby the orthogonal transformations as are the priors (3) and (5)

In practice wavelets exploit the recursive application of lters and the W-matrix notation is more useful for expla-nation than for computation Vannucci and Corradi (1999)proposed a fast recursive algorithm for computing quantitiessuch as WHW0 Their algorithm has a useful link to the two-dimensional DWT (DWT2) making computations simple Thematrix WHW0 can be computed from H with an O4n25 algo-rithm (For more details see secs 31 and 32 of Vannucci andCorradi 1999)

33 A Framework for Variable Selection

To perform selection in the wavelet coef cient domain wefurther elaborate the prior on QB by introducing a latent binaryp-vector ƒ The jth element of ƒ1 ƒj may be either 1 or0 depending on whether the jth column of Z is or is notincluded in the model When ƒj is unity the covariance matrixof the corresponding row of QB is ldquolargerdquo and when ƒj is 0 thecovariance matrix is a zero matrix We have assumed that theprior expectation of QB is 0 and so ƒj

D 0 effectively deletesthe jth explanatory variable (or wavelet coef cient) from themodel This gives conditional on ƒ

QBƒ reg 4 QHƒ1egrave51 (9)

where QBƒ and QHƒ are just QB and QH with the rows and inthe case of QH columns for which ƒj

D 0 deleted Under thisprior each row of QB is modeled as having a scale mixture ofthe type

6 QB76j27 41 ƒ ƒj5I0C ƒjN 401 Qhjjegrave51 (10)

with Qhjj equal to the jth diagonal element of the matrix QH DWHW0 and I0 a distribution placing unit mass on the 1 qzero vector Note that the rows of QB are not independent

A simple prior distribution 4ƒ5 for ƒ takes the ƒj to beindependent with Pr4ƒj

D 15 D wj and Pr4ƒjD 05 D 1 ƒ wj

with hyperparameters wj to be speci ed for j D 11 1 p Inour example we take all of the wj equal to a common w sothat the nonzero elements of ƒ have a binomial distributionwith expectation pw

Mixture priors have been widely used for variable selectionin the original model space originally by Leamer (1978) andmore recently by George and McCulloch (1997) and Mitchelland Beauchamp (1988) for the linear multiple regression caseCarlin and Chib (1995) Chipman (1996) and Geweke (1996)among others concentrated on special features of these priorsClyde DeSimone and Parmigiani (1996) used model mix-ing in prediction problems with correlated predictors whenexpressing the space of models in terms of an orthogonaliza-tion of the design matrix Their methods are not directly appli-cable to our situation because the wavelet transforms do notleave us with an orthogonal design The use of mixture pri-ors for selection in the multivariate regression setup has beeninvestigated by Brown et al (1998ab)

4 SELECTING WAVELET COEFFICIENTS

41 Posterior Distribution of ƒ

The posterior distribution of ƒ given the data 4ƒ mdash Y1Z5assigns a posterior probability to each ƒ-vector and thus toeach possible subset of predictors (wavelet coef cients) Thisposterior arises from the combination of a likelihood that givesgreat weight to subsets explaining a high proportion of thevariation in the responses Y and a prior for ƒ that penalizeslarge subsets It can be computed by integrating out 1B1

and egrave from the joint posterior distribution of these parametersand ƒ given the data With the vague (h ˆ) prior for this parameter is essentially estimated by the mean Y in thecalibration data (see Smith 1973) and to simplify the formulas

402 Journal of the American Statistical Association June 2001

that follow we now assume that the columns of Y have beencentered (Full details of the prior to posterior analysis havebeen given by Brown et al 1998b who also considered otherprior structures) After some manipulation we have

4ƒ mdash Y1Z5 g4ƒ5 D QHƒ Z0ƒZƒ

C QHƒ1ƒ

ƒq=2

ƒ4nCbdquoCqƒ15=2 4ƒ51 (11)

where QƒD Q C Y0Y ƒ Y0Zƒ4Z0

ƒ ZƒC QHƒ1

ƒ 5ƒ1Z0ƒ Y and Zƒ is

Z with the columns for which ƒjD 0 deleted Care is needed

in computing (11) the alternative forms discussed later maybe useful

A simplifying feature of this setup is that all of the com-putations can be formulated as least squares problems withmodi ed Y and Z matrices By writing

QZƒD

sup3Zƒ

QH12ƒ

Ipƒ

acute1 QY D Y

01

where QH1=2ƒ is a matrix square root of QHƒ and pƒ is the number

of 1rsquos in ƒ the relevant quantities entering into (11) can becomputed as

QHƒ Z0ƒZƒ

C QHƒ1ƒ

D QZ0ƒ

QZƒ (12)

and

QƒD Q C QY0 QY ƒ QY0 QZƒ4 QZ0

ƒQZƒ5ƒ1 QZ0

ƒQY3 (13)

that is Qƒ is given by Q plus the residual sum of prod-ucts matrix from the least squares regression of QY on QZƒ The QR decomposition can then be used (see eg Seber1984 chap 10 sec 11b) which avoids ldquosquaringrdquo as in (12)and (13)

42 Metropolis Search

Equation (11) gives the posterior probability of each of the2p different ƒ vectors and thus of each choice of waveletcoef cient subsets What remains to do is to look for ldquogoodrdquowavelet components by computing these posterior probabili-ties When p is much greater than about 25 too many subsetsexist for this to be feasible Fortunately we can use simula-tion methods that will nd the ƒ vectors with relatively highposterior probabilities We can then quickly identify usefulcoef cients that have high marginal probabilities of ƒj

D 1Here we use a Metropolis search as suggested for model

selection by Madigan and York (1995) and applied to variableselection for regression by Brown et al (1998a) George andMcCulloch (1997) and Raftery Madigan and Hoeting (1997)The search starts from a randomly chosen ƒ0 and then movesthrough a sequence of further values of ƒ At each step thealgorithm generates a new candidate ƒ by randomly modify-ing the current one Two types of moves are used

iexcl Add or delete a component by choosing at random onecomponent in the current ƒ and changing its value Thismove is chosen with probability rdquo

iexcl Swap two components by choosing independently at ran-dom a 0 and a 1 in the current ƒ and changing both ofthem This move is chosen with probability 1ƒ rdquo

The new candidate model ƒ uuml is accepted with probability

ming4ƒ uuml 5

g4ƒ511 0 (14)

Thus a more probable ƒ uuml is always accepted and a less prob-able one may be accepted There is scope for further ingenu-ity in designing the sequence of random moves For examplemoves that add or subtract or swap two or three or more at atime or a combination of these may be useful

The sequence of ƒrsquos generated by the search is a realizationof a Markov chain and the choice of acceptance probabilitiesensures that the equilibrium distribution of this chain is thedistribution given by (11) In typical uses of such schemesthe realizations are monitored after a suitable burn-in periodto verify that they appear stationary Here we have a closedform for the posterior distribution and are using the chainsimply to explore this distribution Thus we have not beenso concerned about strict convergence of the Markov chainFollowing Brown et al (1998a) we adopt a strategy of runningthe chain from a number of different starting points (four here)and looking at the four marginal distributions provided by thecomputed g4cent5 values of the visited ƒrsquos We also look for goodindication of mixing and explorations with returns Becausewe know the relative probabilities we do not need to worryabout using a burn-in period

Note that given the form of the acceptance probability (14)ƒ-vectors with high posterior probability have a greater chanceof appearing in the sequence Thus we might expect that along run of such a chain would visit many of the best subsets

5 PREDICTION

Suppose now that we wish to predict Yf an m q matrix offurther Y-vectors given the corresponding X-vectors Xf 1 4mp5 First we treat Xf exactly as the training data have beentreated by subtracting the training data means and transform-ing to wavelet coef cients Zf 1 4m p5 The model for Yf following the model for the training data (6) is

Yfƒ 1m 0 ƒ Zf

QB reg 4 Im1egrave50 (15)

If we believe our Bayesian mixture model then logically weshould apply the same latent structure model to prediction aswell as to training This has the practical appeal of providingaveraging over a range of likely models (Madigan and Raftery1994)

Results of Brown et al (1998b) demonstrate that with thecolumns of Yf centered using the mean Y from the train-ing set the expectation of the predictive distribution p4Yf

mdashƒ1 Z1 Y5 is given by Zf

OBƒ with

OBƒD Z0

ƒZƒC QHƒ1

ƒ

ƒ1Z0

ƒY D QH12ƒ

QZ0ƒ

QZƒ

ƒ1 QZ0ƒ

QY0 (16)

Averaging over the posterior distribution of ƒ gives

OYfD

ZfOBƒ 4ƒ mdash Z1Y51 (17)

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 403

and we might choose to approximate this by some restrictedset of ƒ values perhaps the r most likely values from theMetropolis search

6 APPLICATION TO NEAR-INFRAREDSPECTROSCOPY OF BISCUIT DOUGHS

We now apply the methodology developed earlier to thespectroscopic calibration problem described in Section 11First however we report the results of some other analyses ofthese data

61 Analysis by Standard Methods

For all of the analyses carried out here both compositionaland spectral data were centered by subtracting the training setmeans from the training and validation data The responsesbut not the spectral data were also scaled to give each of thefour variables unit variance in the training set

Mean squared prediction errors were converted back to theoriginal scale by multiplying them by the training sample vari-ances This preprocessing of the data makes no difference tothe standard analyses which treat the response variables sep-arately but it simpli es the prior speci cation for our multi-variate wavelet analysis

Osborne et al (1984) derived calibrations by multipleregression using various stepwise procedures to select wave-lengths for each constituent separately The mean squarederror of predictions on the 39 validation samples for their cal-ibrations are reported in the rst row of Table 1

Brown et al (1999) also selected small numbers of wave-lengths to nd calibration equations for this example TheirBayesian decision theory approach differed from the approachof Osborne in being multivariate (ie in trying to nd onesmall subset of wavelengths suitable for predicting all fourconstituents) and in using a more extensive search using sim-ulated annealing The results for this alternative wavelengthselection approach are given in the second row of Table 1

For the purpose of comparison we carried out two otheranalyses using partial least squares regression (PLS) and prin-cipal components regression (PCR) These approaches bothof which construct factors from the full spectral data and thenregress constituents on the factors are very much the standardtools in NIR spectroscopy (see eg Geladi and Martens 1996Geladi Martens Hadjiiski and Hopke 1996) For the compu-tations we used the PLS Toolbox 20 of Wise and Gallagher(Eigenvector Research Manson WA) Although there aremultivariate versions of PLS we took the usual approach ofcalibrating for each constituent separately The number of fac-tors used selected in each case by cross-validation on thetraining set was ve for each of the PLS equations and six

Table 1 Mean Squared Errors of Prediction on the 39 Biscuit DoughPieces in the Validation Set Using Four Calibration Methods

Method Fat Sugar Flour Water

Stepwise MLR 0044 10188 0722 0221Decision theory 0076 0566 0265 0176PLS 0151 0583 0375 0105PCR 0160 0614 0388 0106

for each of the PCR equations The results given in rows threeand four of Table 1 show that as usual there is little to choosebetween the two methods These results are for PLS and PCRusing the reduced 256-point spectrum that we used for ourwavelet analysis Repeating the analyses using the original700-point spectrum yielded results very similar to those forPLS and somewhat worse than those reported for PCR

Because shortly we need to specify a prior distribution onregression coef cients it is interesting to examine those result-ing from a factor-type approach Combining the coef cientsfor the regression of constituent on factor scores with theloadings that produce scores from the original spectral vari-ables gives the coef cient vector that would be applied to ameasured spectrum to give a prediction Figure 2 plots thesevectors for the PLS equations for the four constituents show-ing the smoothness in the 256 coef cients that we attempt tore ect in our prior distribution for B

62 Wavelet Transforms of Spectra

To each spectrum we apply a wavelet transform convert-ing it to a set of 256 wavelet coef cients We used theMATLAB toolbox Wavbox 43 (Taswell 1995) for this stepUsing spectra with 2m (m integer) points is not a real restric-tion here Methods exist to overcome the limitation allow-ing the DWT to be applied on any length of data We usedMP(4) (Daubechies 1992 p 194) wavelets with four vanish-ing moments The Daubechies wavelets have compact supportimportant for good localization and a maximum number ofvanishing moments for a given smoothness A large number ofvanishing moments leads to high compressibility because the ne-scale wavelet coef cients are essentially 0 where the func-tions are smooth On the other hand support of the waveletsincreases with an increasing number of vanishing momentsso there is a trade-off with the localization properties Somerather limited exploration suggested that the chosen waveletfamily is a good compromise for these data

The graphs on the right side of Figure 1 show the wavelettransforms corresponding to the three NIR spectra in the leftcolumn Coef cients are ordered from coarsest to nest

63 Prior Settings

We need to specify the values of H bdquo and Q in (3) (4)and (5) and the prior probability w that an element of ƒ is1 We wish to put in weak but proper prior information aboutegrave We choose bdquo D 3 because this is the smallest integer valuesuch that the expectation of egrave E4egrave5 D Q=4bdquoƒ25 exists Thescale matrix Q is chosen as Q D k Iq with k D 005 compara-ble in size to the expected error variances of the standardizedY given X With bdquo small the choice of Q is unlikely to becritical

Much more likely to be in uential are the choices of H andw in the priors for B and ƒ To re ect the smoothness in thecoef cients B as exempli ed in Figure 2 while keeping theform of H simple we have taken H to be the variance matrixof a rst-order autoregressive process with hij

D lsquo 2mdashiƒjmdash Wederived the values lsquo 2 D 254 and D 032 by maximizing atype II likelihood (Good 1965) Integrating B and egrave fromthe joint distribution given by (2) (3) (4) and (5) for the

404 Journal of the American Statistical Association June 2001

1400 1600 1800 2000 2200 2400ndash 2

0

2

4

6

Coe

ffici

ent

Fat

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4Sugar

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

Wavelength

Coe

ffici

ent

Flour

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

6

Wavelength

Water

Figure 2 Coefrsquo cient Vectors From 5-Factor PLS Equations

regression on the full untransformed spectra with h ˆ andB0

D 0 we get

f mdashKmdashƒq=2mdashQmdash4bdquoCqƒ15=2mdashQ C Y0Kƒ1Ymdashƒ4bdquoCnCqƒ15=21 (18)

where

K D In C XHX0

and the columns of Y are centered as in Section 41 Withk D 005 and bdquo D 3 already xed (18) is a function via H oflsquo 2 and We used for our prior the values of these hyperpa-rameters that maximize (18) Possible underestimation due tothe use of the full spectra was taken into account by multiply-ing the estimate of lsquo 2 by the in ation factor 256=20 re ect-ing our prior belief as to the expected number of includedcoef cients

Figure 3 shows the diagonal elements of the matrix QH DWHW0 implied by our choice of H The variance matrix of theith column of QB in (8) is lsquo ii

QH so this plot shows the patternin the prior variances of the regression coef cients when thepredictors are wavelet coef cients The wavelet coef cientsare ordered from coarsest to nest so the decreasing priorvariance means that there will be more shrinkage at the nerlevels This is a logical consequence of the smoothness that wehave tried to express in the prior distribution The spikes in theplot at the level transitions are from the boundary conditionproblems of the discrete wavelet transform

We know from experience that good predictions can usuallybe obtained with 10 or so selected spectral points in exam-ples of this type Having no previous experience in selecting

wavelet coef cients and wanting to induce a similarly ldquosmallrdquomodel without constraining the possibilities too severely wechose w in the prior for ƒ so that the expected model sizewas pw D 20 We have given equal prior probability hereto coef cients at the different levels Although we consid-ered the possibility of varying w in blocks we had no strongprior opinions about which levels were likely to provide themost useful coef cients apart from a suspicion that neitherthe coarsest nor the nest levels would feature strongly

64 Implementing the Metropolis Search

We chose widely different starting points for the fourMetropolis chains by setting to 1 the rst 1 the rst 20 the rst 128 (ie half) and all elements of ƒ There were 100000iterations in each run where an iteration comprised eitheraddingdeleting or swapping as described in Section 42The two moves were chosen with equal probability rdquo D 1=2Acceptance of the possible move by (14) was by generation ofa Bernoulli random variable Computation of g4ƒ5 and g4ƒ uuml 5

was done using the QR decomposition of MATLABFor each chain we recorded the visited ƒrsquos and their corre-

sponding relative probability g4ƒ5 No burn-in was necessaryas relatively unlikely ƒrsquos would automatically be down-weighted in our analysis There were approximately 38000ndash40000 successful moves for each of the four runs Of thesemoves around 95 were swaps The relative probabilities ofthe set of distinct visited ƒ were then normalized to 1 overthis set Figure 4 plots the marginal probabilities for com-ponents of ƒ1 P4ƒj

D 151 j D 11 1256 The spikes showwhere regressor variables have been included in subsets with

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 4: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 401

The unknown parameters are B and the q q errorcovariance matrix egrave A conjugate prior for this model is asfollows First given egrave

0 ƒ 00 reg 4h1egrave5 (3)

and independently

B ƒ B0 reg 4H1 egrave50 (4)

The marginal distribution of egrave is then

egrave copymiddot4bdquo3 Q50 (5)

Note that the priors on both and B have covariances depen-dent on egrave in a way that directly extends the univariate regres-sion natural conjugate prior distributions

In practice we let h ˆ to represent vague prior knowl-edge about and take B0 D 0 leaving the speci cation of Hbdquo and Q to incorporate prior knowledge about our particularapplication

32 Transformation to Wavelets

We now transform the predictor variables by applying toeach row of X a wavelet transform as described in Section 21In matrix form multiplying each row of X by the same matrixW is equivalent to multiplying X on the right side by W0

The wavelet transform is orthogonal (ie W0W D I) andthus (2) can be written as

Y ƒ 1n 0 ƒ XW0WB reg 4 In1 egrave50 (6)

We can now express the model in terms of wavelet transfor-mations of the predictors as

Y ƒ 1n0 ƒ Z QB reg 4 In1egrave51 (7)

where Z D XW0 is now a matrix of wavelet coef cients andQB D WB is a matrix of regression coef cients The trans-formed prior on QB in the case of inclusion of all predictors is

QB reg 4 QH1 egrave51 (8)

where QH D WHW0 and the parameters and egrave are unchangedby the orthogonal transformations as are the priors (3) and (5)

In practice wavelets exploit the recursive application of lters and the W-matrix notation is more useful for expla-nation than for computation Vannucci and Corradi (1999)proposed a fast recursive algorithm for computing quantitiessuch as WHW0 Their algorithm has a useful link to the two-dimensional DWT (DWT2) making computations simple Thematrix WHW0 can be computed from H with an O4n25 algo-rithm (For more details see secs 31 and 32 of Vannucci andCorradi 1999)

33 A Framework for Variable Selection

To perform selection in the wavelet coef cient domain wefurther elaborate the prior on QB by introducing a latent binaryp-vector ƒ The jth element of ƒ1 ƒj may be either 1 or0 depending on whether the jth column of Z is or is notincluded in the model When ƒj is unity the covariance matrixof the corresponding row of QB is ldquolargerdquo and when ƒj is 0 thecovariance matrix is a zero matrix We have assumed that theprior expectation of QB is 0 and so ƒj

D 0 effectively deletesthe jth explanatory variable (or wavelet coef cient) from themodel This gives conditional on ƒ

QBƒ reg 4 QHƒ1egrave51 (9)

where QBƒ and QHƒ are just QB and QH with the rows and inthe case of QH columns for which ƒj

D 0 deleted Under thisprior each row of QB is modeled as having a scale mixture ofthe type

6 QB76j27 41 ƒ ƒj5I0C ƒjN 401 Qhjjegrave51 (10)

with Qhjj equal to the jth diagonal element of the matrix QH DWHW0 and I0 a distribution placing unit mass on the 1 qzero vector Note that the rows of QB are not independent

A simple prior distribution 4ƒ5 for ƒ takes the ƒj to beindependent with Pr4ƒj

D 15 D wj and Pr4ƒjD 05 D 1 ƒ wj

with hyperparameters wj to be speci ed for j D 11 1 p Inour example we take all of the wj equal to a common w sothat the nonzero elements of ƒ have a binomial distributionwith expectation pw

Mixture priors have been widely used for variable selectionin the original model space originally by Leamer (1978) andmore recently by George and McCulloch (1997) and Mitchelland Beauchamp (1988) for the linear multiple regression caseCarlin and Chib (1995) Chipman (1996) and Geweke (1996)among others concentrated on special features of these priorsClyde DeSimone and Parmigiani (1996) used model mix-ing in prediction problems with correlated predictors whenexpressing the space of models in terms of an orthogonaliza-tion of the design matrix Their methods are not directly appli-cable to our situation because the wavelet transforms do notleave us with an orthogonal design The use of mixture pri-ors for selection in the multivariate regression setup has beeninvestigated by Brown et al (1998ab)

4 SELECTING WAVELET COEFFICIENTS

41 Posterior Distribution of ƒ

The posterior distribution of ƒ given the data 4ƒ mdash Y1Z5assigns a posterior probability to each ƒ-vector and thus toeach possible subset of predictors (wavelet coef cients) Thisposterior arises from the combination of a likelihood that givesgreat weight to subsets explaining a high proportion of thevariation in the responses Y and a prior for ƒ that penalizeslarge subsets It can be computed by integrating out 1B1

and egrave from the joint posterior distribution of these parametersand ƒ given the data With the vague (h ˆ) prior for this parameter is essentially estimated by the mean Y in thecalibration data (see Smith 1973) and to simplify the formulas

402 Journal of the American Statistical Association June 2001

that follow we now assume that the columns of Y have beencentered (Full details of the prior to posterior analysis havebeen given by Brown et al 1998b who also considered otherprior structures) After some manipulation we have

4ƒ mdash Y1Z5 g4ƒ5 D QHƒ Z0ƒZƒ

C QHƒ1ƒ

ƒq=2

ƒ4nCbdquoCqƒ15=2 4ƒ51 (11)

where QƒD Q C Y0Y ƒ Y0Zƒ4Z0

ƒ ZƒC QHƒ1

ƒ 5ƒ1Z0ƒ Y and Zƒ is

Z with the columns for which ƒjD 0 deleted Care is needed

in computing (11) the alternative forms discussed later maybe useful

A simplifying feature of this setup is that all of the com-putations can be formulated as least squares problems withmodi ed Y and Z matrices By writing

QZƒD

sup3Zƒ

QH12ƒ

Ipƒ

acute1 QY D Y

01

where QH1=2ƒ is a matrix square root of QHƒ and pƒ is the number

of 1rsquos in ƒ the relevant quantities entering into (11) can becomputed as

QHƒ Z0ƒZƒ

C QHƒ1ƒ

D QZ0ƒ

QZƒ (12)

and

QƒD Q C QY0 QY ƒ QY0 QZƒ4 QZ0

ƒQZƒ5ƒ1 QZ0

ƒQY3 (13)

that is Qƒ is given by Q plus the residual sum of prod-ucts matrix from the least squares regression of QY on QZƒ The QR decomposition can then be used (see eg Seber1984 chap 10 sec 11b) which avoids ldquosquaringrdquo as in (12)and (13)

42 Metropolis Search

Equation (11) gives the posterior probability of each of the2p different ƒ vectors and thus of each choice of waveletcoef cient subsets What remains to do is to look for ldquogoodrdquowavelet components by computing these posterior probabili-ties When p is much greater than about 25 too many subsetsexist for this to be feasible Fortunately we can use simula-tion methods that will nd the ƒ vectors with relatively highposterior probabilities We can then quickly identify usefulcoef cients that have high marginal probabilities of ƒj

D 1Here we use a Metropolis search as suggested for model

selection by Madigan and York (1995) and applied to variableselection for regression by Brown et al (1998a) George andMcCulloch (1997) and Raftery Madigan and Hoeting (1997)The search starts from a randomly chosen ƒ0 and then movesthrough a sequence of further values of ƒ At each step thealgorithm generates a new candidate ƒ by randomly modify-ing the current one Two types of moves are used

iexcl Add or delete a component by choosing at random onecomponent in the current ƒ and changing its value Thismove is chosen with probability rdquo

iexcl Swap two components by choosing independently at ran-dom a 0 and a 1 in the current ƒ and changing both ofthem This move is chosen with probability 1ƒ rdquo

The new candidate model ƒ uuml is accepted with probability

ming4ƒ uuml 5

g4ƒ511 0 (14)

Thus a more probable ƒ uuml is always accepted and a less prob-able one may be accepted There is scope for further ingenu-ity in designing the sequence of random moves For examplemoves that add or subtract or swap two or three or more at atime or a combination of these may be useful

The sequence of ƒrsquos generated by the search is a realizationof a Markov chain and the choice of acceptance probabilitiesensures that the equilibrium distribution of this chain is thedistribution given by (11) In typical uses of such schemesthe realizations are monitored after a suitable burn-in periodto verify that they appear stationary Here we have a closedform for the posterior distribution and are using the chainsimply to explore this distribution Thus we have not beenso concerned about strict convergence of the Markov chainFollowing Brown et al (1998a) we adopt a strategy of runningthe chain from a number of different starting points (four here)and looking at the four marginal distributions provided by thecomputed g4cent5 values of the visited ƒrsquos We also look for goodindication of mixing and explorations with returns Becausewe know the relative probabilities we do not need to worryabout using a burn-in period

Note that given the form of the acceptance probability (14)ƒ-vectors with high posterior probability have a greater chanceof appearing in the sequence Thus we might expect that along run of such a chain would visit many of the best subsets

5 PREDICTION

Suppose now that we wish to predict Yf an m q matrix offurther Y-vectors given the corresponding X-vectors Xf 1 4mp5 First we treat Xf exactly as the training data have beentreated by subtracting the training data means and transform-ing to wavelet coef cients Zf 1 4m p5 The model for Yf following the model for the training data (6) is

Yfƒ 1m 0 ƒ Zf

QB reg 4 Im1egrave50 (15)

If we believe our Bayesian mixture model then logically weshould apply the same latent structure model to prediction aswell as to training This has the practical appeal of providingaveraging over a range of likely models (Madigan and Raftery1994)

Results of Brown et al (1998b) demonstrate that with thecolumns of Yf centered using the mean Y from the train-ing set the expectation of the predictive distribution p4Yf

mdashƒ1 Z1 Y5 is given by Zf

OBƒ with

OBƒD Z0

ƒZƒC QHƒ1

ƒ

ƒ1Z0

ƒY D QH12ƒ

QZ0ƒ

QZƒ

ƒ1 QZ0ƒ

QY0 (16)

Averaging over the posterior distribution of ƒ gives

OYfD

ZfOBƒ 4ƒ mdash Z1Y51 (17)

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 403

and we might choose to approximate this by some restrictedset of ƒ values perhaps the r most likely values from theMetropolis search

6 APPLICATION TO NEAR-INFRAREDSPECTROSCOPY OF BISCUIT DOUGHS

We now apply the methodology developed earlier to thespectroscopic calibration problem described in Section 11First however we report the results of some other analyses ofthese data

61 Analysis by Standard Methods

For all of the analyses carried out here both compositionaland spectral data were centered by subtracting the training setmeans from the training and validation data The responsesbut not the spectral data were also scaled to give each of thefour variables unit variance in the training set

Mean squared prediction errors were converted back to theoriginal scale by multiplying them by the training sample vari-ances This preprocessing of the data makes no difference tothe standard analyses which treat the response variables sep-arately but it simpli es the prior speci cation for our multi-variate wavelet analysis

Osborne et al (1984) derived calibrations by multipleregression using various stepwise procedures to select wave-lengths for each constituent separately The mean squarederror of predictions on the 39 validation samples for their cal-ibrations are reported in the rst row of Table 1

Brown et al (1999) also selected small numbers of wave-lengths to nd calibration equations for this example TheirBayesian decision theory approach differed from the approachof Osborne in being multivariate (ie in trying to nd onesmall subset of wavelengths suitable for predicting all fourconstituents) and in using a more extensive search using sim-ulated annealing The results for this alternative wavelengthselection approach are given in the second row of Table 1

For the purpose of comparison we carried out two otheranalyses using partial least squares regression (PLS) and prin-cipal components regression (PCR) These approaches bothof which construct factors from the full spectral data and thenregress constituents on the factors are very much the standardtools in NIR spectroscopy (see eg Geladi and Martens 1996Geladi Martens Hadjiiski and Hopke 1996) For the compu-tations we used the PLS Toolbox 20 of Wise and Gallagher(Eigenvector Research Manson WA) Although there aremultivariate versions of PLS we took the usual approach ofcalibrating for each constituent separately The number of fac-tors used selected in each case by cross-validation on thetraining set was ve for each of the PLS equations and six

Table 1 Mean Squared Errors of Prediction on the 39 Biscuit DoughPieces in the Validation Set Using Four Calibration Methods

Method Fat Sugar Flour Water

Stepwise MLR 0044 10188 0722 0221Decision theory 0076 0566 0265 0176PLS 0151 0583 0375 0105PCR 0160 0614 0388 0106

for each of the PCR equations The results given in rows threeand four of Table 1 show that as usual there is little to choosebetween the two methods These results are for PLS and PCRusing the reduced 256-point spectrum that we used for ourwavelet analysis Repeating the analyses using the original700-point spectrum yielded results very similar to those forPLS and somewhat worse than those reported for PCR

Because shortly we need to specify a prior distribution onregression coef cients it is interesting to examine those result-ing from a factor-type approach Combining the coef cientsfor the regression of constituent on factor scores with theloadings that produce scores from the original spectral vari-ables gives the coef cient vector that would be applied to ameasured spectrum to give a prediction Figure 2 plots thesevectors for the PLS equations for the four constituents show-ing the smoothness in the 256 coef cients that we attempt tore ect in our prior distribution for B

62 Wavelet Transforms of Spectra

To each spectrum we apply a wavelet transform convert-ing it to a set of 256 wavelet coef cients We used theMATLAB toolbox Wavbox 43 (Taswell 1995) for this stepUsing spectra with 2m (m integer) points is not a real restric-tion here Methods exist to overcome the limitation allow-ing the DWT to be applied on any length of data We usedMP(4) (Daubechies 1992 p 194) wavelets with four vanish-ing moments The Daubechies wavelets have compact supportimportant for good localization and a maximum number ofvanishing moments for a given smoothness A large number ofvanishing moments leads to high compressibility because the ne-scale wavelet coef cients are essentially 0 where the func-tions are smooth On the other hand support of the waveletsincreases with an increasing number of vanishing momentsso there is a trade-off with the localization properties Somerather limited exploration suggested that the chosen waveletfamily is a good compromise for these data

The graphs on the right side of Figure 1 show the wavelettransforms corresponding to the three NIR spectra in the leftcolumn Coef cients are ordered from coarsest to nest

63 Prior Settings

We need to specify the values of H bdquo and Q in (3) (4)and (5) and the prior probability w that an element of ƒ is1 We wish to put in weak but proper prior information aboutegrave We choose bdquo D 3 because this is the smallest integer valuesuch that the expectation of egrave E4egrave5 D Q=4bdquoƒ25 exists Thescale matrix Q is chosen as Q D k Iq with k D 005 compara-ble in size to the expected error variances of the standardizedY given X With bdquo small the choice of Q is unlikely to becritical

Much more likely to be in uential are the choices of H andw in the priors for B and ƒ To re ect the smoothness in thecoef cients B as exempli ed in Figure 2 while keeping theform of H simple we have taken H to be the variance matrixof a rst-order autoregressive process with hij

D lsquo 2mdashiƒjmdash Wederived the values lsquo 2 D 254 and D 032 by maximizing atype II likelihood (Good 1965) Integrating B and egrave fromthe joint distribution given by (2) (3) (4) and (5) for the

404 Journal of the American Statistical Association June 2001

1400 1600 1800 2000 2200 2400ndash 2

0

2

4

6

Coe

ffici

ent

Fat

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4Sugar

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

Wavelength

Coe

ffici

ent

Flour

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

6

Wavelength

Water

Figure 2 Coefrsquo cient Vectors From 5-Factor PLS Equations

regression on the full untransformed spectra with h ˆ andB0

D 0 we get

f mdashKmdashƒq=2mdashQmdash4bdquoCqƒ15=2mdashQ C Y0Kƒ1Ymdashƒ4bdquoCnCqƒ15=21 (18)

where

K D In C XHX0

and the columns of Y are centered as in Section 41 Withk D 005 and bdquo D 3 already xed (18) is a function via H oflsquo 2 and We used for our prior the values of these hyperpa-rameters that maximize (18) Possible underestimation due tothe use of the full spectra was taken into account by multiply-ing the estimate of lsquo 2 by the in ation factor 256=20 re ect-ing our prior belief as to the expected number of includedcoef cients

Figure 3 shows the diagonal elements of the matrix QH DWHW0 implied by our choice of H The variance matrix of theith column of QB in (8) is lsquo ii

QH so this plot shows the patternin the prior variances of the regression coef cients when thepredictors are wavelet coef cients The wavelet coef cientsare ordered from coarsest to nest so the decreasing priorvariance means that there will be more shrinkage at the nerlevels This is a logical consequence of the smoothness that wehave tried to express in the prior distribution The spikes in theplot at the level transitions are from the boundary conditionproblems of the discrete wavelet transform

We know from experience that good predictions can usuallybe obtained with 10 or so selected spectral points in exam-ples of this type Having no previous experience in selecting

wavelet coef cients and wanting to induce a similarly ldquosmallrdquomodel without constraining the possibilities too severely wechose w in the prior for ƒ so that the expected model sizewas pw D 20 We have given equal prior probability hereto coef cients at the different levels Although we consid-ered the possibility of varying w in blocks we had no strongprior opinions about which levels were likely to provide themost useful coef cients apart from a suspicion that neitherthe coarsest nor the nest levels would feature strongly

64 Implementing the Metropolis Search

We chose widely different starting points for the fourMetropolis chains by setting to 1 the rst 1 the rst 20 the rst 128 (ie half) and all elements of ƒ There were 100000iterations in each run where an iteration comprised eitheraddingdeleting or swapping as described in Section 42The two moves were chosen with equal probability rdquo D 1=2Acceptance of the possible move by (14) was by generation ofa Bernoulli random variable Computation of g4ƒ5 and g4ƒ uuml 5

was done using the QR decomposition of MATLABFor each chain we recorded the visited ƒrsquos and their corre-

sponding relative probability g4ƒ5 No burn-in was necessaryas relatively unlikely ƒrsquos would automatically be down-weighted in our analysis There were approximately 38000ndash40000 successful moves for each of the four runs Of thesemoves around 95 were swaps The relative probabilities ofthe set of distinct visited ƒ were then normalized to 1 overthis set Figure 4 plots the marginal probabilities for com-ponents of ƒ1 P4ƒj

D 151 j D 11 1256 The spikes showwhere regressor variables have been included in subsets with

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 5: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

402 Journal of the American Statistical Association June 2001

that follow we now assume that the columns of Y have beencentered (Full details of the prior to posterior analysis havebeen given by Brown et al 1998b who also considered otherprior structures) After some manipulation we have

4ƒ mdash Y1Z5 g4ƒ5 D QHƒ Z0ƒZƒ

C QHƒ1ƒ

ƒq=2

ƒ4nCbdquoCqƒ15=2 4ƒ51 (11)

where QƒD Q C Y0Y ƒ Y0Zƒ4Z0

ƒ ZƒC QHƒ1

ƒ 5ƒ1Z0ƒ Y and Zƒ is

Z with the columns for which ƒjD 0 deleted Care is needed

in computing (11) the alternative forms discussed later maybe useful

A simplifying feature of this setup is that all of the com-putations can be formulated as least squares problems withmodi ed Y and Z matrices By writing

QZƒD

sup3Zƒ

QH12ƒ

Ipƒ

acute1 QY D Y

01

where QH1=2ƒ is a matrix square root of QHƒ and pƒ is the number

of 1rsquos in ƒ the relevant quantities entering into (11) can becomputed as

QHƒ Z0ƒZƒ

C QHƒ1ƒ

D QZ0ƒ

QZƒ (12)

and

QƒD Q C QY0 QY ƒ QY0 QZƒ4 QZ0

ƒQZƒ5ƒ1 QZ0

ƒQY3 (13)

that is Qƒ is given by Q plus the residual sum of prod-ucts matrix from the least squares regression of QY on QZƒ The QR decomposition can then be used (see eg Seber1984 chap 10 sec 11b) which avoids ldquosquaringrdquo as in (12)and (13)

42 Metropolis Search

Equation (11) gives the posterior probability of each of the2p different ƒ vectors and thus of each choice of waveletcoef cient subsets What remains to do is to look for ldquogoodrdquowavelet components by computing these posterior probabili-ties When p is much greater than about 25 too many subsetsexist for this to be feasible Fortunately we can use simula-tion methods that will nd the ƒ vectors with relatively highposterior probabilities We can then quickly identify usefulcoef cients that have high marginal probabilities of ƒj

D 1Here we use a Metropolis search as suggested for model

selection by Madigan and York (1995) and applied to variableselection for regression by Brown et al (1998a) George andMcCulloch (1997) and Raftery Madigan and Hoeting (1997)The search starts from a randomly chosen ƒ0 and then movesthrough a sequence of further values of ƒ At each step thealgorithm generates a new candidate ƒ by randomly modify-ing the current one Two types of moves are used

iexcl Add or delete a component by choosing at random onecomponent in the current ƒ and changing its value Thismove is chosen with probability rdquo

iexcl Swap two components by choosing independently at ran-dom a 0 and a 1 in the current ƒ and changing both ofthem This move is chosen with probability 1ƒ rdquo

The new candidate model ƒ uuml is accepted with probability

ming4ƒ uuml 5

g4ƒ511 0 (14)

Thus a more probable ƒ uuml is always accepted and a less prob-able one may be accepted There is scope for further ingenu-ity in designing the sequence of random moves For examplemoves that add or subtract or swap two or three or more at atime or a combination of these may be useful

The sequence of ƒrsquos generated by the search is a realizationof a Markov chain and the choice of acceptance probabilitiesensures that the equilibrium distribution of this chain is thedistribution given by (11) In typical uses of such schemesthe realizations are monitored after a suitable burn-in periodto verify that they appear stationary Here we have a closedform for the posterior distribution and are using the chainsimply to explore this distribution Thus we have not beenso concerned about strict convergence of the Markov chainFollowing Brown et al (1998a) we adopt a strategy of runningthe chain from a number of different starting points (four here)and looking at the four marginal distributions provided by thecomputed g4cent5 values of the visited ƒrsquos We also look for goodindication of mixing and explorations with returns Becausewe know the relative probabilities we do not need to worryabout using a burn-in period

Note that given the form of the acceptance probability (14)ƒ-vectors with high posterior probability have a greater chanceof appearing in the sequence Thus we might expect that along run of such a chain would visit many of the best subsets

5 PREDICTION

Suppose now that we wish to predict Yf an m q matrix offurther Y-vectors given the corresponding X-vectors Xf 1 4mp5 First we treat Xf exactly as the training data have beentreated by subtracting the training data means and transform-ing to wavelet coef cients Zf 1 4m p5 The model for Yf following the model for the training data (6) is

Yfƒ 1m 0 ƒ Zf

QB reg 4 Im1egrave50 (15)

If we believe our Bayesian mixture model then logically weshould apply the same latent structure model to prediction aswell as to training This has the practical appeal of providingaveraging over a range of likely models (Madigan and Raftery1994)

Results of Brown et al (1998b) demonstrate that with thecolumns of Yf centered using the mean Y from the train-ing set the expectation of the predictive distribution p4Yf

mdashƒ1 Z1 Y5 is given by Zf

OBƒ with

OBƒD Z0

ƒZƒC QHƒ1

ƒ

ƒ1Z0

ƒY D QH12ƒ

QZ0ƒ

QZƒ

ƒ1 QZ0ƒ

QY0 (16)

Averaging over the posterior distribution of ƒ gives

OYfD

ZfOBƒ 4ƒ mdash Z1Y51 (17)

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 403

and we might choose to approximate this by some restrictedset of ƒ values perhaps the r most likely values from theMetropolis search

6 APPLICATION TO NEAR-INFRAREDSPECTROSCOPY OF BISCUIT DOUGHS

We now apply the methodology developed earlier to thespectroscopic calibration problem described in Section 11First however we report the results of some other analyses ofthese data

61 Analysis by Standard Methods

For all of the analyses carried out here both compositionaland spectral data were centered by subtracting the training setmeans from the training and validation data The responsesbut not the spectral data were also scaled to give each of thefour variables unit variance in the training set

Mean squared prediction errors were converted back to theoriginal scale by multiplying them by the training sample vari-ances This preprocessing of the data makes no difference tothe standard analyses which treat the response variables sep-arately but it simpli es the prior speci cation for our multi-variate wavelet analysis

Osborne et al (1984) derived calibrations by multipleregression using various stepwise procedures to select wave-lengths for each constituent separately The mean squarederror of predictions on the 39 validation samples for their cal-ibrations are reported in the rst row of Table 1

Brown et al (1999) also selected small numbers of wave-lengths to nd calibration equations for this example TheirBayesian decision theory approach differed from the approachof Osborne in being multivariate (ie in trying to nd onesmall subset of wavelengths suitable for predicting all fourconstituents) and in using a more extensive search using sim-ulated annealing The results for this alternative wavelengthselection approach are given in the second row of Table 1

For the purpose of comparison we carried out two otheranalyses using partial least squares regression (PLS) and prin-cipal components regression (PCR) These approaches bothof which construct factors from the full spectral data and thenregress constituents on the factors are very much the standardtools in NIR spectroscopy (see eg Geladi and Martens 1996Geladi Martens Hadjiiski and Hopke 1996) For the compu-tations we used the PLS Toolbox 20 of Wise and Gallagher(Eigenvector Research Manson WA) Although there aremultivariate versions of PLS we took the usual approach ofcalibrating for each constituent separately The number of fac-tors used selected in each case by cross-validation on thetraining set was ve for each of the PLS equations and six

Table 1 Mean Squared Errors of Prediction on the 39 Biscuit DoughPieces in the Validation Set Using Four Calibration Methods

Method Fat Sugar Flour Water

Stepwise MLR 0044 10188 0722 0221Decision theory 0076 0566 0265 0176PLS 0151 0583 0375 0105PCR 0160 0614 0388 0106

for each of the PCR equations The results given in rows threeand four of Table 1 show that as usual there is little to choosebetween the two methods These results are for PLS and PCRusing the reduced 256-point spectrum that we used for ourwavelet analysis Repeating the analyses using the original700-point spectrum yielded results very similar to those forPLS and somewhat worse than those reported for PCR

Because shortly we need to specify a prior distribution onregression coef cients it is interesting to examine those result-ing from a factor-type approach Combining the coef cientsfor the regression of constituent on factor scores with theloadings that produce scores from the original spectral vari-ables gives the coef cient vector that would be applied to ameasured spectrum to give a prediction Figure 2 plots thesevectors for the PLS equations for the four constituents show-ing the smoothness in the 256 coef cients that we attempt tore ect in our prior distribution for B

62 Wavelet Transforms of Spectra

To each spectrum we apply a wavelet transform convert-ing it to a set of 256 wavelet coef cients We used theMATLAB toolbox Wavbox 43 (Taswell 1995) for this stepUsing spectra with 2m (m integer) points is not a real restric-tion here Methods exist to overcome the limitation allow-ing the DWT to be applied on any length of data We usedMP(4) (Daubechies 1992 p 194) wavelets with four vanish-ing moments The Daubechies wavelets have compact supportimportant for good localization and a maximum number ofvanishing moments for a given smoothness A large number ofvanishing moments leads to high compressibility because the ne-scale wavelet coef cients are essentially 0 where the func-tions are smooth On the other hand support of the waveletsincreases with an increasing number of vanishing momentsso there is a trade-off with the localization properties Somerather limited exploration suggested that the chosen waveletfamily is a good compromise for these data

The graphs on the right side of Figure 1 show the wavelettransforms corresponding to the three NIR spectra in the leftcolumn Coef cients are ordered from coarsest to nest

63 Prior Settings

We need to specify the values of H bdquo and Q in (3) (4)and (5) and the prior probability w that an element of ƒ is1 We wish to put in weak but proper prior information aboutegrave We choose bdquo D 3 because this is the smallest integer valuesuch that the expectation of egrave E4egrave5 D Q=4bdquoƒ25 exists Thescale matrix Q is chosen as Q D k Iq with k D 005 compara-ble in size to the expected error variances of the standardizedY given X With bdquo small the choice of Q is unlikely to becritical

Much more likely to be in uential are the choices of H andw in the priors for B and ƒ To re ect the smoothness in thecoef cients B as exempli ed in Figure 2 while keeping theform of H simple we have taken H to be the variance matrixof a rst-order autoregressive process with hij

D lsquo 2mdashiƒjmdash Wederived the values lsquo 2 D 254 and D 032 by maximizing atype II likelihood (Good 1965) Integrating B and egrave fromthe joint distribution given by (2) (3) (4) and (5) for the

404 Journal of the American Statistical Association June 2001

1400 1600 1800 2000 2200 2400ndash 2

0

2

4

6

Coe

ffici

ent

Fat

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4Sugar

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

Wavelength

Coe

ffici

ent

Flour

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

6

Wavelength

Water

Figure 2 Coefrsquo cient Vectors From 5-Factor PLS Equations

regression on the full untransformed spectra with h ˆ andB0

D 0 we get

f mdashKmdashƒq=2mdashQmdash4bdquoCqƒ15=2mdashQ C Y0Kƒ1Ymdashƒ4bdquoCnCqƒ15=21 (18)

where

K D In C XHX0

and the columns of Y are centered as in Section 41 Withk D 005 and bdquo D 3 already xed (18) is a function via H oflsquo 2 and We used for our prior the values of these hyperpa-rameters that maximize (18) Possible underestimation due tothe use of the full spectra was taken into account by multiply-ing the estimate of lsquo 2 by the in ation factor 256=20 re ect-ing our prior belief as to the expected number of includedcoef cients

Figure 3 shows the diagonal elements of the matrix QH DWHW0 implied by our choice of H The variance matrix of theith column of QB in (8) is lsquo ii

QH so this plot shows the patternin the prior variances of the regression coef cients when thepredictors are wavelet coef cients The wavelet coef cientsare ordered from coarsest to nest so the decreasing priorvariance means that there will be more shrinkage at the nerlevels This is a logical consequence of the smoothness that wehave tried to express in the prior distribution The spikes in theplot at the level transitions are from the boundary conditionproblems of the discrete wavelet transform

We know from experience that good predictions can usuallybe obtained with 10 or so selected spectral points in exam-ples of this type Having no previous experience in selecting

wavelet coef cients and wanting to induce a similarly ldquosmallrdquomodel without constraining the possibilities too severely wechose w in the prior for ƒ so that the expected model sizewas pw D 20 We have given equal prior probability hereto coef cients at the different levels Although we consid-ered the possibility of varying w in blocks we had no strongprior opinions about which levels were likely to provide themost useful coef cients apart from a suspicion that neitherthe coarsest nor the nest levels would feature strongly

64 Implementing the Metropolis Search

We chose widely different starting points for the fourMetropolis chains by setting to 1 the rst 1 the rst 20 the rst 128 (ie half) and all elements of ƒ There were 100000iterations in each run where an iteration comprised eitheraddingdeleting or swapping as described in Section 42The two moves were chosen with equal probability rdquo D 1=2Acceptance of the possible move by (14) was by generation ofa Bernoulli random variable Computation of g4ƒ5 and g4ƒ uuml 5

was done using the QR decomposition of MATLABFor each chain we recorded the visited ƒrsquos and their corre-

sponding relative probability g4ƒ5 No burn-in was necessaryas relatively unlikely ƒrsquos would automatically be down-weighted in our analysis There were approximately 38000ndash40000 successful moves for each of the four runs Of thesemoves around 95 were swaps The relative probabilities ofthe set of distinct visited ƒ were then normalized to 1 overthis set Figure 4 plots the marginal probabilities for com-ponents of ƒ1 P4ƒj

D 151 j D 11 1256 The spikes showwhere regressor variables have been included in subsets with

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 6: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 403

and we might choose to approximate this by some restrictedset of ƒ values perhaps the r most likely values from theMetropolis search

6 APPLICATION TO NEAR-INFRAREDSPECTROSCOPY OF BISCUIT DOUGHS

We now apply the methodology developed earlier to thespectroscopic calibration problem described in Section 11First however we report the results of some other analyses ofthese data

61 Analysis by Standard Methods

For all of the analyses carried out here both compositionaland spectral data were centered by subtracting the training setmeans from the training and validation data The responsesbut not the spectral data were also scaled to give each of thefour variables unit variance in the training set

Mean squared prediction errors were converted back to theoriginal scale by multiplying them by the training sample vari-ances This preprocessing of the data makes no difference tothe standard analyses which treat the response variables sep-arately but it simpli es the prior speci cation for our multi-variate wavelet analysis

Osborne et al (1984) derived calibrations by multipleregression using various stepwise procedures to select wave-lengths for each constituent separately The mean squarederror of predictions on the 39 validation samples for their cal-ibrations are reported in the rst row of Table 1

Brown et al (1999) also selected small numbers of wave-lengths to nd calibration equations for this example TheirBayesian decision theory approach differed from the approachof Osborne in being multivariate (ie in trying to nd onesmall subset of wavelengths suitable for predicting all fourconstituents) and in using a more extensive search using sim-ulated annealing The results for this alternative wavelengthselection approach are given in the second row of Table 1

For the purpose of comparison we carried out two otheranalyses using partial least squares regression (PLS) and prin-cipal components regression (PCR) These approaches bothof which construct factors from the full spectral data and thenregress constituents on the factors are very much the standardtools in NIR spectroscopy (see eg Geladi and Martens 1996Geladi Martens Hadjiiski and Hopke 1996) For the compu-tations we used the PLS Toolbox 20 of Wise and Gallagher(Eigenvector Research Manson WA) Although there aremultivariate versions of PLS we took the usual approach ofcalibrating for each constituent separately The number of fac-tors used selected in each case by cross-validation on thetraining set was ve for each of the PLS equations and six

Table 1 Mean Squared Errors of Prediction on the 39 Biscuit DoughPieces in the Validation Set Using Four Calibration Methods

Method Fat Sugar Flour Water

Stepwise MLR 0044 10188 0722 0221Decision theory 0076 0566 0265 0176PLS 0151 0583 0375 0105PCR 0160 0614 0388 0106

for each of the PCR equations The results given in rows threeand four of Table 1 show that as usual there is little to choosebetween the two methods These results are for PLS and PCRusing the reduced 256-point spectrum that we used for ourwavelet analysis Repeating the analyses using the original700-point spectrum yielded results very similar to those forPLS and somewhat worse than those reported for PCR

Because shortly we need to specify a prior distribution onregression coef cients it is interesting to examine those result-ing from a factor-type approach Combining the coef cientsfor the regression of constituent on factor scores with theloadings that produce scores from the original spectral vari-ables gives the coef cient vector that would be applied to ameasured spectrum to give a prediction Figure 2 plots thesevectors for the PLS equations for the four constituents show-ing the smoothness in the 256 coef cients that we attempt tore ect in our prior distribution for B

62 Wavelet Transforms of Spectra

To each spectrum we apply a wavelet transform convert-ing it to a set of 256 wavelet coef cients We used theMATLAB toolbox Wavbox 43 (Taswell 1995) for this stepUsing spectra with 2m (m integer) points is not a real restric-tion here Methods exist to overcome the limitation allow-ing the DWT to be applied on any length of data We usedMP(4) (Daubechies 1992 p 194) wavelets with four vanish-ing moments The Daubechies wavelets have compact supportimportant for good localization and a maximum number ofvanishing moments for a given smoothness A large number ofvanishing moments leads to high compressibility because the ne-scale wavelet coef cients are essentially 0 where the func-tions are smooth On the other hand support of the waveletsincreases with an increasing number of vanishing momentsso there is a trade-off with the localization properties Somerather limited exploration suggested that the chosen waveletfamily is a good compromise for these data

The graphs on the right side of Figure 1 show the wavelettransforms corresponding to the three NIR spectra in the leftcolumn Coef cients are ordered from coarsest to nest

63 Prior Settings

We need to specify the values of H bdquo and Q in (3) (4)and (5) and the prior probability w that an element of ƒ is1 We wish to put in weak but proper prior information aboutegrave We choose bdquo D 3 because this is the smallest integer valuesuch that the expectation of egrave E4egrave5 D Q=4bdquoƒ25 exists Thescale matrix Q is chosen as Q D k Iq with k D 005 compara-ble in size to the expected error variances of the standardizedY given X With bdquo small the choice of Q is unlikely to becritical

Much more likely to be in uential are the choices of H andw in the priors for B and ƒ To re ect the smoothness in thecoef cients B as exempli ed in Figure 2 while keeping theform of H simple we have taken H to be the variance matrixof a rst-order autoregressive process with hij

D lsquo 2mdashiƒjmdash Wederived the values lsquo 2 D 254 and D 032 by maximizing atype II likelihood (Good 1965) Integrating B and egrave fromthe joint distribution given by (2) (3) (4) and (5) for the

404 Journal of the American Statistical Association June 2001

1400 1600 1800 2000 2200 2400ndash 2

0

2

4

6

Coe

ffici

ent

Fat

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4Sugar

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

Wavelength

Coe

ffici

ent

Flour

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

6

Wavelength

Water

Figure 2 Coefrsquo cient Vectors From 5-Factor PLS Equations

regression on the full untransformed spectra with h ˆ andB0

D 0 we get

f mdashKmdashƒq=2mdashQmdash4bdquoCqƒ15=2mdashQ C Y0Kƒ1Ymdashƒ4bdquoCnCqƒ15=21 (18)

where

K D In C XHX0

and the columns of Y are centered as in Section 41 Withk D 005 and bdquo D 3 already xed (18) is a function via H oflsquo 2 and We used for our prior the values of these hyperpa-rameters that maximize (18) Possible underestimation due tothe use of the full spectra was taken into account by multiply-ing the estimate of lsquo 2 by the in ation factor 256=20 re ect-ing our prior belief as to the expected number of includedcoef cients

Figure 3 shows the diagonal elements of the matrix QH DWHW0 implied by our choice of H The variance matrix of theith column of QB in (8) is lsquo ii

QH so this plot shows the patternin the prior variances of the regression coef cients when thepredictors are wavelet coef cients The wavelet coef cientsare ordered from coarsest to nest so the decreasing priorvariance means that there will be more shrinkage at the nerlevels This is a logical consequence of the smoothness that wehave tried to express in the prior distribution The spikes in theplot at the level transitions are from the boundary conditionproblems of the discrete wavelet transform

We know from experience that good predictions can usuallybe obtained with 10 or so selected spectral points in exam-ples of this type Having no previous experience in selecting

wavelet coef cients and wanting to induce a similarly ldquosmallrdquomodel without constraining the possibilities too severely wechose w in the prior for ƒ so that the expected model sizewas pw D 20 We have given equal prior probability hereto coef cients at the different levels Although we consid-ered the possibility of varying w in blocks we had no strongprior opinions about which levels were likely to provide themost useful coef cients apart from a suspicion that neitherthe coarsest nor the nest levels would feature strongly

64 Implementing the Metropolis Search

We chose widely different starting points for the fourMetropolis chains by setting to 1 the rst 1 the rst 20 the rst 128 (ie half) and all elements of ƒ There were 100000iterations in each run where an iteration comprised eitheraddingdeleting or swapping as described in Section 42The two moves were chosen with equal probability rdquo D 1=2Acceptance of the possible move by (14) was by generation ofa Bernoulli random variable Computation of g4ƒ5 and g4ƒ uuml 5

was done using the QR decomposition of MATLABFor each chain we recorded the visited ƒrsquos and their corre-

sponding relative probability g4ƒ5 No burn-in was necessaryas relatively unlikely ƒrsquos would automatically be down-weighted in our analysis There were approximately 38000ndash40000 successful moves for each of the four runs Of thesemoves around 95 were swaps The relative probabilities ofthe set of distinct visited ƒ were then normalized to 1 overthis set Figure 4 plots the marginal probabilities for com-ponents of ƒ1 P4ƒj

D 151 j D 11 1256 The spikes showwhere regressor variables have been included in subsets with

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 7: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

404 Journal of the American Statistical Association June 2001

1400 1600 1800 2000 2200 2400ndash 2

0

2

4

6

Coe

ffici

ent

Fat

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4Sugar

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

Wavelength

Coe

ffici

ent

Flour

1400 1600 1800 2000 2200 2400ndash 4

ndash 2

0

2

4

6

Wavelength

Water

Figure 2 Coefrsquo cient Vectors From 5-Factor PLS Equations

regression on the full untransformed spectra with h ˆ andB0

D 0 we get

f mdashKmdashƒq=2mdashQmdash4bdquoCqƒ15=2mdashQ C Y0Kƒ1Ymdashƒ4bdquoCnCqƒ15=21 (18)

where

K D In C XHX0

and the columns of Y are centered as in Section 41 Withk D 005 and bdquo D 3 already xed (18) is a function via H oflsquo 2 and We used for our prior the values of these hyperpa-rameters that maximize (18) Possible underestimation due tothe use of the full spectra was taken into account by multiply-ing the estimate of lsquo 2 by the in ation factor 256=20 re ect-ing our prior belief as to the expected number of includedcoef cients

Figure 3 shows the diagonal elements of the matrix QH DWHW0 implied by our choice of H The variance matrix of theith column of QB in (8) is lsquo ii

QH so this plot shows the patternin the prior variances of the regression coef cients when thepredictors are wavelet coef cients The wavelet coef cientsare ordered from coarsest to nest so the decreasing priorvariance means that there will be more shrinkage at the nerlevels This is a logical consequence of the smoothness that wehave tried to express in the prior distribution The spikes in theplot at the level transitions are from the boundary conditionproblems of the discrete wavelet transform

We know from experience that good predictions can usuallybe obtained with 10 or so selected spectral points in exam-ples of this type Having no previous experience in selecting

wavelet coef cients and wanting to induce a similarly ldquosmallrdquomodel without constraining the possibilities too severely wechose w in the prior for ƒ so that the expected model sizewas pw D 20 We have given equal prior probability hereto coef cients at the different levels Although we consid-ered the possibility of varying w in blocks we had no strongprior opinions about which levels were likely to provide themost useful coef cients apart from a suspicion that neitherthe coarsest nor the nest levels would feature strongly

64 Implementing the Metropolis Search

We chose widely different starting points for the fourMetropolis chains by setting to 1 the rst 1 the rst 20 the rst 128 (ie half) and all elements of ƒ There were 100000iterations in each run where an iteration comprised eitheraddingdeleting or swapping as described in Section 42The two moves were chosen with equal probability rdquo D 1=2Acceptance of the possible move by (14) was by generation ofa Bernoulli random variable Computation of g4ƒ5 and g4ƒ uuml 5

was done using the QR decomposition of MATLABFor each chain we recorded the visited ƒrsquos and their corre-

sponding relative probability g4ƒ5 No burn-in was necessaryas relatively unlikely ƒrsquos would automatically be down-weighted in our analysis There were approximately 38000ndash40000 successful moves for each of the four runs Of thesemoves around 95 were swaps The relative probabilities ofthe set of distinct visited ƒ were then normalized to 1 overthis set Figure 4 plots the marginal probabilities for com-ponents of ƒ1 P4ƒj

D 151 j D 11 1256 The spikes showwhere regressor variables have been included in subsets with

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 8: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 405

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Figure 3 Diagonal Elements of QH D WHW 0 With Qh i i Plotted Against i

0 50 100 150 200 2500

02

04

06

08

1(i)

Pro

babi

lity

0 50 100 150 200 2500

02

04

06

08

1(ii)

0 50 100 150 200 2500

02

04

06

08

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

02

04

06

08

1

Wavelet coefficient

(iv)

Figure 4 Marginal Probabilities of Components of ƒ for Four Runs

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 9: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

406 Journal of the American Statistical Association June 2001

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

ndash 400

ndash 300

ndash 200

ndash 100

0(b)

Iteration number

Log

prob

s

Figure 5 Plots in Sequence Order for Run (iii) (a) The number of 1rsquos (b) log relative probabilities

high probability For one of the runs (iii) Figure 5 gives twomore plots the number of 1rsquos and the log-relative probabil-ities log4g4ƒ55 of the visited ƒ plotted over the 100000iterations The other runs produced very similar plots quicklymoving toward models of similar dimensions and posteriorprobability values

Despite the very different starting points the regionsexplored by the four chains have clear similarities in that plotsof marginals are overall broadly similar However there arealso clear differences with some chains making frequent useof variables not picked up by others Although we would notclaim convergence all four chains arrive at some similarlyldquogoodrdquo albeit different subsets With mutually correlated pre-dictor variables as we have here there will always tend tobe many solutions to the problem of nding the best predic-tors We adopt the pragmatic stance that all we are trying todo is identify some of the good solutions If we happen tomiss some other good ones then this is unfortunate but notdisastrous

65 Results

We pooled the distinct ƒrsquos visited by the four chains nor-malized the relative probabilities and ordered them accordingto probability Then we predicted the further 39 unseen sam-ples using Bayes model averaging Mean squared predictionerrors converted to the original scale were 00631 04491 0348and 0050 These results use the best 500 models accountingfor almost 99 of the total visited probability and using 219wavelet coef cients They improve considerably on all of thestandard methods reported in Table 1 The single best subset

among the visited ones had 10 coef cients accounted for 9of the total visited probability and least squares predictionsgave mean squared errors of 0059 0466 0351 and 0047

Examining the scales of the selected coef cients is interest-ing The model making the best predictions used coef cients(10 11 14 17 33 34 82 132 166 255) which include (00 0 37 6 6 1 2) of all of the coef cients atthe eight levels from the coarsest to the nest scales The mostuseful coef cients seem to be in the middle of the scale withsome more toward the ner end Note that the very coarsestcoef cients which would be essential to any reconstruction ofthe spectra are not used despite the smoothness implicit in Hand the consequent increased shrinkage associated with nerscales as seen in Figure 3

Some idea of the locations of the wavelet coef cientsselected by the modal model can be obtained from Figure 6Here we used the linearity of the wavelet transform and thelinearity of the prediction equation to express the predictionequation as a vector of coef cients to be applied to the originalspectral data This vector is obtained by applying the inversewavelet transform to the columns of the matrix of the leastsquares estimates of the regression coef cients Because selec-tion has discarded unnecessary details most of the coef cientsare very close to 0 We thus display only the 1600ndash1800 nmrange which permits a better view of the features of coef -cients in the range of interest These coef cient vectors canbe compared directly with those shown in Figure 2 They areapplied to the same spectral data to produce predictions anddespite the different scales some comparisons are possibleFor example the two coef cient vectors for fat can be eas-

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 10: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

Brown Fearn and Vannucci Bayesian Wavelet Regression on Curves 407

ily interpreted Fats and oils have absorption bands at around1730 and 1765 nm (Osborne et al 1993) and strong positivecoef cients in this region are the major features of both plotsThe wavelet-based equation (Fig 6) is simpler the selectionhaving discarded much unnecessary detail The other coef -cient vectors show less resemblance and are also harder tointerpret One thing to bear in mind in interpreting these plotsis that the four constituents add up to 100 Thus it shouldnot be surprising to nd the fat measurement peak in all eightplots nor that the coef cient vectors for sugar and our are sostrongly inversely related

Finally we comment brie y on results that we obtained byinvestigating logistic transformations of the data Our responsevariables are in fact percentages and are constrained to sumup to 100 thus they lie on a simplex rather than in the fullq-dimensional space Sample ranges are 18 3 for fat 17 7for sugar 51 6 for our and 14 3 for water

Following Aitchison (1986) we transformed the originalY 0s into log ratios of the form

Z1 D ln4Y1=Y351 Z2 D ln4Y2=Y351 and

Z3 D ln4Y4=Y350 (19)

The choice of the third ingredient for the denominator wasthe most natural in that our is the major constituent andalso because ingredients in recipes often are expressed as aratio to our content We centered and scaled the Z variablesand recomputed empirical Bayes estimates for lsquo 2 and (Theother hyperparameters were not affected by the logistic trans-formation) We then ran four Metropolis chains using the start-

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Coe

ffici

ent

Fat

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Sugar

1400 1500 1600 1700 1800ndash 400

ndash 200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800ndash 200

ndash 100

0

100

200

Wavelength

Water

Figure 6 Coefrsquo cient Vectors From the ldquoBestrdquo Wavelet Equations

ing points used previously with the data in the original scaleDiagnostic plots and plots of the marginals appeared verysimilar to those of Figures 4 and 5 The four chains visited151183 distinct models We nally computed Bayes modelaveraging and least squares predictions with the best modelunscaling the predicted values and transforming them back tothe original scale as

YiD 100exp4Zi5P3

jD1 exp4Zj5 C 11 i D 11 21

Y3D 100

P3jD1 exp4Zj5 C 1

1 and Y4D 100exp4Z35P3

jD1 exp4Zj5 C 10

The best 500 visited models accounted for 994 of the totalvisited probability used 214 wavelet coef cients and gaveBayes mean squared prediction errors of 0058 0819 0457 and0080 The single best subset among the visited ones had 10coef cients accounted for 158 of the total visited proba-bility and gave least squares prediction errors of 00911 079310496 and 0119 The logistic transformation does not seem tohave a positive impact on the predictive performance andoverall the simpler linear model seems to be adequate Ourapproach gives Bayes predictions on the original scale satisfy-ing the constraint of summing exactly to 100 This stems fromthe conjugate prior linearity in Y 1 zero mean for the prior dis-tribution of the regression coef cients and vague prior for theintercept It is easily seen that with X centered O 01 D 100 andOB01 D 0 for either least squares or Bayes estimates and henceall predictions sum to 100 In addition had we imposed asingular distribution to deal with the four responses summing

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293

Page 11: 17-Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem.pdf

408 Journal of the American Statistical Association June 2001

to 100 then a proper analysis would suggest eliminating onecomponent but then the predictions for the remaining threecomponents would be as we have derived Thus our analysisis Bayes for the singular problem even though super ciallyit ignores this aspect This does not address the positivityconstraint but the composition variables are all so far fromthe boundaries compared to residual error that this is not anissue Our desire to stick with the original scale is supportedby the BeerndashLambert law which linearly relates absorbanceto composition The logistic transform distorts this linearrelationship

7 DISCUSSION

In specifying the prior parameters for our example we madea number of arbitrary choices In particular the choice of H

as the variance matrix of an autoregressive process is a rathercrude representation of the smoothness in the coef cients Itmight be interesting to try to model this in more detail How-ever although there may be room for improvement the simplestructure used here does appear to work well

Another area for future investigation is the use of moresophisticated wavelet systems in this context The addi-tional exibility of wavelet packets (Coifman Meyer andWickerhauser 1992) or m-band wavelets (Mallet CoomansKautsky and De Vel 1997) might lead to improved predic-tions or might be just an unnecessary complication

[Received October 1998 Revised November 2000]

REFERENCES

Aitchison J (1986) The Statistical Analysis of Compositional Data LondonChapman and Hall

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Fearn T and Vannucci M (1999) ldquoThe Choice of Variablesin Multivariate Regression A Bayesian Non-Conjugate Decision TheoryApproachrdquo Biometrika 86 635ndash648

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian WavelengthSelection in Multicomponent Analysisrdquo Journal of Chemometrics 12173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Carlin B P and Chib S (1995) ldquoBayesian Model Choice via MarkovChain Monte Carlordquo Journal of the Royal Statistical Society Ser B 57473ndash484

Chipman H (1996) ldquoBayesian Variable Selection With Related PredictorsrdquoCanadian Journal of Statistics 24 17ndash36

Clyde M DeSimone H and Parmigiani G (1996) ldquoPrediction via Orthog-onalized Model Mixingrdquo Journal of the American Statistical Association91 1197ndash1208

Clyde M and George E I (2000) ldquoFlexible Empirical Bayes Estima-tion for Waveletsrdquo Journal of the Royal Statistical Society Ser B 62681ndash698

Coifman R R Meyer Y and Wickerhauser M V (1992) ldquoWavelet Anal-ysis and Signal Processingrdquo in Wavelets and Their Applications eds M BRuskai G Beylkin R Coifman I Daubechies S Mallat Y Meyer andL Raphael Boston Jones and Bartlett pp 153ndash178

Cowe I A and McNicol J W (1985) ldquoThe Use of Principal Compo-nents in the Analysis of Near- Infrared Spectrardquo Applied Spectroscopy 39257ndash266

Daubechies I (1992) Ten Lectures on Wavelets (Vol 61 CBMS-NSF

Regional Conference Series in Applied Mathematics) Philadephia Societyfor Industrial and Applied Mathematics

Dawid A P (1981) ldquoSome Matrix-Variate Distribution Theory NotationalConsiderations and a Bayesian Applicationrdquo Biometrika 68 265ndash274

Donoho D Johnstone I Kerkyacharian G and Picard D (1995) ldquoWaveletShrinkage Asymptopiardquo (with discussion) Journal of the Royal StatisticalSociety Ser B 57 301ndash369

Geladi P and Martens H (1996) ldquoA Calibration Tutorial for Spectral DataPart 1 Data Pretreatment and Principal Component Regression Using Mat-labrdquo Journal of Near Infrared Spectroscopy 4 225ndash242

Geladi P Martens H Hadjiiski L and Hopke P (1996) ldquoA CalibrationTutorial for Spectral Data Part 2 Partial Least Squares Regression UsingMatlab and Some Neural Network Resultsrdquo Journal of Near Infrared Spec-troscopy 4 243ndash255

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian Vari-able Selectionrdquo Statistica Sinica 7 339ndash373

Geweke J (1996) ldquoVariable Selection and Model Comparison in Regres-sionrdquo in Bayesian Statistics 5 eds J M Bernardo J O Berger A PDawid and A F M Smith Oxford UK Clarendon Press pp 609ndash620

Good I J (1965) The Estimation of Probabilities An Essay on ModernBayesian Methods Cambridge MA MIT Press

Hrushka W R (1987) ldquoData Analysis Wavelength Selection Methodsrdquoin Near- Infrared Technology in the Agricultural and Food Industries edsP Williams and K Norris St Paul MN American Association of CerealChemists pp 35ndash55

Leamer E E (1978) ldquoRegression Selection Strategies and Revealed PriorsrdquoJournal of the American Statistical Association 73 580ndash587

Madigan D and Raftery A E (1994) ldquoModel Selection and Accounting forModel Uncertainty in Graphical Models Using Occamrsquos Windowrdquo Journalof the American Statistical Association 89 1535ndash1546

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

Mallat S G (1989) ldquoMultiresolution Approximations and Wavelet Orthonor-mal Bases of L24r5rdquo Transactions of the American Mathematical Society315 69ndash87

Mallet Y Coomans D Kautsky J and De Vel O (1997) ldquoClassi cationUsing Adaptive Wavelets for Feature Extractionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence 19 1058ndash1066

McClure W F Hamid A Giesbrecht F G and Weeks W W (1984)ldquoFourier Analysis Enhances NIR Diffuse Re ectance SpectroscopyrdquoApplied Spectroscopy 38 322ndash329

Mitchell T J and Beauchamp J J (1988) ldquoBayesian Variable Selectionin Linear Regressionrdquo Journal of the American Statistical Association 831023ndash1036

Osborne B G Fearn T and Hindle P H (1993) Practical NIR Spec-troscopy Harlow UK Longman

Osborne B G Fearn T Miller A R and Douglas S (1984) ldquoApplica-tion of Near- Infrared Re ectance Spectroscopy to Compositional Analysisof Biscuits and Biscuit Doughsrdquo Journal of the Science of Food and Agri-culture 35 99ndash105

Raftery A E Madigan D and Hoeting J A (1997) ldquoBayesian ModelAveraging for Linear Regression Modelsrdquo Journal of the American Statis-tical Association 92 179ndash191

Seber G A F (1984) Multivariate Observations New York WileySmith A F M (1973) ldquoA General Bayesian Linear Modelrdquo Journal of the

Royal Statistical Society Ser B 35 67ndash75Taswell C (1995) ldquoWavbox 4 A Software Toolbox for Wavelet Transforms

and Adaptive Wavelet Packet Decompositionsrdquo in Wavelets and Statis-tics eds A Antoniadis and G Oppenheim New York Springer-Verlagpp 361ndash375

Trygg J and Wold S (1998) ldquoPLS Regression on Wavelet-CompressedNIR Spectrardquo Chemometrics and Intelligent Laboratory Systems 42209ndash220

Vannucci M and Corradi F (1999) ldquoCovariance Structure of Wavelet Coef- cients Theory and Models in a Bayesian Perspectiverdquo Journal of theRoyal Statistical Society Ser B 61 971ndash986

Walczak B and Massart D L (1997) ldquoNoise Suppression and Signal Com-pression Using the Wavelet Packet Transformrdquo Chemometrics and Intelli-gent Laboratory Systems 36 81ndash94

Wold S Martens H and Wold H (1983) ldquoThe Multivariate CalibrationProblem in Chemistry Solved by PLSrdquo in Matrix Pencils eds A Ruhe andB Kagstrom Heidelberg Springer pp 286ndash293