Issues in Bayesian Analysis of Neural Network Models

LETTER Communicated by Michael Jordan

Issues in Bayesian Analysis of Neural Network Models

Peter MullerDuke University, Durham NC 27708-0251, U.S.A., and Department of ArtificialIntelligence, Madrid Technical University, 28660 Madrid, Spain

David Rios InsuaDepartment of Artificial Intelligence, Madrid Technical University,28660 Madrid, Spain, and CNR-IAMI, 20131 Milano, Italy

Stemming from work by Buntine and Weigend (1991) and MacKay (1992),there is a growing interest in Bayesian analysis of neural network models.Although conceptually simple, this problem is computationally involved.We suggest a very efficient Markov chain Monte Carlo scheme for infer-ence and prediction with fixed-architecture feedforward neural networks.The scheme is then extended to the variable architecture case, providinga data-driven procedure to identify sensible architectures.

1 Introduction

Neural networks (NN) constitute the central theme of a huge amount of re-cent research. Introductions from the physical (Muller & Reinhardt, 1990),computational (Beale & Jackson, 1990), mathematical (Amari, 1993), andstatistical (Cheng and Titterington 1994; Stern 1996) points of view are avail-able. Recently, Wang (1995) has suggested the importance of incorporatinghuman knowledge in NN models to improve their performance. This nat-urally leads to efforst to model this knowledge through prior distributionsover the parameters.

This article discusses these issues, exploring the potential of Bayesianideas in the analysis of NN models. We propose methods to incorporatepriors on the size of the hidden layer, in particular priors that could fa-vor smaller-size networks. Buntine and Weigend (1991) and MacKay (1992,1995) have provided frameworks for implementing Bayesian inference basedon gaussian approximations, and Neal (1993, 1996) has applied hybridMonte Carlo methods. Ripley (1993) and Cheng and Titterington (1994)have dwelled on the power of these ideas, including interpretation andarchitecture selection. MacKay (1995), Neal (1996), and Bishop (1995) pro-vide excellent recent reviews of and elaborations on Bayesian approachesto NNs.

We concentrate on approximation problems, though many of our sugges-tions can be translated to other areas. For those problems, NNs are viewed

Neural Computation 10, 749–770 (1998) c© 1998 Massachusetts Institute of Technology

750 Peter Muller and David Rios Insua

as highly nonlinear (semiparametric) approximators, with parameters typ-ically estimated by least squares. Applications of interest for practition-ers comprise nonlinear regression, stochastic optimization, and regressionmetamodels for simulation output. Our main focus is the computationalaspects. Our contributions include an efficient, novel Markov chain MonteCarlo scheme and its extension to a scheme for handling a variable archi-tecture model and combining NNs with more traditional models—in ourcase, linear regression. This scheme allows for identification of promisingarchitectures and hence provides a step forward in the problem of NN archi-tecture choice. In section 2, after introducing our basic model, we introduceand discuss our Markov chain Monte Carlo scheme. This leads us to study,in section 3, variable architecture models and their combination with linearregression. Several examples illustrate the discussion.

2 Posterior Analysis of Feed-Forward Neural Networks

Let (x1, x2, . . . , xp) be explanatory variables for a response y, which, for now,we assume to be real valued. A feedforward neural network (FFNN) withactivation functionψ , p input units, one hidden layer with M hidden nodes,and one output node is defined by

y(x) =M∑

j=1

βjψ(x′γj + δj), (2.1)

with βj ∈ R, γj ∈ Rp,M ∈ N. The terms δj are designated biases and maybe assimilated with the rest of the γj vector if we consider an additionalinput with constant value one, say, x0 = 1. Interest in these models stemsfrom results, by Cybenko (1989) and others, suggesting them as universalapproximators, for appropriate choices of functionsψ . In most of the article,we shall assume that they are logistic functions. We shall undertake Bayesiananalyses of the above model.

2.1 The Model. We view the above model as an implementation of anonlinear regression of a response y on covariates x = (x0, x1, . . . , xp):

yi =M∑

j=1

βjψ(x′iγj)+ εi, i = 1, . . . ,N,

εi ∼ N(0, σ 2), ψ(η) = exp(η)/(1+ exp(η)). (2.2)

To undertake Bayesian inference with this model, we shall use the followingprior:

βj ∼ N(µβ, σ 2β ), γj ∼ N(µγ , Sγ ), j = 1, . . . ,M. (2.3)

Issues in Bayesian Analysis of Neural Network Models 751

When there is nonnegligible uncertainty about the prior (hyper-)parameters,we may complete the prior model with a hyperprior over them. We shall usethe following:µβ ∼ N(aβ,Aβ),µγ ∼ N(aγ ,Aγ ),σ−2

β ∼ Gamma(cb/2, cbCb/2),S−1γ ∼ Wish(cγ , (cγCγ )−1), and σ−2 ∼ Gamma(s/2, sS/2). The particular

choice of normal, gamma, and inverse Wishart distributions is for technicalconvenience. Similar hyperpriors are fairly common in Bayesian model-ing (see, e.g., Lavine & West, 1992). In general, posterior inference is rea-sonably robust with respect to the choice of hyperpriors (see, e.g., Berger,1990). However, if available prior information suggests different hyperpri-ors and hyperprior parameters, the model should be adjusted appropriately.MacKay (1995) and Neal (1996) emphasize the role of hyperparameters inNN models.

We use an informative prior probability model because of the meaningand interpretation of the parameters. For example, the βj’s should reflect theorder of magnitude of the data yi. Typically positive and negative values forβj would be equally likely, calling for a symmetric prior around aβ = 0 witha standard deviation reflecting the range of plausible values for yi. Similarly,a range of reasonable values for the logistic coefficients γj will be determinedby the meaning of the data yi being modeled, mainly to address smoothnessissues. The Appendix describes our specific choices in the examples.

2.2 A Markov Chain Monte Carlo Method for FFNNs. We assume wehave data D = {(x1, y1), . . . , (xN, yN)}. Let β = (β1, . . . , βM) denote thenetwork weights and γ = (γ10, γ11, . . . , γ1p, . . . , γM,0, . . . , γM,p) the logisticslopes, and let ν combine all the hyperparameters, ν = (µβ, σβ, µγ , Sγ , σ ).Let θ = (β, γ, ν) be the full parameter vector.

For inference purposes, we are interested in computing the posteriordistribution

p(β, γ, ν|D) = p(β, γ, ν)p(D|β, γ, ν)∫p(β, γ, ν)p(D|β, γ, ν)dβdγ dν

,

and, specifically, the marginal posterior p(β, γ |D) = ∫p(β, γ, ν|D)dν, per-

haps summarized through moments and regions. Above, p(D|β, γ, ν)desig-nates the likelihood. We shall be mainly interested, for predictive purposes,in the predictive distribution,

p(yN+1|D, xN+1) =∫

p(yN+1|β, γ, ν, xN+1)p(β, γ, ν|D)dβdγ dν,

possibly summarized through moments or probability regions. Here p(y|β,γ, ν, x) is the conditional N(

∑j βjψ[x′γj], σ 2) distribution of the response y

given parameters (β, γ ) and covariate x.One possibility to undertake the computations would be to appeal to sev-

eral normal approximations (see Buntine & Weigend, 1991; MacKay, 1992;Thodberg, 1996, for examples). Assessment of these and other techniques


for posterior inference problems may be seen in Robert (1994, chapter 9).In the specific context of NN models, posterior inference for these schemesmay be misled by local modes of the posterior distribution.

Buntine and Weigend (1991) mitigate this by finding several modes andbasing the analysis on weighted mixtures of the corresponding normal ap-proximations. Of course, we return to the same problem since we are prob-ably leaving out undiscovered local modes. An alternative view is arguedby MacKay (1995): inference from such schemes is best considered as ap-proximate posterior inference in a submodel defined by constraining theparameters to a neighborhood of the particular local mode. Depending onthe emphasis of the analysis, this might be reasonable, especially if in a finalimplementation our aim is to set the parameters at specific values. However,we prefer to propagate the uncertainty in the parameters, since this allowsbetter predictions (see Raftery, Madigan, & Volinsky, 1996).

To do this, we appeal to Markov chain Monte Carlo (MCMC) methods toimplement posterior inference. The essential idea is to obtain by computersimulation a sample from the posterior and base inference on that sam-ple by, for example, replacing posterior expectations with sample meansover the simulated posterior sample. The difficulty resides in simulating asample from the posterior p(θ |D). The rationale of MCMC is to consider aMarkov chain {θn} with state θ and having p(θ |D) as stationary distribu-tion. Tierney (1994) describes various ways of defining such chains, includ-ing Metropolis, Gibbs, and independence chains. The strategy is to startwith arbitrary values of θ , let the Markov chain run until it has practicallyreached convergence, say after T iterations, and use the next k observed val-ues of the chain as an approximate posterior sample {θ1, . . . , θk}. MacKay(1995) implements an MCMC method for neural networks based on BUGS(Spiegelhalter, Thomas, & Gilks, 1994), a program for Bayesian Inferenceusing the Gibbs sampler. Neal (1996) proposes using a hybrid Monte Carloalgorithm merging conventional Hastings-Metropolis chains with samplingtechniques based on dynamic simulation. Both authors warn against the po-tential inefficiency of straightforward implementation of MCMC methodsin Bayesian analysis of NN models. Also, Besag and Green (1993), albeit in adifferent application context, dwell on the special care required when usingMCMC in multimodal problems.

We introduce here a hybrid Markov chain Monte Carlo scheme. Themethod is hybrid in the sense that we sample from the posterior condi-tionals (steps 3 and 4 in our algorithm) when they are available, and useMetropolis steps otherwise (step 2). To fight potential inefficiencies due tomultimodality, our method has two additional features. First, wheneverpossible, we integrate out some of the parameters (the weights β) by par-tial marginalization. Second, we update some of the parameters in blocks(specifically, resample jointly the weights β). These two features allow forfast and effective mixing over the various local modes in the posterior dis-tribution. Combined with model augmentation to a variable architecture,


as described in section 3, this leads to a practically useful MCMC schemefor NN analyses.

The key observation in our scheme is that given the currently imputedvalues of the γ ’s, we actually have a standard hierarchical normal linearmodel (Lindley & Smith, 1971; Bernardo & Smith, 1994). On one hand, thiswill allow us to sample easily from the posterior marginals of the weightsβ and hyperparameters, given the γ ’s. On the other hand, this allows usto marginalize model represented in equations 2.2 and 2.3) with respect toβj, j = 1, . . . ,M, to obtain the marginal likelihood p(D|γ, ν). This computa-tion will be instrumental in the Metropolis step (step ii) in our algorithm.

The following lemma provides the marginalised likelihood, where forthe sake of simplified notation, we shall omit dependence on the hyperpa-rameters.

Lemma 2.1. Let zij = zij(γ ) = ψ(x′iγj), Z = (zij)j=1,...,Mi=1,...,N , 1 = (1)i=1,...,M, A =

Z′Z/σ 2, ρ = Z′y/σ 2, C = 1/σ 2β I, δ = µβ/σ 2

β1. Let mb(γ ) = (A + C)−1(ρ + δ)and Sb(γ ) = (A+ C)−1. Then,

p(D|γ ) = p[β = mb(γ )]p[β = mb(γ )|y, γ ]

N∏i=1

p[yi|β = mb(γ ), γ ]

= p[β = mb(γ )]|Sb(γ )|1/2N∏

i=1

p[yi|β = mb(γ ), γ ].

Proof. Conditional on γ , the model in equations 2.2 and 2.3 becomes anormal linear regression model. The posterior p(β|D, γ ) takes the form of amultivariate normal distribution N[mb(γ ), Sb(γ )], with posterior momentsmb(γ )and Sb(γ ), given, for example, in Bernardo and Smith (1994). By Bayes’theorem, p(β|D, γ ) = p(β)

∏Ni=1 p(yi|β, γ )/p(D|γ ). Substituting β = mb(γ )

in the last equation, we obtain the expression for p(D|γ ).

Our hybrid, blocking, partially marginalized MCMC algorithm for infer-ence and prediction with FFNNs is as follows:

1. Start with θ equal to some initial guess (for example, the prior means).

Until convergence is achieved, iterate through steps 2 through 4:

2. Given current values of ν only, (marginalizing over β) replace γ byMetropolis steps: For each γj, j = 1, . . . ,M, generate a proposal γj ∼gj(γj), with gj(γj) described below. Compute

a(γj, γj) = min[

1,p(D|γ , ν)p(γ |ν)p(D|γ, ν)p(γ |ν)

], (2.4)


where γ = (γ1, . . . , γj−1, γj, γj+1, . . . , γM). With probability a(γj, γj) re-place γj by the new candidate γj. Otherwise leave γj unchanged. UseLemma 2.1 to evaluate p(D|γ, ν).An alternative Metropolis step which updates each coordinate of γjseparately, is described in section 3.2. For the probing distribution gj(·),we use a multivariate normal N(γj, c2Cγ ) with c = 0.1. Conceptually,any probing distribution that is symmetric in its arguments, that is,g(γj|γj) = g(γj|γj), would imply the desired posterior as stationarydistribution of the corresponding Markov chain. For practical imple-mentation, a probing distribution with acceptance rates not too closeto zero or one is desirable. For a specialized setup, Gelman, Roberts,and Gilks (1996) showed that acceptance rates of around 25% are op-timal. In the examples, we found appropriate values for c by tryinga few alternative choices until we achieved acceptance rates in thisrange.

3. Given current values of (γ, ν), generate new values for β by a drawfrom the complete conditional p(β|γ, ν,D). This is a multivariate nor-mal distribution with moments described in Lemma 2.1.

4. Given current values of (β, γ ), replace the hyperparameters by a drawfrom the respective complete conditional posterior distributions: p(µβ |β, σβ)is a normal distribution, p(µγ |γ, Sγ ) is multivariate normal, p(σ−2

β |β,µβ)is a Gamma distribution, p(S−1

γ |γ, µγ ) is Wishart, and p(σ−2|β, γ, y)is Gamma, as corresponds to a normal linear model. (See Bernardo &Smith, 1994).

The proof of the convergence of this chain follows from arguments inTierney (1994). To judge convergence in practice, we rely on both sam-pled paths of parameters of interest and a convergence diagnostic pro-posed by Geweke (1992), as illustrated in examples 3 and 4. Once we havean approximate posterior sample {θ1, . . . , θk}, we may undertake variousposterior and predictive tasks as usual. For example, predictive meansf (x) = E(yn+1|xn+1 = x,D) can be evaluated via

f (x) = E(yn+1|xn+1,D) = 1k

k∑t=1

E(yN+1|xn+1, θ = θt).

We illustrate some of these calculations in the examples below.

Example 1: Galaxy Data. We try to relate velocity (yi) and radial position(xi1) of galaxy NGC7531 at 323 different locations (Buta, 1987). For this ex-ample, we use only the first 80 observations. The data are shown in Figure1. Radial positions are centered and scaled to have zero mean and unit vari-ance, and velocities have been shifted by a constant offset of 1400. A constant


RADIAL POSITION

F-H

AT

-2 -1 0 1 2

010

020

030

0•

•••••••••••

•••

••

••

••

••

•••••

••••••

•

•

•

•

•

••••••••••

••

••••

••

•••

•••

•••••••••••

•••

•••

RADIAL POSITION

F-H

AT

-2 -1 0 1 2

010

020

030

0

••••••••

••••

•••

••

••

••

••

•••••

••••••

•

•

•

•

•

••••••••••

••

••••

••

•

••••

••

•••••••••••

••

•••

Figure 1: Example 1. Estimated regression curve f (x) (left panel) and a fewdraws from the posterior distribution on the regression curve (right panel) in-duced by the posterior distribution p(θ |D) on the parameters. In both panels,the dots show the data points.

-300 -100 0 100 200 300

0.0

0.01

0.02

0.03

B1-10 -5 0 5

0.0

0.05

0.10

0.15

C11

Figure 2: Example 1. p(β1|D) and p(γ11|D) estimated by MCMC.

covariate xi0 adds an intercept to the logistic regression terms ψ(x′γj) of theNN model. For this problem, we fit an FFNN with three hidden nodes andmoments described in the appendix. Note that we did not use a hierar-chical model. Figure 2 shows some aspects of the posterior inference. Thetwo panels show the estimated marginal posterior distributions for β1 andγ11, showing multimodality of posteriors, a feature that hinders the use ofother approximate integration methods based on normal approximations(see Bishop, 1995). As discussed in section 2.3, an order constraint on the γjwas used to avoid nonidentifiability.


We also illustrate predictive inference. Figure 1a plots the fitted curvef (x). In addition to estimating the nonlinear regression curve, the MCMCallows a complete probabilistic description of the involved uncertainties.Figure 1b, for example, visualizes the posterior distribution on the nonlinearregression curve induced by the posterior distribution p(θ |D). We will revisitthis example (after Example 2) to illustrate issues of multimodality in neuralnetworks.

2.3 Posterior Multimodality. Multimodality issues have pervaded dis-cussions of classic analysis of NN models (see Ripley, 1993). They are alsoimportant issues to be considered when implementing Bayesian inferencein NNs (MacKay, 1995), since they affect the choice of the integration schemeand illuminate the discussion on model (architecture) selection. This issueof architecture choice has received relatively little attention in the literature.

Besides inherent multimodality due to the nonlinearity of FFNN, mul-tiple modes can occur for at least two more reasons related to ambigui-ties in the parameterization. First, multiple modes occur because prior andlikelihood, and hence the posterior, are invariant with respect to arbitraryrelabeling of the nodes. This problem is easily avoided by introducing anarbitrary ordering of the nodes. For example, we could impose the con-straint that the γjp be ordered, that is, γ1p ≤ γ2p . . . ≤ γMp. We used thisconstraint in the examples. Note that the prior p(γ ) under the constraint isa factor M! larger than it would be under the same prior probabilty modelwithout the constraint. The implementation of the Markov chain MonteCarlo scheme in section 2.2 may be simplified by the following observation.Define an alternative probing distribution γ ′j ∼ g′j(·) by first generating aproposal γj ∼ gj(γj) without regard to the order constraint. If γj violatesthe order constraint, permute the indices appropriately to get γ ′, otherwiseγ ′ = γ . Also, instead of scanning over j = 1, . . . ,M when updating γj,randomly choose j ∈ {1, . . . ,M}. The resulting probing distribution on theconstrained parameter space still defines a Metropolis step with a symmet-ric probing distribution. The implementation can be even further simplifiedby doing the reordering only before saving or printing the imputed values(γ1, . . . , γM). For the evaluation of equation 2.4 and the updating of otherparameters, the ordering is irrelevant since the posterior is invariant underpermutations of the indices. In this case the random scanning (i.e., randomlychoosing the index of the next γj to be updated) is not required.

Second, a more serious source of multimodality is the duplication ofterms in the network and the inclusion of irrelevant nodes. Node duplicationoccurs when multiple hidden nodes with practically the same γ parametersare included. An irrelevant node is a hidden node j with practically zerohidden-to-output layer weight βj. We may see both as a manifestation ofmodel mixing as follows. Denote withM0 the fixed architecture model (seeequation 2.2) with M∗ hidden nodes. Denote withMM the fixed architecture


model (see equation 2.2) with M distinct hidden nodes and nonzero weightsβj. ModelM0 containsMM, M = 1, . . . ,M∗ as special cases by setting, forexample, βi = 0 or γi = γM for i = M + 1, . . . ,M∗. While exact equalityof γ ’s or βj = 0 has zero posterior probability because of the continuouspriors we have adopted over the parameters, approximate equality canhave considerable posterior probability.

In fact, denote with pM(θ |D) the posterior distribution under modelMM.The posterior distribution p(θ |D) in modelM0 can be rewritten as a mixture∑

Pr(M = m|D)pm(θ |D). If the terms of this mixture are spiked and wellenough separated, p(θ |D) exhibits local modes corresponding to the sub-models, with additional multimodality entering through the different waysof nestingMM inM0 (for example, nodes could be duplicated, or some ofthe weights βj could be set to zero). For demonstration, we generated datafrom an NN model with two distinct hidden modes.

Example 2: Simulated Data. We simulated y1, . . . , yN from equations 2.2and 2.3 with M = 2, γ1 = (γ10, γ11) = (2,−1), γ2 = (γ20, γ21) = (1, 1.5),β = (20, 10), N = 100 and σ = 0.1, and estimated modelsM3 andM2. Themarginal posterior p(γ21|y), shown in Figure 3, has at least three local modes.The first local mode (around -1) and the third mode (around 1.5) are due tomodelM2 contained inM3 by duplicating node 1 in node 2 (first mode) orby duplicating node 3 (third mode). Also, underM3, p(γ21|y) shows a localmode around 0. This is due to nesting modelM2 inM3 by setting β2 = 0.Conditional on β2 = 0 the conditional posterior for γ21 would coincide withthe prior (i.e., centered around the prior mean zero).

Example 1 (continued). Figure 4 shows some more aspects of the poste-rior inference in Example 1 relating to multimodality due to node duplica-tion. The patterns are similar to the simulated data. However, even undermodelM2, we still see some multimodality, some of which could be due tomodelM1.

From a predictive point of view, node duplication is no issue. If the fo-cus of the analysis is prediction—for example, fitting a nonlinear regressionsurface—one could ignore the possibility of node duplication. However, itis important to be aware of the implications for the particular estimationscheme: routine application of any numerical posterior integration schemebased on approximate posterior normality and unimodality would be hin-dered. This includes widely used algorithms like direct normal approxi-mation, Laplace integration, importance sampling, and iterative gaussianquadrature. If used, inference will only be applicable to the local mode (i.e.,the particular submodel) on which the normal approximation was based.Node duplication will accentuate the problem of posterior multiple modes,hence hinder the efficiency of MCMC methods, especially random walk


Cj1-6 -4 -2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

••

•

•

•

••

•

••

•

••

•••

•

•

•••••

•

•

•

•

•

••••

•

•

•

••

•

•

••

•

•

•••• ••

• •

•

•

• ••

••

••••••••

•

•

••

•

••

•••••

•

••

•

•

••

•

•

••

•

•

•

•

•

••

•

•

•

••

•

• ••

•

••

•

• ••

•

•

•

•

•

•

•

••••

•

••

••

•

•

• • ••

•

•

•

•

••

•

••

•••••

•

•

•

••

•••

•

•

•

•••

••

••

•

• •• •

•

•

••••

••• • ••

•

•

•

•

•

•

•

•••

••

•

•

••

•

••

•

•

•

•

••

•

• ••

•••

•

•••

•

• •

••••

••

••

•

• • •••

•

• •

••

•

•

•

•

•••

•

•

•

••

• •

••

•

•••

••

••

•

••

••• •••••

•••••

•

•

••••

•

•

••

•••••••

•

•••••••••••••

•••••

••

•

••••

•

•

•

••••• ••••••• •••••••••••••

••

• ••

••

••

•

•••

••

•

••

•

•••

•

••••••

•

•

•

••••

•

•

••

••

•

•

•

••

••

••••

••• • •• •

••

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•••

•

••••

•

•

••••

•••

••

••••

•

•

•

•

•

•• ••

•• •••

•

••

•

•

•

••

••

•

• ••

•

•

•

•

••

••

•

••

•

•• ••

••

•

••••

•

• •

••

••••••

•

•••

•

••

•

• •

•

•

••

•

•

•

•••••

••

•

•

•

••

••

•••

•

•

•

••

•••

•

•

•

•

•

•

•

•

•••

••

•

•

••

••

••

•

•

•

••

•

•

•

•

•

••••••

•

•••

••

•

•

•

•

••

•

•

•

••

••

•

•••••••

• •••••

•

••

•

••

•• • •

•

•

•

••

•

•

•

•••

••

•

•

•

•••

•

•

••

•

••••

•

•••

••

••••

•

••••

•

••

•

•

•

••••

••••

•

•••

••

••

••

••••

•

•

•••

••••

••

••

••

••

• •• •

•

•

•

•••

•

•

••

•••••

•

•

•

•

•

•

•••

••

•••

•• •••

•

•

•••

••

•

••

•

•

•

•

••

•

••

••

•

•

•

••••

•

••

•

•

••••

•

••

•

•

•

••

•

•

•

•

•••

•

•••••••••

•••

••

•••••

•

•

••• ••

• •

••

••••

•••

•

••

•••

•• •

•

••

•••

•

•

•

•

•••

••••

•

•

•••

•

•••

•

•

•••

•

••

• •

•

•

••

• •

••

••

••

••

•

••••

•

•••

•

••

•

•

•

••••

•

••

•

••••

•

•••

••

•

• •

••••

•• ••

••

•

•

••

•• •

••

•

••

•

•• ••••

•••••

•

•••••

C11

C21

-15 -10 -5 0 5

-4-2

02

4

Cj1-6 -4 -2 0 2 4 6

0.0

0.5

1.0

1.5

••

•

••

•••

•••

•

• •

••

•••

•

•

•

•

•

•

•

•

•

•• •• •

••

•

••

•

••

•

•

•••

•••

•

••

•

••

•

• •

••

•

••

•

•

• •

•

•

•

•

•

••

••

•

•

• •

•

•• ••

•

• •

•

•

•

•

•

•

•

•

•

•

••

••

••

•

•

•••

•

•

•

•

•••

•

••

• •

•

•

•

•

•••

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

••

•

••

••

•

•

••

•

•••••

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•••

••

•

••

••

•

•

•

• ••

• •

••

•

•

•

••••

•• ••

•• •

•

•

•

••

•

•

•

•••

••

••

••

•

•

•

•

•

•

••

• •

••

• •••

•

•••••

•

••

••

••

• ••

••

•

•

•

•

••

••

••

•

•••

•

•

•

•

••

•••

•

•

•

•

•

•

•

••••

•

•

•• •

•

•

•

•

•

•••

•

••

••

•

•

•

•

•

•

•

•

•

• •

•

••

••

•

••

••

• •

•

•

•

•

•

•

•

•

•

•

•

• ••

• •

• •

•

•••

•

•

•

• • ••

••

•

•

•

•

•

••

• •

••

•

•

•

•

•

•

•

••

•

•••

•

•

•

•••

••

•

•

•

•

•

•

••

••

•

•

•

•

•

••

•

•• •

•

•

•

•

•

•

•

•

••

• •

•

•

•

•

•

•

•••

•

••

•

•

•

•

•

•••

•••

•

•

•

•

••

••

• •

•

•

•

•••

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

••

••

• •

••

•

•

•

•

••

•••

•

•

• •

••

• •

•

•

•

•

•

•

•

•

•

•

•

•••

•

••

••

•

•••

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

• ••

•• •

•

•

•

•

•

•

•

•

•

•

•

•••

•

••

••

•

•

••

•

•

•• •

•

•

•

•

••

•• •

•

•

•••

••

•

• •

•••

•

•

•

•

•

•

•

•

•

•

•

••

•

• ••

•

••

•• •

•

•

•••

•

•

•

• •

••

•

•

•

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•• •

•

•

•

•

••

••

• ••• •••

•

•

•

•

•

•••

• •

•

•••

•

•

•

•

•

••

•

••

•

•

••

•••

••

•

•••

•

••

•

•

•• •

•

••

••

••

•

•

•

•

•

••••

••

•

•

•

••••

•

••

••

••

••

•

•

•• •

• ••

•

•

•

• ••

•

•

•

•

••

••

•

•

•

•

•

••

•

• •

•

•

••

•

•

••

• •

••

•

• •

••

•••

•

•

•

•••

• •••

•

•

•

•••

• •••

•

•

•

••

•

•

•

•••

••

•••

•••

•

•

••

•

•

•

•

•

•

•

•

•••

••

•

•

•

••

•

•

•

•

•

•••

••

• •

•••

•

•

••

•

••

••

••

•

•

•••

•

•

•••

•

•

•

•••

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

••

•

•

• • •

••

•

•

••

•

•

•

••

••

••• •

•

•

• •

••

•

•

•

C11

C21

-1.4 -1.2 -1.0 -0.8

1.5

2.0

2.5

3.0

Figure 3: Example 2 (simulated). Posterior distributions under modelM3 (toprow) and under model M2 (bottom row). The panels on the left show themarginal posterior distributions p(γ11|D) (solid line), p(γ21|D) (dotted line), andp(γ31|D) (dashed line in the top left panel). The posterior multimodality can beclearly seen in the marginal posterior distributions underM3. Note how p(γ21|D)takes the form of a mixture of p(γ11|D), the prior p(γ21), and p(γ31|D). This mul-timodality vanishes under M2. Panels on the right show the joint posteriordistribution p(γ11, γ21|D) underM3 (top) andM2 (bottom). The line indicatesthe 45-degree line γ11 = γ21. All multimodality is removed by constraining toM = 2 nodes. The data were generated from a model with two hidden nodes.In general, some multimodality might be left due toM1.

Metropolis schemes, which could easily get trapped in a particular localmode. This is of particular concern since most commonly used convergencediagnostics are based on analyzing the simulation output and could falselydiagnose practical convergence.

As a consequence, we are interested in removing ambiguities in the pa-


Cj1-10 -5 0 5 10

0.0

0.05

0.15

0.25

• ••••

• •••••••

••

•

• •••

•••

••

•

••

• ••• •• ••

• •••••

• •

•

•

••

•••

•

•

•

•

•••

•••

•

••

•

••

•

•••• ••

• • •••

•

•• •••• • ••

•••••• • •

••

••••

•••

• •

•

•

•

• •

••

•• •••• • •• •

••

•

•

•

•

•••

•••

••

•• •••

••• •

•

•

• ••

• •

•

•

•

• ••

•

•

••

• ••• ••

•

••••••

•

•••• • • • ••• ••

• ••••••••

••••••••

•• •

•

•• •

•

••

••

•

•

•

•••

•

•

••

•

•

•

••

••••••••• • • •••

••••

• •• •••••

••

•• ••••

•• ••• • • ••

•

•

•

•• •••••

•

••

•

•••• •••

••••••

•

••

••

••

•• •

•

• ••

••

••

••

••

••

•• ••

•

•

•

•

•••• •

•

•

•• •

•

••

••

•

•

•

••

••

•

••••

••

•

•

•• •••

••• • • ••• ••

•

••• ••••

••

••••• •• •• • •• •• • ••

••••••

••

••

•

••

• •

••

•

••

•

•••• • •••• •

••• •

•

••• •• •• •• ••• •••• •••••• •

•

• ••

• •••

•

••

•

•

•

••

• •

• •

••

• ••••••

••

•••••• • ••

•

• • ••

••

••••• ••••••• ••

•••••

••• ••

•••

••••

•••••

• • •••••

•• • ••

•••• •••• •

••••••

•••

•

••••

•

•

•

•

•• •••

••• •••

•

•••••

•

• ••

•

•

•

••

•

•• •

••

•

••

••

•

•

••••

•••

••••••

••

• •

• •

•••

••

•

•

••

•

•• ••

•••••• •

••• •

•

•••

•

•

•

••

•• •

•

••

• ••••••••••

•

• ••

•

• •

•••

•

•

••••

•

•

••

•• • •

•• •

•

•••

•

•• ••

•

• •••• •• ••• •• • •

• ••• •

•

•••

••

•••

•

•

• • •••••••

•••••

•

•••••• •• ••••••

••

•

•

•••

•

••

•

••

•

•••

• • •••• •••

••

•

••

•

••

•

••

•

•

•

•

•

•

••

•

••

•

••

••

•• •

•

••

•

•• ••••• •

•• ••••• ••

••••

•

•

••

•

•••••• ••••• •••••••

•••

•••••

•

••

•••

•

••

••

•

•••

••

•

•

•

••

••••

•

•

•• ••

••

•• •

•

•

••

••

••••

• •• •••

•

•• ••• • •

•

•

•

••••• •••••

••

•

••

•

•

•

C11

C21

-15 -10 -5 0 5

-10

-50

510

Cj1-10 -5 0 5 10

0.0

0.2

0.4

0.6

••

••

•••

•

• ••

••• •••

••

• ••

•••

••

••

•••••

• •••

• •

•••••••••

•••• • •••

••

• ••••

•••

• ••

•••

••

••••••• • ••• • •

••••• •

•• •••

••

•••

••••

• •••

•••

•

•

••

• • • ••• •

•• •• •

••

•• • • •••

•

•• • •

•• •••

• • • •••••

••

•

• ••

•••• • ••• •

••

•• ••• •

••

••

••

•

• ••

• ••••

•

• ••••

••

• ••

••

• ••••• •• ••

••• •••

• ••••

•

•• • • •••• •••

• ••

•• ••• ••

• ••

••••

••• • •••

•• •

••

•

•

••••

•••••

•

•• •••

••

•••

•

•••

••

•

•

•

••• ••

••••••

•

•••

••• • •

•••••••• •• •• •••

• ••

••

• •

• •••

••

•••••

••••

• •••

• • •••• • ••• • •

• •

• •

••

•

• • •••••• • ••••• ••

••

••

•••

• •

••

••• •

••••• •

•••

••••

••

•••

• ••

••• • •

•••••

••••• •

•

•

•• •• •

••• •• • •••••

••••••

• •••

••

• •• •••• ••

• •••

••

•• •

• ••

••• •••

•• •••

•••

•

•

•

••• ••

•• ••••

••• •

••••

••••••

••• •• • •••

•

•• • • •••• • •

• ••

•

•••• •••

• ••

• •

••• •••

••••• •

•

••

•••

• •

••• ••

• ••• ••• ••••

•

•

•• ••••

••• •

•••• • •

•

••

• •• •• ••

••

••

••• •

•

• •••

••• •••

• •

•• •••• ••••• •

• •• ••• • • •

•••

• •••• • •

••

•• • ••

•

••

• •

••• •

•• •

•

•• •

• •••••

••

•

••

•

•

•

•

•

••

•

•

•

•

•

••••

•• •••••

•

•

•

•

••

• •• •••

• •

•• ••••••• •

••

• •

•••• • ••

••• •••••••

• •••••• ••

••• •• •

•• •

•

•

•

•

• •• ••

•••••• •• ••

• ••

••

• ••

••••

•• • ••

•• • •••• •

••••• •••••• ••• •

• •••••

••

•

••••• •• •••

• •••

• •• ••• • • • •••••

• •

•• •••

••

••

•••

•• •••

•

••

••

•

•••

•

• ••• ••

• ••

•••

•••

•

•• •• •

••

•• •

•

•

•

••

•• •• ••••

•

•

••

C11

C21

-15 -10 -5 0 5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Figure 4: Example 1. Posterior distributions under model M3 (top row) andM2 (bottom). The features are very similar to the posterior distribution forthe simulated data in the previous figure. In particular, note that p(γ21|D) isa mixture of p(γ11|D), p(γ31|D), and the prior p(γ21). Also, the second mode ofp(γ11|D) duplicates p(γ31|D), indicating positive posterior probability for modelM1 using node 3 only. As in the simulated example, most of the multimodalityfrom the posterior underM3 is lost underM2. Only some minor mode due tomodelM1 remains. It can be seen as a secondary mode in p(γ11|D), as well as asecondary mode close to the 45-degree line in p(γ11, γ21|D).

rameterization whenever possible. Often multimodality can be partly re-moved by picking a lower-order model, that is, when symptoms of nodeduplication are noticed in the posterior distribution, one could considermodels with M < M∗ nodes, as shown in Figures 3 and 4.

This discussion leads us to issues of model (architecture) selection. Webelieve that a conceptually clear and straightforward approach is to include


explicitly the number of hidden nodes M as a parameter in the model, thatis, use variable architecture NN models. This is fairly natural, since, apartfrom simulated examples, we do not expect a “true” M for the NN model;hence we need to model uncertainty about it. There remains the problemof providing procedures for modeling it and a scheme for estimating themodel. This is the topic of section 3.

3 Variable Architecture NN Models

Our considerations in section 2 lead us to contemplate M as another pa-rameter, for two main reasons. First, random M with a prior distributionfavoring smaller values reduces posterior multimodality, as discussed insection 2.3. Although in principle posterior multimodality does not pre-vent valid Bayesian inference, we consider it good Bayesian modeling toavoid parameterizations leading to posterior multimodality. Second, themarginalization over βj (see Lemma 2.1) requires the inversion of matricesof dimension related to M. Avoiding unnecessarily large M is critical toreduce computational effort.

We provide here a scheme for modeling and estimating uncertainty aboutM, therefore dropping the assumption of a fixed, known architecture. Weactually allow the model to “select” the size of the hidden layer by includingindicators dj, with dj = 1 if a node is included, and dj = 0 if a node is dropped.The extension of our algorithm to this case will allow the identification ofarchitectures supported by the data.

We generalize the fixed architecture model in yet another direction, byincluding a linear regression term x′λ to model level and linear effects ef-ficiently. Typically this would tend to reduce the size of the network. Wewill always use d1 = 1, assuming only problems that cannot be describedby a linear model alone would be analyzed by an NN model. This corre-sponds to a model-building strategy based on blocks, where the linear partmodels linear aspects of the process of interest and the NN part modelsnonlinear aspects. Of course, the constraint d1 = 1 is not required and couldbe removed if desired. We included it in our implementation for technicalconvenience to avoid a separate code for the special case M = 0.

Neal (1996) suggests an alternative approach based on choosing a bigenough number M∗ of hidden nodes, and an appropriate prior. Our modelcomprises M = M∗ as a special case, but for the aforementioned reasons,we favor the approach with random M.

3.1 The Model. The model we use is:

yi = x′iλ+M∗∑j=1

djβjψ(x′iγj)+ εi, i = 1, . . . ,N,

εi ∼ N(0, σ 2), ψ(η) = exp(η)/(1+ exp(η)). (3.1)


We include at most M∗ hidden nodes, allowing for architectures from onehidden node (when only d1 = 1) to M∗ hidden nodes (when di = 1, ∀i).Again, we recommend including an order constraint to avoid trivial poste-rior multimodality due to permutations of indices, γ1p ≤ γ2p · · · ≤ γMp.

The prior we introduce over the indicators is:

Pr(dj = d|dj−1 = 1) ={

1− α, for d = 0α for d = 1 j = 1, . . . ,M∗, (3.2)

and dj = 0 if dj−1 = 0, with d1 = 1. Observe that the indicators dj are ordereddecreasingly, so that dj = 0 implies dk = 0 for k > j. The prior distribution forthe indicators dj allows a priori any architecture with M ≤ M∗ hidden nodes.It actually implies a geometric prior with parameter α, truncated at M∗, onthe size M of the hidden layer. This prior enables efficient implementationand favors parsimony, in the sense of supporting architectures with a smallernumber of hidden nodes.

The prior over other network parameters is similar to that of the modelin equations 2.2 and 2.3 with an additional prior for α:

βj ∼ N(µβ, σ 2β ), λ ∼ N(µβ, σ 2

β ),

γj ∼ N(µγ , Sγ ), α ∼ Beta(aα, bα). (3.3)

Finally, we complete the model with the same hyperpriors on (µβ, µγ , sβ,Sγ , σ 2) as we did with the fixed architecture model.

3.2 An MCMC Method for the Variable Architecture Case. The com-putational scheme is a natural extension of that in section 2.2, yet anotheradvantage of favoring an MCMC approach to NN modeling. The simula-tion scheme outlined for inference in the fixed architecture model requiresonly minor modifications to be used for the models in equations 3.1, 3.2,and 3.3. Conditional on currently imputed values for the indicators dj, themodel reduces to the fixed-architecture one. Given other model parame-ters, the conditional posterior probabilities for dj = 0 and dj = 1 are easilycomputed.

Denote with M = ∑M∗j=1 dj the number of hidden nodes currently in-

cluded in the model. By definition of the indices, we always have dj = 1,j = 1, . . . ,M, and dj = 0, j = M+ 1, . . . ,M∗. Before discussing details of thealgorithm, we outline the updating scheme, which goes through the follow-ing steps, until convergence is judged. The notation x|y, z indicates that x isbeing updated using the current values of y and z, γ−jk denotes γ without


γjk, d−j = (d1, . . . , dj−1, dj+1, . . . , dM+1), and so forth. Updating details arediscussed below.

1. γjk|γ−jk,M, ν,D, j = 1, . . . ,M+ 1, k = 0, . . . , p.

2. dj|d−j, γ1, . . . , γM, γM+1, ν,D, j = 1, . . . ,M+ 1.

3. β1, . . . , βM, λ|γ1, . . . , γM,M, ν,D.

4. ν|β1, γ1, . . . , βM, γM, λ,D.

In step 1, we marginalize over (λ, β) and include γM+1. Conditional onM and the hyperparameters, the conditional posterior on γM+1 is just theN(µγ , Sγ ) prior. All other γj’s (j = 1, . . . ,M) are updated through Metropo-lis steps, similar to step (2) in section 2.2. Also, the comments in section 2.3about resampling under the order constraint on the γj in the fixed architec-ture model, apply equally for the variable architecture model. Randomlyselect j ∈ {1, . . . ,M}, k ∈ {1, . . . , p}, and generate a proposal γjk ∼ g(γjk),with g(γjk) described below. If γjk violates the order constraint, permute theindices appropriately to get γ ′, otherwise γ ′ = γ , where γ is the γ vectorwith γjk replaced by γjk. Compute

a(γ, γ ′) = min[

1,p(D|γ ′,M, ν)p(γ ′|M, ν)p(D|γ,M, ν)p(γ |M, ν)

].

Use Lemma 2.1 to evaluate p(D|γ, ν). With probability a(γ, γ ′) replace γby the new candidate γ ′. Otherwise, leave γ unchanged. For the probingdistribution g(·), we use a normal N(γjk, c2Cγ,kk) distribution, where Cγ,kk isthe kth element on the diagonal of Cγ , and c is a fixed scaling factor. Thespecific choices for c used in our examples are reported in the appendix.Alternatively one could consider proposals changing all coordinates of γjjointly, as we did in equation 2.4.

Step 2 refers to updating the number of hidden nodes. Again we marginal-ize over (β, λ). Denote with γ the list of regression parameters, includingthe (M + 1)st term: γ = (γ1, . . . , γM+1). Similarly, d = (d1, . . . , dM+1). Toupdate d we use a Metropolis step with the following probing distributiong(γ , d|γ, d) to generate a proposal (γ , d). We include γ in the notation be-cause the proposal might include a permutation of indices, which wouldaffect theγ and the d vector. First, randomly select an index j ∈ {1, . . . ,M+1},with uniform probability 1/(M+ 1) each. Then define dj by flipping dj, thatis, dj = 1 − dj. Third, permute the indices to maintain the order constrainton γjp and the constraint on the dj. The above three steps implicitly defineg(γ , d|γ, d). Note that by definition of g, we have g(γ , d|γ, d) = 1/(M+1) forall possible (γ , d), where M =∑ dj is the number of terms before generatingthe proposals. Also g(γ, d|γ , d) = 1/(M+1), where M =∑ dj is the number


of terms in the proposal. Having generated the proposal, evaluate

a(γ, d, γ , d) = min

[1,

p(γ , d|ν,D)p(γ, d|ν,D)

g(γ, d|γ , d)

g(γ , d|γ, d)

]

= min

[1,

p(D|γ , d, ν)p(γ |d, ν)p(d|ν)p(D|γ, d, ν)p(γ |d, ν)p(d|ν)

(M+ 1)

(M+ 1)

].

With probability a(γ, d, γ , d), accept the proposal (γ , d) as a new value for(γ, d); otherwise keep (γ, d) unchanged. Note that for a proposal with M =(M−1), we get p(γ |d, ν)/p(γ |d, ν) = 1/M when using an order constraint onγ1p, . . . , γMp. This is because under the proposal (γ , d) only γj, j = 1, . . . ,M−1 are subject to the order constraint, as opposed to γj, j = 1, . . . ,M, underthe current parameter vector. For the same reason, p(γ |d, ν)/p(γ |d, ν) = Mif the proposal increases the number of hidden nodes by one. Step 2 canbe repeated several times. If this is done, it is important to note two moreimplementation details. First, if a dj is set to zero, then the correspondingγj becomes γM+1, after reordering the indices and changing M, and is usedfor the next iteration. Second, if step 2 is repeated, say, T times, then step 1needs to be modified to generate γM+j for j = 1, . . . ,T.

It is important to marginalize over β when updating the d. Conditioningon β would lead to a very slowly mixing Markov chain and render thescheme of little practical value. For example, when updating dM+1, a moveto dM+1 = 1, that is, M := M + 1, would only rarely be accepted whenby chance the weight βM+1 which was previously sampled from the prior,happens to be “right.” Compare with the discussion in Example 3 and Figure5. A further marginalization is possible by analytically integrating over µγ .We used this in our implementation but consider it far less critical than themarginalization over β.

In step 3, we sample all βj and λ jointly. Step 3 draws from the completeconditional posterior. The complete conditional posterior p(λ, β1, . . . , βM|γ1, . . . , γM,M, ν,D) is a multivariate normal posterior distribution in a lin-ear normal regression model, as described in Lemma 2.1. Step 4 is un-changed from the fixed architecture case. Convergence of the algorithmfollows from the arguments in Tierney (1994).

We illustrate the variable architecture model with two examples. The firstshows how our model may cope with a multivariate problem. The second isstructurally complicated and suggests how our model may adapt to sharpedges. The flexibility of this model for coping with these features makes itvery competitive with respect to other smoothing methods, including themodel in equations 2.2 and 2.3.


ITERATION

MS

E

0 100 200 300 400 500

0.00

10.

005

0.05

00.

500

Figure 5: Example 3 (robot arm). Predictive mean squared error (MSE) for thetest data set. The figure plots MSE averaged over batches of 10 iterations versusiteration. The solid curve corresponds to simulations using the MCMC schemedescribed in section 3.2. The dashed curve shows the MSE for the same MCMCscheme, but without marginalizing over (β, λ) when updating γ and M. Thehorizontal dashed line indicates the predictive MSE after 20,000 iterations (atMSE = 0.00545).

Example 3: Robot Arm. This test problem is analyzed in MacKay (1992)and reanalyzed in Neal (1993, 1996). We have to learn a mapping from tworeal-valued inputs representing joint angles for a robot arm to two real-valued outputs that predict the resulting arm position, defined by

yi1 = 2.0 cos(xi1)+ 1.3 cos(xi1 + xi2)+ εi1, i = 1, . . . ,N,

yi2 = 2.0 sin(xi1)+ 1.3 sin(xi1 + xi2)+ εi2,

εikiid∼ N(0, 0.05).

To accommodate the bivariate response, we generalize the model in equa-tion 3.1 to

yik = x′iλk +M∗∑j=1

djβjkψ(x′iγj)+ εik, i = 1, . . . ,N, k = 1, . . . ,K,

εikiid∼ N(0, σ 2), ψ(η) = tanh(η). (3.4)

Following MacKay (1992) we replace the logistic activation function bytanh(·). To avoid nonidentifiability in the likelihood, we add a constraint


Table 1: Example 3. Geweke’s (1992) Convergence Diagnostic and Lag 10 Au-tocorrelations.

With WithoutMarginalizing (λ, β) Marginalizing (λ, β)

Variable Geweke Autocorrelation Geweke Autocorrelation

M −1.81 0.81 NA 0.99σ 1.75 0.05 13.40 0.97µβ −0.46 0.03 2.59 −0.00µγ,0 −1.28 0.24 − 1.86 0.03µγ,1 −0.75 −0.03 − 2.63 0.01µγ,2 0.08 −0.05 0.31 −0.00λ10 0.19 0.34 −21.80 0.99λ11 −1.77 0.58 −12.40 0.90λ12 1.87 0.70 −19.70 0.97λ20 −0.53 0.27 −15.90 0.97λ21 0.35 0.48 7.63 0.94λ22 0.48 0.91 −11.10 0.97

Note: Estimates are based on 20,000 iterations, discarding the first 1000 as burn-in andthinning out to every 10th iteration thereafter. Without marginalization, the simulatedchain did not change M over the last 50% of the iterations, making evaluation of Geweke’sconvergence diagnostic impossible for M.

γj1 > 0. Without this constraint, one could change (βj, γj) to (−βj,−γj)with-out changing the likelihood. If the prior is symmetric around 0, the posteriordistribution would remain invariant under such transformations.

The prior model for (λk, γj, βjk) remains the independent normal model(see equation 3.3) with hyperparameters as in the appendix. We used thesame data set as MacKay (1992). We split the data into a training data set(the first 200 observations) and a test data set (the last 200 observations).Figure 5 reports the mean squared predictive error for the test data set as afunction of the number of iterations. After around 300 iterations, the meansquared error is already close to the asymptotic value 0.00545 (note the the-oretical minimum 2σ 2 = 0.0050), indicating that short run lengths of sev-eral hundred iterations are sufficient for predictive purposes. However, tomonitor convergence diagnostics on some selected parameters, we needed20,000 iterations to achieve practical convergence. Details are reported in Ta-ble 1. The estimated marginal posterior probabilities p(M|D) for the num-ber of hidden nodes are 0.22, 0.47, 0.27, and 0.04 for M = 6, 7, 8, and 9,respectively.

For the Markov chain Monte Carlo scheme to be of practical use, themarginalization over (β, λ) is crucial. This is illustrated in Figure 5 and theright column of Table 1.


-1.5-1

-0.5 0

0.51 1.5

X1-1.5

-1

-0.5 0

0.51

1.5

X2

-6-5

-4-3

-2-1

01

-1.5-1

-0.5 0

0.51

X1-1.5

-1

-0.5

00.5

11.5

X2

-5-4

-3-2

-1 0

1

Figure 6: Example 4 (reservoir). Data and fitted surface using the NN model ofsection 4. The solid triangles indicate the data points.

Example 4: Reservoir Management. We apply the described methods toa case study coming from Rios Insua and Salewicz (1995). The exampleconcerns a reservoir operation problem, complicated by the existence ofmultiple objectives, uncertainty to the inflows, and the effect of time. Thedecisions to be made each month were the volumes of water to be releasedthrough turbines and spilled, based on maximizing a predictive expectedutility. There was no analytic expression for this last one, so we could appealto an optimization method based on function evaluations only, such as theNelder Mead method.

Alternatively, we could evaluate the expected utility at a grid of controls,fit a surface, and optimize it. We illustrate this last approach, fitting an NNmodel. Figure 6 shows the data and the fitted surface. Note how the NNmodel fitted the sharp edge in front, a case in which many commonly usedsmoothing methods might fail. Figure 7 shows the marginal posterior on M.Table 2 reports convergence diagnostics.

In addition to the normal prior (see equation 3.3), we constrained the γjkby |γjk| < 10.0 to avoid numerical problems. Otherwise proposals for the γvector could lead to degenerate design matrices in the regression problemsrequired for the evaluation of p(γ |ν) (see Lemma 2.1).

4 Discussion

Neural network models are used to model nonlinear features in problemslike approximation, regression, smoothing, forecasting, and classification.Although they are typically presented as black box models, allowing theincorporation of prior knowledge in those models enhances their perfor-


(a) (b)

Figure 7: Example 4 (reservoir). Posterior p(M|D) on the size of the hidden layer.(a) Plots of the estimated posterior distribution p(M|D). (b) The trajectory usingthe Markov chain.

Table 2: Example 4. Geweke’s (1992) Convergence Diagnostic and Lag 10 Au-tocorrelations.

Variable Geweke Autocorrelation

M 0.29 0.67σ −0.07 0.27µβ −1.35 0.00µγ,0 −0.87 0.25µγ,1 −1.63 0.38µγ,2 −1.44 0.19λ0 −1.88 0.18λ1 0.96 0.51λ2 0.38 0.22

Note: Estimates are based on 40,000 itera-tions, discarding the first 1000 as burn-inand thinning out to every tenth iterationthereafter.

mance. This begs naturally for a Bayesian approach to NN models. Amongother advantages, this allows for the coherent incorporation of all uncer-tainties, including those relating to the hidden layer size. This approach,however, leads to difficult computational problems.

Specifically, we have noted potential problems of normal approximation-based approaches due to multimodality. As an alternative, we have pro-vided a powerful Markov chain Monte Carlo scheme that avoids those prob-lems and permits routine Bayesian analysis of FFNN models. The schemeallows the consideration of variable architecture networks and consequentlyautomates the choice of the network architecture. We have also shown that


the scheme allows the combination with more conventional models such aslinear regression.

In summary, we have provided a general framework for the Bayesiananalysis of FFNN models. Future work will deal with a somewhat inverseproblem: how FFNN models enhance the Bayesian tool kit. In particular,from a statistical modeling point of view, NNs are very close to mixturemodels. Many issues about posterior multimodality and computationalstrategies in NN modeling are of relevance in the wider class of mixturemodels (see Escobar & West, 1995; West, Muller & Escobar, 1994; West &Turner 1994). For example, we could explore the potentiality of our frame-work when dealing with uncertainty in the number of components of amixture model.

Appendix: Implementation, Initialization and Convergence Diagnostic

In the examples, we have used the following initialization and hyperpa-rameters. The covariates were standardized to have xi = 0 and var(xi) = 1.0(except for the dummy intercept x0i = 1). In Examples 1 and 2 we fixedthe hyperparameters µβ,µγ , σβ , and Sγ at µβ = µγ,j = 0, σ 2

β = 10000 andSγ = diag(25, 25). In Examples 3 and 4 we used initial valuesµβ = µγ,j = 0,σ 2β = 10, and Sγ = diag(4, 10, 10). The sample variance σ 2 was fixed toσ 2 = 100 and 1.0 in Examples 1 and 2, respectively, and initialized asσ 2 = 0.0025 and σ 2 = 0.5 in Examples 3 and 4. The remaining hyper-parameters in Examples 3 and 4 were chosen as aβ = aγ,j = 0, Aβ = 1,Aγ = diag(1, 1, 1), cb = 11, cγ = 13, Cb = σ 2 and Cγ = Sγ . In Examples 3and 4, we initialized α = 0.2 and used hyperparameters aα = bα = 1. Theprior on M was truncated at M∗ = 20, and M was initialized with M = 3. Forthe scaling parameter c in the probing distribution for γjk, we used c = 0.1in Examples 3 and 4.

We simulated 10,000, 10,000, 20,000, and 40,000 iterations in Examples 1,2, 3, and 4, respectively. The decision to terminate simulations was basedon the convergence diagnostic proposed by Geweke (1992). Compare withTables 1 and 2. The relatively long simulation lengths in the simple examples(1 and 2) were required to obtain sufficiently large Monte Carlo posteriorsamples for the posterior scatter plots. In Examples 1 and 2 we discarded theinitial 100 iterations as burn-in and saved every tenth iteration thereafter tocollect an approximate posterior Monte Carlo sample, used for the ergodicaverages. In Examples 3 and 4 we discarded the first 1000 as burn-in andsaved every tenth.

Acknowledgments

Research was supported by grants from the National Science Foundation,CICYT, and the Iberdrola Foundation. It was completed while P.M. visited


the Department of Artificial Intelligence of Madrid Technical Universityand D.R.I. was visiting CNR-IAMI. We are very grateful to the numerousremarks from the referees and discussions with Fabrizio Ruggeri.

References

Amari, S. (1993). Mathematical methods of neurocomputing. In O. E. Barndorf-Nielsen, J. L. Jensen, & J. S. Kendall (Eds.), Networks and chaos. London: Chap-man and Hall.

Beale, R., & Jackson, T. (1990). Neural computing. Bristol: Hilger.Berger, J. O. (1990). Robust Bayesian analysis: Sensitivity to the prior. Journal of

Statistical Planning and Inference, 25, 303–328.Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley.Besag, J., & Green, P. J. (1993). Spatial statistics and Bayesian computation. Journal

of the Royal Statistical Society, Series B, 55, 25–37.Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford Uni-

versity Press.Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex

Systems, 5, 603–643.Buta, R. (1987). The structure and dynamics of ringed galaxies, III. Astrophysical

Journal Supplement Series, 64, 1–37.Cheng, B., & Titterington, D. M. (1994). Neural networks: A review from a sta-

tistical perspective (with discussion). Statistical Science, 9, 2–54.Cybenko, G. (1989). Approximation by superposition of sigmoidal functions.

Mathematics of Control Systems and Signals, 2, 303–314.Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference

using mixtures. Journal of the American Statistical Association, 90, 577–588.Gelman, A., Roberts, G. O., & Gilks, W. R. (1996). Efficient Metropolis jumping

rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.),Bayesian statistics 5. Oxford: Oxford University Press.

Geweke, J. (1992). Evaluating the accuracy of sampling based approaches to thecalculation of posterior moments. In J. M. Bernardo, J. O. Berger, A. P. Dawid,& A. F. M. Smith (Eds.), Bayesian statistics 4. Oxford: Oxford University Press.

Lavine, M., & West, M. (1992). A Bayesian method for classification and discrim-ination. Canandian Journal of Statistics, 20, 451–461.

Lindley, D., & Smith, A. F. M. (1971). Bayes estimates for the linear model (withdiscussion). Journal of the Royal Statistical Society, Series B, 34, 1–41.

MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks.Neural Computation, 4, 448–472.

MacKay, D. J. C. (1995). Bayesian methods for neural networks: Theory and applica-tions (Technical Rep.). Cambridge: Cavendish Laboratory, Cambridge Uni-versity.

Muller, B., & Reinhardt, J. (1990). Neural networks. Berlin: Springer-Verlag.Neal, R. M. (1993). Bayesian learning via stochastic dynamics. In C. L. Giles,

S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processingsystems 5, San Francisco: Morgan Kaufman.

http://www.mitpressjournals.org/action/showLinks?crossref=10.2307%2F2291069

http://www.mitpressjournals.org/action/showLinks?crossref=10.1214%2Fss%2F1177010638

http://www.mitpressjournals.org/action/showLinks?crossref=10.1007%2FBF02551274

http://www.mitpressjournals.org/action/showLinks?crossref=10.1016%2F0378-3758%2890%2990079-A

http://www.mitpressjournals.org/action/showLinks?crossref=10.1016%2F0378-3758%2890%2990079-A


http://www.mitpressjournals.org/action/showLinks?system=10.1162%2Fneco.1992.4.3.448


Neal, R. M. (1996). Bayesian learning for neural networks, New York: Springer-Verlag.

Raftery, A. E., Madigan, D. M., Volinsky, C. (1996). Accounting for model un-certainty in survival analysis improves predictive performance. In J. M.Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistcs5. Oxford: Oxford University Press.

Rios Insua, D. & Salewicz, K. A. (1995). The operation of Kariba Lake: A multiob-jective decision analysis. Journal of Multicriteria Decision Analysis, 4, 203–222.

Ripley, B. D. (1993). Statistical aspects of neural networks. In O. E. Barndorf-Nielsen, J. L. Jensen, & W. S. Kendall (Eds.), Networks and chaos. London:Chapman and Hall.

Robert, C.P. (1994). The Bayesian choice, New York: Springer-Verlag.Spiegelhalter, D.J., Thomas, A., & Gilks, W.R. (1994). BUGS Manual. Cambridge:

MRC Biostatistics Unit, IPH.Stern, H.S. (1996). Neural networks in applied statistics. Technometrics, 38, 205–

220.Thodberg, H. H. (1996). Review of Bayesian neural networks with an application

to near infrared spectroscopy. IEEE Transactions on Neural Networks, 7, 56–72.Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals

of Statistics, 22, 1701–1762.Wang, Y. (1995). Unpredictability of standard back propagation neural networks.

Managment Science, 41, 555–559.West, M., Muller, P., & Escobar, M. D. (1994). Hierarchical priors and mix-

ture models, with application in regression and density estimation. InA. F. M. Smith & P. R. Freeman (Eds.), Aspects of uncertainty: A tribute toD. V. Lindley. New York: Wiley.

West, M., & Turner, M. D. (1994). Deconvolution of mixtures in analysis of synap-tic transmission. The Statistician, 43, 31–43.

Received March 6, 1996; accepted August 5, 1997.

http://www.mitpressjournals.org/action/showLinks?crossref=10.1214%2Faos%2F1176325750

http://www.mitpressjournals.org/action/showLinks?crossref=10.1214%2Faos%2F1176325750

http://www.mitpressjournals.org/action/showLinks?crossref=10.1109%2F72.478392

http://www.mitpressjournals.org/action/showLinks?crossref=10.1002%2Fmcda.4020040402

http://www.mitpressjournals.org/action/showLinks?crossref=10.1287%2Fmnsc.41.3.555



This article has been cited by:

1. Helmut Schaeben. 2014. A Mathematical View of Weights-of-Evidence, Conditional Independence, and Logistic Regression inTerms of Markov Random Fields. Mathematical Geosciences 46, 691-709. [CrossRef]

2. Helmut Schaeben. 2014. Potential modeling: conditional independence matters. GEM - International Journal on Geomathematics. [CrossRef]

3. Discrete Event Simulation 226-242. [CrossRef]4. Xuesong Zhang, Kaiguang Zhao. 2012. Bayesian Neural Networks for Uncertainty Analysis of Hydrologic Modeling: A

Comparison of Two Schemes. Water Resources Management . [CrossRef]5. Xuesong Zhang, Faming Liang, Beibei Yu, Ziliang Zong. 2011. Explicitly integrating parameter, input, and structure uncertainties

into Bayesian Neural Networks for probabilistic hydrologic forecasting. Journal of Hydrology . [CrossRef]6. Robert Lowe, Hamse Y. Mussa, John B. O. Mitchell, Robert C. Glen. 2011. Classifying Molecules Using a Sparse Probabilistic

Kernel Binary Classifier. Journal of Chemical Information and Modeling 110708142912003. [CrossRef]7. Jianli Liu, Baoqi Zuo, Xianyi Zeng, Philippe Vroman, Besoa Rabenasolo. 2011. Wavelet energy signatures and robust Bayesian

neural network for visual quality recognition of nonwovens. Expert Systems with Applications 38, 8497-8508. [CrossRef]8. I. Albarrán, P. J. Alonso, J. M. Marin. 2011. Nonlinear models of disability and age applied to census data. Journal of Applied

Statistics 1-13. [CrossRef]9. Pongchanun Luangpaiboon. 2011. Continuous Stirred Tank Reactor Optimisationvia Simulated Annealing, Firefly and Ant

ColonyOptimisation Elements on the Steepest Ascent. International Journal of Machine Learning and Computing 58-65.[CrossRef]

10. Irene Kouskoumvekaki, Gianni Panagiotou. 2011. Navigating the Human Metabolome for Biomarker Identification and Designof Pharmaceutical Molecules. Journal of Biomedicine and Biotechnology 2011, 1-19. [CrossRef]

11. Henrique S. Hippert, James W. Taylor. 2010. An evaluation of Bayesian techniques for controlling model complexity and selectinginputs in a neural network for short-term load forecasting. Neural Networks 23, 386-395. [CrossRef]

12. Guilherme T. de Assis, Alberto H. F. Laender, Marcos André Gonçalves, Altigran S. da Silva. 2009. A Genre-Aware Approachto Focused Crawling. World Wide Web 12, 285-319. [CrossRef]

13. Xuesong Zhang, Faming Liang, Raghavan Srinivasan, Michael Van Liew. 2009. Estimating uncertainty of streamflow simulationusing Bayesian neural networks. Water Resources Research 45, n/a-n/a. [CrossRef]

14. Yan V Sun, Sharon L R Kardia. 2008. Imputing missing genotypic data of single-nucleotide polymorphisms using neural networks.European Journal of Human Genetics 16, 487-495. [CrossRef]

15. Luigi Spezia, Roberta Paroli. 2008. Bayesian Inference and Forecasting in Dynamic Neural Networks with Fully Markov SwitchingARCH Noises. Communications in Statistics - Theory and Methods 37, 2079-2094. [CrossRef]

16. Faming Liang. 2007. Annealing stochastic approximation Monte Carlo algorithm for neural network training. Machine Learning68, 201-233. [CrossRef]

17. H HRUSCHKA. 2006. Relevance of functional flexibility for heterogeneous sales response models☆A comparison of parametricand semi-nonparametric models. European Journal of Operational Research 174, 1009-1020. [CrossRef]

18. Greer B. Kingston, Martin F. Lambert, Holger R. Maier. 2005. Bayesian training of artificial neural networks used for waterresources modeling. Water Resources Research 41, n/a-n/a. [CrossRef]

19. Faming Liang. 2005. Evidence Evaluation for Bayesian Neural Networks Using Contour Monte Carlo. Neural Computation 17:6,1385-1410. [Abstract] [PDF] [PDF Plus]

20. Faming Liang. 2005. Bayesian neural networks for nonlinear time series forecasting. Statistics and Computing 15, 13-29. [CrossRef]21. David Rios Insua, Julio Holgado, Raul Moreno. 2003. Multicriteriae-negotiation systems fore-democracy. Journal of Multi-

Criteria Decision Analysis 12:10.1002/mcda.v12:2/3, 213-218. [CrossRef]22. Faming Liang, Wing Hung Wong. 2001. Real-Parameter Evolutionary Monte Carlo With Applications to Bayesian Mixture

Models. Journal of the American Statistical Association 96, 653-666. [CrossRef]23. J Lampinen. 2001. Bayesian approach for neural networks—review and case studies. Neural Networks 14, 257-274. [CrossRef]24. Siddhartha ChibMarkov Chain Monte Carlo Methods: Computation and Inference 3569-3649. [CrossRef]25. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C

(Applications and Reviews) 30, 451-462. [CrossRef]26. H Lee. 2000. Consistency of posterior distributions for neural networks. Neural Networks 13, 629-642. [CrossRef]

http://dx.doi.org/10.1007/s11004-013-9513-y

http://dx.doi.org/10.1007/s13137-014-0059-z

http://dx.doi.org/10.1002/9780470975916.ch9

http://dx.doi.org/10.1007/s11269-012-0021-5

http://dx.doi.org/10.1016/j.jhydrol.2011.09.002

http://dx.doi.org/10.1021/ci200128w

http://dx.doi.org/10.1016/j.eswa.2011.01.049

http://dx.doi.org/10.1080/02664763.2010.545120

http://dx.doi.org/10.7763/IJMLC.2011.V1.9

http://dx.doi.org/10.1155/2011/525497

http://dx.doi.org/10.1016/j.neunet.2009.11.016

http://dx.doi.org/10.1007/s11280-009-0063-7

http://dx.doi.org/10.1029/2008WR007030

http://dx.doi.org/10.1038/sj.ejhg.5201988

http://dx.doi.org/10.1080/03610920701851805

http://dx.doi.org/10.1007/s10994-007-5017-7

http://dx.doi.org/10.1016/j.ejor.2005.05.003

http://dx.doi.org/10.1029/2005WR004152

http://dx.doi.org/10.1162/0899766053630323

http://www.mitpressjournals.org/doi/pdf/10.1162/0899766053630323

http://www.mitpressjournals.org/doi/pdfplus/10.1162/0899766053630323

http://dx.doi.org/10.1007/s11222-005-4786-8

http://dx.doi.org/10.1002/mcda.358

http://dx.doi.org/10.1198/016214501753168325

http://dx.doi.org/10.1016/S0893-6080(00)00098-8

http://dx.doi.org/10.1016/S1573-4412(01)05010-3

http://dx.doi.org/10.1109/5326.897072

http://dx.doi.org/10.1016/S0893-6080(00)00045-9

Issues in Bayesian Analysis of Neural Network Models

Documents

Transcript of Issues in Bayesian Analysis of Neural Network Models