Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The...

19
Statistics and Computing https://doi.org/10.1007/s11222-018-9816-4 Gauss–Christoffel quadrature for inverse regression: applications to computer experiments Andrew Glaws 1 · Paul G. Constantine 1 Received: 26 October 2017 / Accepted: 29 May 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract Sufficient dimension reduction (SDR) provides a framework for reducing the predictor space dimension in statistical regression problems. We consider SDR in the context of dimension reduction for deterministic functions of several variables such as those arising in computer experiments. In this context, SDR can reveal low-dimensional ridge structure in functions. Two algorithms for SDR—sliced inverse regression (SIR) and sliced average variance estimation (SAVE)—approximate matrices of integrals using a sliced mapping of the response. We interpret this sliced approach as a Riemann sum approximation of the particular integrals arising in each algorithm. We employ the well-known tools from numerical analysis—namely, multivariate numerical integration and orthogonal polynomials—to produce new algorithms that improve upon the Riemann sum-based numerical integration in SIR and SAVE. We call the new algorithms Lanczos–Stieltjes inverse regression (LSIR) and Lanczos–Stieltjes average variance estimation (LSAVE) due to their connection with Stieltjes’ method—and Lanczos’ related discretization—for generating a sequence of polynomials that are orthogonal with respect to a given measure. We show that this approach approximates the desired integrals, and we study the behavior of LSIR and LSAVE with two numerical examples. The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate bottleneck resulting from the Riemann sum approximation, thus enabling high-order numerical approximations of the integrals when appropriate. Moreover, LSIR and LSAVE perform as well as the best-case SIR and SAVE implementations (e.g., adaptive partitioning of the response space) when low-order numerical integration methods (e.g., simple Monte Carlo) are used. Keywords Sufficient dimension reduction · Sliced inverse regression · Sliced average variance estimation · Orthogonal polynomials 1 Introduction and background Increases in computational power have enabled the simu- lation of complex physical processes by models that more Glaws’ work is supported by the Ben L. Fryrear Ph.D. Fellowship in Computational Science at the Colorado School of Mines and the Department of Defense, Defense Advanced Research Projects Agency’s program Enabling Quantification of Uncertainty in Physical Systems. Constantine’s work is supported by the US Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Number DE-SC-0011077. B Andrew Glaws [email protected] Paul G. Constantine [email protected] 1 Department of Computer Science, University of Colorado Boulder, Boulder, CO, USA accurately reflect reality. Scientific studies that employ such models are often referred to as computer experiments (Sacks et al. 1989; Koehler and Owen 1996; Santner et al. 2003). For algorithm development in this context, the computer simu- lation is represented by a deterministic function that maps the simulation inputs to outputs of interest. These functions rarely have closed-form expressions and are often expensive to evaluate. To enable computer experiments with expen- sive models, we may construct a cheaper surrogate—e.g., a response surface approximation (Myers and Montgomery 1995; Jones 2001)—that can be evaluated quickly. However, building a surrogate suffers from the curse of dimension- ality (Traub and Werschulz 1998; Donoho 2000); loosely speaking, the number of model evaluations required to achieve a prescribed accuracy grows exponentially with the input space dimension. One approach to combat the curse of dimensionality is to reduce the dimension of the input space. For example, 123

Transcript of Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The...

Page 1: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computinghttps://doi.org/10.1007/s11222-018-9816-4

Gauss–Christoffel quadrature for inverse regression: applications tocomputer experiments

Andrew Glaws1 · Paul G. Constantine1

Received: 26 October 2017 / Accepted: 29 May 2018© Springer Science+Business Media, LLC, part of Springer Nature 2018

AbstractSufficient dimension reduction (SDR) provides a framework for reducing the predictor space dimension in statistical regressionproblems. We consider SDR in the context of dimension reduction for deterministic functions of several variables such asthose arising in computer experiments. In this context, SDR can reveal low-dimensional ridge structure in functions. Twoalgorithms for SDR—sliced inverse regression (SIR) and sliced average variance estimation (SAVE)—approximate matricesof integrals using a sliced mapping of the response. We interpret this sliced approach as a Riemann sum approximationof the particular integrals arising in each algorithm. We employ the well-known tools from numerical analysis—namely,multivariate numerical integration and orthogonal polynomials—to produce new algorithms that improve upon the Riemannsum-based numerical integration in SIR and SAVE. We call the new algorithms Lanczos–Stieltjes inverse regression (LSIR)and Lanczos–Stieltjes average variance estimation (LSAVE) due to their connection with Stieltjes’ method—and Lanczos’related discretization—for generating a sequence of polynomials that are orthogonal with respect to a givenmeasure.We showthat this approach approximates the desired integrals, and we study the behavior of LSIR and LSAVE with two numericalexamples. The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate bottleneck resultingfrom the Riemann sum approximation, thus enabling high-order numerical approximations of the integrals when appropriate.Moreover, LSIR and LSAVE perform as well as the best-case SIR and SAVE implementations (e.g., adaptive partitioning ofthe response space) when low-order numerical integration methods (e.g., simple Monte Carlo) are used.

Keywords Sufficient dimension reduction · Sliced inverse regression · Sliced average variance estimation · Orthogonalpolynomials

1 Introduction and background

Increases in computational power have enabled the simu-lation of complex physical processes by models that more

Glaws’ work is supported by the Ben L. Fryrear Ph.D. Fellowship inComputational Science at the Colorado School of Mines and theDepartment of Defense, Defense Advanced Research ProjectsAgency’s program Enabling Quantification of Uncertainty in PhysicalSystems. Constantine’s work is supported by the US Department ofEnergy Office of Science, Office of Advanced Scientific ComputingResearch, Applied Mathematics program under Award NumberDE-SC-0011077.

B Andrew [email protected]

Paul G. [email protected]

1 Department of Computer Science, University of ColoradoBoulder, Boulder, CO, USA

accurately reflect reality. Scientific studies that employ suchmodels are often referred to as computer experiments (Sackset al. 1989; Koehler andOwen 1996; Santner et al. 2003). Foralgorithm development in this context, the computer simu-lation is represented by a deterministic function that mapsthe simulation inputs to outputs of interest. These functionsrarely have closed-form expressions and are often expensiveto evaluate. To enable computer experiments with expen-sive models, we may construct a cheaper surrogate—e.g.,a response surface approximation (Myers and Montgomery1995; Jones 2001)—that can be evaluated quickly. However,building a surrogate suffers from the curse of dimension-ality (Traub and Werschulz 1998; Donoho 2000); looselyspeaking, the number of model evaluations required toachieve a prescribed accuracy grows exponentially with theinput space dimension.

One approach to combat the curse of dimensionality isto reduce the dimension of the input space. For example,

123

Page 2: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

when the inputs are correlated, then principal componentanalysis (Jolliffe 2002) can identify a set of fewer uncorre-lated variables that linearly map to the correlated simulationinputs. When the inputs are independent, one may use globalsensitivity analysis (Saltelli et al. 2008) to identify a subsetof inputs whose variability does not significantly change theoutputs; these inputs can be fixed at nominal values, thusreducing the input dimension. Active subspaces (Constan-tine 2015) generalize the coordinate-based global sensitivityanalysis to important directions in the input space; perturb-ing inputs along the orthogonal complement of the activesubspace changes the outputs relatively little, on average.

In the related context of statistical regression, sufficientdimension reduction (SDR) provides a theoretical frameworkfor subspace-based dimension reduction in the predictorspace relative to the regression response (Cook 1998;Ma andZhu 2013); SDRapproaches have also been used as heuristicsfor dimension reduction in deterministic functions arising incomputer experiments (Cook 1994; Li et al. 2016; Glawset al. 2018). There are several methods for SDR includingordinary least squares (OLS) (Li and Duan 1989), princi-pal Hessian directions (pHd) (Li 1992), among others (Cookand Ni 2005; Li and Wang 2007; Cook and Forzani 2009).In this paper, we consider two well-known methods forSDR—sliced inverse regression (SIR) (Li 1991) and slicedaverage variance estimation (SAVE) (Cook and Weisberg1991). Although SIR and SAVE are two of the earlier meth-ods developed for SDR, they remain popular today due totheir effectiveness and ease of implementation. Moreover,many alternative methods for SDR are variations of SIR andSAVE in some way.

SIR and SAVE each approximate a specific matrix ofexpectations by slicing the response space based on a givendata set of predictor/response pairs. If we consider apply-ing SIR and SAVE for dimension reduction in deterministicfunctions, these expectations become Lebesgue integralsover the function’s range (i.e., the space of simulation out-puts). Assuming the output distributions are sufficientlysmooth, the integrals can be written as Riemann integralsagainst the push-forward density induced by the functionand the distribution on the input space. Then the slic-ing from SIR and SAVE can be interpreted as a Riemannsum approximation of these integrals (Davis and Rabi-nowitz 1984). The approximation accuracy of SIR andSAVE depends on the number of terms in the Riemannsum—i.e., the number of slices. The Riemann sum approxi-mation acts as an accuracy bottleneck for these algorithms;obtaining additional data or using a better design on theinput space cannot maximally improve the approximation.We introduce new algorithms—Lanczos–Stieltjes inverseregression (LSIR) and Lanczos–Stieltjes average varianceestimation (LSAVE)—that improve the accuracy and con-vergence rates compared to slicing by employing high-order

Gauss–Christoffel quadrature to approximate the integralswith respect to the output. The new algorithms eliminate thebottleneck due to Riemann sums and place the burden ofaccuracy on the input space design. Thus, LSIR and LSAVEenable the use of higher-order numerical integration rules onthe input space such as sparse grids or tensor product quadra-ture when such approaches are appropriate.

This work contains twomain contributions: First, by char-acterizing the matrices of expectations from SIR and SAVEas integrals in the context of dimension reduction for deter-ministic functions, we interpret the slice-based methods asRiemann sum approximations of these integral. We recog-nize that this approach produces an accuracy bottleneck onthe approximation of thematrices of expectations underlyingSIR and SAVE. Second, we develop high-order quadraturemethods for these integrals that remove this accuracy bot-tleneck imposed by the Riemann sums on the output space.When the input space dimension is sufficiently small andthe function of interest is sufficiently smooth so that high-order numerical integration is justified on the input space,our new LSIR and LSAVE methods produce exponentiallyconverging estimates of the matrices of expectations used forsubspace-based dimension reduction—compared to themax-imal first-order algebraic convergence rate resulting fromthe slicing. When low-order, dimension-independent numer-ical integration methods (e.g., simple Monte Carlo) are moreappropriate due to a high input space dimension, the LSIRand LSAVE methods perform as well as the best slice-basedapproaches (e.g., adaptive partitioning of the output space).

The remainder of this paper is structured as follows: InSect. 2, we review SDR theory and the SIR and SAVE algo-rithms.We interpret thesemethods in the context of computerexperiments in Sect. 3. In Sect. 4, we review key elements ofclassical numerical analysis—specificallyGauss–Christoffelquadrature and orthogonal polynomials—that we employ inthe new Lanczos–Stieltjes algorithms. We introduce and dis-cuss the new algorithms in Sect. 5. Finally, in Sect. 6, wenumerically study various aspects of the newly proposedalgorithms on several test problems and compare the resultsto those from the traditional SIR and SAVE algorithms.

Throughout this paper we use the following notationalconventions unless otherwise noted. Bold uppercase lettersrepresent matrices, while bold lowercase letters are vectors.The elements of vectors are denoted using parentheses andsubscript indices. For example, (A)i, j refers to the i j th ele-ment of the matrix A and (v)i refers to the i th element of thevector v. Also, we use zero-based indexing throughout thispaper. We do this because matrix and vector elements willoften be associated with polynomials of increasing degree,starting with degree zero (i.e., constant) polynomials. Wemake use of a variety of probability measures, which wedenote by π with the appropriate subscripts to indicate theirdomain.

123

Page 3: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

2 Inverse regressionmethods

Sufficient dimension reduction (SDR) enables dimensionreduction in regression problems. In this section, we providea brief overview of SDR and twomethods for its implementa-tion.More detailed discussions of SDR can be found in Cook(1998) and Ma and Zhu (2013).

The regression problem considers a given set of predic-tor/response pairs, denoted by

{[ x�i , yi ]}, i = 0, . . . , N − 1, (1)

where xi ∈ Rm is the vector-valued predictor and yi ∈ R

is the scalar-valued response. These pairs are assumed tobe drawn according to an unknown joint distribution πx,y .The goal of regression is to approximate statistical proper-ties (e.g., the CDF, expectation, variance) of the conditionalrandomvariable y|x using the predictor/response pairs.How-ever, such characterization of y|x becomes difficult when thedimension of the predictor space m becomes large.

SDR seeks out a low-dimensional linear transform ofthe predictors under which the statistical behavior of y|x isunchanged. Consider A ∈ R

m×n with n < m. If

y |� x |A�x, (2)

then, y|x ∼ y|A�x (Cook 1998, Ch. 6). Equation (2)—referred to as conditional independence—states that if thereduced predictors A�x are known, then the response andthe original predictors are independent. Under conditionalindependence, y|x and y|A�x have the same distribution.Furthermore, any A that satisfies (2) is a basis for sufficientdimension reduction. Methods for SDR attempt to computeA for a given regressionproblemusing thepredictor/responsepairs {[ x�

i , yi ]}, i = 0, . . . , N − 1.We consider twomethods for SDR—sliced inverse regres-

sion (SIR) (Li 1991) and sliced average variance estimation(SAVE) (Cook and Weisberg 1991). These methods fall intoa class of techniques known as inverse regression methods,named as such because they search for dimension reductionin y|x by studying the inverse regression x|y. The inverseregression is an m-dimensional conditional random variableparameterized by the scalar-valued response. This replacesthe single m-dimensional problem with m one-dimensionalproblems. A detailed discussion on inverse regression can befound in Adragni and Cook (2009). We focus on the SIR andSAVE methods.

Before describing the SIR andSAVEalgorithms,we intro-duce the assumption of standardized predictors,

E [x] = 0 and C ov [x] = I . (3)

This assumption simplifies discussion and does not restrictthe SDR problem (Cook 1998, Ch. 11). We assume (3) holdsfor the remainder of this section.

The SIR and SAVE algorithms approximate populationmatrices constructed from statistical characteristics of x|y.The matrix underlying the SIR algorithm is

CIR = C ov [E [x|y]] . (4)

We denote this matrix by CIR because it does not yet includeslicing (i.e., the “S” in SIR). We introduce the slicing later inthis section. The matrix CIR is the covariance of the inverseregression function E [x|y]. This function draws out a curvethrough m-dimensional space that is parameterized by thescalar-valued response y. Under certain conditions (Li 1991),the column space of CIR can be used to approximate A from(2). This matrix is known to perform well for regressionsexhibiting strong linear trends, but can struggle on problemswhere the response is symmetric about the origin.

SAVE is based on the population matrix

CAVE = E

[(I − C ov [x|y])2

]. (5)

Similar to (4), we drop the “S” in the notation as this matrixdoes not include slicing.We see themotivation for thismatrixby recognizing that, under (3),

E [C ov [x|y]] = C ov [x] − C ov [E [x|y]]= I − C ov [E [x|y]] , (6)

which implies that C ov [E [x|y]] = E [I − C ov [x|y]]. Thesquare of I − C ov [x|y] is used to ensure that the eigen-values of the resulting matrix are nonnegative (Cook andWeisberg 1991). By using higher moments of the inverseregression, this method overcomes the issues with symme-try in SIR. However, in practice accurately estimating thesehigher moments can be difficult, especially when the numberof predictor/response pairs is small.

The regression problem begins with predictor/responsepairs {[ x�

i , yi ]}, i = 0, . . . , N − 1. We want to use thesedata to approximate CIR and CAVE. This is difficult due tothe conditional expectations in these matrices. The SIR andSAVE algorithms use a slicing approach to enable numericalapproximation of (4) and (5). Let Jr for r = 0, . . . , R − 1denote R intervals that partition the range of response valuesinto slices. Define the sliced mapping of the response

h(y) = r where y ∈ Jr . (7)

The SIR and SAVE algorithms consider dimension reductionrelative to the conditional random variable h(y)|x instead of

123

Page 4: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

y|x. That is, these methods approximate the matrices

CSIR = C ov [E [x|h(y)]] ,

CSAVE = E

[(I − C ov [x|h(y)])2

],

(8)

respectively, where the notation CSIR and CSAVE indicatethat they are defined with respect to the sliced response h(y).

Algorithm 1 provides an outline for estimating CSIR bycomputing a sample estimate CSIR from the given predic-tor/response pairs; Algorithm 2 shows the same for SAVE.In practice, the eigendecomposition of the sample estimatesCSIR and CSAVE leads to an approximation of the basis Afrom (2). We focus not on the eigenspaces, but on how wecan interpret the outputs from the SIR and SAVE algorithmsas approximations of CIR and CAVE from (4) and (5), respec-tively.

Algorithm 1 Sliced inverse regression (SIR)

Given: Predictor/response pairs {[ x�i , yi ]}, i = 0, . . . , N − 1 drawn

according to πx,y .Assumptions: The predictors are standardized such that (3) is satisfied.

1. Define a sliced partition of the observed response space, Jr forr = 0, . . . , R − 1. Let Ir ⊂ {0, . . . , N − 1} be indices i for whichyi ∈ Jr and Nr = |Ir |.

2. For r = 0, . . . , R − 1, compute the in-slice sample expectation

μ(r) = 1

Nr

∑i∈Ir

xi . (9)

3. Compute the sample matrix

CSIR = 1

N

R−1∑r=0

Nr μ(r) μ(r)�. (10)

Output: The matrix CSIR.

The slicedmapping h enables the computation of the sam-ple estimates

μ(r) ≈ E [x|h(y)] and �(r) ≈ C ov [x|h(y)] (14)

by binning the response data. The approximations in (14)depend on the number of predictor/response pairs N . Thesample estimates of CSIR and CSAVE have been shown to beN−1/2 consistent (Li 1991; Cook 2000). Accuracy of CSIR

and CSAVE in approximating CIR and CAVE, respectively,depends on the slicing applied to the response space.Definingthe slices Jr , r = 0, . . . , R − 1, such that they each containapproximately equal numbers of samples can improve theapproximation by balancing accuracy in the sample estimates(14) across all slices (Li 1991). By increasing the numberR of slices, we increase accuracy due to slicing, but mayreduce accuracy of (14) by decreasing the number of samples

Algorithm 2 Sliced average variance estimation (SAVE)

Given: Predictor/response pairs {[ x�i , yi ]}, i = 0, . . . , N − 1 drawn

according to πx,y .Assumptions: The predictors are standardized such that (3) is satisfied.

1. Define a sliced partition of the observed response space, Jr forr = 0, . . . , R − 1. Let Ir ⊂ {0, . . . , N − 1} be indices i for whichyi ∈ Jr and Nr = |Ir |.

2. For r = 0, . . . , R − 1,

(i) Compute the in-slice sample expectation

μ(r) = 1

Nr

∑i∈Ir

xi . (11)

(ii) Compute the in-slice sample covariance

�(r) = 1

Nr − 1

∑i∈Ir

(xi − μ(r)

) (xi − μ(r)

)�. (12)

3. Compute the sample matrix

CSAVE = 1

N

R−1∑r=0

Nr

(I − �(r)

)2. (13)

Output: The matrix CSAVE.

within each slice (Cook 1998, Ch. 6). We discuss the idea ofconvergence in terms of the number of slices in more detailin Sect. 3. Generally, the slicing approach in Algorithms 1and 2 is justified by properties of SDR that relate dimensionreduction of h(y)|x to dimension reduction of y|x (Cook1998, Ch. 6).

3 SIR and SAVE for deterministic functions

In this section, we consider the application of SIR and SAVEto deterministic functions, such as those arising in com-puter experiments. Computer experiments employ modelsof physical phenomena, which we represent as determinis-tic functions that map m simulation inputs to a scalar-valuedoutput (Sacks et al. 1989; Koehler and Owen 1996; Santneret al. 2003). We write this as

y = f (x), x ∈ X ⊆ Rm, y ∈ F ⊆ R, (15)

where f : X → F is measurable and the random vari-ables x, y are the function inputs and output, respectively.Let πx denote the probability measure induced by x over X .Unlike the joint distribution πx,y , this measure is assumedto be known. Furthermore, we may choose the values of xat which to evaluate (15) according to πx. Define the proba-bility measure induced by y over F by πy . This measure isfully determined by πx and f , but its form is assumed to beunknown. We can estimate πy using point evaluations of f .

123

Page 5: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

Remark 1 In the context of deterministic functions, SDRmaybe viewed as amethod for ridge recovery (Glaws et al. 2018).That is, it uses point queries of f (x) to search for A ∈ R

m×n

with n < m assuming that

y = f (x) = g(A�x) (16)

for some g : Rn → R. When (16) holds for some g and A,f (x) is called a ridge function (Pinkus 2015).

In Sect. 5, we introduce new algorithms for estimatingC IR

and CAVE. First, we must provide context for applying thesemethods to deterministic functions. We start by introducingan assumption on the input space X .

Assumption 1 The inputs have finite fourthmoments and arestandardized such that

∫x dπx(x) = 0 and

∫x x� dπx(x) = I . (17)

Assumption 1 is similar to the assumption of standardizedpredictors in (3). However, it also assumes finite fourthmoments of the predictors. This assumption is importantfor the new algorithms introduced in Sect. 5. We assumeAssumption 1 holds throughout.

Recall that SIR and SAVE are based on the inverse regres-sion x|y. For deterministic functions, this is the inverse imageof the function f ,

f −1(y) = {x ∈ X : f (x) = y} , (18)

and we define the conditional measure πx|y as the restric-tion of πx to f −1 through disintegration (Chang and Pollard1997). We denote this measure by πx|y to emphasize itsconnection to the inverse regression x|y. Since the inverseregression methods for deterministic functions operate onthe inverse image f −1(y), these methods do not require thatthe function f be uniquely invertible.

ThematricesC IR andCAVE contain the conditional expec-tation and the conditional covariance of the inverse regression(see (4) and (5), respectively). These statistical quantities canbe expressed as integrals with respect to the conditional mea-sure πx|y ,

μ(y) =∫

x dπx|y(x),

�(y) =∫

(x − μ(y)) (x − μ(y))� dπx|y(x).(19)

We denote the conditional expectation and covariance byμ(y) and �(y), respectively, to emphasize that these quan-tities are functions of the output. Using (19), we can express

the CIR and CAVE matrices as integrals over the output spaceF ,

CIR =∫

μ(y)μ(y)� dπy(y),

CAVE =∫

(I − �(y))2 dπy(y).(20)

The integrals in (19) are difficult to approximate due tothe complex structure of πx|y . Algorithms 1 and 2 slice therange of output values to enable this approximation. Let h(y)denote the slicing function from (7) and define the probabilitymass function

ω(r) =∫

Jrdπy(y), r = 0, . . . , R − 1, (21)

where Jr is the r th interval over the range of outputs. Wecan then express the slice-based matrices CSIR and CSAVE

as sums over the R slices,

CSIR =R−1∑r=0

ω(r)μ(r)μ(r)�,

CSAVE =R−1∑r=0

ω(r) (I − �(r))2 ,

(22)

where μ(r) and �(r) are as in (19), but applied to the slicedoutput h(y) = r .

This provides a proper interpretation for the applicationof Algorithms 1 and 2 to (15). We can construct a set ofinput/output pairs {[ x�

i , yi ]}, i = 0, . . . , N − 1, by ran-domly sampling the input space X according to πx andevaluating yi = f (xi ). SIR and SAVE then produce numer-ical approximations CSIR and CSAVE to CSIR and CSAVE,respectively.

This exercise of expressing the various elements of SIRand SAVE as integrals also emphasizes the two levels ofapproximation occurring in these algorithms: (i) approx-imation in terms of the number of samples N and (ii)approximation in terms of the number of slices R. That is,

CSIR ≈ CSIR ≈ CIR,

CSAVE ≈ CSAVE ≈ CAVE.(23)

Asmentioned in Sect. 2, approximation due to sampling (i.e.,the leftmost approximations in (23)) has been shown to beN−1/2 consistent (Li 1991; Cook 2000). From the integralperspective, the slicing approach can be interpreted as a Rie-mann sum approximation of the integrals in (20). Riemannsum approximations estimate integrals by a finite sum. Theseapproximations converge like R−1 for continuous functions

123

Page 6: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

on compact domains, where R denotes the number of termsin the Riemann sum (Davis and Rabinowitz 1984, Ch. 2).

In the next section, we introduce several numerical toolsand link them to the various elements of the SIR and SAVEalgorithms discussed so far. Ultimately, we use these tools,including orthogonal polynomials and numerical quadrature,in the proposed algorithms to enable approximation of CIR

and CAVE without using Riemann sums.

4 Orthogonal polynomials andGauss–Christoffel quadrature

Orthogonal polynomials and numerical quadrature are fun-damental to numerical analysis and have been studiedextensively; detailed discussions are available in Liesen andStrakoš (2013), Gautschi (2004), Meurant (2006), Golub andMeurant (2010). Our discussion is based on these references,reviewing key concepts necessary to develop the new algo-rithms for approximating the matrices in (20). We begin withthe Stieltjes procedure for constructing orthonormal poly-nomials with respect to a given measure. We then relatethis procedure to Gauss–Christoffel quadrature, polynomialexpansions, and the Lanczos algorithm.

4.1 The Stieltjes procedure

The Stieltjes procedure recursively constructs a sequence ofpolynomials that are orthonormal with respect to a givenmeasure (Liesen and Strakoš 2013, Ch. 3). Let π denotea given probability measure with sample space R, and letφ,ψ : R → R be two scalar-valued functions that aresquare-integrable with respect to π . The continuous innerproduct relative to π is

(φ,ψ)π =∫

φ(y) ψ(y) dπ(y), (24)

and the induced norm is ||φ||π = √(φ, φ)π . A sequence of

polynomials {φ0, φ1, φ2, . . . } is orthonormal with respect toπ if

(φi , φ j )π = δi, j , i, j = 0, 1, 2, . . . , (25)

where δi, j is the Kronecker delta. Algorithm 3 contains amethod for constructing a sequence of orthonormal polyno-mials {φ0, φ1, φ2, . . . } relative toπ . This algorithm is knownas the Stieltjes procedure and was first introduced in Stieltjes(1884).

The last step of Algorithm 3 defines the three-term recur-rence relationship for orthonormal polynomials,

βi+1φi+1(y) = (y − αi ) φi (y) − βiφi−1(y), (26)

Algorithm 3 Stieltjes procedure (Gautschi 2004, Section2.2.3.1)Given: The probability measure π .Assumptions: Let φ−1(y) = 0 and φ0(y) = 1.

1. For i = 0, 1, 2, . . .

(i) βi = ||φi (y)||π(ii) φi (y) = φi (y) / βi(iii) αi = (y φi (y), φi (y))π(iv) φi+1(y) = (y − αi ) φi (y) − βi φi−1(y)

Output:The orthonormal polynomials {φ0, φ1, φ2, . . . } and recurrencecoefficients αi , βi for i = 0, 1, 2, . . . .

for i = 0, 1, 2, . . . . Any sequence of polynomials that sat-isfies (26) is orthonormal with respect to the given measure.If we consider the first k terms, then we can rearrange it toobtain

y φi (y) = βiφi−1(y) + αiφi (y) + βi+1φi+1(y), (27)

for i = 0, 1, 2, . . . , k − 1. Let

φ(y) = [φ0(y) , φ1(y) , . . . , φk−1(y) ]�. (28)

We can then write (27) in matrix form as

y φ(y) = J φ(y) + βk φk(y) ek, (29)

where ek ∈ Rk is the vector of zeros with a one in the kth

entry and J ∈ Rk×k is

J =

⎡⎢⎢⎢⎢⎢⎣

α0 β1

β1 α1 β2. . .

. . .. . .

βk−2 αk−2 βk−1

βk−1 αk−1

⎤⎥⎥⎥⎥⎥⎦

, (30)

whereαi ,βi are the recurrence coefficients fromAlgorithm3.This matrix—known as the Jacobi matrix—is symmetric andtridiagonal. Let the eigendecomposition of J be

J = Q�Q�. (31)

From (29), the eigenvalues of J , denoted by λi , i =0, . . . , k−1, are the zeros of the degree-k polynomial φk(y).Furthermore, the eigenvector associated with λi is φ(λi ). Weassume the eigenvectors of J are normalized such that Q isan orthogonal matrix with entries

(Q)i, j = φi (λ j )

||φ(λ j )||2 , i, j = 0, . . . , k − 1, (32)

where || · ||2 is the vector 2-norm.

123

Page 7: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

We end this section with brief a note about Fourier expan-sion of functions in terms of orthonormal polynomials. If agiven function g(y) is square-integrable with respect to π ,then g admits a mean-squared convergent Fourier series interms of the orthonormal polynomials,

g(y) =∞∑i=0

gi φi (y), (33)

for y in the support of π and where equality is in the L2(π)

sense. By orthogonality of the polynomials, the Fourier coef-ficients gi are

gi = (g, φi )π . (34)

This polynomial approximation plays an important role inthe algorithms introduced in Sect. 5.

4.2 Gauss–Christoffel quadrature

Wenext discuss theGauss–Christoffel quadrature for numer-ical integration and show its connection to orthonormalpolynomials (Liesen and Strakoš 2013, Ch. 3). Given a mea-sure π and an integrable function g(y), a k-point quadraturerule approximates the integral of g with respect to π by aweighted sum of g evaluated at k input values,

∫g(y) dπ(y) =

k−1∑i=0

ωi g(λi ) + rk . (35)

The λi ’s are the quadrature nodes, and the ωi ’s are the asso-ciated quadrature weights. The k-point quadrature approx-imation error is contained in the residual term rk . Wecan minimize |rk | by choosing the quadrature nodes andweights appropriately. The nodes and weights of the Gauss–Christoffel quadrature maximize the polynomial degree ofexactness, which refers to the highest degree polynomialthat the quadrature rule exactly integrates (i.e., rk = 0).The k-point Gauss–Christoffel quadrature rule has polyno-mial degree of exactness 2k − 1. Furthermore, the Gauss–Christoffel quadrature has been shown to converge exponen-tially at a rate O(ρ−k) for integrals defined on a compactdomain when the integrand is analytic (Trefethen 2013, Ch.19). The base ρ > 1 relates to the size of the function’sdomain of analytical continuability. For functions with p−1continuous derivatives on a compact domain, the Gauss–Christoffel quadrature converges like O(k−(2p+1)).

The Gauss–Christoffel quadrature nodes and weightsdepend on the given measure π . They can be obtainedthrough the eigendecomposition of J (Golub and Welsch1969). Recall from (30) that J is the matrix of recurrence

coefficients resulting from k steps of the Stieltjes proce-dure. The eigenvalues of J are the zeros of the k-degreeorthonormal polynomial φk(y). These zeros are the nodes ofthe k-point Gauss–Christoffel quadrature rule with respectto π (i.e., φk(λi ) = 0 for i = 0, . . . , k − 1). The associatedweights are the squares of the first entry of each normalizedeigenvector,

ωi = (Q)20,i = 1

||φ(λi )||22, i = 0, . . . , k − 1. (36)

The Stieltjes procedure employs the inner product from(24) to define the recurrence coefficients αi , βi . In the nextsection, we consider the Lanczos algorithm and explore con-ditions under which it be may considered a discrete analogto the Stieltjes procedure in Sect. 4.1. Before we make thisconnection, we define the discrete inner product. Consideran N -point numerical integration rule with respect to π withnodes λi and weights ωi for i = 0, . . . , N − 1. For exam-ple, these could be the Gauss–Christoffel quadrature nodesand weights, or we could use Monte Carlo as a randomizednumerical integration method, where the λi ’s are drawn ran-domly according to π and ωi = 1/N for all i . The discreteinner product is the numerical approximation of the contin-uous inner product,

(φ,ψ)π(N ) =N−1∑i=0

ωi φ(λi ) ψ(λi ) ≈ (φ,ψ)π , (37)

provided that the weightsωi are all positive. The “N” inπ(N )

denotes the number of points in the numerical integration ruleused. The discrete norm is ||φ||π(N ) = √

(φ, φ)π(N ) .The weights ωi are guaranteed to be positive for Gauss–

Christoffel quadrature andMonteCarlo, regardless of dimen-sion. This is not the case for other numerical integrationmethods such as sparse grids. However, sparse grid integra-tion can still be used to define a discrete inner product as longas care is taken to combine function evaluations appropri-ately, see Constantine et al. (2012) for details. For simplicity,we only consider discrete inner products defined by eitherGauss–Christoffel quadrature or Monte Carlo in this paper.

The pseudospectral expansion approximates the Fourierexpansion from (33) by truncating the series after k termsand approximating the Fourier coefficients in (34) using thediscrete inner product (Constantine et al. 2012). We writethis series for a given square-integrable function g(y) as

g(y) ≈ g(y) =k−1∑i=0

gi φi (y), (38)

where the pseudospectral coefficients are

123

Page 8: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

gi = (g, φi )(N )π . (39)

Note that the approximation of g(y) by g(y) depends on twofactors: (i) the approximation accuracy of the coefficientsincluded in the pseudospectral expansion gi , i = 0, . . . , k−1and (ii) the magnitude of the coefficients in the omitted termsgi , i = k, k + 1, . . . . We can improve (i) by using a higher-order integration rule or by increasing N . We improve (ii) byincludingmore terms in the truncated series—that is, increas-ing k. The pseudospectral approximation and this two-levelconvergence play an important role in the new algorithmsproposed in Sect. 5.

4.3 The Lanczos algorithm

The Lanczos algorithmwas originally introduced as an itera-tive scheme for approximating eigenvalues and eigenvectorsof linear differential operators (Lanczos 1950). Given a sym-metric N×N matrix A, it constructs a symmetric, tridiagonalk × k matrix T whose eigenvalues approximate those ofA. Additionally, it produces an N × k matrix, denoted byV = [ v0 , v1 , . . . , vk−1 ], where the vi ’s are the Lanczosvectors. These vectors transform the eigenvectors of T intoapproximate eigenvectors of A. Algorithm 4 contains thesteps of the Lanczos algorithm. Note that the inner productsand norms in Algorithm 4 are given by

(w,u) = w�u, ||w|| = √(w,w), (40)

for vectors w,u ∈ RN . These relate to the discrete inner

products introduced in Sect. 4.2, and we make precise con-nections later in this section.

After k iterations, the Lanczos algorithm yields the rela-tionship

AV = VT + βkvke�k , (43)

where ek ∈ Rk is the vector of zeroswith a one in the kth entry

and V and T are as in (41) and (42), respectively. The matrixT , similar to J in (30), is the Jacobi matrix (Gautschi 2004).The relationship between the matrices J and T has beenstudied extensively (Gautschi 2002; Forsythe 1957; Boor andGolub 1978).

We are interested in the use of the Lanczos algorithm asa discrete approximation to the Stieltjes procedure. Recallthat Algorithm 3 (the Stieltjes procedure) assumes a givenmeasure π . Let λi , ωi for i = 0, . . . , N −1 be the nodes andweights for some N -point numerical integration rule withrespect to π . This integration rule defines a discrete approx-imation of π , which we denote by π(N ) (see Fig. 1). Innerproducts with respect to π(N ) take the form of the discreteinner product in (37). If we perform the Lanczos algorithmwith

Algorithm 4 Lanczos algorithm (Gautschi 2004, Section3.1.7.1)Given: An N × N symmetric matrix A.Assumptions:Let v−1 = 0 ∈ R

N and v0 be an arbitrary nonzero vectorof length N .

1. For i = 0, 1 . . . , k − 1,

(i) βi = ||vi ||(ii) vi = vi / βi(iii) αi = (Avi , vi )(iv) vi+1 = (A − αi I)vi − βi−1vi−1

2. Define

V =⎡⎣v0 v1 . . . vk−1

⎤⎦ (41)

and

T =

⎡⎢⎢⎢⎢⎢⎣

α0 β1β1 α1 β2

. . .. . .

. . .

βk−2 αk−2 βk−1βk−1 αk−1

⎤⎥⎥⎥⎥⎥⎦

. (42)

Output: The matrix of Lanczos vectors V and the matrix of recurrencecoefficients T .

A =⎡⎢⎣

λ0. . .

λN−1

⎤⎥⎦ , v0 =

⎡⎢⎣

√ω0...√

ωN−1

⎤⎥⎦ , (44)

then the result is equivalent to running the Stieltjes proce-dure using the discrete inner product with respect to π(N ).Increasing N improves the approximation of π(N ) to π . Itcan be shown that the recurrence coefficients in the resultingJacobi matrix will converge to the recurrence coefficientsrelated to the Stieltjes procedure with respect to π as Nincreases (Gautschi 2004, Section 2.2). In the next section,we show how this relationship between Stieltjes and Lanc-zos can be used to approximate composite functions, whichis essential to understand the underpinnings of LSIR andLSAVE in Sect. 5.

4.4 Composite function approximation

The connection betweenAlgorithms 3 and 4 can be exploitedfor polynomial approximation of composite functions (Con-stantine and Phipps 2012). Consider a function of the form

h(x) = g( f (x)), x ∈ X ⊆ R, (45)

where

f : X → F ⊆ R,

g : F → G ⊆ R.(46)

123

Page 9: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

Monte CarloGauss-Christoffel Quadrature

(a) (b)

Fig. 1 Distribution functions associated with the probability measure π and the discrete approximation π(N ). a constructs π(N ) using a MonteCarlo integration rule with respect to π , while b uses Gauss–Christoffel quadrature to construct π(N )

Assume the input spaceX is weighted with a given probabil-ity measure πx . This measure and f induce a measure on Fthat we denote by πy . Note that the methodology describedin this section can be extended to multivariate inputs (i.e.,X ⊆ R

m) through tensor product constructions and to mul-tivariate outputs (i.e., G ⊆ R

n) by considering each outputindividually. We need both of these extensions in Sect. 5;however, we consider the scalar case here for clarity.

The goal is to construct a pseudospectral expansion ofg using orthonormal polynomials with respect to πy anda Gauss–Christoffel quadrature rule defined over F . InSect. 4.1, we examined the Stieltjes procedure which con-structs a sequence of orthonormal polynomials with respectto ameasure—πy in this context. In Sect. 4.2, we showed thisalgorithm also produces the nodes and weights of the Gauss–Christoffel quadrature rulewith respect toπy . In Sect. 4.3, wesaw how the Lanczos algorithm can be used to produce sim-ilar results for a discrete approximation to the measure πy .All of this suggests a methodology for constructing a pseu-dospectral approximation of g. However, we constructed thediscrete approximation of πy in Sect. 4.3 using a numericalintegration rule.We cannot do this here since πy is unknown.We can construct an N -point numerical integration rule onX since πx is known. Let xi , νi for i = 0, . . . , N − 1 denotethe nodes and weights for our integration rule of choicewith respect to πx . This rule defines a discrete approxima-tion of πx , which we write as π

(N )x . We approximate πy

by the discrete measure π(N )y by evaluating fi = f (xi ) for

i = 0, . . . , N − 1. We then perform the Lanczos algorithmon

A =⎡⎢⎣f0

. . .

fN−1

⎤⎥⎦ , v0 =

⎡⎢⎢⎢⎣

√ν0√ν1...√

νN−1

⎤⎥⎥⎥⎦ (47)

to obtain the system AV = VT + βkvke�k .

Let the eigendecomposition of the resulting Jacobi matrixbe

T = Q�Q�. (48)

The eigenvalues of T define the k-point Gauss–Christoffelquadrature nodes relative to π

(N )y . We denote these quadra-

ture nodes by λ(N )i , i = 0, . . . , k − 1, where the superscript

indicates that these nodes are relative to the discrete measureπ

(N )y . From (32), the normalized eigenvectors of T have the

form

(Q)i = φ(N )(λ(N )i )

||φ(N )(λ(N )i )||2

, i = 0, . . . , k − 1, (49)

whereφ(N )(y) = [φ(N )0 (y) , φ

(N )1 (y) , . . . , φ

(N )k−1(y) ]� and

φ(N )i (y) denotes the i th orthonormal polynomial relative to

π(N )y . The quadrature weight associated with λ

(N )i is given

by the square of the first element of the i th normalized eigen-vector,

ω(N )i = (Q)20,i , i = 0, . . . , k − 1. (50)

From Sect. 4.3 we have convergence of the quadrature nodesλ

(N )i and weights ω

(N )i to λi and ωi , respectively, as N

increases. In this sense, we consider these quantities to beapproximations of the quadrature nodes and weights relativeto πy ,

λ(N )i ≈ λi and ω

(N )i ≈ ωi . (51)

In Sect. 6, we numerically study this approximation.An alternative perspective on this k-point Gauss–

Christoffel quadrature rule is that of a second discrete mea-sure π

(N ,k)y that approximates the measure π

(N )y . That is,

π(N ,k)y ≈ π(N )

y ≈ πy . (52)

123

Page 10: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

By taking more Lanczos iterations, we improve the leftmostapproximation, and for k = N , we have π

(N ,N )y = π

(N )y

(in exact arithmetic assuming that f0, . . . , fN−1 from (47)are distinct). By increasing N (the number of points in thenumerical integration rule with respect to πx ), we improvethe rightmost approximation. This may be viewed as conver-gence of the discrete Lanczos algorithm to the continuousStieltjes procedure. This mirrors the two-level approxima-tion from (23). Both contain approximation over X by achosen integration rule and approximation overF by an inte-gration rule resulting from the different algorithms. The keydifference is in the quality of those integration rules overF—Gauss–Christoffel quadrature in (52) versus Riemann sumsin (23).

Constantine and Phipps (2012) show that the Lanczos vec-tors resulting from performing the Lanczos algorithm on (47)also contain useful information. The Lanczos vectors are ofthe form

(V )i, j ≈ √νi φ j ( fi ),

i = 0, . . . , N − 1,j = 0, . . . , k − 1,

(53)

where fi = f (xi ), νi is the weight associated with thenode xi , and φ j is the j th-degree orthonormal polynomialwith respect to πy . The approximation in (53) is due to the

approximation of πy by π(N )y and is in the same vein as

the approximation in (51). We also numerically study thisapproximation in Sect. 6.

The approximation method in Constantine and Phipps(2012) suggests evaluating g at the k � N quadrature nodesobtained from the Jacobi matrix and using these evaluationsto construct a pseudospectral approximation. This allows foraccurate estimation of g while placing a majority of the com-putational cost on evaluating f instead of both f and g in(45). Such an approach is valuable when g is difficult to com-pute relative to f . In the next section, we explain how thismethodology can be used to construct approximations to CIR

and CAVE from (20).

5 Lanczos–Stieltjes methods for inverseregression

In this section, we use the tools reviewed in Sect. 4 todevelop a new Lanczos–Stieltjes approach to inverse regres-sion methods—specifically to approximate the matrices CIR

and CAVE. This approach avoids approximating CIR andCAVE by Riemann sums (or slicing) as discussed in Sect. 3.Instead, we use orthonormal polynomials and quadratureapproximations to build more accurate estimates of thesematrices. For notation reference,we restate the problem setupfrom (15),

y = f (x), x ∈ X ⊆ Rm, y ∈ F ⊆ R, (54)

with X weighted by the known probability measure πx andF weighted by the unknown probability measure πy .

5.1 Lanczos–Stieltjes inverse regression (LSIR)

In Sect. 3, we showed that the SIR algorithm approximatesthe matrix

CIR =∫

μ(y)μ(y)� dπy(y) (55)

using a sliced mapping of the output (i.e., a Riemann sumapproximation). We wish to approximate CIR without suchslicing; however, the structure of μ(y) makes this difficult.Recall that the conditional expectation is the average of theinverse image of f (x) for a fixed value of y,

μ(y) =∫

x dπx|y(x). (56)

Approximating μ(y) requires knowledge of the conditionalmeasure πx|y , which may not be available if f is complex.However, using the tools from Sect. 4, we can exploit com-posite structure in μ(y).

The conditional expectation (56) is a function that mapsvalues of y to values inRm . Furthermore, y is itself a functionof x; see (54). Thus, the conditional expectation has compos-ite structure, μ( f (x)), where (using the notation from (46))

f : (X ⊆ Rm) → (F ⊆ R) ,

μ : (F ⊆ R) → (G ⊆ Rm)

.(57)

Using the techniques from Sect. 4.4, we seek to construct apseudospectral expansion of μ using the orthogonal polyno-mials and Gauss–Christoffel quadrature with respect to πy .

Recall Assumption 1 ensures the inputs have finite fourthmoments and are standardized according to (17). By Jensen’sinequality,

μ(y)�μ(y) ≤∫

x�x dπx|y(x). (58)

Integrating both sides of (58) with respect to πy ,

∫μ(y)�μ(y) dπy(y) ≤

∫∫x�x dπx|y(x) dπy(y)

=∫

x�x dπx(x).(59)

UnderAssumption 1, the right-hand side of (59) is finite. Thisguarantees that each component ofμ(y) is square-integrable

123

Page 11: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

with respect toπy , ensuring thatμ(y) has a Fourier expansionin orthonormal polynomials with respect to πy ,

μ(y) =∞∑i=0

μi φi (y) (60)

with equality in the L2(πy) sense. The Fourier coefficientsin (60) are

μi =∫

μ(y) φi (y) dπy(y). (61)

Plugging (60) into (55),

CIR =∫

μ(y)μ(y)� dπy(y)

=∫ [ ∞∑

i=0

μi φi (y)

] ⎡⎣

∞∑j=0

μ j φ j (y)

⎤⎦

�dπy(y)

=∞∑i=0

∞∑j=0

μi μ�j

[∫φi (y) φ j (y) dπ(y)

]

=∞∑i=0

μi μ�i .

(62)

Thus, the CIR matrix can be computed as the sum of the outerproducts of Fourier coefficients from (61).

We cannot compute the Fourier coefficients directly sincethey require knowledge of μ(y). However, we can rewrite(61) as

μi =∫

μ(y) φi (y) dπy(y)

=∫ [∫

x dπx|y(x)]

φi (y) dπy(y)

=∫∫

x φi (y) dπx|y(x) dπy(y)

=∫

x φi ( f (x)) dπx(x).

(63)

This form of the Fourier coefficients is more amenable tonumerical approximation. Since πx is known, we can obtainan N -point integration rule with nodes x j and weights ν j

for j = 0, . . . , N − 1. Recall from Sect. 4.2 that we requireall of the quadrature weights ν j to be positive so that theydefine a valid inner product and norm. We then define thepseudospectral coefficients with respect to the N -point mul-tivariate integration rule,

μi =N−1∑j=0

ν j x j φi ( f (x j )). (64)

To evaluate φi at f (x j ), we use the Lanczos vectors in V .Recall from (53) that these vectors approximate the orthonor-mal polynomials from Stieltjes at f evaluated at the nodesscaled by the square root of the associated weights. Algo-rithm 5 formalizes this process as the Lanczos–Stieltjesinverse regression (LSIR) algorithm.

Algorithm 5 Lanczos–Stieltjes inverse regression (LSIR)Given: The function f : Rm → R and input probability measure πx.Assumptions: Assumption 1 holds.

1. Obtain the nodes and weights for a valid N -point integration rulewith respect to πx,

xi , νi , for i = 0, . . . , N − 1. (65)

2. Evaluate fi = f (xi ) for i = 0, . . . , N − 1.3. Perform k iterations of Algorithm 4 on

A =⎡⎢⎣f0

. . .

fN−1

⎤⎥⎦ , v0 =

⎡⎢⎢⎢⎣

√ν0√ν1...√

νN−1

⎤⎥⎥⎥⎦ (66)

to obtain AV = VT + ηkvke�k .

4. For i = 0, . . . ,m − 1, = 0, . . . , k − 1,Compute the i th component of the th pseudospectral coefficient

)i =

N−1∑p=0

√νp

(xp

)i (V )p, . (67)

5. For i, j = 0, . . . ,m − 1,Compute the i, j th component of CIR

(CIR

)i, j

=k−1∑ =0

)i

)j . (68)

Output: The matrix CIR.

Algorithm 5 depends on two levels of approximation: (i)approximation due to the numerical integration rule onX and(ii) approximation due to truncating the polynomial expan-sion ofμ(y). The former depends on the number N of pointsin the numerical integration rule over X , while the latterdepends on the number k of Lanczos iterations. Performingmore Lanczos iterations includes more terms in the approx-imation of CIR; however, the additional terms correspondto integrals against higher-degree polynomials in (61). Forfixed N , the quality of the pseudospectral approximationdeteriorates as the degree of polynomial increases. There-fore, sufficiently many points are needed to ensure qualityestimates of the orthonormal polynomials of high degree.Quantifying this statement precisely is outside the scope ofthe current paper; the numerical experiments in Sect. 6 showthese phenomena.

To compare the computational cost of the LSIR algorithmto its slice-based counterpart, we consider the approximate

123

Page 12: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

costs of the Lanczos method and the slicing procedure.Due to its origins as an iterative procedure for approxi-mating eigenvalues, the computational costs of the Lanczosalgorithm have been well studied. For the diagonal matrixA ∈ R

N×N , performing k iterations the Lanczos algorithmrequires O(kN ) operations (Golub and Van Loan 1996, Ch.9). The costs associated with the SIR algorithm arise fromsorting the outputs in order to define the slices. Sortingalgorithms are known to take an average of O(N log(N ))

operations (Ajtai et al. 1983). However, in practice themost significant cost is typically the cost of the evaluationsfi = f (xi )—a necessary step for both the Lanczos–Stieltjesand the slice-based approaches.

5.2 Lanczos–Stieltjes average variance estimation(LSAVE)

In this section,we apply the Lanczos–Stieltjes approach fromSect. 5.1 to the CAVE matrix to construct an alternative algo-rithm to the slice-based SAVE. Recall from (20) that

CAVE =∫

(I − �(y))2 dπy(y), (69)

where the conditional covariance of the inverse imageof f (x)for a fixed value of y is

�(y) =∫

(x − μ(y)) (x − μ(y))� dπx|y(x). (70)

Similar to the conditional expectation, �(y) has compositestructure due to the relationship y = f (x) such that we canwrite �( f (x)), where

f : (X ⊆ Rm) → (F ⊆ R) ,

� : (F ⊆ R) → (G ⊆ Rm×m)

.(71)

We want to build a pseudospectral expansion of � withrespect to πy similar to μ in Sect. 5.1.

Recall Assumption 1 as we consider the Frobenius normof the conditional covariance. By Jensen’s inequality,

‖�(y)‖2F ≤∫ ∥∥∥(x − μ(y)) (x − μ(y))�

∥∥∥2

Fdπx|y(x).

(72)

Integrating each side with respect to πy ,

∫‖�(y)‖2F dπy(y)

≤∫∫ ∥∥∥(x − μ(y)) (x − μ(y))�

∥∥∥2

Fdπx|y(x) dπy(y)

≤∫ ∥∥∥(x − μ( f (x))) (x − μ( f (x)))�

∥∥∥2

Fdπx(x). (73)

Expanding the integrand on the right-hand side producessums and products of fourth and lower conditional momentsof the inputs. Assumption 1 guarantees that all of these con-ditional moments are finite. Thus,

∫‖�(y)‖2F dπy(y) < ∞, (74)

which implies that each component of �(y) is square-integrable with respect to πy ; therefore, it has a convergentFourier expansion in terms of orthonormal polynomials withrespect to πy ,

�(y) =∞∑i=0

�i φi (y), (75)

where equality is in the L2(πy) sense. The coefficients in(75) are

�i =∫

�(y) φi (y) dπy(y). (76)

Plugging (75) into (69)

CAVE =∫

(I − �(y))2 dπy(y)

=∫ (

I −∞∑i=0

�i φi (y)

)2

dπy(y)

=∫

I dπy(y) − 2∞∑i=0

�i

∫φi (y) dπy(y)

+∞∑i=0

∞∑j=0

�i � j

∫φi (y) φ j (y) dπy(y)

= I − 2�0 +∞∑i=0

�2i .

(77)

Therefore, we can compute CAVE using the Fourier coef-ficients of �(y). To simplify the computation of �i , werewrite 76 as

�i =∫

�(y) φi (y) dπy(y)

=∫ ∫

(x − μ(y)) (x − μ(y))� dπx|y(x) φi (y) dπy(y)

=∫ ∫

(x − μ(y)) (x − μ(y))� φi (y) dπx|y(x) dπy(y)

=∫

(x − μ( f (x))) (x − μ( f (x)))� φi ( f (x)) dπx(x).

(78)

We approximate this integral using the N -point numericalintegration rule with respect to πx to obtain

123

Page 13: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

�i =N−1∑j=0

ν j(x j − μ( f (x j ))

)

× (x j − μ( f (x j ))

)�φi ( f (x j )).

(79)

We again approximate φi ( f (x j )) using the Lanczos vectorssimilar to LSIR; see (53). Notice that (79) also depends onμ( f (x j )) for j = 0, . . . , N − 1. To obtain these values,we compute the pseudospectral coefficients ofμ( f (x)) from(64) and construct its pseudospectral expansion at each x j .Algorithm 6 provides an outline for Lanczos–Stieltjes aver-age variance estimation (LSAVE).

Algorithm 6 contains the same two-level approximationas the LSIR algorithm. As such, it also requires a suffi-ciently high-order integration rule to accurately approximatethe high-degree polynomials resulting from k Lanczos itera-tions. In the next section, we provide numerical studies of theLSIR andLSAVEalgorithms on several test problems aswellas comparisons to the traditional SIR and SAVE algorithms.

6 Numerical results

In this section, we numerically study the LSIR and LSAVEalgorithms. In Sect. 6.1, we study the approximation of theGauss–Christoffel quadrature and orthonormal polynomialson the output space by the Lanczos algorithm as described inSect. 4.4. This study is performed on a multivariate quadraticfunction of dimension m = 3. In Sect. 6.2, we examine theapproximation of CIR and CAVE by the Lanczos–Stieltjesestimates CIR and CAVE, and we compare these resultsto the slice-based estimates CSIR and CSAVE. This studyis performed using a standard test problem in dimensionreduction and uncertainty quantification: the OTL circuitfunction (Surjanovic and Bingham 2015). This is a modelof the output from a transformerless push–pull circuit withm = 6 inputs.MATLAB code for performing these studies isavailable at [https://bitbucket.org/aglaws/gauss-christoffel-quadrature-for-inverse-regression].

Recall that the goal of the proposed algorithms is toapproximate the CIR and CAVE matrices more efficientlythan the slice-based (Riemann sum) approach in SIR andSAVE. Therefore, most of the presented results are in termsof the Frobenius norm of the relative matrix errors. Recallthat the Frobenius norm is

‖E‖F =⎛⎝

m−1∑i=0

m−1∑j=0

(E)2i, j

⎞⎠

1/2

, (86)

which we use to measure error in the matrix approximations.

Algorithm 6 Lanczos–Stieltjes average variance estimation(LSAVE)Given: The function f : Rm → R and input probability measure πx.Assumptions: Assumption 1 holds.

1. Obtain the nodes and weights for a valid N -point integration rulewith respect to πx,

xi , νi , for i = 0, . . . , N − 1. (80)

2. Evaluate fi = f (xi ) for i = 0, . . . , N − 1.3. Perform k iterations of Algorithm 4 on

A =⎡⎢⎣f0

. . .

fN−1

⎤⎥⎦ , v0 =

⎡⎢⎢⎢⎣

√ν0√ν1...√

νN−1

⎤⎥⎥⎥⎦ (81)

to obtain AV = VT + ηkvke�k .

4. For i = 0, . . . ,m − 1, = 0, . . . , k − 1,Compute the i th component of the th pseudospectral coefficientof μ(y)

)i =

N−1∑p=0

√νp

(xp

)i (V )p, . (82)

5. For i = 0, . . . ,m − 1,Compute the i th component of the pseudospectral expansion ofμ( f (xp))

(μ( f (xp))

)i =

k−1∑ =0

1√νp

)i (V )p, . (83)

6. For i, j = 0, . . . ,m − 1, = 0, . . . , k − 1,Compute the i, j th component of the th pseudospectral coefficientof �(y)

(�

)i, j

=N−1∑p=0

√νp

((xp)i − (

μ( f (xp)))i

)

×((xp) j − (

μ( f (xp)))j

)(V )p, .

(84)

7. For i, j = 0, . . . ,m − 1,Compute the i, j th component of CAVE

(CAVE

)i, j

= δi, j − 2(�0

)i, j

+k−1∑ =0

m−1∑p=0

(�

)i,p

(�

)p, j

.

(85)

Output: The matrix CAVE.

6.1 Example 1: Lanczos–Stieltjes convergence

To study numerically the convergence of the Lanczos–Stieltjes approach, we consider the function

y = f (x) = g�x + x�Hx, x ∈ [−1, 1]3 ⊂ R3, (87)

where g ∈ R3 is a constant vector and H ∈ R

3×3 is a constantmatrix. We assume the inputs are weighted by the uniform

123

Page 14: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

density over the input space X = [−1, 1]3,

dπx(x) ={

123

dx if ||x||∞ ≤ 1,

0 dx otherwise.(88)

Recall from Sect. 4.4 that we can use the Lanczos algorithmto obtain a k-point Gauss–Christoffel quadrature rule and thefirst k orthonormal polynomials relative to the discrete mea-sureπ

(N )y .We treat these as approximations to the quadrature

rule and orthonormal polynomials relative to the continuousmeasure πy [see (51) and (53)]. In this section, we study thebehavior of these approximations for (87).

We use two different integration rules for this study: a ten-sor product Clenshaw–Curtis quadrature rule (Clenshaw andCurtis 1960) and simple Monte Carlo (Owen 2013). We dothis to emphasize that approximation accuracy of the Gauss–Christoffel quadrature rule and the orthonormal polynomialswith respect to πy depends on the quality of the integrationrule chosen with respect to πx.

Recall from (53) that the Lanczos vectors contain evalua-tions of the first k (corresponding to the number of Lanczositerations performed) orthonormal polynomials with respectto π

(N )y at the points fi = f (xi ) weighted by the square root

of the associated weights,√

νi . To compare the approxima-tions for increasing numbers of samples, we must ensure thatour integration rules are nested. That is, the xi ’s of the N j -point integration rule must be included in the N j+1-pointintegration rule, where N j and N j+1 denote subsequentlyincreasing numbers of points. We use the Clenshaw–Curtisquadrature rule here because it is a nested quadrature rule. ForMonte Carlo, we append new independent random samplesto our current set to ensure the nested structure.

First, we study convergence of the Gauss–Christoffelquadrature rule produced on the output spaceF by the Lanc-zos algorithm. Gautschi (2004) shows the convergence of theJacobi matrix (42) to (30) as the discrete approximation π

(N )y

approaches πy ; however, we are specifically interested in thequadrature rule resulting from the eigendecomposition of theJacobi matrix. Recall from (51) that λ

(N )i and ω

(N )i denote

the i th Gauss–Christoffel quadrature node and weight withrespect to π

(N )y . Figure 2 shows the differences in the 5-point

quadrature rules with increasing samples over X ,

∣∣∣λ(N j+1)

i − λ(N j )

i

∣∣∣ and∣∣∣ω(N j+1)

i − ω(N j )

i

∣∣∣ , (89)

for i = 0, . . . , 4. Using Clenshaw–Curtis quadrature rulesover the input space (Fig. 2a, b), the Gauss–Christoffelquadrature rule with respect to πy converges exponentially.In Fig. 2c, d, we see the expected N−1/2 convergence in theGauss–Christoffel quadrature rule when using Monte Carloto approximate πx. Quality estimates of the quadrature ruleover the output spaceF are required to produce good approx-

imations of the CIR and CAVE using the Lanczos–Stieltjesapproach.

Next, we examine the convergence of the Lanczos vec-tors to the orthonormal polynomials with respect to πy . LetV (N j ) be the matrix of k Lanczos vectors resulting from per-forming Lanczos with an N j -point integration rule on theinput space X with respect to πx. Define the matrix Wν =diag

([ √

ν0 . . .√

νN j−1 ]), and let V

(N j ) = W−1ν V (N j ) be

the matrix of orthonormal polynomials evaluated at the fi ’s(no longer scaled by the

√νi ’s). Due to the nestedness of the

integration rules, V(N j ) contains evaluations of the same k

orthonormal polynomials at nested sets of fi ’s as we increasethe number of samples. Let P ∈ R

N j×N j+1 be the matrix that

removes the rows of V(N j+1) that do not correspond to the

rows in V(N j ). We write the maximum difference in the i th

polynomial for subsequent orders of the quadrature rule as

∥∥∥P(V

(N j+1))i−

(V

(N j ))i

∥∥∥∞, (90)

where (·)i denotes the i th column of the given matrix fori = 0, . . . , k − 1.

Figure 3a, b contains plots of the maximum differ-ences in (90) for increasing numbers of Clenshaw–Curtispoints and Monte Carlo points, respectively. In both plots,higher-degree polynomials require more samples to produceaccurate approximations. This is not surprising, but it doeshighlight an important relationship between the numericalintegration rule used with respect to πx and the number ofLanczos iterations performed. Namely, as the number ofLanczos iterations increases, more samples are required toensure accurate approximation of the orthonormal polyno-mials (and, in turn, CIR and CAVE). Additionally, we seethat the convergence rate of the orthonormal polynomialsdepends on the integration rule chosen over X .

6.2 Example 2: OTL circuit function

We consider the physically motivated OTL circuit function.This functionmodels themidpoint voltage (Vm) from a trans-formerless push–pull circuit [see Surjanovic and Bingham(2015) and Ben-Ari and Steinberg (2007)]. It has the form

Vm = (Vb1 + 0.74) β (Rc2 + 9)

β (Rc2 + 9) + R f

+ 11.35 R f

β (Rc2 + 9) + R f

+ 0.74 R f β (Rc2 + 9)

(β (Rc2 + 9) + R f ) Rc1,

(91)

123

Page 15: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -15

10 -10

10 -5

10 0

Err

or0

1

2

3

4

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -15

10 -10

10 -5

10 0

Err

or

0

1

2

3

4

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -4

10 -3

10 -2

10 -1

10 0

Err

or

0

1

2

3

4

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -4

10 -3

10 -2

10 -1

10 0

Err

or

0

1

2

3

4

(a) (b)

(c) (d)

Fig. 2 Differences in the 5-point Gauss–Christoffel quadrature nodes λi and weights ωi resulting from the Lanczos algorithm applied to (87). a, bcontain the differences using a tensor product Clenshaw–Curtis quadrature rule used over X , and c, d use Monte Carlo samples

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -15

10 -10

10 -5

10 0

Err

or

1

2

3

4

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -4

10 -3

10 -2

10 -1

10 0

Err

or

1

2

3

4

(a) (b)

Fig. 3 Maximum differences in the approximated orthonormal polynomials as the number of quadrature points on X increases. a uses a tensorproduct Clenshaw–Curtis quadrature rule over the input space, while b uses Monte Carlo integration

where

Vb1 = 12 Rb2

Rb1 + Rb2. (92)

The six physical inputs to the OTL circuit model aredescribed in Table 1. The table also contains the ranges of theinputs. We assume the density πx on the input space, definedby the product of the ranges in Table 1, is a uniform density.

For our first study, we use tensor product Gauss–Christoffel quadrature rules on the input space. The num-

ber of points in a tensor product rule grows exponentiallywith dimension, so they are not appropriate for more than ahandful of inputs. The point of this study is to emphasize anddemonstrate how the LSIR and LSAVE algorithms removethe approximation bottleneck caused by the Riemann sumsin SIR and SAVE and place the burden of accuracy on thenumerical integration rule over the input space. The resultsfor this study are shown in Fig. 4.

Figure 4a demonstrates the convergence of the LSIRalgorithm in terms of the number N of Gauss–Christoffel

123

Page 16: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

Table 1 Six physical inputs for the OTL circuit function with theirranges. We assume a uniform density over these ranges

Input Description (units) Range

Rb1 Resistance b1 (K-ohms) [50, 150]Rb2 Resistance b2 (K-ohms) [25, 70]R f Resistance f (K-ohms) [0.5, 3]Rc1 Resistance c1 (K-ohms) [1.2, 2.5]Rc2 Resistance c2 (K-ohms) [0.25, 1.2]β Current gain (amperes) [50, 300]

quadrature nodes on the input space, and Fig. 4b showsconvergence in terms of the number k of Lanczos itera-tions performed. These plots show the Frobenius norm ofthematrix differences between subsequent Lanczos–Stieltjesapproximations of CIR computed using increasing numbersof quadrature nodes (with k = 35 fixed) and Lanczos itera-tions (with N = 176 = 24,137,569 fixed), respectively. Wesee a decay in these differences as we increase N and k,suggesting that the LSIR algorithm is converging.

For Fig. 4c, we perform Algorithm 5 for N = 176 =24,137,569 and k = 35 and treat the resulting matrix asthe “true” value of CIR. We then compute errors relative tothis matrix for various values of N and k. Figure 4c showsthe relative error decaying as we increase both the num-ber of quadrature nodes and Lanczos iterations (i.e., as wemove down and to the right). Consider this plot for increas-ing N with k fixed (i.e., moving downward at a fixed point

along the horizontal axis). The error decays up to a point atwhich it remains constant. This decay corresponds to moreaccurate computation of the pseudospectral coefficients μifrom (64) by taking more quadrature nodes. The leveling offcorresponds to the point at which errors in the coefficientsare smaller than errors due to truncating the pseudospectralexpansion at k (the number of Lanczos iterations). Con-versely, if we fix N and study the error as we increase k (i.e.,fix a point along the vertical axis and move right), we seethe error decay until a point at which it begins to grow again.This behavior agrees with the results from Sect. 6.1 that sug-gest that sufficiently many quadrature nodes are needed toaccurately estimate the high-degree polynomials associatedwith large values of k. As we move right in Fig. 4c, we areapproximating higher-degree polynomials using the Lanczosalgorithm. Poor approximation of these polynomials resultsin an inaccurate estimate of CIR.

Figure 4d–f shows the results of the same studies as above,but performed on the LSAVE algorithm (Algorithm 6). Theresults and interpretations are similar to those for the LSIRalgorithm.

Next we examine how the approximated SIR and SAVEmatrices from Algorithms 1 and 2, respectively, compareto the LSIR and LSAVE approximations. Recall CSIR andCSAVE contain two levels of approximation—one due tothe number of samples N and one due to the number ofterms in the Riemann sum (i.e., the number of slices) R overthe output space. We first focus on convergence in terms ofRiemann sums. Figure 5 shows the comparison of the approx-

10 3 10 4 10 5 10 6 10 7

Number of Samples

10 -6

10 -4

10 -2

10 0

Err

or

5 10 15 20 25 30

Number of Lanczos Iterations

10 -6

10 -4

10 -2

10 0

Err

or

5 10 15 20 25 30

Number of Lanczos Iterations

10 3

10 4

10 5

10 6

10 7

Num

ber o

f Sam

ples

1e-06

1e-05

0.0001

0.001

0.01

0.1

10 3 10 4 10 5 10 6 10 7

Number of Samples

10 -6

10 -4

10 -2

10 0

Err

or

5 10 15 20 25 30

Number of Lanczos Iterations

10 -6

10 -4

10 -2

10 0

Err

or

5 10 15 20 25 30

Number of Lanczos Iterations

10 3

10 4

10 5

10 6

10 7

Num

ber o

f Sam

ples

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

(a) (b) (c)

(d) (e) (f)

Fig. 4 Convergence studies for the LSIR algorithm (a, b, c) and the LSAVE algorithm (d, e, f) on the OLT circuit function from (91)

123

Page 17: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

10 1 10 2 10 3

Number of Slices

10 -4

10 -3

10 -2

10 -1

Err

or

10 1 10 2 10 3

Number of Slices

10 -4

10 -3

10 -2

10 -1

Err

or

(a) (b)

Fig. 5 A comparison of the Riemann sum approximation and theLanczos–Stieltjes approximation ofC IR (a) andCAVE (b) for increasingnumber of Riemann sums (or slices) for (91). For the Lanczos–Stieltjes

approximations, we used N = 176 = 24,137,569 Gauss–Christoffelquadrature nodes on X and k = 35 Lanczos iterations. For the slice-based approximations, we used N = 108 Monte Carlo samples

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -3

10 -2

10 -1

10 0

Err

or

SIR with Monte Carlo samples

R = 5R = 25R = 50R = 100R = 500R = 1000

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -3

10 -2

10 -1

10 0

Err

or

LSIR with Monte Carlo samples

k = 5k = 10k = 15k = 20k = 25k = 30SIRBest Case

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -4

10 -3

10 -2

10 -1

10 0

Err

or

SAVE with Monte Carlo samples

R = 5R = 25R = 50R = 100R = 500R = 1000

10 2 10 3 10 4 10 5 10 6

Number of Samples

10 -4

10 -3

10 -2

10 -1

10 0

Err

or

LSAVE with Monte Carlo samples

k = 5k = 10k = 15k = 20k = 25k = 30SAVEBest Case

(a) (b)

(c) (d)

Fig. 6 A comparison of the SIR/SAVE and LSIR/LSAVE algorithmsfor (91) using Monte Carlo integration on the input space. a and c showthe relative matrix errors of the SIR and SAVE algorithms, respectively,as a function of the number of samples for various values of R (the num-ber of slices). b and d show the relative matrix errors of the LSIR and

LSAVE algorithms, respectively, as a function of the number of samplesfor various values of k (the number of Lanczos iterations). These plotsalso show the best-case results from their slice-based counterparts forreference

imated SIR and SAVE matrices for increasing R to theirLanczos–Stieltjes counterparts. For the Lanczos–Stieltjesapproximations—C IR and CAVE—we use N = 176 =

24,137,569 Gauss–Christoffel quadrature nodes and k = 35Lanczos iterations as this produces sufficiently convergedmatrices; see the previous numerical study. For the SIR and

123

Page 18: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

SAVE algorithms, we use N = 108 samples randomly drawnaccording to πx. Additionally, we define the slices adaptivelybased on the sampling to balance the number of samples ineach slice as discussed in Sect. 2. The sliced approximationsconverge to their Lanczos–Stieltjes counterparts at a rate R−1

as expected for Riemann sums (Davis and Rabinowitz 1984,Ch. 2).

Lastly, we compare the slice-based algorithms SIR andSAVE to their Lanczos–Stieltjes counterparts LSIR andLSAVE with Monte Carlo integration on the input space.This comparison is the most appropriate for practical modelswith several input parameters, where tensor product quadra-ture on the input space is infeasible.

We again use the Lanczos–Stieltjes algorithms with N =176 = 24,137,569 quadrature nodes and k = 35 Lanczositerations as the “true” values of CIR and CAVE for compu-tation of the relative matrix errors. Figure 6a, b shows thecomparison of the SIR and LSIR algorithms (Algorithms 1and 5, respectively) for increasing numbers of Monte Carlosamples. Additionally, we perform this comparison for var-ious values of R (the number of terms in the Riemann sumor slices) and k (the number of Lanczos iterations) for eachof the methods. We notice less variance in the LSIR plot asa function of k than in the SIR plot as a function of R. TheLanczos–Stieltjes approach refines the approximation overthe output space such that the final approximation dependsmost strongly on the chosen integration rule over the inputspace. That is, the issue of choosing how to slice up the outputspace does not exist in the Lanczos–Stieltjes approach as themethod automatically chooses the best integration rule overF . Figure 6b also shows the best-case error from the SIRplot. This is the minimum error among all of the tested val-ues of R for each value of N . The LSIR algorithm performsapproximately as well as the best case in SIR, regardless ofthe value of k chosen. Figure 6c, d performs the same study asabove comparing the SAVE and LSAVE algorithms (Algo-rithms 2 and 6, respectively). The results and interpretationsare similar to those for the SIR/LSIR study.

7 Conclusion

We propose alternative approaches to the sliced inverseregression (SIR) and sliced average variance estimation(SAVE) algorithms for approximating CIR and CAVE. Thetraditionalmethods approximate thesematrices by applying apartitioning (i.e., slicing) to the range of output values. In thecontext of deterministic functions, this slice-based approachcan be interpreted as a Riemann sum approximation of theintegrals in (20). The proposed algorithms use orthonormalpolynomials and Gauss–Christoffel quadrature to producehigh-order approximations of CIR and CAVE. We call the

new algorithms Lanczos–Stieltjes inverse regression (LSIR)andLanczos–Stieltjes average variance estimation (LSAVE).

We use two numerical test problems to study conver-gence of the Lanczos–Stieltjes algorithms with respect tothe algorithm parameters. We first examine the convergenceof the approximate quadrature and orthonormal polynomialcomponents resulting from the Lanczos method’s discreteapproximation of the Stieltjes procedure. This study high-lights the interplay between the number of quadrature nodeson the input space and the number of Lanczos iterations.More Lanczos iterations correspond to higher-degree poly-nomials, which require more samples for the same accuracy.Poor approximations of these polynomials lead to poorapproximations of CIR and CAVE. We then compare theLanczos–Stieltjes approximations of CIR and CAVE to theirslice-based counterparts. These numerical studies empha-size a key characteristic of the Lanczos–Stieltjes approaches.Due to the composite structure of CIR and CAVE, both theslicing approach and Lanczos–Stieltjes contain two levels ofapproximation: (i) numerical integration on the input spaceX and (ii) approximation on the output space F . There is atrade-off between (i) and (ii) in terms of which approxima-tion is the dominant source of numerical error for variouschoices of N and R. The Lanczos–Stieltjes approach sig-nificantly reduces the errors due to approximation over F ,placing the burden of accuracy on the approximation overX . This enables Gauss–Christoffel quadrature on X to pro-duce high-order accuracy when such integration rules areappropriate—e.g., when the number of inputs is sufficientlysmall. When tensor product quadrature rules are infeasible,the Lanczos–Stieltjes approach allows Monte Carlo integra-tion to perform as expected without significant dependenceon the approximation on the output space F .

References

Adragni, K.P., Cook, R.D.: Sufficient dimension reduction and predic-tion in regression. Philos. Trans. R. Soc. London A Math. Phys.Eng. Sci. 367(1906), 4385–4405 (2009)

Ajtai, M., Janos, K., and Szemeredi, E.: AnO(n log n) sorting network.In: Fifteenth Annual ACM Symposium on Theory of Computing,pp. 1–9 (1983)

Ben-Ari, E.N., Steinberg, D.M.: Modeling data from computer experi-ments: an empirical comparison of kriging with mars and projec-tion pursuit regression. Qual. Eng. 19(4), 327–338 (2007)

Chang, J.T., Pollard, D.: Conditioning as disintegration. Stat. Neer-landica 51(3), 287–317 (1997)

Clenshaw, C.W., Curtis, A.R.: A method for numerical integration onan automatic computer. Numer. Math. 2(1), 197–205 (1960)

Constantine, P.G.: Active Subspaces: Emerging Ideas for DimensionReduction in Parameter Studies. SIAM, Philadelphia (2015)

Constantine, P.G., Phipps, E.T.: A Lanczos method for approximatingcomposite functions. Appl.Math. Comput. 218(24), 11751–11762(2012)

123

Page 19: Gauss–Christoffel quadrature for inverse regression ...paco3637/docs/glaws2018gauss.pdf · The quadrature-based LSIR and LSAVE eliminate the first-order algebraic convergence rate

Statistics and Computing

Constantine, P.G., Eldred, M.S., Phipps, E.T.: Sparse pseudospectralapproximation method. Comput. Methods Appl. Mech. Eng. 229,1–12 (2012)

Cook,R.D.:Using dimension-reduction subspaces to identify importantinputs in models of physical systems. In: Proceedings of the Sec-tion on Physical and Engineering Sciences, pp. 18–25. AmericanStatistical Association, Alexandria, VA (1994)

Cook, R.D.: Regression Graphics: Ideas for Studying Regressionthrough Graphics. Wiley, New York (1998)

Cook, R.D.: SAVE: a method for dimension reduction and graphics inregression. Commun. Stat. TheoryMethods 29(9–10), 2109–2121(2000)

Cook, R.D., Forzani, L.: Likelihood-based sufficient dimension reduc-tion. J. Am. Stat. Assoc. 104(485), 197–208 (2009)

Cook, R.D., Ni, L.: Sufficient dimension reduction via inverse regres-sion: a minimum discrepancy approach. J. Am. Stat. Assoc.100(470), 410–428 (2005)

Cook, R.D., Weisberg, S.: Sliced inverse regression for dimensionreduction: comment. J. Am. Stat. Assoc. 86(414), 328–332 (1991)

Davis, P.J., Rabinowitz, P.: Methods Numer. Integr. Academic Press,San Diego (1984)

de Boor, C., Golub, G.H.: The numerically stable reconstruction ofa Jacobi matrix from spectral data. Linear Algebra Appl. 21(3),245–260 (1978)

Donoho,D.L.: High-dimensional data analysis: the curses and blessingsof dimensionality. In: AMSConference onMath Challenges of the21st Century (2000)

Forsythe, G.E.: Generation and use of orthogonal polynomials for data-fitting with a digital computer. J. Soc. Ind. Appl.Math. 5(2), 74–88(1957)

Gautschi, W.: The interplay between classical analysis and (numerical)linear algebra–a tribute to Gene H. Golub. Electron. Trans. Numer.Anal. 13, 119–147 (2002)

Gautschi, W.: Orthogonal Polynomials. Oxford Press, Oxford (2004)Glaws, A., Constantine, P.G., and Cook, R.D.: Inverse regression for

ridge recovery I: Theory. (2018) arXiv:1702.02227v2Golub, G.H., Meurant, G.: Matrices, Moments, and Quadrature with

Applications. Princeton University, Princeton (2010)Golub, G.H., Van Loan, C.F.: Matrix Computations. JHU Press, Balti-

more (1996)Golub,G.H.,Welsch, J.H.: Calculation ofGauss quadrature rules.Math.

Comput. 23(106), 221–230 (1969)Jolliffe, I.T.: Principal ComponentAnalysis. Springer,NewYork (2002)Jones, D.R.: A taxonomy of global optimization methods based on

response surfaces. J. Glob. Optim. 21(4), 345–383 (2001)Koehler, J.R., Owen, A.B.: Computer experiments. Handb. Stat. 13(9),

261–308 (1996)

Lanczos, C.: An iteration method for the solution of the eigenvalueproblem of linear differential and integral operators. J. Res. Natl.Bureau Stand. 45(4), 255–282 (1950)

Li, B., Wang, S.: On directional regression for dimension reduction. J.Am. Stat. Assoc. 102(479), 997–1008 (2007)

Li, K.C.: Sliced inverse regression for dimension reduction. J. Am. Stat.Assoc. 86(414), 316–327 (1991)

Li, K.C.: On principal Hessian directions for data visualization anddimension reduction: another application of Stein’s lemma. J. Am.Stat. Assoc. 87(420), 1025–1039 (1992)

Li, K.C., Duan, N.: Regression analysis under link violation. Ann. Stat.17(3), 1009–1052 (1989)

Li, W., Lin, G., Li, B.: Inverse regression-based uncertainty quantifica-tion algorithms for high-dimensional models: theory and practice.J. Comput. Phys. 321, 259–278 (2016)

Liesen, J., Strakoš, Z.: Krylov Subspace Methods: Principles and Anal-ysis. Oxford Press, Oxford (2013)

Ma, Y., Zhu, L.: A review on dimension reduction. Int. Stat. Rev. 81(1),134–150 (2013)

Meurant, G.: The Lanczos and Conjugate Gradient Algorithms: FromTheory to Finite Precision Computations. SIAM, Philadelphia(2006)

Myers, R.H., Montgomery, D.C.: Response Surface Methodology:Process and Product Optimization Using Designed Experiments.Wiley, New York (1995)

Owen, A.B.: Monte Carlo Theory, Methods and Examples (2013).http://statweb.stanford.edu/~owen/mc/

Pinkus, A.: Ridge Functions. Cambridge University Press, Cambridge(2015)

Sacks, J., Welch, W.J., Mitchell, T.J., Wynn, H.P.: Design and analysisof computer experiments. Stat. Sci. 4(4), 409–423 (1989)

Saltelli, A., Ratto,M.,Andres, T., FrancescaCampolongo, J.C.,Galtelli,D., Saisana, M., Tarantola, S.: Global Sensitivity Analysis. ThePrimer. Wiley, England (2008)

Santner, T.J., Williams, B.J., Notz, W.I.: The Design and Analysis ofComputer Experiments. Springer, New York (2003)

Stieltjes, T.J.: Quelques recherches sur la théorie des quadratures ditesmécaniques. Annales scientifiques de l’ École Normale Supérieure1, 409–426 (1884)

Surjanovic, S., Bingham, D.: Virtual library of simulation experiments:test functions and datasets (2015). http://www.sfu.ca/~ssurjano

Traub, J.F., Werschulz, A.G.: Complexity and Information. CambridgeUniversity Press, Cambridge (1998)

Trefethen, L.N.: Approximation Theory and Approximation Practice.SIAM, Philadelphia (2013)

123