Department of Statistics, University of Michigan Department of … · 2020. 6. 2. · Integrative...

Integrative Methods for Post-Selection Inference Under ConvexConstraints

Snigdha Panigrahi†, Jonathan Taylor?, and Asaf Weinstein‡

†Department of Statistics, University of Michigan?Department of Statistics, Stanford University

‡School of Computer Science and Engineering, Hebrew University of Jerusalem

Abstract

Inference after model selection has been an active research topic in the past few years, with numer-ous works offering different approaches to addressing the perils of the reuse of data. In particular, majorprogress has been made recently on large and useful classes of problems by harnessing general theory ofhypothesis testing in exponential families, but these methods have their limitations. Perhaps most imme-diate is the gap between theory and practice: implementing the exact theoretical prescription in realisticsituations—for example, when new data arrives and inference needs to be adjusted accordingly—maybe a prohibitive task.

In this paper we propose a Bayesian framework for carrying out inference after model selection inthe linear model. Our framework is very flexible in the sense that it naturally accommodates differentmodels for the data, instead of requiring a case-by-case treatment. At the core of our methods is a newapproximation to the exact likelihood conditional on selection, the latter being generally intractable.We prove that, under appropriate conditions, our approximation is asymptotically consistent with theexact truncated likelihood. The advantages of our methods in practical data analysis are demonstrated insimulations and in application to HIV drug-resistance data.

1 Introduction

1.1 Background

Any meaningful statistical problem consists of a model and a set of matching parameters for which decisionsare required. In the classical, “textbook” paradigm, the model and this set of target parameters are assumedto be chosen independently of the data subsequently used for statistical inference (e.g., estimating the pa-rameters). In practice, however, more often than not, data analysts examine some aspect of the data beforedeciding on a model and/or the target parameters. For example, in fitting a simple regression model, a plotof the data might help the analyst decide whether to model the relationship between the response and theexplanatory variable as linear or nonlinear; in fitting linear regression, one might decide to discard variableswith large p-values and re-fit the smaller model before reporting any findings; and in multiple hypothesistesting, the analyst might be tempted to report confidence intervals only for rejected nulls.

Of course, ignoring such form of adaptivity in choosing the model and/or the target parameters, mayresult in the loss of inferential guarantees and lead to flawed conclusions. Still, most would agree that in-structing the analyst to avoid such exploration of the data altogether, is not only impractical, but also notrecommended. This realization on the one hand, and a serious concern about a replication problem in scienceon the other hand, have elicited an effort in the statistical community to develop tools for selective inference.

arX

iv:1

605.

0882

4v7

[st

at.M

E]

30

May

202

0

In general, such tools allow to take into account the fact that the same data used to select target parame-ters, is used when providing inference, thus attempting to restore validity of inference following selection.This includes bias-reducing methods based on an extended definition of Uniform Minimum-Variance Unbi-ased estimation (Robbins, 1988), Bayesian approaches (Efron, 2011) and bootstrapping (Simon and Simon,2013); as well as methods that work by performing simultaneous inference (Berk et al., 2013). Recently,tools from information theory (Russo and Zou, 2015) and Differential Privacy (Dwork et al., 2014, 2015)have also been proposed to quantify and control the effect of selection.

Another common approach is to condition on selection, namely, base inference on the likelihood of theobserved data when truncated to the set of all realizations that would result in the analyst posing the samequestion. A conditional approach has been pursued by several authors that addressed selective inference inso-called large-scale inference problems (Efron, 2012). Underlying such problems is, classically, a sequencemodel, where each observation corresponds to a single parameter, and the parameters are typically notassumed to have any relationship with one another. Conditional inference for the effects corresponding to theK largest statistics in the sequence model was proposed in Reid et al. (2014). Zöllner and Pritchard (2007)and Zhong and Prentice (2008) suggested point and interval estimates for the parameter of a univariategaussian distribution conditional on exceeding a fixed threshold, in the context of genome-wide associationstudies; Weinstein et al. (2013) constructed confidence intervals, and Benjamini and Meir (2014) exploredpoint estimators, for the univariate truncated gaussian problem with a more flexible (random) choice ofcutoff; Yekutieli (2012) proposed to base inference on the truncated likelihood while incorporating a prioron the parameters; and Simonsohn et al. (2014) proposed a frequentist method to assess effect sizes fromthe distribution of the p-values corresponding to only the selected.

More recently, the practicability of the conditional approach has been extended considerably from thesequence model to the realm of linear models and GLMs (Lee et al., 2016; Taylor et al., 2013, 2014; Leeand Taylor, 2014; Fithian et al., 2014; Tian et al., 2018, among others). In these works the selection protocolis assumed to partition the sample space into polyhedral (or at least convex) sets, fitting many popular ‘auto-matic’ model selection procedures such as marginal screening, Lasso, forward-selection etc. The techniquesdeveloped in that line of work have practical importance as they allow to carry out exact inference aftermodel selection, which is one of the most popular situations where the problem of selective inference arises.At the core of these new contributions is the realization, first made in Lee et al. (2016), that when inferencefor the projection of the mean vector of Y (in a fixed-X homoscedastic, gaussian linear model with knownσ) onto a one-dimensional subspace is desired, then conditioning further on the projection of Y onto theorthogonal complement of this subspace, reduces the problem to inference for a univariate truncated normalvariable, the distribution of which depending only on the (scalar) parameter of interest.

The polyhedral lemma of Lee et al. (2016) was indeed a significant step forward, because it madefeasible exact inference after variable selection in the gaussian regression case, for many popular variableselection rules. However, this method has its limitations. If attention is restricted to hypothesis testing, thenthe polyhedral lemma yields a Uniformly Most Powerful Unbiased (UMPU) test in the fixed-X setting andunder a specific model for the data, entailing that the expectation of the response is an arbitrary vector in Rn;Fithian et al. (2014) call this the saturated model. Furthermore, the selection rule is assumed to involve thesame set of observations used for subsequent inference. Lastly, the target of inference in Lee et al. (2016)is restricted to a real-valued linear function of the model parameters. If these assumptions are violated, thepolyhedral lemma may not be applicable, or lead to suboptimal procedures.

1.2 Our contribution

In the current paper we explore the problem of selective inference when one moves away from the aforemen-tioned assumptions. To circumvent the limitations related to the polyhedral lemma, we offer an approach

2

that works directly with a full selective likelihood function while implementing the principles of data carv-ing (Fithian et al., 2014). Carving is a form of randomization that resembles data-splitting in that selectionoperates on a subset of the data, but is different from data-splitting in that inference uses the entire dataset instead of relying on the held-out portion alone. While the flavor is frequentist (we use a conditionallikelihood), in our framework we incorporate a prior on the parameters of the (adaptively) chosen model.This allows us to give inference for general functions of the parameter vector by integrating out nuisanceparameters. Our approach contrasts with the strictly frequentist approach of Fithian et al. (2014), whichdeals with nuisance parameters by conditioning them out. By adopting formally a Bayesian approach, weare able to exploit the usual advantages of the Bayesian machinery. For example, because we have a mech-anism for generating samples from the posterior distribution, credible intervals can be estimated easily byfinding appropriate quantiles in the posterior sample of the parameter. Because our methods are amenableto randomization, we handle data carving naturally, and in that sense offer considerably more flexibility ascompared to the methods of Lee et al. (2016).

Our main contribution is a tractable approximation to the full truncated likelihood function, which al-lows to overcome serious computational objections. On the theoretical side, much of our effort is focused onmotivating the approximation and especially to proving that it enjoys various desirable properties. Specifi-cally, our main technical results establish that in a fixed-p, growing-n regime, and under our randomizationframework, the proposed approximation:

• is consistent for the exact selection probability, see Theorem 5.1.

• leads to consistency of the approximate posterior distribution in the sense of Theorem 6.11.

• from a computational perspective, it presents a convex optimization problem when solving for the ap-proximate posterior mode, see Theorem 6.12.

To emphasize the utility of the proposed methods, in our framework we divide the data into two parts,where selection operates on the first part only. The second, held-out portion, is reserved for inference afterselection. By preventing the statistician from using the held-out portion of the data for selection, this schemeensures that there is enough leftover information (Fithian et al., 2014) after selection, thereby allowing togive more “powerful” inference, for example, to construct shorter confidence intervals; see also Kivaranovicand Leeb (2018). This paradigm arises naturally in many realistic situations: in almost any field of science,the researcher begins with an initial data set which she might use for selection and (post-selection) inference.But, usually further observations are made available at a future point in time. This is either because theresearcher decided to collect more data after seeing the outcome of the initial analysis, or simply becauseanother data set comes in later on. At this point the researcher is faced with the question of how to combinethe two data sets to provide inference for parameters selected based only on the first data set. As was alreadypointed out in Fithian et al. (2014), this situation can be cast into the selective inference framework, wherethe data is the augmented set of observations, and selection “so happens” to operate on only part of the data(see the simple Example 4 in Fithian et al., 2014). In other words, the optimal thing to do is base inference onboth the initial and the follow-up parts of the data, while taking into account the fact that selection affects thedistribution in the first part—what we referred to as “data carving” earlier. Our Bayesian methods naturallyimplement the principles of data carving, through the updating of the prior distribution by the truncatedlikelihood.

Another difference between our work and, e.g., Lee et al. (2016), is that we allow the researcher torefine the choice of the model after seeing the output of the automatic algorithm. Indeed, in many realisticsituations this kind of flexibility is essential. For example, suppose that we run Lasso to select among pmaineffects, but after observing the output, we want to add interactions between some of the selected variables.Including such interactions may account for some unexplained variance (Gunter et al., 2011; Yu et al., 2019).

3

SR-1 SR-2 SR-3signal regime

0.0

0.2

0.4

0.6

0.8

1.0

coverage


0.0

2.5

5.0

7.5

10.0

12.5

1.40%1.40%1.95%1.95%

4.73%4.73%

length


0.0

0.2

0.4

0.6

0.8

1.0

risk

naiveLeesplitcarved

Figure 1In other cases, domain-specific information might be available a priori for the predictors. For instance, instatistical genomics it is common to use existing biological annotations to inform selection of variables (Gaoet al., 2017). Such information may suggest to include a group of genes known to interact with each other,even if some of them were not selected by the automatic protocol (Stingo et al., 2011).

Figure 1 showcases the advantages of our methods on simulated data. In this experiment, the data obeysa linear model with a 500-by-100 design matrix X, such that the rows of X are i.i.d. copies of a correlatedmultivariate normal vector. The model coefficients βj are i.i.d. from a mixture of two zero-mean gaussians,one with small variance 0.01 and the other with large variance V . We select variables with the Lasso, usinga fixed value for the tuning parameter. Four different methods are compared on three criteria, for threedifferent values of V : average coverage of interval estimates for the βj with nominal FCR (false coveragerate) level of 10%; length of the interval estimates; and the prediction risk. The figure shows that unadjustedinference, labeled as “naive” is indeed not valid. FCR is roughly maintained at the nominal level for theother methods, including our new “carving” method that constructs (approximate) Bayesian intervals withrespect to a vague prior. It can be seen that the carving method produces the shortest intervals. In themiddle panel of the figure we also indicate the percentage of intervals constructed with the methods of Leeet al. (2016), that had infinite length. Our method also attains smaller prediction risk than the other threeestimators. The simulation setting and all four methods methods are described precisely in Section 5, wherewe also provide a table corresponding to Figure 1.

The rest of the paper is organized as follows. Section 2 describes the general setup and introduces thecentral component, namely, a selection-adjusted likelihood. In the following two sections we instantiatethe general framework by considering two particular cases. Section 3 begins with the simple example ofunivariate data and a simple selection rule. In Section 4 we analyze the example of Lasso selection in thelinear model, that features all of the complexity of the general framework; further examples appear in theAppendix. Section 5, which includes the main technical novelty, develops an approximation to the selection-adjusted likelihood, provides supporting analysis, and demonstrates its usefulness in a simulation. We treatpoint estimation separately in Section 6 and prove the consistency guarantees associated with our estimates.Computational details for implementing our methods are provided in Section 8, and Section 9 concludeswith a discussion. Proofs are deferred to the Appendix.

2 Basic setup

Let(Xi1, ..., Xip, Yi) ∼ P, i = 1, ..., n,

4

be (p + 1)-dimensional i.i.d. vectors with Yi ∈ R. Denote y = (Y1, ..., Yn)T , and denote by X the n × pmatrix with (Xi1, ..., Xip) in its i-th row. Further, for any matrixD ∈ Rk×l, any subset N ⊆ {1, ..., k} andany subset M ⊆ {1, ..., l}, we denote byDN the |N | × l matrix obtained fromD by keeping only the rowswith indices in N , and we denote byDM the k×|M |matrix obtained fromD by keeping only the columnswith indices in M . Lastly, Nk(·;η,Γ) is the density of an k-dimensional multivariate normal vector withmean η and covariance Γ.

Our framework consists of a model selection stage followed by an inference stage. For a subset S ⊂{1, ..., n} determined independently of the data, let (XS , yS) be a subsample of size |S| = n1. We willrefer to (XS , yS) as the original data henceforth. On observing the original data, the statistician looks forrelevant variables by applying some predetermined variable selection rule, a mapping

(XS , yS) 7−→ Ê ⊆ {1, ..., p}.

The statistician is allowed at this point to refine the selection by dropping variables from (or adding vari-ables to) the output Ê. Specifically, let Ê′ ⊆ {1, ..., p} be a subset obtained by applying an arbitrary butdeterministic function to Ê. We denote by E and E′ the realizations of the variables Ê and Ê′, respectively.

In the second stage inference is provided for P assuming that

y|X ∼ Nn(y; XE′βE′, σ2I). (1)

If not indicated otherwise, we treat σ2 = σ2E′ as known. Nevertheless, we would like to emphasize thatour method easily accommodates the case where σ2 is unknown and itself given a prior, see Sections 5and 8. Naive inference would now proceed under the model (1), ignoring the fact that E′ was chosen afterseeing (part of) the data. Instead, the fact that E′ is data-dependent will be accounted for by truncatingthe likelihood in (1) to the event {Ê = E}. Formally, this means that, conditionally on X, (1) should bereplaced by

y|X, Ê = E ∼ Nn(y; XE′βE

′, σ2I)∫

1{Ê=E}(y) · Nn(y; XE′βE′ , σ2I)dy

1{Ê=E}(y), (2)

remembering that Ê′ is determined by Ê, and that Ê is in turn measurable with respect to (XS , yS).In principle, post-selection inference for βE

′could now be based on (2). Unfortunately, in practice,

the denominator in (2) is intractable, and we will seek simplifications. Taking a step back, note that if noselection had been involved in specifying E′, standard inference for βE

′would rely on the distribution of

the least squares estimator,β̂E′

= X†E′y, (3)

where X†E′ is the pseudo-inverse of XE′ ; if XE′ has full rank, this is (XTE′XE′)

−1XTE′ . If the selectionevent could be expressed in terms of β̂E

′and nothing else, this would give rise to a truncated distribution

for β̂E′. This is, however, not the case for the selection rules considered in this paper for two reasons. First

of all, because we implement data carving, the choice of E cannot be a deterministic function of β̂E′, the

latter being a function of the full data (X, y) rather than (XS , yS). Second, even if we allowed selectionto be based on the full data (X, y), we cannot restrict it to be a function of β̂E

′alone. For example, Lasso

selection cannot be expressed in terms of only the least squares statistic even if selection operates on theentire data set. Suppose for the moment, then, that the event Ê = E could be written in a polyhedral form,

AE√nTn + BEΩn < bE , (4)

with

Tn =

(β̂E′

U

),

5

where U is a vector that may itself depend on E′ and, importantly, Ωn has the property that (i) it is in-dependent of Tn and (ii) it follows a distribution that does not depend on βE

′. Using the terminology of

Tian et al. (2018), Ωn is a randomization term. As we will see later on, the selection rules that we will beconcerned with can indeed be approximately represented in the form (4) with appropriate choices of U, Ωnand AE ,BE ,bE ; see equations (11) and (4.2). By “approximately represented” we mean that we allow aterm that goes to zero in probability in (4) and also that Ωn satisfies the properties (i) and (ii) above asymp-totically, as in the sense formalized in Section 4.1 We would like to emphasize at this point that one of thetechnical challenges is to find an explicit representation as such for a given selection rule. This task will berelatively easy in Section 3, but considerably more involved in Section 4, when we address selection withthe Lasso.

Treating (4) first as if it holds exactly instead of asymptotically, we start by considering the truncateddistribution of (Tn,Ωn),

Ln(βE′; tn, ωn)

P(AE√nTn + BEΩn < bE |βE′)

1{AE√ntn+BEωn

The central component in our Bayesian model is the selection-adjusted likelihood (5), which alsopresents the main technical challenge. In order to sample from (7), we need to be able to evaluate theadjustment factor,

P(AE√nTn + BEΩn < bE |βE

′), (8)

as a function of βE′, which is still (generally) intractable. Hence, most of our effort will be devoted to

developing an amenable approximation that can be plugged in instead.

3 A Primer

We find it instructive to first demonstrate our methods in a simple situation, namely, the special case wherethere are no covariates and the selection rule takes on a simple form. Besides priming the sequel, the examplebelow is a case where an exact analysis can be carried out. Hence, unlike in the general situation, we cancompare our approximate methods against the ground truth when assessing their effectiveness.

Throughout this section, then, suppose that the observations are

Yi ∼ N (β, 1), i = 1, ..., n. (9)

For the theoretical analysis that follows, β = βn depends in general on n, and we use the parameterization√nβn ≡ nδβ∗ with β∗ a constant and δ ∈ (0, 1/2]. Let S ⊆ {1, ..., n} be a random subset of size n1 = ρn,

and denoteȲ S :=

1

n1

∑i∈S

Y (i).

We provide inference for βn conditionally on

√n1Ȳ

S > 0, (10)

that is, the selection event entails the (scaled) least squares estimate for βn from the original data, exceeds afixed threshold (zero). We first note that, on defining

Wn := ȲS − Ȳn,

we have√nWn ∼ N

(0,

1− ρρ

),

√nWn |=

√nȲn.

Therefore, letting √nβ̂n =

√nȲn, Ωn =

√nWn, (11)

we see that the selection event (10) can be written exactly in the form (4) with Tn = β̂n, and AE =−1,BE = −1, bE = 0.

In this simple example, we have an exact form for the probability of the selection event,

P(AE√nTn + BEΩn < bE |βn

)= P

(√n1Ȳ

S > 0|βn)

= Φ̄(−√ρ ·

√nβn

). (12)

This will not be the case in the more complicated example of the following section, and approximations tothe selection probability will be crucial. The following result suggests an approximation for the selectionprobability (12) in the simple univariate example, which we will generalize in the next section.

7

Theorem 3.1. Let β = βn = nδβ∗/√n, where β∗ is a constant. Let K = (0,∞). Then

logP(√

n(Ȳn +Wn)/nδ ∈ K

∣∣βn) ≤ −n2δ · inf(z,w):z+w∈K

(z − β∗)2

2+

ρw2

2(1− ρ)for any n ∈ N (13)

whenever 0 < δ ≤ 1/2. Furthermore, the logarithm of the sequence of selection probabilities satisfies

limn→∞

1

n2δlogP

(√n(Ȳn +Wn)/n

δ ∈ K∣∣βn) = − inf

(z,w):(z+w)∈K

(z − β∗)2

2+

ρw2

2(1− ρ). (14)

We call the exponent of the right hand side of (13) the Chernoff approximation to the selection proba-bility (12). Note that the infimum on right hand side of (13) is taken over a constrained set in R2. To obtaina more computationally amenable expression, we propose to replace the right hand side of (13) with theunconstrained optimization problem

− n2δ · inf(z,w)∈R2

{(z − β∗)2

2+

ρw2

2(1− ρ)+

1

n2δψn−δ(z + w)

}, (15)

where ψn−δ is a function satisfying that in the limit as n→∞,

1

n2δψn−δ(x) −→ I(x) =

{0 if x ∈ (0,∞)∞ otherwise

.

Specifically, usingψn−δ(x) = log(1 + n

−δ/x) (16)

whenever x ∈ (0,∞) and∞ otherwise in (15), we obtain our barrier approximation to the selection proba-bility (12), so called because (16) serves as a “soft” barrier alternative to the indicator function I(x). Figure2a compares the approximations in (13) and (15) to the exact expression (12). The plots for both approxima-tions follow the exact curve fairly closely. Note that the curve corresponding to the Chernoff approximationlies above that for the true probability, as predicted by Theorem 3.1.

Remark 3.2. In practice, of course, we will not have access to δ, but, by substituting√nz′ = nδz,

√nw′ =

nδw, (15) is equivalent to

−n · inf(z′,w′)∈R2

{(z′ − βn)2

2+

ρw′2

2(1− ρ)+

1

nψn−1/2(z

′ + w′)

},

and we can simply work with βn without knowing the underlying parameterization.

The logarithm of the selection-adjusted posterior is

log πS(βn|ȳn) = log π(βn)− n(ȳn − βn)2/2− logP(√n(Ȳn +Wn)/n

δ ∈ K|βn), (17)

and a corresponding approximate selection-adjusted posterior, denoted as log π̃S(βn|ȳn) is

log π(βn)− n(ȳn − βn)2/2 + n2δ · inf(z,w)∈R2

{(z − β∗)2

2+

ρw2

2(1− ρ)+

1

n2δψn−δ(z + w)

}, (18)

where ψn−δ(·) is a (convex) barrier function associated with K = (0,∞). The following Corollary, aconsequence of Theorem 3.1, says that the sequence of selection-adjusted posteriors based on a suitablebarrier approximation, indeed converges to the corresponding (true) selection-adjusted posterior.

8

Corollary 3.3. For β = βn such that√nβn ≡ nδβ∗, δ ∈ (0, 1/2], we have

limn→∞

1

n2δ{log πS(βn|ȳn)− log π̃S(βn|ȳn)} → 0

as n→∞, whenever n−2δψn−δ(x) converges pointwise to

IK(x) =

{0 if x ∈ K∞ otherwise

.

Here K = (0,∞), and log πS(βn|ȳn), log π̃S(βn|ȳn) are given by (17) and (18), respectively.

In Section 6 we discuss estimation with an approximate maximum a-posteriori (MAP) statistic, here anapproximation to the maximizer of log πS(β|ȳn), and focus on establishing theoretical guarantees relatingthe approximate MAP (using our barrier approximation to the adjustment factor) to the exact MAP. Specifi-cally, if we use a constant prior, the MAP is equivalent to the maximum-likelihood estimate (MLE). Withoutdelving into the technical details that will follow, Figure 2b shows how the approximate MLE compares tothe exact MLE (and to the unadjusted MLE) in the univariate gaussian setting of this section.

(a) (b)

Figure 2: (a) Approximations to log-adjustment factor in the univariate normal setting. In the legend,“Chernoff” refers to the approximation from (13), and “Barrier” to the approximation from (15); “True”corresponds to the exact expression (12). Note that the curve for the Chernoff approximation lies abovethe true one, as predicted by our theory. (b) Maximum-likelihood estimation for the univariate gaussiansetting. Broken line corresponds to the approximate MLE, incorporating the proposed approximation tothe selection probability. Solid black curve is the exact (true) MLE, and solid red line is the identity line,representing the unadjusted estimate.

4 Selection with the Lasso

We now turn to the case of primary interest, specializing the general framework of Section 2 to selectionwith the Lasso in the linear model. It is important to note again that our methodology accommodates a rich

9

class of selection protocols, of which Lasso is no more than a special case: in the Appendix we show howmarginal-screening and logistic Lasso for a binary response, can be reduced to fit our framework.

From now on, unless specified otherwise, we consider

Ê = {j : β̂λj 6= 0}, (19)

where β̂λ is the solution to

minimizeβ∈Rp

1

2√nρ

∥∥∥yS −XSβ∥∥∥22

+ λ‖β‖1, (20)

and ρ = n1/n. We emphasize that β̂λ = β̂λS is obtained from only the original data (yS ,XS), but we

suppress the subscript throughout. We also denote by β̂λE the vector in R|E| consisting of only the nonzerocoordinates of β̂λ. While the general prescription in Section 2 would now entail conditioning on the eventÊ = E, as in Lee et al. (2016) we in fact condition on the more specific event(

Ê, ŜÊ)

=(E, sE

), (21)

where ŜÊ = sgn(β̂λE) is the vector of signs of β̂λE . This refinement is important for obtaining a convex

truncation region, crucial for our methods to be applicable. As in the general setting, the object of inferenceis βE

′, with E′ ⊆ {1, ..., p} chosen upon observing (21).

To fit the general framework of Section 2, in the previous section we re-expressed the selection event,originally involving β̂S = Ȳ S , as a sum involving β̂n = Ȳn and an independent variable Wn. Furthermore,the distribution of Wn was fully specified, in particular it did not depend on the unknown parameter β. Wewill now work out an analogous asymtotic representation for the Lasso selection event (21). Thus, let

√nTn :=

( √nβ̂E

′

√nN−E′

):=

√nβ̂E′1√n

XT−E′(

y −XE′ β̂E′) (22)

where X−E′ is the matrix obtained from X by deleting the columns with indices in E′. Define also

Ωn =√nWn =

∂

∂β

{− 1

2√nρ

∥∥∥yS −XSβ∥∥∥22

+1

2√n

∥∥∥y −Xβ∥∥∥22

} ∣∣∣β=β̂λ

. (23)

As observed in Markovic and Taylor (2016), Ωn can be treated asymptotically as a randomization term. Weverify this in the following proposition.

Proposition 4.1. Denote µ := (βE′ , 0)T ∈ Rp where βE′ ∈ R|E′|. Define also the matrices

Q := EP (XTE′XE′/n), N :={EP (XT−E′X−E′/n)− EP (XT−E′XE′/n)Q−1EP (XT−E′XE′/n)T

}−1.

Then, under the modeling assumption (1), the distribution of(√n (Tn − µ)

Ωn

)∈ R2p

is asymptotically normal with mean zero and covariance

Σ =

[ΣP 00 ΣG

],

where ΣP = σ2[Q−1 0

0 N−1

]and ΣG are p× p matrices.

10

To fit the framework of Section 2, we now show that the event (21) can be asymptotically written as in(4). This is established in the following proposition.

Proposition 4.2. The event(Ê, ŜE

)=(E, sE

)is equivalent to

AE√nTn + BEΩn + op(1) < bE , (24)

where

AE =

−diag(sE)(QE)−1PE,E′ −diag(sE)(QE)−1IE,E′FE,E′ −CE(QE)−1PE,E′ J E,E′ −CE(QE)−1IE,E′−FE,E′ + CE(QE)−1PE,E′ −J E,E′ + CE(QE)−1IE,E′

,

BE =

−diag(sE)(QE)−1 0−CE(QE)−1 ICE(QE)−1 −I

, bE = λ−diag(sE)(QE)−1sE1−CE(QE)−1sE

1 + CE(QE)−1sE

;PE,E

′= EP (XTEXE′/n), FE,E

′= EP (XT−EXE′/n),

QE = EP (XTEXE/n), CE = EP (XT−EXE/n);

IE,E′ ∈ R|E|×(p−|E′|) and J E,E′ ∈ R(p−|E|)×(p−|E′|) are matrices with entries that are all zero exceptfor the (j, j) entries, that are defined as follows: Suppose that E = {k1, k2, ..., k|E|}, and that Ec ={l1, l2, ..., lp−|E|}. Then

IE,E′

j,j =

{1 if kj /∈ E′

0 otherwise, j = 1, ...,min(|E|, p− |E′|);

J E,E′

j,j =

{1 if lj /∈ E′

0 otherwise, j = 1, ...,min(p− |E|, p− |E′|).

Remark 4.3. When E′ = E, then note that IE,E = 0, J E,E = I. In particular, the polyhedron inProposition 4.2 has

AE =

−diag(sE) 00 I0 −I

, BE =−diag(sE)QE

−10

−CEQE−1 ICEQE

−1 −I

, bE = λ−diag(sE)QE

−1sE

1−CEQE−1sE

1 + CEQE−1sE

.Lemma 14 in Tian et al. (2018) establishes asymptotic normality for

√n(Tn − µ) (compare to Proposition

4.1), and gives the approximate polyhedron for the specific case E′ = E, under the Lasso objective withadditive heavy-tailed randomization.

5 Approximations to the selection-adjusted likelihood

As indicated before, to be able to put the Bayesian machinery to work, an immediate challenge that presentsitself is evaluating the selection-adjusted likelihood—more specifically, the adjustment factor (8)—as a func-tion of βE

′. Since (8) is in general unmanageable, we will propose to replace it with a computationally

tractable approximation. In this section, we use βE′

n instead of βE′ and represent µ defined in Proposition

4.1 by µn to make explicit the dependence of the parameter vector in (2) on n. Furthermore, we assume

11

without loss of generality that the variance term σ2 = 1. The case where σ2 is unknown is discussed afterstating Theorem 5.1.

Theorem 5.1 below is a moderate deviations-type of result allowing to obtain the limiting value ofthe probability of a polyhedral region, and provides motivation for our approximation. Before stating thetheorem, we recall the following representation from the proof of Proposition 4.1,(√

nTnΩn

)−√n

(µn0

)=√nZ̄n + En,

where En = op(1) and µn = (βE′

n , 0)T ∈ Rp, and Z̄n is a mean statistic based on n i.i.d. centered

observations in our framework.

Theorem 5.1. Suppose that (1) holds for some vector βE′n satisfying√nβE

′n = n

δβ∗ ∈ R|E′|, where β∗does not depend on n, and where δ ∈ (0, 1/2). Defining Wn through Ωn :=

√nWn, we assume that

EP[exp(λ0‖X(Y −XE′TβE

′n )‖

],EP

[exp(η0‖X(Y −XETE[β̂λE ])‖)

] 0, where (X,Y ) = (X1, ..., Xp, Y ) ∼ P . We also assume that

n−2δ · logP[n−δ‖En‖ > �] = −∞ for every � > 0, (26)

and that

limn→∞

1

n2δ

{logP(AE

√nTn + BE

√nWn + op(1) < bE |βE

′n )− logP(AE

√nTn + BE

√nWn < 0|βE

′n )}

= 0. (27)

Then, denotingHn = {(b, η, w) : AE(b, η)T + BEw < n−δbE}, we have

limn→∞

1

n2δlogP(AE

√nTn + BE

√nWn < bE |βE

′n )

+ inf(b,η,w)∈Hn

(b− β∗)TQ(b− β∗)2

+ηTNη

2+wTΣ−1G w

2= 0,

(28)

where the optimization variables (b, η, w) ∈ RE′ × Rp−|E′| × Rp, and matrices Q,N and ΣG are definedin Proposition 4.1.

Theorem 5.1 suggests using the negative of the second term in (28), scaled by n2δ, as an approximationin computing the log of the probability in (8), whenever

√n(Tn,Wn) satisfies a central limit property and

has exponential moments. In fact, the polyhedron can be more generally replaced with any open and convexsubset K ⊂ R2p. Furthermore, if σ2 is unknown, we can put

K = {(t, w) : AE√nt+ BE

√nw ≤ bE/σ}

and compute the probability with respect to the law of (√nTn/σ,

√nWn/σ)

T where µn := (βE′

n /σ, 0)T .

In this case, we note that (28) becomes

limn→∞

1

n2δlogP(AE(

√nTn/σ) + BE(

√nWn/σ) < bE/σ | βE

′n , σ)

+ (σ2)−1 · inf(b,η,w)∈Hn

(b− β∗)TQ(b− β∗)2

+ηTNη

2+wTΣ−1G w

2= 0,

(29)

whereHn is defined in Theorem 5.1. Hence our approximation readily accommodates the case of unknownσ2. In Section 8 we use this to outline a scheme for posterior sampling when imposing a joint prior on(βE

′, σ)T .

12

Remark 5.2. Assumption (25) is a condition on the existence of an exponential moment. This is a necessarycondition for a mean statistic based on i.i.d. observations, in our case Z̄n, to satisfy a moderate deviationprinciple; see Arcones (2002); Eichelsbacher and Löwe (2003). Assumption (26) is necessary to allowa moderate deviation principle for the statistic (

√n(Tn − µn),Ωn)T , which is obtained by adding to the

centered mean statistic√nZ̄n a remainder term that is converging in probability to 0. This assumption

is typically needed to apply moderate deviations to statistics such as M-estimators; see Arcones (2002).Moderate deviations approximations are used to compute probabilities of the form

P(√n(Tn,Wn)

T /nδ ∈ K).

If we take K ={

(t, w) :[AE BE

](t, w)T < 0

}, then assumption (27) allows us to approximate

n−2δ · logP([AE BE

]√n(Tn,Wn)

T /nδ + n−δop(1) < n−δbE |βE

′n )

byn−2δ · logP(

[AE BE

]√n(Tn,Wn)

T /nδ < 0|βE′n ).

From now on, we write the selection event as {AE√nTn + BE

√nWn < bE}, ignoring the op(1)

term. In practice, to obtain an approximation for logP(AE√nTn + BE

√nWn < bE |βE

′n ), we solve an

unconstrained version of the optimization problem in (28), as we describe next.First we introduce a change of variable for the optimization arguments, which simplifies the constraints

in the left hand side of (28) considerably.

Proposition 5.3. For 0 < δ < 1/2 and for a sequence βE′n such that√nβE

′n = n

δβ∗, define a change ofvariable ω 7→ o through

nδw = nδPE

(bη

)+ nδQEo+ rE , (30)

where

PE = −[PE,E

′ IE,E′

FE,E′ J E,E′

], QE =

[QE 0CE I

], rE =

(λsE

0

).

Then, minimizing n2δ · inf(b,η,w)∈Hn

{(b− β∗)TQ(b− β∗)/2 + ηTNη/2 + wTΣ−1G w/2

}is equivalent to

minimizing

n2δ · inf{(b,η,o)∈R2p:o∈On}

{(b− β∗)TQ(b− β∗)/2 + ηTNη/2

+

(PE

(bη

)+QEo+ rE/n

δ

)TΣ−1G

(PE

(bη

)+QEo+ rE/n

δ

)/2

}, (31)

where the constraints in the two objectives are given, respectively, by

Hn ={

(b, η, w) ∈ R2p : AE(bη

)+ BEw ≤ n−δbE

};On = {o ∈ Rp : sgn(nδoE) = sE , ‖nδo−E‖∞ ≤ λ}.

Note that in the new form of the optimization problem, the variables b and η are unconstrained, while ohas the simple constraint given above. For the parametrization of βE

′n considered, we obtain a more flexible

form of the optimization problem as

log P̃(AE√nTn + BE

√nWn < bE |βE

′n ) = −n2δ · inf(b,η,o)∈R2p

{(b− β∗)TQ(b− β∗)/2 + ηTNη/2

+(PE(b η

)T+QEo+ rE/n

δ)T

Σ−1G

(PE(b η

)T+QEo+ rE/n

δ)/

2 + n−2δψn−δ(oE , o−E)},

(32)

13

where ψs(o) = ψs(oE , o−E) is some penalty function corresponding to the set On, with a scaling factor s.This specializes to (31) by taking ψs(oE , o−E) to be the characteristic function

IO(oE , o−E) =

{0 if o ∈ O∞ otherwise

.

In the next and crucial step, instead of the characteristic function that restricts the optimizing variables to thesetOn, we use a smoother nonnegative penalty function: we replace IOn(oE , o−E) with a suitable “barrier”penalty function ψs that reflects preference for values of o farther away from the boundary and inside theconstraint region On, by taking on smaller values for such o. Specifically, we use ψn−δ defined by

ψn−δ(o) ≡ ψn−δ(oE , o−E) =

=

E∑i=1

log

(1 +

1

si,Enδoi,E

)+

p−|E|∑i=1

log

(1 +

1

λi,−E − nδoi,−E

)+ log

(1 +

1

λi,−E + nδoi,−E

). (33)In line with Remark 3.2 for the univariate thresholding example, we remark that δ in the parameterizationof βE

′n is only a theoretical construct.This ultimately leads to an approximation to the log of the selection-adjusted posterior (7), as

log π̃S(βE′n |β̂E

′) = log π(βE

′n )−n(β̂E

′ − βE′n )TQ(β̂E′ − βE′n )/2

− log P̃(AE√nTn + BE

√nWn < bE |βE

′n ),

(34)

where we use (33) in (32). To be more explicit, the last term in (34) can be obtained from (32) as

− n · inf(b′,η′,o′)∈R2p

{(b′ − βE

′

n )TQ(b′ − βE

′

n )/2 + η′TNη′/2

+(PE(b′ η′

)T+QEo

′ + rE/√n)T

Σ−1G

(PE(b′ η′

)T+QEo

′ + rE/√n)/

2 + ψn−1/2(o′E , o

′−E)/n

}by substituting

√nb′ = nδb,

√nη′ = nδη,

√no′ = nδo.

We now present some simulation results which demonstrate the effectiveness of our methods.

Example 1. In each of 50 rounds, we draw a n × p matrix X with n = 500, p = 100 such that the rowsx(i) ∼ Np(0,Σ), i = 1, 2, ..., 500, where the (j, k)-th entry of Σ equals 0.20|j−k|. We then draw a pair(β, y), where the components of β ∈100×1 are i.i.d. from

0.9 · N (0, 0.1) + 0.1 · N (0, V ), (35)

a mixture of two zero-mean normal distributions, one with small variance 0.1 and the other with largervariance V , and y|X, β ∼ Nn(Xβ, I). The variance V ∈ {5, 3, 2}, roughly corresponding to signal-to-noise ratio 0.50, 0.25, 0.10, respectively. In the selection stage, we select at random a subset S ⊂ {1, ..., n}of size |S| = n/2 = 250, and denote by (yS ,XS) the corresponding data. For a theoretical value of thetuning parameter, λ = E[‖XTΨ‖∞], Ψ ∼ Nn(0, I) (as proposed in Negahban et al., 2009), denote byE ⊂ {1, ..., 100} the set corresponding to the nonzero Lasso estimates, obtained by solving

argminβ

1

2ρ‖yS −XSβ‖2 + λ‖β‖1.

In the inference stage we have access to the entire data (X(i), Y (i)), i = 1, ..., n. We give inferenceassuming that the model is

y|X ∼ N (XEβE , σ2I),

14

V = 5 (SNR = 0.5) V = 3.5 (SNR = 0.25) V = 2 (SNR = 0.10)1-FCR Length Rel. Risk 1-FCR Length Rel. Risk 1-FCR Length Rel. Risk

Carving 0.884 3.92 0.21 0.887 4.04 0.39 0.905 4.22 0.66Naive 0.753 3.33 0.25 0.728 3.33 0.44 0.598 3.31 0.75Split 0.9 4.80 0.25 0.915 4.79 0.44 0.902 4.77 0.70Lee et al 0.913 8.38 0.29 0.939 9.58 0.46 0.863 10.73 0.83

Table 1: Summary of Example 1. Four methods are compared in three SNR regimes. For Lee et al, thenumbers displayed for length of CI are averages of constructed intervals that had finite length; for V =5, 3.5, 2, Lee et al produced infinitely long intervals for 1.4%, 1.95% and 4.73% of the selected parameters,on average.

in other words, we take E′ = E in (2); of course, this model is (usually) misspecified, in the sense thatE 6= {j : βj 6= 0}. Ancillarity now entails conditioning on X when giving inference for βE

′. Four different

methods for inference are compared:

Unadjusted (Naive). Bayesian inference for βE using a noninformative prior π(βE) ∝ 1 and the unad-justed likelihood y|X ∼ Nn(XEβE , σ2I)Split. Bayesian inference using only the confirmatory (held-out) data: a noninformative prior π(βE) ∝ 1is prepended to the unadjusted likelihood yS

c |XSc ∼ Nn(XSc

E βE , σ2I)

Carving. Bayesian inference for βE using a noninformative prior π(βE) ∝ 1 and the approximateselection-adjusted likelihood,

π̃S(βE |β̂E) ∝ exp(−n(β̂

E − βE)TQ(β̂E − βE)/2σ2)P̃(AE

√nTn + BE

√nWn < bE |βE)

(36)

where the denominator on the right-hand side is given by (32) with the penalty defined in (33).

Lee et al.. We use the Selective Inference package in R to obtain estimates with the methods of Lee et al.(2016). Because there is no implementation for carving, the entire data is used for selection, that is, Ê isthe subset of indices corresponding to the nonzero elements of

argminβ

1

2‖y −Xβ‖2 + λ‖β‖1.

Exact inference conditional on Ê and the corresponding signs is given coordinate-wise using the methodsof Lee et al. (2016), which correct for selection.

We compare the methods above on the following criteria: (i) we use each method to construct (marginal)90% interval estimates, and calculate the average (over simulation rounds) proportion of covering intervals(this is reported as 1−FCR in the tables below); (ii) lengths of constructed CIs; and (iii) relative predictionrisk,

(β̂ − β)T (XTX)(β̂ − β)βT (XTX)β

,

where the ‘inactive’ coordinates {j /∈ E} of β and β̂ are set to zero, and the estimates β̂ are the posteriormeans for the first three methods, and the plain Lasso estimate for “Lee et al.”.

We see that for all methods except the unadjusted, the coverage, as measured by one minus the falsecoverage rate (FCR), is roughly the nominal level 0.9. In particular, the CIs constructed based on the

15

proposed approximation to the selection-adjusted posterior have good coverage; we emphasize again thatthis is in spite of the fact the the assumed model is (usually) misspecified. Meanwhile, the length of theintervals for the proposed method (this is “carving” in the tables), is much smaller than that for Lee et alintervals, which are sometimes infinitely long. More importantly, the carved intervals based on our methodare smaller in length than the intervals for sample splitting. This matches our expectations, because left-overinformation is not utilized at all in sample-splitting.

6 Point estimates

This section elaborates on point estimation after selection. In our Bayesian selection-adjusted framework,a natural estimate for the model parameters βE

′n is the selection-adjusted maximum a-posteriori (MAP)

statistic, the maximizer in βE′

n of (7). The selection-adjusted MLE obtains as a special case when we use aconstant prior. In any case, this would again require the ability to evaluate the adjustment factor as a functionof βE

′n , hence we cannot implement the exact MAP. However, we may consider an approximate MAP

by replacing the adjustment factor in the selection-adjusted likelihood with our workable approximation.Specifically, we will show that using the approximation given by (32) with the choice of penalty (33), hasvarious desirable and nontrivial features. Before we proceed, it is worth mentioning that point estimatescan be obtained with the methods of Lee et al. (2016) which rely on an exact truncated univariate normaldistribution. Unsurprisingly, such estimates might be suboptimal because, stated informally, they do notutilize all of the information about the parameters in the sample and are tied exclusively to the saturatedmodel. By contrast, the methods suggested below are based on the full (conditional) likelihood, defined fora fairly flexible class of models.

We study first the approximate MAP in the univariate gaussian example of Section 3. Again, in that casewe can implement the exact MAP. As before, we assume (9) with β = βn such that

√nβn = n

δβ∗, 0 <δ ≤ 1/2. The selection event {√n1Ȳ S > 0} ≡ {

√n(Ȳn +Wn) > 0} is of the form

√n(Ȳn +Wn)/n

δ ∈ K,

where K is an interval on the real line. The selection-adjusted likelihood is

LnS(βn) = −n(ȳn − βn)2/2− logP(√n(Ȳn +Wn)/n

δ ∈ K|βn), (37)

and our approximate version for the selection-adjusted (log-) likelihood is

L̃nS(βn) = −n(ȳn − βn)2/2 + n2δ · inf(z,w)∈R2{

(z − β∗)2

2+

ρw2

2(1− ρ)+

1

n2δψn−δ(z + w)

}. (38)

First we prove consistency of the approximate selective MLE. As in Theorem 3.1, the statement in Theorem6.3 holds also for δ = 1/2; compare to Theorem 6.9. We will note that consistency of the approximateselective MLE hinges on strong convexity of the negative logarithms of the approximate selection-adjustedlikelihood, which leads to a contraction identity as in Lemma 6.2. In Theorem 6.3 we additionally use thefact that the variance of a gaussian random variable is smaller when restricted to a convex set (Kanter andProppe, 1977). Before stating the main theorem, in the following two lemmas we make crucial observationsabout the sequence of approximate likelihoods.

Lemma 6.1 (Strong convexity). Let√nβn = n

δβ∗, 0 < δ ≤ 1/2. The approximate selection-adjustedlog-likelihood in (38) equals

L̃nS(βn) = nȳnβn − n2δ · C̃n(β∗)− nȳ2n/2,

16

whereC̃n(β

∗) = (1− ρ) · β∗2/2 + H̄∗n(ρβ∗), (39)

and with H̄∗n(.) denoting the convex conjugate of H̄n(z̄) = ρ · z̄2/2 + n−2δ · ψn−δ(z̄). Moreover, C̃n(·) isstrongly convex with index of convexity lower bounded by (1− ρ).

Lemma 6.2. For a sequence βn as in Lemma 6.1, the maximizer β̂S of (38) satisfies

n1−2δ(β̂S − βn)2 ≤1

(1− ρ)2(n1/2−δȳn −∇C̃n(β∗))2,

where C̃n is given by (39).

We are now ready to prove the consistency guarantees associated with our approximate selective MLE.

Theorem 6.3. Let K ⊂ R be a convex set, and denote by β̂S the maximizer of (38). Then, for 0 < δ ≤ 1/2,

P(n1/2−δ|β̂S − βn| > �∣∣ √n(Ȳn +Wn)/nδ ∈ K) −→ 0

as n→∞.

Figure 2b shows the approximate MLE against the exact MLE. To complete the picture, we show thatwithout randomization, the maximizer of the approximate truncated likelihood is not consistent. This furtherhighlights the importance of holding out some samples at the selection stage, exclusively for inference. Ifselection was based on the entire data, then we could still employ our approximation for the selectionprobability P(

√nȲn/n

δ ∈ K|βn). In particular, when K = (0,∞) (as considered before), the approximatelog-selection probability takes the form

−n · infz∈R

{(z − βn)2

2+

1

nlog

(1 +

1√nz

)}.

Theorem 6.4. Let βn ≡ β∗ < 0. Consider the approximate (non-randomized) selection-adjusted loglikelihood,

L̃nS(βn) = −n(ȳn − βn)2/2 + n · infz∈R

{(z − βn)2

2+

1

nlog

(1 +

1√nz

)}. (40)

Then the maximizer β̂S of (40) does not converge in probability to β∗ as n→∞.

Remark 6.5. Theorem 6.4 is stated for the barrier approximation in (16) for clarity of exposition. Moregenerally, the approximate non-randomized selective MLE is not consistent as long as the barrier functionψ(·) satisfies

x∇ψ(x)→ constant as x ↓ 0.

As a consequence of Theorem 6.3, we show next a form of consistency of the selection-adjusted pos-terior law with respect to a fixed prior. This guarantees that the approximate selection-adjusted posteriorconcentrates around the true parameter βn in an asymptotic sense.

Theorem 6.6. Consider a ball of radius δ around βn,

B(βn, δ) := {bn : |bn − βn| ≤ δ},

17

and suppose that π is a prior which assigns nonzero probability to B(βn, δ) for any δ > 0. Under theconditions in Theorem 6.3, for any � > 0 we have

P(ΠS(Bc(βn, δ)|Ȳn) > �|√n(Ȳn +Wn)/n

δ ∈ K)→ 0

as n→∞, where

ΠS (Bc(βn, δ)|ȳn) :=

∫Bc(βn,δ)

π(bn) · exp(L̃nS(bn))dbn∫π(bn) · exp(L̃nS(bn))dbn

is the posterior probability under the approximate truncated likelihood .

Having established consistency for the univariate example, we now move on to the general case. Thus,consider the MAP estimator, given as the maximizer of (34). For a constant prior this reduces to the approx-imate maximum-likelihood estimate,

β̂E′

S = argminβE′n

{n(β̂E

′ − βE′n )′Q(β̂E′ − βE′n )/2

+ log P̃(AE√nTn + BE

√nWn < bE |βE

′n )},

(41)

where in (32) we use (33). Before presenting the main result for this section, we state two key lemmas.Recall the quantitites PE , QE , rE defined in Proposition 5.3. Let PE

′E denote the matrix that consists of the

columns in PE corresponding to the coordinates in E′. Similarly, P−E′

E denotes the matrix consisting of theremaining p− |E′| columns.

Lemma 6.7. Under the parameterization√nβE

′n = n

δβ∗ with δ ∈ (0, 1/2), the approximate log-partitionfunction

nβE′

n

TQβE

′n /2 + log P̃(AE

√nTn + BE

√nWn < bE |βE

′n ) = n

2δC̃n(Qβ∗),

where we define

C̃n(Qβ∗) := β∗TQ(Q + PE

′E

TΣ−1G P

E′E )−1Qβ∗/2 + h∗ (M1Qβ

∗) + β∗TQM2

withM1 = −

[P−E

′

E QE

]TΣ−1G P

E′E (Q + P

E′E

TΣ−1G P

E′E )−1,

M2 = −(Q + PE′

E

TΣ−1G P

E′E )−1PE

′E

TΣ−1G rE/n

δ,

and where h∗(·) is the convex conjugate of the function

h(η, o) = ηTNη/2 + L(η, o)T (Σ−1G −Σ−1G P

E′E (Q + P

E′E

TΣ−1G P

E′E )−1PE

′E

TΣ−1G )L(η, o)/2

+ n−2δψn−δ(oE , o−E)

with L(η, o) = P−E′E η +QEo+ rE/nδ.

Lemma 6.8. Under the parameterization√nβE

′n = n

δβ∗ with δ ∈ (0, 1/2) and when XE′ is of full columnrank, the approximate log-partition function n2δC̃n(Qβ∗), corresponding to the approximate negative log-likelihood,

n(β̂E′ − βE′n )TQ(β̂E

′ − βE′n )/2 + log P̃(AE√nTn + BE

√nWn < bE |βE

′n ),

is strongly convex. Furthermore, the index of strong convexity for C̃n(Qβ∗) is bounded below by λmin, thesmallest eigenvalue of

(Q + PE′

E

TΣ−1G P

E′E )−1.

18

Using Lemma 6.8, we are now able to prove a consistency result for the approximate selective MLE.We note here again that randomization is crucial for the theorem below to hold, as we saw already for theunivariate example.

Theorem 6.9. When√nβE

′n = n

δβ∗ for δ ∈ (0, 1/2), the approximate selective MLE given in (41) isn1/2−δ-consistent for βE

′n under the selection-adjusted law (5):

P(n1/2−δ‖β̂E′S − βE′

n ‖ > �∣∣ AE√nTn + BE√nWn < bE) −→ 0

as n→∞.

We move on to show a consistency result for the posterior distribution with respect to a fixed prior. Wewill need the following lemma, where we obtain finite-sample bounds on the log-likelihood ratios at theapproximate selective-MLE and an arbitrary value for the parameter.

Lemma 6.10. Assume the conditions and parameterization in Lemma 6.8. Denote the logarithm of theapproximate truncated likelihood by

L̃nS(βE′n ) =

√nβ̂E

′Q√nβE

′n − n2δC̃n(n1/2−δQβE

′n )− nβ̂E

′Qβ̂E

′/2,

where C̃n(n1/2−δQβE′

n ) = C̃n(Qβ∗) is defined in Lemma 6.8. Then we have

−n · (β̂E′S − βE′

n )TQ(β̂E

′S − βE

′n )/2 ≤ L̃nS(βE

′n )− L̃nS(β̂E

′S ) ≤ −nλ̃min · (β̂E

′S − βE

′n )

T (β̂E′

S − βE′

n )/2

where λ̃min is the smallest eigenvalue of Q(Q + PE′

E

TΣ−1G P

E′E )−1Q, and β̂E

′S is the approximate selective-

MLE in (41).

Theorem 6.11. Assume the conditions in Theorem 6.9. Consider a ball of radius δ around the truth βE′n ,

B(βE′n , δ) ≡ {bn : ‖bn − βE′

n ‖ ≤ δ},

and suppose that π is a prior that assigns nonzero probability to B(βE′n , δ) for any δ > 0. Then

P(ΠS(Bc(βE′

n , δ)|β̂E′) > �

∣∣ AE√nTn + BE√nWn < bE)→ 0 as n→∞for any � > 0, where ΠS(·) denotes the posterior probability under the approximate truncated likelihood,

ΠS

(Bc(βE′n , δ)

∣∣β̂E′) :=∫

Bc(βE′n ,δ)π(bn) · exp(L̃nS(bn))dbn∫

π(bn) · exp(L̃nS(bn))dbn,

for L̃nS(bn) =√nβ̂E

′Q√nbn − n2δC̃n(n1/2−δQbn)− nβ̂E

′Qβ̂E

′/2.

The next result has implications for the computation of the approximate selective MLE.

Theorem 6.12 (Convexity of approximate MAP with general approximation). Let π(·) be a log-concaveprior. For any barrier function ψs(·), minimizing the negative of the approximate log-posterior based on theapproximation in (32),

log π(βE′

n ) + n(β̂E′ − βE′n )TQ(β̂E

′ − βE′n )/2 + log P̃(AE√nTn + BE

√nWn < bE |βE

′n ),

in βE′

n , is a convex optimization problem for any n ∈ N.

19

Remark 6.13 (Uniform convergence on compact sets). The approximation log P̃(AE√nTn+BE

√nWn <

bE |βE′

n ) for the (log- ) selection probability is continuous in βE′n and so is the true selection probability in

βE′

n . Hence the difference

n−2δ{

log P̃(AE√nTn + BE

√nWn < bE |βE

′n )− logP(AE

√nTn + BE

√nWn < bE |βE

′n )}

converges uniformly on a compact subset Θ ⊂ RE′ of the parameter space.

Finally, it is natural to ask how our approximate MLE compares to the exact MLE,

β̆E′

S = argminβE′n

{n(β̂E

′ − βE′n )′Q(β̂E′ − βE′n )/2

+ logP(AE√nTn + BE

√nWn < bE |βE

′n )}.

(42)

The following theorem asserts that the approximate version converges to the exact MLE. We remark that thereaders can see Hjort and Pollard (2011) to understand the proof of Theorem 6.14. We provide a proof inthe Appendix for completeness.

Theorem 6.14. Under the parameterization in Theorem 6.9 for a δ ∈ (0, 1/2), for any � > 0 we have

P(n1/2−δ‖β̂E′S − β̆E′

S ‖ > �| AE√nTn + BE

√nWn < bE)→ 0.

as n→∞.

7 HIV drug-resistance data

In this section we apply our methods to the HIV dataset analyzed in Rhee et al. (2006), Bi et al. (2020).With an attempt to understand the genetic basis of drug resistance in HIV, Rhee et al. (2006) used markersof inhibitor mutations to predict susceptibility to 16 antiretroviral drugs. We follow Bi et al. (2020) andfocus on the protease inhibitor subset of the data, and on one particular drug, Lamivudine (3TC), where thegoal is to identify mutations associated with response to 3TC. There are n = 633 cases and p = 91 differentmutations occurring more than 10 times in the sample.

In the selection stage we applied the Lasso to a 80% split of the data, with the regularization parameterset to the theoretical value proposed in Negahban et al. (2009). This resulted in 17 selected mutations, cor-responding to the nonzero Lasso estimates. Figure 3 shows 90% interval estimates constructed for selectedvariables (excluding mutation ‘P184V’) according to four different methods: “naive” is the usual, unad-justed intervals using the entire data; “split” uses only the 20% left-out portion of the data to construct theintervals; “Lee” is the adjusted confidence intervals of Lee et al. (2016) and based on the original 80% por-tion used for selection; finally, “carved” are the intervals relying on the methods we propose in the currentpaper. Of course, we cannot assume a given σ2 in this analysis, and we offer two different implementationsto handle an unknown variance. The first is to estimate σ2 from a least squares fit against the 91 availablepredictors, and then simply plug this estimate in (as if it were data independent). The second approach is toadd a prior on σ2 as described in Section 8. Specifically, we model

π(βE′ |σ2) ∝ σ−|E′| exp(−‖βE′‖2/(2c2σ2)), π(σ2) ∝ Inv-Gamma(a; b), (43)

where we denote Inv-Gamma(a; b) := (σ2)−a−1 exp(−b/σ2), and where c2 is a large constant. In thisexample we used a = b = 0.1 for the other hyperparameters. The two implementations are referred to as“plug-in” and “full Bayes” in Figure 3.

20

All four methods are implemented assuming a linear model consisting of the selected variables. If themodel were correctly specified, “split” and “Lee” would both produce valid interval estimates, although thefirst uses only the held-out data, and the second relies only on the portion of the data used for selection. Our“carved” intervals are approximate, but they utilize information from both portions of the data. The “naive”intervals are invalid.

In terms of length, it can be seen that the two implementations of carving (“plug-in” and “full Bayes”)produce similar interval estimates, that are shorter than both “split” and “Lee”. Note that we could haveused the entire data to construct the intervals of Lee et al. (2016), but we chose to use only the “exploratory”portion to ensure that the same variables are selected as with the other methods. That “Lee” interval esti-mates are considerably longer than “split” matches our expectations: the leftover Fisher information in thisscenario is smaller than the “marginal” information from independent data, which results in longer intervals.

The figure also shows point estimates: for “naive” these are just the Lasso (nonzero) estimates, for“split” these are least-squares, and for “carved” these are (approximate) posterior modes. In the absence ofknowledge about the underlying true means, it is hard to compare the different estimates and directions ofshrinkage; “split” estimates are unbiased under the assumed model, while “carved” arguably have smallervariance (because they use additional data). Figure 4 in the Appendix depicts the effect-size estimates of allvariables in the selected set, including mutation ‘P184V’.

Figure 3: Interval and point estimates for selected features. To allow convenient visualization, the figure isnot showing mutation ‘P184V’, which has a different scale from the other variables in the selected set.

21

8 Computation

We provide computational details for sampling from the approximate posterior and solving for the approxi-mate MAP (MLE) problem. Note that both of these problems require computing the gradient of the approx-imate log-posterior, which amounts to solving for the optimizing variables in a certain convex optimizationproblem. Throughout this section, we assume that π is a log-concave prior. We also assume that the columnsof X are scaled by

√n (but suppress the subscript n). The crucial elements in our implementation are:

• Equation (46) gives the exact form of the gradient of the log-posterior, which we compute at eachdraw of the sampler. Note that it involves b̂(QβE(K)), the optimizing variables to the optimizationproblem

supb∈R|E′|

{bTQβE

′

(K) − (bTQb/2 +Hψ(b))

},

where Hψ is defined within the section and βE′

(K) is the K-th draw of a sampler.

• To solve for an approximate selection-adjusted MAP/MLE, we employ gradient descent on the objec-tive of the MAP/MLE problem. The computation at each step of the descent again involves solvingfor the optimizing variables of the optimization problem stated above.

Treating first the case where σ2 is known, assume again without loss of generality that σ2 = 1. To samplefrom the log-concave (approximate) posterior incorporating (32), we use a Langevin random walk to obtaina sample of size nS from the (approximate) selection-adjusted posterior π̃S(·). As a function of the previousdraw, the (K + 1)-th draw for βE

′of the Langevin sampler is computed as

βE′

(K+1) = βE′

(K) + γ · ∇ log π̃S(βE′

(K)|β̂E′) +

√2γ · �(K) (44)

where �(K) for K = 1, 2, ..., nS are independent draws from a centered gaussian with unit variance, and γis a predetermined step size. The sampler takes a noisy step along the gradient of the log-posterior, with noaccept-reject step—compare to the usual Metropolis Hastings (MH) algorithm. Hence, at each draw of thesampler, the main computational cost is calculating the gradient of the approximate (log-) selection-adjustedposterior.

Recalling the approximation based on (32), we note that the approximate log-posterior is

log π(βE′

(K))− βE′

(K)

TQβE

′

(K)/2 + βE′

(K)

TQβ̂E

′ − log P̃(Ê = E, ŜÊ = sE |βE′(K))

log P̃(Ê = E, ŜÊ = sE |βE′(K)) = − infb∈R|E′|

{(b− βE′(K))

TQ(b− βE′(K))/2 +Hψ(b)}. (45)

Above, Hψ is a function of b = (η, o) ∈ {(η, o) : η ∈ Rp−|E′|, o ∈ Rp}, and given by

Hψ(b) = inf(η,o)∈R2p−|E′|

{ηTNη/2 +

(PE(b η

)T+QEo+ rE

)TΣ−1G

(PE(b η

)T+QEo+ rE

)/2

+

E∑i=1

log (1 + 1/si,Eoi,E) +

p−|E|∑i=1

log (1 + 1/(λi,−E − oi,−E)) + log (1 + 1/(λi,−E + oi,−E))

}.

Denoting by H̄∗(·) the conjugate of H̄(b) = bTQb/2 +Hψ(b), we can write

log P̃(Ê = E, ŜÊ = sE |βE′(K)) = −βE′

(K)

TQβE

′

(K)/2 + supb∈RE

{bTQβE

′

(K) − (bTQb/2 +Hψ(b))

}= −βE′(K)

TQβE

′

(K)/2 + H̄∗(QβE

′

(K)).

22

Letting b̂(QβE′

(K)) be the maximizer of {bTQβE

′

(K) − (bTQb/2 +Hψ(b))}, we have

∇ log P̃(Ê = E, ŜÊ = sE |βE′(K)) = −QβE′

(K) + Q∇H̄−1(QβE

′

(K)) = −QβE′

(K) + Qb̂(QβE′

(K)),

and, finally, the gradient of the log-posterior in (44) equals

∇ log π(βE′(K)) + Qβ̂E′ −Qb̂(QβE′(K)). (46)

Alternatively, we could employ a Metropolis-Hastings algorithm that would require the value of the opti-mization problem in (45) to approximate logP(Ê = E, ŜÊ = sE |βE′(K)), which equals

−βE′(K)TQβE

′

(K)/2 + H̄∗(QβE

′

(K))

in order to compute the acceptance ratio each time.

For the MAP problem, we note that the approximate MAP minimizes the convex objective

minimizeβE′∈R|E′| − log π(βE′)− βE′

TQβ̂E

′+ H̄∗(QβE

′).

This reduces to the MLE problem when π(βE′) ∝ 1. Employing a gradient descent algorithm to compute

the approximate MAP, we note that the K-th update can be written as

β̂E′

S; (K+1) = β̂E′

S; (K) − ηT · (Qb̂(Qβ̂E′S; (K))−∇ log π̃(β̂

E′

S; (K))−Qβ̂E′), (47)

involving again the optimizer b̂(QβE′), obtained from solving

maximizeβE′{bTQβE

′ − (bTQb/2 +Hψ(b))}.

Next, suppose that σ2 is unknown. In that case, we propose to replace (6) with a joint prior on βE′

and σ2. For example, we can take the conjugate prior (43), a large c2 entailing a diffuse prior for βE′

conditionally on σ2. Below, we describe a Gibbs sampler to alternately sample βE′

and σ2.

Let L̃S(βE′, σ2) denote the approximate (log) selection-adjusted likelihood where we plug in our ap-

proximation for the selection probability, given by (29)

log P̃(Ê = E, ŜÊ = sE | βE′(K), σ2)

= −(σ2)−1 · infb∈R|E′|

{(b− βE′(K))

TQ(b− βE′(K))/2 +Hψ(b)}.

(48)

We employ a Gibbs scheme to draw a new sample (βE′

(K+1), σ2(K+1))

T from the resulting posterior, whoselogarithm equals

log π(βE′, σ2) + L̃S(β

E′ , σ2)

up to a constant, under the prior in (43).Using our previous notation, let us denote the posterior distribution for βE

′conditional on σ2 by

π̃S(βE′ |β̂E′ , σ2). Conditional on the K-th update for σ2(K), we make a fresh draw β

E′

(K+1) through a noisyupdate along the gradient of the log-posterior, as described previously in (44):

(βE′)-Update: βE

′

(K+1) ← βE′

(K) + γ · ∇ log π̃S(βE′

(K)|β̂E′ , σ2(K)) +

√2γ · �(K).

23

Observe that the gradient of the log-posterior which leads to our sample update equals

− βE′(K)/c2σ2(K) + Qβ̂

E′/σ2(K) −Qb̂(QβE′

(K))/σ2(K). (49)

We alternate this step with updates for the variance parameter; see for example Tong et al. (2019). Con-ditional on βE

′

(K+1), we see that the choice of an inverse-gamma prior for the variance parameter serves asa conjugate prior as it does usually in Bayesian linear regression. Specifically, the posterior distribution ofσ2(K+1) conditional on β

E′

(K+1) is an inverse-gamma random variable with density proportional to

(σ2)−|E′|/2−p−a−1 exp

(− (b− L̃S(βE

′

(K+1), 1) + (βE′

(K+1))TβE

′

(K+1)/2c2))/σ2

).

That is, we now sample

(σ2)-Update: σ2(K+1) ← Inv-Gamma(|E′|/2 + p+ a; b̃),

where the updated hyperparameter b̃(βE′

(K+1)) equals

b− L̃S(βE′

(K+1), 1) + (βE′

(K+1))TβE

′

(K+1)/2c2.

9 Discussion

To address the problem of inference after variable selection, we adopt the point of view where inference isbased on the likelihood when truncated to the event including all possible realizations that lead the researcherto posing the same question. The methods we propose are based on an approximation to the adjustmentfactor—the denominator in the selection-adjusted likelihood—which applies to a large class of selectionrules, including such that involve randomization. By working directly with the full truncated likelihood, weobviate the need to differentiate between various cases according to choices of the statistician: for example,the approach of Fithian et al. (2014) requires computations that are different in essence under the selectedmodel and under the saturated model, whereas with the tools we develop, there is no essential differencebetween the two. Similarly, our methods are amenable to data-carving. In Panigrahi et al. (2019) theapproximation proposed in our paper is employed to obtain tractable pivotal quantities, and more recentlyPanigrahi and Taylor (2019) used this approximation in a maximum-likelihood approach.

There is certainly room for further research and extensions of the current work. On the methodologicalside, it would be interesting to investigate if a variational Bayes approach can be taken instead of imple-menting MCMC sampling schemes for posterior updates. From a theoretical point of view, we have shownconsistency properties of the posterior that appends a “carved” likelihood to a prior, that is, we proved thatsuch a posterior concentrates around the true underlying parameter with probability converging to one asthe sample size increases. Empirical evidence showing that our credible intervals (under a diffuse prior) aresimilar to the frequentist post-selection intervals, suggests that the frequentist guarantees that we providedcan be strengthened, for example by presenting a Bernstein–von Mises-type of result.

10 Acknowledgements

The idea of providing Bayesian adjusted inference after variable selection was previously proposed byDaniel Yekutieli and Edward George and presented at the 2012 Joint Statistical Meetings in San Diego.Our interest re-arose with the recent developments on exact post-selection inference in the linear model.A.W. is thankful to Daniel and Ed for helpful conversations and to Ed for pointing out the reference to Ba-yarri and DeGroot (1987). A.W. was partially supported by ERC grant 030-8944 and by ISF grant 039-9325.

24

J.T. was supported in part by ARO grant 70940MA. S.P. is thankful to Xuming He, Liza Levina, Rina FoygelBarber and Yi Wang for reading an initial version of the draft and offering valuable comments and insights.The authors would also like to sincerely thank and acknowledge Chiara Sabatti for the long discussions andfor her suggestions along the way.

References

Miguel A Arcones. Moderate deviations for m-estimators. Test, 11(2):465–500, 2002.

MJ Bayarri and MH DeGroot. Bayesian analysis of selection models. The Statistician, pages 137–146,1987.

Yoav Benjamini and Amit Meir. Selective correlations-the conditional estimators. arXiv preprintarXiv:1412.3242, 2014.

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, Linda Zhao, et al. Valid post-selection inference.The Annals of Statistics, 41(2):802–837, 2013.

Nan Bi, Jelena Markovic, Lucy Xia, and Jonathan Taylor. Inferactive data analysis. Scandinavian Journalof Statistics, 47(1):212–249, 2020.

AA Borovkov and AA Mogul’skii. Probabilities of large deviations in topological spaces. i. SiberianMathematical Journal, 19(5):697–709, 1978.

Andreas Buja, Richard A Berk, Lawrence D Brown, Edward I George, Emil Pitkin, Mikhail Traskin, LindaZhao, and Kai Zhang. Models as approximations-a conspiracy of random regressors and model deviationsagainst classical inference in regression. Statistical Science, page 1, 2015.

GTEx Consortium et al. Genetic effects on gene expression across human tissues. Nature, 550(7675):204–213, 2017.

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserv-ing statistical validity in adaptive data analysis. arXiv preprint arXiv:1411.2664, 2014.

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. General-ization in adaptive data analysis and holdout reuse. arXiv preprint arXiv:1506.02629, 2015.

Bradley Efron. Tweedies formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.

Bradley Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction,volume 1. Cambridge University Press, 2012.

Peter Eichelsbacher and Matthias Löwe. Moderate deviations for iid random variables. ESAIM: Probabilityand Statistics, 7:209–218, 2003.

William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprintarXiv:1410.2597, 2014.

Ning Gao, Johannes WR Martini, Zhe Zhang, Xiaolong Yuan, Hao Zhang, Henner Simianer, and JiaqiLi. Incorporating gene annotation into genomic prediction of complex phenotypes. Genetics, 207(2):489–501, 2017.

25

L Gunter, J Zhu, and SA Murphy. Variable selection for qualitative interactions. Statistical methodology, 8(1):42–55, 2011.

Nils Lid Hjort and David Pollard. Asymptotics for minimisers of convex processes. arXiv preprintarXiv:1107.3806, 2011.

Marek Kanter and Harold Proppe. Reduction of variance for gaussian densities via restriction to convexsets. Journal of Multivariate Analysis, 7(1):74–81, 1977.

Danijel Kivaranovic and Hannes Leeb. Expected length of post-model-selection confidence intervals condi-tional on polyhedral constraints. arXiv preprint arXiv:1803.01665, 2018.

Jason D Lee and Jonathan E Taylor. Exact post model selection inference for marginal screening. InAdvances in Neural Information Processing Systems, pages 136–144, 2014.

Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact post-selection inference, withapplication to the lasso. The Annals of Statistics, 44(3):907–927, 2016.

Jelena Markovic and Jonathan Taylor. Bootstrap inference after using multiple queries for model selection.arXiv preprint arXiv:1612.07811, 2016.

Ian W McKeague and Min Qian. An adaptive resampling test for detecting the presence of significantpredictors. Journal of the American Statistical Association, 110(512):1422–1433, 2015.

Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high-dimensional analysis ofm-estimators with decomposable regularizers. In Advances in Neural InformationProcessing Systems, pages 1348–1356, 2009.

Snigdha Panigrahi. Carving model-free inference. arXiv preprint arXiv:1811.03142, 2018.

Snigdha Panigrahi and Jonathan Taylor. Approximate selective inference via maximum likelihood. arXivpreprint arXiv:1902.07884, 2019.

Snigdha Panigrahi, Junjie Zhu, and Chiara Sabatti. Selection-adjusted inference: an application to con-fidence intervals for cis-eQTL effect sizes. Biostatistics, 07 2019. ISSN 1465-4644. doi: 10.1093/biostatistics/kxz024. URL https://doi.org/10.1093/biostatistics/kxz024. kxz024.

Stephen Reid, Jonathan Taylor, and Robert Tibshirani. Post-selection point and interval estimation of signalsizes in gaussian samples. arXiv preprint arXiv:1405.3340, 2014.

Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and Robert W Shafer.Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the Na-tional Academy of Sciences, 103(46):17355–17360, 2006.

H Robbins. The uv method of estimation. Statistical Decision Theory and Related Topics IV, 1:265–270,1988.

Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. arXivpreprint arXiv:1511.05219, 2015.

Noah Simon and Richard Simon. On estimating many means, selection bias, and the bootstrap. arXivpreprint arXiv:1311.3709, 2013.

26

https://doi.org/10.1093/biostatistics/kxz024

Uri Simonsohn, Leif D Nelson, and Joseph P Simmons. P-curve: A key to the file-drawer. Journal ofExperimental Psychology: General, 143(2):534, 2014.

Francesco C Stingo, Yian A Chen, Mahlet G Tadesse, and Marina Vannucci. Incorporating biologicalinformation into linear models: A bayesian approach to the selection of pathways and genes. The annalsof applied statistics, 5(3), 2011.

J Taylor, R Lockhart, RJ Tibshirani, and R Tibshirani. Exact post-selection inference for sequential regres-sion procedures. Journal of the American Statistical Association, 111(514):600–620, 2014.

Jonathan Taylor and Robert Tibshirani. Post-selection inference for-penalized likelihood models. CanadianJournal of Statistics, 46(1):41–61, 2018.

Jonathan Taylor, Joshua Loftus, and Ryan Tibshirani. Tests in adaptive regression via the kac-rice formula.arXiv preprint arXiv:1308.3020, 2013.

Xiaoying Tian, Jonathan Taylor, et al. Selective inference with a randomized response. The Annals ofStatistics, 46(2):679–710, 2018.

XT Tong, M Morzfeld, and YM Marzouk. Mala-within-gibbs samplers for high-dimensional distributionswith sparse conditional structure. arXiv preprint arXiv:1908.09429, 2019.

Asaf Weinstein, William Fithian, and Yoav Benjamini. Selection adjusted confidence intervals with morepower to determine the sign. Journal of the American Statistical Association, 108(501):165–176, 2013.

Daniel Yekutieli. Adjusted bayesian inference for selected parameters. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 74(3):515–541, 2012.

Guo Yu, Jacob Bien, and Ryan Tibshirani. Reluctant interaction modeling. arXiv preprintarXiv:1907.08414, 2019.

Hua Zhong and Ross L Prentice. Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics, 9(4):621–634, 2008.

Sebastian Zöllner and Jonathan K Pritchard. Overcoming the winners curse: estimating penetrance param-eters from case-control data. The American Journal of Human Genetics, 80(4):605–615, 2007.

11 Appendix

11.1 Proofs for Section 3

Proof. Theorem 3.1: We begin by noting that logP(n1/2−δ(Ȳn +Wn) ∈ K|βn) can be bounded above by

logE(exp(n1/2+δαȲn + n1/2+δγWn + u)|βn)

for a choice of α, γ and u such that

n1/2+δ · (αz̄ + γw̄) + u ≥ 0 whenever n1/2−δ(z̄ + w̄) ∈ K.

Settingu = − inf z̄,w̄:n1/2−δ(z̄+w̄)∈K (n1/2+δαz̄ + n1/2+δγw̄),

27

it follows that

logP(n1/2−δ(Ȳn +Wn) ∈ K|βn)

≤ supz̄,w̄:n1/2−δ(z̄+w̄)∈K

{−n1/2+δαz̄ − n1/2+δγw̄ + n2δαβ∗ + n

2δ

2α2 +

n2δ(1− ρ)2ρ

γ2}

= n2δ · supz,w:(z+w)∈K

{−αz − γw + αβ∗ + α

2

2+

(1− ρ)γ2

2ρ

}.

Since the above display holds for any arbitrary α, γ, we can write

logP(n1/2−δ(Ȳn +Wn) ∈ K|βn)

≤ −n2δ · supα,γ

{inf

z,w:(z+w)∈Kαz −

(αβ∗ +

α2

2

)+ γw − (1− ρ)γ

2

2ρ

}= −n2δ · inf

z,w:(z+w)∈K

{supα,γ

αz −(αβ∗ +

α2

2

)+ γw − (1− ρ)γ

2

2ρ

}= −n2δ inf

z,w:(z+w)∈K

(z − β∗)2

2+

ρw2

2(1− ρ).

The penultimate equation follows by a minimax argument and the last step calculates the conjugates of the

moment generating functions of gaussian random variables with variances 1 and1− ρρ

respectively.

The large deviation limit when δ = 1/2 follows from Cramer’s Theorem on the real line and the mod-erate deviations limit follows from Theorem 2.2 and Remark 2.32, Eichelsbacher and Löwe (2003) whenδ ∈ (0, 1/2). These theorems state the following limit

limn→∞

1

n2δlogP[

√n(Ȳn +Wn)/n

δ ∈ K|βn] = − inf(z,w):z+w∈K

R(z, w);

R(z, w) = supx,y{xz + yw − logE[exp(xZ + yW )|β∗]}, (Z,W ) ∼ N (µ,Σ), µ =(β∗

0

)and Σ =[

1 00 (1− ρ)/ρ

]. We have the proof by plugging in the gaussian MGF and finally, observing that the rate

function equalsR(z, w) = (z − β∗)2/2 + ρw2/2(1− ρ).

Proof of Corollary 3.3. It follows from Theorem 3.1 that the sequence of true truncated posteriors, writtenas

log πS(βn|ȳn) = log π(βn)− n(ȳn − βn)2/2− logP(√n(Ȳn +Wn)/n

δ ∈ K|βn)

can be approximated by

log π(βn)− n(ȳn − βn)2/2 + n2δ · inf(z,w):z+w∈K

{(z − β∗)2

2+

ρw2

2(1− ρ)

}.

Note that the limiting sequence of objectives

(z − β∗)2

2+

ρw2

2(1− ρ)+

1

n2δψn−δ(z + w)

28

are convex in (z, w). Furthermore, the above sequence converges to the continuous, convex objective

(z − β∗)2

2+

ρw2

2(1− ρ)+ IK(z + w), IK(z̄) =

{0 if z̄ ∈ K∞ otherwise

under the condition n−2δψn−δ(z +w)→ IK(z +w) for all (z, w) ∈ R2 as n→∞. Finally, observing thatthe limiting objective has a unique minimum, we reach the conclusion of the Corollary.


Proof. Proposition 4.1: Noting that the least squares estimator β̂E′

satisfies

1√n

XTE′(

y −XE′ β̂E′)

= 0, (50)

we can write

0 =1√n

XTE′(

y −XE′ β̂E′)

=1√n

XTE′(

y −XE′βE′)−(

XTE′XE′

n

)√n(β̂E

′ − βE′)

=1√n

XTE′(

y −XE′βE′)−Q√n(β̂E

′ − βE′)

−(

XTE′XE′

n−Q

)√n(β̂E

′ − βE′).

Thus, follows from observing XTE′XE′/n −Q = op(1) and√n(β̂E

′ − βE′) = Op(1) (see Proposition 7

Buja et al. (2015)) that(

XTE′XE′

n−Q

)√n(β̂E

′ − βE′) = op(1) and hence,

√n(β̂E

′ − βE′) = Q−1XTE′√n

(y −XE′βE

′)

+ op(1). (51)

Further, it is easy to note that the asymptotic variance of√n(β̂E

′ − βE′) equals σ2Q−1 under the abovemodel assumptions.

Defining C as EP (XT−E′XE′/n), we can expand the termXT−E′√

n

(y −XE′ β̂E

′)

, which identifies

√nN−E′ =

XT−E′√n

(y −XE′βE

′)−C√n(β̂E

′ − βE′)−

(XT−E′XE′

n−C

)√n(β̂E

′ − βE′).

Plugging in√n(β̂E

′ − βE′) from (51) and noting that

(XT−E′XE′

n−C

)√n(β̂E

′ − βE′) = op(1), we

have√nN−E′ =

XT−E′√n

(y −XE′βE

′)−CQ−1

XTE′√n

(y −XE′βE

′)

+ op(1). (52)

Note that the asymptotic variance of√nN−E′ equals σ2N−1, where N−1 = (P −CQ−1CT ), with P =

EP (XT−E′X−E′/n). Further, the asymptotic covariance between√n(β̂E

′ − βE′) and√nN−E′ is 0, which

follows from

EP

(Cov

(Q−1

XTE′√n

(y −XE′βE

′),XT−E′√

n

(y −XE′βE

′)−CQ−1

XTE′√n

(y −XE′βE

′) ∣∣∣X)) = 0.

29

Now we look at the randomization term. Let β∗E(βE′) = EP [β̂λE ], where β̂λE ∈ R|E| is the vector of the

non zero coordinates of the Lasso solution in (20). The randomization term in (23) equals

Ωn = −XT√n

(y −XEβ∗E) +XS

T

ρ√n

(yS −XSEβ∗E

)+ op(1), (53)

where the remainder term is(

XTXEn

−A)√n(β̂λE − β∗E)−

(XS

TXSE

ρn−A

)√n(β̂λE − β∗E), and A =

EP (XTXE/n) = EP (XSTXSE/ρn). Clearly, from equations (51), (52) and (53), it follows that

√n(Tn − E[Tn]) ∼ N(0,ΣP ); Ωn ∼ N(0,ΣG); ΣP = σ2

[Q−1 0

0 N−1

].

The proof is finally complete by noting the block diagonal structure of the covariance between√nTn,Ωn.

Observe that the covariance

Cov

(XT√n

(y −XEβ∗E)−XS

T

ρ√n

(yS −XSEβ∗E

),XT (y −XE′βE

′)

)= 0

which proves the asymptotic independence between√n(Tn − E[Tn]) and Ωn.

Proof. Proposition 4.2: We denote β̂λE ∈ R|E| as the vector of the selected coordinates of the Lasso solution,not shrunk to 0. From the definition of Ωn in (23) and the K.K.T. conditions of LASSO, it follows that

Ωn = −XT√n

(y −XE β̂λE

)+

(λsE

z−E

)where the subgradient vector equals at the solution

∂

∂β(λ‖β‖1) =

(λsE

z−E

), and ‖z−E‖∞ < λ.

Noting that−XTy/

√n = −XTXE′ β̂E

′/√n−XT (y −XE′ β̂E

′)/√n,

the K.K.T. map now equals

Ωn = −XTXE′

n

√nβ̂E

′ −XT (y −XE′ β̂E′)/√n+

XTXEn

√nβ̂λE +

(λsE

z−E

).

Using the facts that (XTEXE′/n − PE,E′) = op(1), (XT−EXE′/n − FE,E

′) = op(1), (XTEXE/n −

QE) = op(1) and (XT−EXE/n−CE) = op(1), coupled with the K.K.T. map above, we can write

Ωn = −[PE,E

′ IE,E′

FE,E′ J E,E′

]√nTn +

[QE 0CE I

](√nβ̂λEz−E

)+

(λsE

0

)+ op(1). (54)

The constraints equivalent to selection of(Ê, ŜE

)=(E, sE

)are given by

diag(

sgn(β̂λE))

= sE , ‖z−E‖∞ < λ

which can be equivalently written using (54) as

− diag(sE)(QE)−1(PE,E

′ IE,E′)√

nTn −(diag(sE)(QE)−1 0

)Ωn + op(1)

< −λ · diag(sE)(QE)−1sE ;

30

((FE,E

′ J E,E′)−CE(QE)−1

(PE,E

′ IE,E′))√

nTn +(−CE(QE)−1 I

)Ωn + op(1)

< λ ·(1−CE(QE)−1sE

);

and (−(FE,E

′ J E,E′)

+ CE(QE)−1(PE,E

′ IE,E′))√

nTn +(CE(QE)−1 −I

)Ωn + op(1)

< λ ·(1 + CE(QE)−1sE

).


Proof. Theorem 5.1: Under the parameteri

Department of Statistics, University of Michigan Department of … · 2020. 6. 2. · Integrative...

Documents

Transcript of Department of Statistics, University of Michigan Department of … · 2020. 6. 2. · Integrative...