multiclass and multilabel prediction - arXiv · 2020-04-22 · Knowing what you know: valid con...

Knowing what you know: valid confidence sets in

multiclass and multilabel prediction∗

Maxime Cauchois†1, Suyash Gupta†1, and John C. Duchi1, 2

1Department of Statistics, Stanford University2Department of Electrical Engineering, Stanford University

{maxcauch, suyash28, jduchi}@stanford.edu

Abstract

We develop conformal prediction methods for constructing valid predictive confidencesets in multiclass and multilabel problems without assumptions on the data generatingdistribution. A challenge here is that typical conformal prediction methods—which givemarginal validity (coverage) guarantees—provide uneven coverage, in that they addresseasy examples at the expense of essentially ignoring difficult examples. By leveragingideas from quantile regression, we build methods that always guarantee correct coveragebut additionally provide (asymptotically optimal) conditional coverage for both multiclassand multilabel prediction problems. To address the potential challenge of exponentiallylarge confidence sets in multilabel prediction, we build tree-structured classifiers thatefficiently account for interactions between labels. Our methods can be bolted on top ofany classification model—neural network, random forest, boosted tree—to guarantee itsvalidity. We also provide an empirical evaluation suggesting the more robust coverage ofour confidence sets.

Keywords Multilabel and Multiclass Classification · Conformal Inference · QuantileRegression · Graphical Models

1 Introduction

The average accuracy of a machine-learned model by itself is insufficient to trust the model’sapplication; instead, we should ask for valid confidence in its predictions. Valid here doesnot mean “valid under modeling assumptions,” or “trained to predict confidence,” but honestvalidity, independent of the underlying distribution. In particular, for a supervised learningtask with inputs x ∈ X , targets y ∈ Y, and a given confidence level α ∈ (0, 1), we seekconfidence sets C(x) such that P (Y ∈ C(X)) ≥ 1 − α; that is, we cover the true target Ywith a given probability 1− α.

The typical approach in supervised learning is to learn a scoring function s : X × Y → Rwhere high scores s(x, y) mean that y is more likely for a given x. Given such a score function,

∗Research supported by NSF CAREER award CCF-1553086, ONR Young Investigator Program awardN00014-19-2288, and NSF award HDR-1934578 (the Stanford Data Science Collaboratory)†Equal contribution.

1

arX

iv:2

004.

1018

1v1

[st

at.M

L]

21

Apr

202

0

a natural goal for prediction with confidence is to compute a quantile function qα of the scoresatisfying

P (s(x, Y ) ≥ qα(x) | X = x) ≥ 1− α, (1)

where α > 0 is some pre-fixed level of acceptable error. We could then output conditionallyvalid confidence sets at level 1− α by simply returning

{y ∈ Y | s(x, y) ≥ qα(x)}

for each x ∈ X . Unfortunately, such conditionally valid coverage is impossible without eithervacuously loose thresholds [32] or strong modeling assumptions [32, 1], but this idea formsthe basis for our coming approach.

To address this impossibility, conformal inference [33] resolves instead to a marginal cov-erage guarantee: given n observations and a desired confidence level 1−α, conformal inferencemethods construct confidence sets C(x) such that for a new pair (Xn+1, Yn+1) from the samedistribution, Yn+1 ∈ C(Xn+1) with probability at least 1− α, where the probability is jointlyover X and Y . Conformal inference algorithms can build upon arbitrary predictors, neuralnetworks, random forests, kernel methods and treat them as black boxes, “conformalizing”them post-hoc.

This distribution-free coverage is only achievable marginally, and standard conformal pre-dictions for classification achieve it (as we see later) by providing good coverage on easy ex-amples at the expense of miscoverage on harder instances. We wish to provide more uniformcoverage, and we address this in multiclass and multilabel problems. We combine the ideas ofconformal prediction [33] with an approach to fit a quantile function q on the scores s(x, y) ofthe prediction model—an approach Romano et al. [27] originate for regression problems—andbuild feature-adaptive quantile predictors, which output sets of labels, allowing us to guar-antee valid marginal coverage (independent of the data generating distribution) while betterapproximating the conditional coverage (1).

Conformal inference in classification We consider both multiclass (where each examplebelongs to a single class) and multilabel (where each example may belong to several classes)classification problems.

In the multiclass case, we propose a method that fits a quantile function on the scores,conformalizing it on held-out data. While this immediately provides valid marginal coverage,the accuracy of the quantile function—how well it approximates the conditional quantiles—determines conditional coverage performance. Under certain consistency assumptions on thelearned scoring functions and quantiles as sample size increases, we show in Section 2 that werecover the conditional coverage (1) asymptotically.

The multilabel case poses new statistical and computational challenges, as a K-class prob-lem entails 2K potential responses. In this case, we seek efficiently representable inner andouter sets Cin(x) and Cout(x) ⊂ {1, . . . ,K} such that

P(Cin(X) ⊂ Y ⊂ Cout(X)) ≥ 1− α. (2)

We propose two approaches to guarantee the containments (2). The first directly fits innerand outer sets by solving two separate quantile regression problems. The second begins byobserving that labels are frequently correlated—think, for example, of chairs, which frequentlyco-occur with a table—and learns a tree-structured graphical model [19] to efficiently addresssuch correlation. We show how to build these in top of any predictive model.

2

Related work and background Vovk et al. [33] introduce (split-)conformal inference,which splits the first n samples of the exchangable pairs {(Xi, Yi)}n+1

i=1 into two sets (say, eachof size n1 and n2 respectively) where the first training set (I1) is used to learn a scoringfunction s : X × Y → R and the second validation set ( I2) use the scoring function to“conformalize” to construct a confidence set over potential labels (or targets) Y, definingconfidence sets of the form

C(x) := {y ∈ Y | s(x, y) ≥ t}

for some threshold t. The basic split-conformal method chooses the (1+1/n2)(1−α)-empiricalquantile Qmarg

1−α of the negative of the scores {−s(Xi, Yi)}i∈I2 (on the validation set) and definesthe confidence sets

C(x) := {y ∈ Y | s(x, y) ≥ −Qmarg1−α }. (3)

Such a confidence set satisfies 1− α marginal coverage guarantee for the pair (Xn+1, Yn+1).We refer to the procedure (3) as the Marginal conformal prediction method.While conformal inference guarantees marginal coverage without assumption on the distri-

bution that generates the data, Vovk [32] shows it is virtually impossible to attain distribution-free conditional coverage, and Barber et al. [1] prove that in regression, one can only achievea weaker form of approximately-conditional coverage without conditions on the underlyingdistribution. Because of this theoretical limitation, work in conformal inference often fo-cuses on minimizing confidence set sizes or guaranteeing different types of coverage. Forinstance, Sadinle et al. [28] propose conformal prediction algorithms for multiclass problemsthat minimize the expected size of the confidence set and conformalize the scores separatelyfor each class, providing class-wise coverage. In the same vein, Hechtlinger et al. [15] usedensity estimates of p(x | y) as conformal scores to build a “cautious” predictor, the ideabeing that it should output an empty set when the new sample differs too much from theoriginal distribution. In regression problems, Romano et al. [27] conformalize a quantile pre-dictor, which allows them to build marginally valid confidence sets that are adaptive to featureheterogeneity and empirically smaller on average than purely marginal confidence sets. Weadapt this regression approach to classification tasks, learning quantile functions to constructvalid—potentially conditionally valid—confidence predictions.

Notation P is a set of distributions on X × Y, where Y = {0, 1}K is a discrete set oflabels, K ≥ 2 is the number of classes, and X ⊂ Rd. We assume that we observe a finite

sequence (Xi, Yi)1≤i≤niid∼ P from some distribution P = PX×PY |X ∈ P and wish to predict a

confidence set for a new example Xn+1 ∼ PX . P stands over the randomness of both the newsample (Xn+1, Yn+1) and the full procedure. Finally, we tacitly identify the one-hot vectorY ∈ Y = {0, 1}K with the subset of [K] = {1, . . . ,K} it represents, and the notation X ⇒ [K]indicates a set-valued mapping between X and [K].

2 Conformal multiclass classification

We begin with multiclass classification problems, developing Conformalized Quantile Classi-fication (CQC) to construct finite sample marginally valid confidence sets. CQC is similar toRomano et al.’s Conformalized Quantile Regression (CQR): we estimate a quantile functionof the scores, which we use to construct valid confidence sets after conformalization. The ideais to split the data into subsets, I1, I2, and I3 with sample sizes n1, n2 and n3 respectively,where I3 is disjoint from I1 ∪ I2. The basic idea is as follows, as Fig. 1 outlines: we use the

3

set I1 for fitting the scoring function with an (arbitrary) given learning algorithm A, use I2for fitting a quantile function from a family Q ⊂ X → R of possible quantile functions to theresulting scores, and use I3 for calibration. In the algorithm, we recall the “pinball loss” [18]ρα(t) = (1−α) [−t]+ +α [t]+, which satisfies argminq∈R E[ρα(Z − q)] = inf{q | P(Z ≤ q) ≥ α}for any random variable Z.

Algorithm 1: Split Conformalized Quantile Classification (CQC).

Input: Sample {(Xi, Yi)}ni=1, index sets I1, I2, I3 ⊂ [n], fitting algorithmA, quantilefunctions Q, and desired confidence level α

1. Fit scoring function vias := A ((Xi, Yi)i∈I1) . (4)

2. Fit quantile function via

qα ∈ argminq∈Q

{1

|I2|∑i∈I2

ρα (s(Xi, Yi)− q(Xi))

}(5)

3. Calibrate by computing conformity scores Si = qα(Xi)− s(Xi, Yi), defining

Q1−α(S, I3) := (1 + 1/n3)(1− α) empirical quantile of {Si}i∈I3

and return prediction set function

C1−α(x) := {k ∈ [K] | s(x, k) ≥ qα(x)−Q1−α(S, I3)} . (6)

The CQC algorithm in Fig. 1 returns an estimated confidence set function C1−α, which on anew input Xn+1, predicts the set C1−α(Xn+1).

2.1 Finite sample validity of CQC

We begin by showing that Alg. 1 enjoys the coverage guarantees we expect, which is moreor less immediate by the marginal guarantees for the method (3). We include the proof forcompleteness and because its cleanness highlights the ease of achieving validity.

Theorem 1. Let (Xi, Yi) ∼ P, i = 1, . . . , n + 1 be exchangeable, where P is an arbitrarydistribution. Then the prediction set C1−α of the split CQC Algorithm 1 satisfies

P{Yn+1 ∈ C1−α(Xn+1)} ≥ 1− α.

Proof The argument is due to Romano et al. [27, Thm. 1]. Observe that Yn+1 ∈ C1−α(Xn+1)if and only if Sn+1 ≤ Q1−α(S, I3). Define the σ-field F12 = σ {(Xi, Yi) | i ∈ I1 ∪ I2}. Then

P(Yn+1 ∈ C1−α(Xn+1) | F12) = P(Sn+1 ≤ Q1−α(S, I3) | F12).

We use the following lemma.

Lemma 2.1 (Lemma 2, Romano et al. [27]). Let Z1, . . . , Zn+1 be exchangable random vari-ables and Qn(·) be the empirical quantile function of Z1, . . . Zn. Then for any α ∈ (0, 1),

P(Zn+1 ≤ Qn((1 + n−1)α)

)≥ α.

4

If Z1, . . . , Zn are almost surely distinct, then

P(Zn+1 ≤ Qn((1 + n−1)α)

)≤ α+

1

n.

Since the original sample is exchangeable, so are the conformity scores Si for i ∈ I3,conditionally on F12. Lemma 2.1 implies P(Sn+1 ≤ Q1−α(S, I3) | F12) ≥ 1 − α, and takingexpectations over F12 yields the theorem.

The conditional distribution of scores given X is discrete, so the conformalized quantileconfidence set may be conservative: it is possible that for any q such that P (s(X,Y ) ≥ q |X) ≥ 1 − α, we have P (s(X,Y ) ≥ q | X) ≥ 1 − ε for some ε � α. Conversely, as the CQCprocedure 1 seeks 1−α marginal coverage, it may sacrifice a few examples to bring the coveragedown to 1 − α. Moreover, there may be no unique quantile function for the scores. Thereare several possibilities to address these issues: the first is to estimate a quantile function onI2 (recall step (5) of the CQC method) so that we can guarantee higher coverage (which ismore conservative, but is free in the ε� α case). An alternative, which we outline here, is torandomize scores without changing the relative order.

To achieve exact 1 − α asymptotic coverage and a unique quantile, we randomize byadding a continuous random variable with full support on the real line simultaneously to each

sample score. Let Ziiid∼ π for some distribution π with continuous density that is supported

on the entire real line, and let σ > 0 be a noise parameter. Then for any scoring functions : X × Y → R, we define the randomized scores

sσ(x, y, z) := s(x, y, z) + σz.

Note that sσ(x, y, z)− sσ(x, y′, z) = s(x, y)− s(x, y′), so that this does not change the relativeordering of label scores, only giving the score function a conditional density.

Now, consider replacing the quantile estimator (5) with the randomized estimator

qσα ∈ argminq∈Q

{1

n2

∑i∈I2

ρα(sσ(Xi, Yi, Zi)− q(Xi))

}, (7)

and let Sσi := qσα(Xi)−sσ(Xi, Yi, Zi) be the corresponding conformity scores. Similarly, replacethe prediction set (6) with

Cσ1−α(x, z) := {k ∈ [K] | sσ(x, k, z) ≥ qσα(x)−Q1−α(Sσ, I3)} . (8)

where Q1−α(Sσ, I3) is the (1−α)(1+1/n3)-th empirical quantile of {Sσi }i∈I3 . Then for a newinput Xn+1 ∈ X , we simulate a new independent variable Zn+1 ∼ π, and give the confidenceset Cσ1−α(Xn+1, Zn+1).

As the next theorem shows, this gives nearly perfectly calibrated coverage.

Theorem 2. Let (Xi, Yi) ∼ P, i = 1, . . . , n + 1 be exchangeable, where P is an arbitrarydistribution. Let the estimators (7) and (8) replace the estimators (5) and (6) in the CQCAlgorithm 1, respectively. Then the prediction set Cσ1−α satisfies

1− α ≤ P{Yn+1 ∈ Cσ1−α(Xn+1, Zn+1)} ≤ 1− α+1

1 + n3.

5

Proof The argument is again more or less due to Romano et al. [27]. First, we observe thatYn+1 ∈ Cσ1−α(Xn+1, Zn+1) if and only if Sσn+1 ≤ Q1−α(Sσ, I3), which implies, conditionallyon the σ-field F2 = σ((Xi, Yi) | i ∈ I2), that

P(Yn+1 ∈ Cσ1−α(Xn+1, Zn+1) | F2

)= P

(Sσn+1 ≤ Q1−α(Sσ, I3) | F2

).

As the original observations are exchangeable, so are the conformity scores Sσi for i ∈ I3, aswe add independent noise to each, which guarantees the Sσi are distinct almost surely. Thenapplying Lemma 2.1 and taking expectations gives

1− α ≤ P(Sσn+1 ≤ Q1−α(Sσ, I3)) ≤ 1− α+1

n3 + 1,

as desired.

2.2 Asymptotic optimality of CQC method

Under appropriate assumptions typical in proving the consistency of prediction methods,Conformalized Quantile Classification (CQC) guarantees conditional coverage asymptotically.To set the stage, assume that as n ↑ ∞, the fit score functions sn in Eq. (4) converge to afixed s : X × Y → R (cf. Assumption A1). Let

qα(x) := inf {z ∈ R | α ≤ P (s(x, Y ) ≤ z | X = x)}

be the α-quantile function of the limiting scores s(x, Y ) and for σ > 0 define

qσα(x) := inf {z ∈ R | α ≤ P (sσ(x, Y, Z) ≤ z | X = x)} ,

where Z ∼ π (for a continuous distribution π as in the preceding section) is independent ofx, y, to be the α-quantile function of the noisy scores sσ(x, Y, Z) = s(x, Y ) +σZ. With these,we can make the following natural definitions of our desired asymptotic confidence sets.

Definition 2.1. The randomized-oracle and super-oracle confidence sets are Cσ1−α(X,Z) :={k ∈ [K] | sσ(X, k, Z) ≥ qσα(X)} and C1−α(X) = {k ∈ [K] | s(X, k) ≥ qα(X}, respectively.

Our aim will be to show that the confidence sets of the split CQC method 1 (or its randomizedvariant) converge to these confidence sets under appropriate consistency conditions.

To that end, we consider the following consistency assumption.

Assumption A1 (Consistency of scores and quantile functions). The score functions s andquantile estimator qσα are mean-square consistent, that is,

‖s− s‖2L2(PX) :=

∫X‖s(x, ·)− s(x, ·)‖2∞dPX(x)

p→ 0

and

‖qσα − qσα‖2L2(PX) :=

∫X

(qσα(x)− qσα(x))2dPX(x)p→ 0

as n1, n2, n3 →∞.

With this assumption, we have the following theorem, whose proof we provide in Appendix A.1.

6

Theorem 3. Let Assumption A1 hold. Then the confidence sets Cσ1−α satisfy

limn→∞

P(Cσ1−α(Xn+1, Zn+1) 6= Cσ1−α(Xn+1, Zn+1)) = 0.

While we do not require this for Theorem 3, in the ideal scenario that the limiting scorefunction s is optimal (cf. [2, 36, 31]), so that s(x, y) > s(x, y′) whenever P (Y = y | X = x) >P (Y = y′ | X = x). Under this additional condition, the super-oracle confidence set C1−α(X)in Def. 2.1 is the smallest conditionally valid confidence set at level 1− α. Conveniently, ourrandomization is consistent as σ ↓ 0: the super oracle confidence set C1−α contains Cσ1−α(with high probability). We provide the proof of the following theorem in Appendix A.2.

Theorem 4. The confidence sets Cσ1−α satisfy

limσ→0

P(Cσ1−α(Xn+1, Zn+1) ⊆ C1−α(Xn+1)) = 1.

Because we always have P(Y ∈ Cσ1−α(X,Z)) ≥ 1 − α, Theorem 4 shows that we maintainvalidity while potentially shrinking the confidence sets.

3 The multilabel setting

In multilabel classification, we observe a single feature vector x ∈ X and wish to predict avector y ∈ {−1, 1}K where yk = 1 indicates that label k is present and yk = 0 indicates itsabsence. For example, in object detection problems [26, 30, 22], we want to detect severalentities in an image, while in text classification tasks [23, 24, 17], a single document canpotentially share multiple topics. To conformalize such predictions, we wish to output anaggregated set of predictions {y(x)} ⊂ {−1, 1}K—a collection of −1-1-valued vectors—thatcontains the true configuration Y = (Y1, . . . , YK) with probability at least 1− α.

The multilabel setting poses both statistical and computational challenges. On the sta-tistical side, multilabel prediction engenders a multiple testing challenge: even if each taskhas an individual confidence set Ck(x) such that P (Yk ∈ Ck(X)) ≥ 1 − α, in general we canonly conclude that P (Y ∈ C1(X)× · · · × CK(X)) ≥ 1−Kα, though as all predictions sharethe same features x, we wish to leverage correlation through x. Additionally, as we discussin the introduction (recall Eq. (3)), we require a scoring function s : X × Y → R. A naivescoring function, given a predictor y : X → {−1, 1}k, for multilabel problems is to simply uses(x, y) = 1 {y(x) = y}, but this fails, as the confidence sets contain either all configurationsor a single configuration. On the computational side, the total number of label configurations(2K) grows exponentially, so a “standard” multiclass-like approach outputting confidence setsC(X) ⊂ Y = {−1, 1}K , although feasible for small values of K, is computationally impracticaleven for moderate K.

We instead propose using inner and outer set functions Cin, Cout : X ⇒ [K] to efficientlydescribe a confidence set on Y, where we require they satisfy the coverage guarantee

P(Cin(X) ⊂ Y ⊂ Cout(X)

)≥ 1− α, (9a)

or equivalently, we learn two functions yin, yout : X → Y = {−1, 1}K such that

P (yin(X) � Y � yout(X)) ≥ 1− α. (9b)

We thus say that coverage is valid if the inner set exclusively contains positive labels and theouter set contains at least all positive labels. For example, yin(x) = 0 and yout(x) = 1 are

7

always valid, though uninformative, while the smaller the set difference between Cin(X) andCout(X) the more confident we may be in a single prediction.

In the remainder of this section, we propose methods to conformalize multilabel predic-tors in varying generality, using tree-structured graphical models to both address correlationsamong labels and computational efficiency. We begin in Section 3.1 with a general methodfor conformalizing an arbitrary scoring function s : X × Y → R on multilable vectors, whichguarantees validity no matter the score. We then show different strategies to efficiently imple-ment the method, depending on the structure of available scores, in Section 3.2, showing how“tree-structured” scores allow computational efficiency while modeling correlations among thetask labels y. Finally, in Section 3.3, we show how to build such a tree-structured score func-tion s from both arbitrary predictive models (e.g. [4, 35]) and those more common multilabelpredictors—in particular, those based on neural networks—that learn and output per-taskscores sk : X → R for each task [16, 25].

3.1 A generic split-conformal method for multilabel sets

We begin by assuming we have a general score function s : X ×Y → R for Y = {−1, 1}K thatevaluates the quality of a given set of labels y ∈ Y for an instance x; in the next subsection,we describe how to construct such scores from multilabel prediction methods, presenting ourgeneral method first. We consider the variant of the CQC method 1 in Alg. 2.

Algorithm 2: Split Conformalized Inner/Outer method for classification (CQioC)

Input: Sample {(X1, Y1)}ni=1, disjoint index sets I2, I3 ⊂ [n], quantile functions Q,desired confidence level α, and score function s : X × Y → R.

1. Fit quantile function qα ∈ argminq∈Q{∑

i∈I2 ρα(s(Xi, Yi)− q(Xi))}, as in (5).

2. Compute conformity scores Si = qα(Xi)− s(Xi, Yi), define Q1−α as (1 + 1/|I3|) ·(1− α) empirical quantile of S = {Si}i∈I3 , and return prediction set function

Cio(x) := {y ∈ Y | yin(x) � y � yout(x)} , (10)

where yin and yout satisfy

yin(x)k = 1 if and only if maxy∈Y:yk=−1

s(x, y) < qα(x)− Q1−α

yout(x)k = −1 if and only if maxy∈Y:yk=1

s(x, y) < qα(x)− Q1−α.(11)

There are two considerations for Algorithm 2: its computational efficiency and its validityguarantees. Deferring the efficiency questions to the coming subsections, we begin with thelatter. The naive approach is to simply use the “standard” or implicit conformalizationapproach, used for regression or classification, by defining

Cimp(x) :={y ∈ Y | s(x, y) ≥ qα(x)− Q1−α

}, (12)

where qα and Q1−α are as in the CQioC method 2. This does guarantee validity, as we havethe following corollary of Theorem 1.

8

Corollary 3.1. Assume that (Xi, Yi)n+1i=1 ∼ P are exchangeable, where P is an arbitrary

distribution. Then for any confidence set C :⇒ Y satisfying C(x) ⊃ Cimp(x) for all x,

P(Yn+1 ∈ C(Xn+1)) ≥ 1− α.

Instead of the inner/outer set Cio in Eq. (10), we could use any confidence set C(x) ⊃ Cimp(x)

and maintain validity. Unfortunately, as we note above, the set Cimp may be exponentiallycomplex to represent and compute, necessitating a simplifying construction, such as our in-ner/outer approach. Conveniently, the set (10) we construct via yin and yout satisfying theconditions (11) satisfies Corollary 3.1. Indeed, we have the following.

Proposition 1. Let yin and yout satisfy the conditions (11). Then the confidence set Cio(x)in Algorithm 2 is the smallest set containing Cimp(x) and admitting the form (10).

Proof The conditions (11) immediately imply

yin(x)k = miny∈Cimp(x)

yk and yout(x)k = maxy∈Cimp(x)

yk,

which shows that Cimp(x) ⊂ Cio(x). On the other hand, suppose that yin(x) and yout(x) are

configurations inducing a confidence set C(x) that satisfies Cimp(x) ⊂ C(x). Then for any

label k included in yin(x), all configurations y ∈ Cimp(x) satisfy yk = 1, as Cimp(x) ⊂ C(x),so we must have yin(x)k ≤ yin(x)k. The argument to prove yout(x)k ≥ yout(x)k is similar.The expansion from Cimp(x) to Cio(x) may increase the size of the confidence set; in practice,we see it is rarely a substantial increase, as our predictors address correlation between indi-vidual labels yk, making it unlikely that “opposing” configurations y and −y both belong toCio(x). Moreover, in the case that the standard implicit confidence set Cimp is a singleton,

so that there is only a single label vector y ∈ Y satisfying s(x, y) ≥ qα(x) − Q1−α, then bydefinition yin(x) = yout(x), so that Cio = Cimp. Thus, at least when we have strong predictors,the expansion (11) has limited downside.

3.2 Efficient construction of inner and outer confidence sets

With the validity of any inner/outer construction verifying the conditions (11) established,we turn to two approaches to efficiently satisfy these. The first focuses on the scenario wherea prediction method provides individual scoring functions sk : X → R for each task k ∈ [K](as frequent for complex classifiers, such as random forests or deep networks [e.g. 16, 25]),while the second considers the case when we have a scoring function s : X × Y → R that istree-structured in a sense we make precise; in the next section, we will show how to constructsuch tree-structured scores using any arbitrary multilabel prediction method.

A direct inner/outer method using individual task scores We assume here that weobserve individual scoring functions sk : X → R for each task. We can construct inner andouter sets using only these scores while neglecting label correlations by learning thresholdfunctions tin ≥ tout : X → R, where we would like to have the (true) labels yk satisfy

sign(sk(x)− tin(x)) ≤ yk ≤ sign(sk(x)− tout(x)).

In Algorithm 3, we accomplish this via quantile threshold functions on the maximal andminimal values of the scores sk for positive and negative labels, respectively.

9

Algorithm 3: Split Conformalized Direct Inner/Outer method for Classification(CDioC).

Input: Sample {(Xi, Yi)}ni=1, disjoint index sets I2, I3 ⊂ [n], quantile functions Q,desired confidence level α, and K score functions sk : X → R.

1. Fit threshold functions (noting the sign conventions)

tin = − argmint∈Q

{∑i∈I2

ρα/2

(−max

k{sk(Xi) | Yi,k = −1} − t(Xi)

)}tout = argmin

t∈Q

{∑i∈I2

ρα/2

(mink{sk(Xi) | Yi,k = 1} − t(Xi)

)} (13)

2. Define score s(x, y) = min{mink:yk=1 sk(x) − tout(x), tin(x) − maxk:yk=−1 sk(x)}and compute conformity scores Si := −s(Xi, Yi).

3. Let Q1−α be the (1 + 1/|I3|)(1− α)-empirical quantile of S = {Si}i∈I3 and

tin(x) := tin(x) + Q1−α and tout(x) := tout(x)− Q1−α.

4. Define yin(x)k = sign(sk(x) − tin(x)) and yout(x)k = sign(sk(x) − tout(x)) andreturn prediction set Cio as in Eq. (10).

The method is a slightly modified instantiation of the CQioC method in Alg. 2 that allowseasier computation. We can also see that it guarantees validity.

Corollary 3.2. Assume that (Xi, Yi)n+1i=1 ∼ P are exchangeable, where P is an arbitrary

distribution. Then the confidence set Cio in Algorithm 3 satisfies

P(Yn+1 ∈ Cio(Xn+1)) ≥ 1− α.

Proof We show that the definitions of yin and yout in Alg. 3 are special cases of thecondition (11), which then allows us to apply Corollary 3.1. We focus on the inner set, as theouter is similar. Suppose in Alg. 3 that yin(x)k = 1. Then sk(x)− tin(x)− Q1−α ≥ 0, whichimplies that tin(x)− sk(x) ≤ −Q1−α, and for any y ∈ Y satisfying yk = −1, we have

tin(x)− maxl:yl=−1

sl(x) ≤ −Q1−α.

But of course, for the scores s(x, y) in line 2 of Alg. 3, we then immediately obtain s(x, y) ≤−Q1−α for any y ∈ Y with yk = −1. This is the first condition (11), so that (performing,mutatis-mutandis, the same argument with yout(x)k) Corollary 3.1 then implies the validityof Cio in Algorithm 3.

A prediction method for tree-structured scores We now turn to efficient computationof the inner and outer vectors (11) when the scoring function s : X ×Y → R is tree-structured.By this we mean that there is a tree T = ([K], E), with nodes [K] and edges E ⊂ [K]2, andan associated set of pairwise and singleton factors ψe : {−1, 1}2 × X → R for e ∈ E and

10

ϕk : {−1, 1} × X → R for k ∈ [K] such that

s(x, y) =K∑k=1

ϕk(yk, x) +∑

e=(k,l)∈E

ψe(yk, yl, x). (14)

Such a score allows us to consider interactions between tasks k, l while maintaining computa-tional efficiency, and we will show how to construct such a score function both from arbitrarymultilabel prediction methods and from those with individual scores as above. When wehave a tree-structured score (14), we can use efficient message-passing algorithms [7, 19] tocompute the collections (maximum marginals) of scores

S− :={

maxy∈Y:yk=−1

s(x, y)}Kk=1

and S+ :={

maxy∈Y:yk=1

s(x, y)}Kk=1

(15)

in time O(K), from which it is immediate to construct yin and yout as in the conditions (11).We outline the approach in Appendix A.3, as it is not the central theme of this paper, thoughthis efficiency highlights the importance of the tree-structured scores for practicality.

3.3 Building tree-structured scores

With the descriptions of the generic multilabel conformalization method in Alg. 2 and that wecan efficiently compute predictions using tree-structured scoring functions (14), we now turnto constructing such scoring functions from predictors, which trade between label dependencystructure, computational efficiency, and accuracy of the predictive function. We begin witha general case of an arbitrary predictor function, then describe a heuristic graphical modelconstruction when individual label scores are available (as we assume in Alg. 3).

From arbitrary predictions to scores We begin with the most general case that we haveaccess only to a predictive function y : X → RK . This prediction function is typically theoutput of some learning algorithm, and in the generality here, may either output real-valuedscores yk(x) ∈ R or simply output yk(x) ∈ {−1, 1}, indicating element k’s presence.

We compute a regularized scoring function based on a tree-structured graphical model(cf. [19]) as follows. Given a tree T = ([K], E) on the labels [K] and parameters α ∈ RK ,β ∈ RE , we define

sT ,α,β(x, y) :=K∑k=1

αkykyk(x) +∑

e=(k,l)∈E

βeykyl (16)

for all (x, y) ∈ X ×{−1, 1}K , where we recall that we allow yk(x) ∈ R. We will find the tree Tassigning the highest (regularized) scores to the true data (Xi, Yi)

ni=1 using efficient dynamic

programs reminiscent of the Chow-Liu algorithm [6]. To that end, we use Algorthim 4.

Algorithm 4: Method to find optimal tree from arbitrary predictor y.

Input: Convex regularizers r1, r2 : R→ R, data (Xi, Yi)ni=1, predictor y : X → RK .

Set

(T , α, β) := argmaxT =([K],E),α,β

{ n∑i=1

sT ,α,β(Xi, Yi)−K∑k=1

r1(αk)−∑e∈E

r2(βe)

}. (17)

and return score function sT ,α,β of form (16).

11

Because the regularizers r1, r2 decompose along the edges and nodes of the tree, we canimplement Alg. 4 using a maximum spanning tree algorithm. Indeed, recall the familiarconvex conjugate r∗(t) := supα{αt− r(α)} [5]. Then immediately

supα,β

{ n∑i=1

sT ,α,β(Xi, Yi)−K∑k=1

r1(α1)−∑e∈E

r2(βe)

}

=

K∑k=1

r∗1

(n∑i=1

Yi,kyk(Xi)

)+

∑e=(k,l)∈E

r∗2

(n∑i=1

Yi,kYi,l

),

which decomposes along the edges of the putative tree. As a consequence, we may solveproblem (17) by finding the maximum weight spanning tree in a graph with edge weightsr∗2(∑n

i=1 Yi,kYi,l) for each edge (k, l), then choosing α, β to maximize the objective (17), whichis a collection of 1-dimensional convex optimization problems.

From single-task scores to a tree-based probabilistic model While Algorithm 4 willwork regardless of the predictor it is given—which may simply output a vector y ∈ {−1, 1}K ,as in Alg. 3 it is frequently the case that multilabel methods output scores sk : X → R foreach task. To that end, a natural strategy is to model the distribution of Y | X directly viaa tree-structured graphical model [21]. Similar to the score in Eq. (14), we define interactionfactors ψ : {−1, 1}2 → R4 by ψ(−1,−1) = e1, ψ(1,−1) = e2, ψ(−1, 1) = e3 and ψ(1, 1) = e4,the standard basis vectors, and marginal factors ϕk : {−1, 1} × X → R2 with

ϕk(yk, x) :=1

2

[(yk − 1) · sk(x)(yk + 1) · sk(x)

],

incorporating information sk(x) provides on yk. For a tree T = ([K], E), the label model is

pT ,α,β (y | x) ∝ exp

( ∑e=(k,l)∈E

βTe ψ(yk, yl) +K∑k=1

αTk ϕk(yk, x)

), (18)

where (α, β) where {ηα}α∈E∪[K] is a set of parameters such that, for each edge e ∈ E, βe ∈ R4

and 1Tβe = 0 (for identifiability), while αk ∈ R2 for each label k ∈ [K]. Because we viewthis as a “bolt-on” approach, applicable to any method providing scores sk, we include onlypairwise label interaction factors independent of x, allowing singleton factors to depend onthe observed feature vector x through the scores sk.

The log-likelihood log pT ,α,β is convex in (α, β) for any fixed tree T , and the Chow-Liudecomposition [6] of the likelihood of a tree T = ([K], E) gives

log pT ,α,β(y | x) =

K∑k=1

log pT ,α,β(yk | x) +∑

e=(k,l)∈E

logpT ,α,β(yk, yl | x)

pT ,α,β(yk | x)pT ,α,β(yl | x), (19)

that is, the sum of the marginal log-likelihoods and pairwise mutual information terms, con-ditional on X = x. Given a sample (Xi, Yi)

ni=1, the goal is to then solve

maximizeT ,α,β

Ln(T , α, β) :=

n∑i=1

log pT ,α,β (Yi | Xi) . (20)

12

When there is no conditioning on x, the maximal pairwise mutual information terms logpT ,α,β(yk,yl)p(yk)p(yl)

are independent of the tree T , allowing Chow and Liu [6] to use a maximum-spanning-treealgorithm to find the tree maximizing the likelihood (20). We heuristically compute empiricalconditional mutual informations between each pair (k, l) of tasks, choosing the tree T thatmaximizes these values to approximate problem (20) in Algorithm 5, using the selected treeT to choose α, β maximizing Ln(T , α, β). (In the algorithm we superscript Y to make clearertask labels versus observations.)

Algorithm 5: Chow-Liu-type approximate Maximum Likelihood Tree and ScoringFunction

Input: Sample (X(i), Y (i))ni=1

For each pair e = (k, l) ∈ [K]2

1. Define the single-edge tree Te = ({k, l}, {e})

2. Fit model (19) for tree T via (α, β) := argmaxα,β Ln(Te, α, β)

3. Estimate empirical mutual information

Ie :=n∑i=1

log

(pTe,α,β(Y

(i)k , Y

(i)l | X(i))

pTe,α,β(Y(i)k | X(i))pTe,α,β(Y

(i)l | X(i))

)

Set T = MaxSpanningTree((Ie)e∈[K]2) and (α, β) = argmaxα,β Ln(T , α, β).Return scoring function

sT ,α,β(x, y) :=∑

e=(k,l)∈E

βTe ψ(yk, yl) +

K∑k=1

αTk ϕk(yk, x)

The algorithm takes time roughly O(nK2 + K2 log(K)), as each optimization step solves aconcave maximization problem, which is straightforward via a Newton method (or gradientdescent). The approach does not guarantee recovery of the correct tree structure even ifthe model is well-specified, as we neglect any potential information coming from labels otherthan k and l in the estimates Ie for edge e = (k, l), although we expect the heuristic to returnsufficiently reasonable tree structures. In any case, the actual scoring function s it returns stillallows efficient conformalization and valid predictions via Alg. 2, regardless of its accuracy; amore accurate tree will simply allow smaller and more accurate confidence sets (11).

4 Experiments

JCD Comment: We need to have much better explanations in all the figures, aswell as to fix their labeling to correspond to the text in a better way. This shouldnot be too hard using tikz and other packages.

Our main motivation is to design methods with more robust conditional coverage than the“marginal” split-conformal method (3). Accordingly, the methods we propose in Sections 2and 3 fit conformalization scores that depend on the features x and, in some cases, model

13

dependencies among y variables. Our experiments consequently focus on more robust notionsof coverage than the nominal marginal coverage the methods guarantee, and we develop anew evaluation metric for validation and testing of coverage, looking at connected subsets ofthe space X and studying coverage over these. Broadly, we expect our proposed methods tomaintain coverage of (nearly) 1 − α across subsets; as we see presently, the experiments areconsistent with this expectation.

Measures beyond marginal coverage Except in simulated experiments, we cannot com-pute conditional coverage of each instance. Instead, we look at coverage over slabs

Sv,a,b :={x ∈ Rd | a ≤ vTx ≤ b

},

where v ∈ Rd and a < b ∈ R, which (as we see presently) allow some calculation of conditionalcoverage in X -space and statistical validity. For a direction v and threshold 0 < δ ≤ 1, wethen consider the worst coverage over all such slabs containing δ mass in {Xi}ni=1, defining

WSCn(C, v) := infa<b

{Pn(Y ∈ C(X) | a ≤ vTX ≤ b) s.t. Pn(a ≤ vTX ≤ b) ≥ δ

}, (21)

where Pn denotes the empirical distribution on (Xi, Yi)ni=1. As long as the mapping C : X ⇒ Y

is constructed independently of Pn, we can show that these quantities concentrate. Indeed,let us temporarily assume that the confidence set has the form C(x) = {y | s(x, y) ≥ q(x)}for some (arbitrary) scoring s : X ×Y → R and threshold functions q. Let V ⊂ Rd; we abusenotation to let VC(V ) be the VC-dimension of the set of halfspaces it induces, where we notethat VC(V ) ≤ min{d, log2 |V |}. Then for some numerical constant C, for all t > 0

supv∈V,a≤b:Pn(X∈Sv,a,b)≥δ

{|Pn(Y ∈ C(X) | X ∈ Sv,a,b)− P (Y ∈ C(X) | X ∈ Sv,a,b)|

}(22)

≤ C√

VC(V ) log n+ t

δn≤ C

√min{d, log |V |} log n+ t

δn

with probability at least 1− e−t. (See Appendix A.4 for a brief derivation of inequality (22)and a few other related inequalities.)

Each of the confidence sets we develop in this paper satisfy C(x) ⊃ {y | s(x, y) ≥ q(x)} forsome scoring function s and function q. Thus, if C : X ⇒ Y effectively provides conditionalcoverage at level 1− α, we should observe that

infv∈V

WSCn(C, v) ≥ 1− α−O(1)

√VC(V ) log n

δn≥ 1− α−O(1)

√min{d, log |V |} log n

δn.

In each of our coming experiments, we draw M = 1000 samples vj uniformly on Sd−1,computing the worst-slice coverage (21) for each vj . In the multiclass case, we expect ourconformal quantile classification (CQC, Alg. 1) method to provide larger worst-slab cover-age than the standard marginal method (3), while in the multilabel case, we expect thatthe combination of tree-based scores (Algorithms 4 or 5) with the conformalized quantile in-ner/outer classification (CQioC, Alg. 2) should provide larger worst-slab coverage than theconformalized direct inner/outer classification (CDioC, Alg. 3). In both cases, we expectthat our more sophisticated methods should provide confidence sets of comparable size to themarginal methods. Unlike in the multiclass case, we know of no baseline method for multil-abel problems, as the “marginal” method (3) is computationally inefficient when the number

14

Figure 1. Gaussian mixture

with µ0 = (1, 0), µ1 = (− 12 ,√32 ),

µ2 = (− 12 ,−

√32 ), and µ3 = (− 1

2 , 0);and Σ0 = diag(0.3, 1), Σ1 =Σ2 = diag(0.2, 0.4), and Σ3 =diag(0.2, 0.5).

of labels grows. For this reason, we focus on the methods in this paper, highlighting thepotential advantages of each one while comparing them to an oracle (conditionally perfect)confidence set in simulation.

We present four experiments: two with simulated datasets and two with real datasets(CIFAR-10 [20] and Pascal VOC 2012 [12]), and both in multiclass and multilabel settings.In each experiment, a single trial corresponds to a realized random split of the data betweentraining, validation and calibration sets I1, I2 and I3, and in each figure, the red dottedline represents the desired level of coverage 1− α. Unless otherwise specified, we summarizeresults via boxplots that display the lower and upper quartiles as the hinges of the box, themedian as a bold line, and whiskers that extend to the 5% and 95% quantiles of the statisticof interest (typically coverage or average confidence set size).

4.1 Simulation

4.1.1 More uniform coverage on a multiclass example

Our first experiment is a simulation to allow us to compute the conditional coverage of eachsample, studying the performance of our CQC method 1 specifically studies the performanceof our CQC method on small sub-populations, a challenge for traditional machine learningmodels [10, 13]. In contrast to the traditional split-conformal algorithm (method (3), cf. [33]),we expect the quantile estimator (5) in the CQC method 1 to better account for data hetero-geneity, maintaining higher coverage on subsets of the data.

To test this, we generate n = 105 data points {Xi, Yi}i∈[N ] i.i.d. from a Gaussian mixturewith one majority group and three minority ones,

Y ∼ Mult(π) and X | Y = y ∼ N(µy,Σy). (23)

where π = (.7, .1, .1, .1) (see Fig. 1). We purposely introduce more confusion for the threeminority groups (1, 2 and 3), whereas the majority (0) has a clear linear separation. Wechoose α = 10%, the same as the size of the smaller sub-populations, then we apply our CQCmethod and compare it to both the Marginal (3) and the Oracle (which outputs the smallestconditionally valid 1− α confidence set) methods.

The results in Figure 2 are consistent with our expectations. The oracle method dominates,of course, but CQC appears to provide more robust coverage for individual classes than themarginal method. While all methods maintain 1− α coverage marginally, the left plot showsthat 80% of sampled X have coverage at least 1− α (where α = .1) using the CQC method,

15

0.5

0.6

0.7

0.8

0.9

1.0

0.00 0.25 0.50 0.75 1.00Level of Coverage

Fra

ctio

n of

the

popu

latio

n

MethodMarginal

CQC

Oracle

0.00

0.25

0.50

0.75

1.00

Marginal CQC OracleMethod

Cla

ss−w

ise

cove

rage

Class0

1

2

3

X-m

easu

rePX

({x

:P(Y∈C

(X)|X

=x

)≥t}

)

Conditional coverage probability t

Per

-cla

ssco

vera

ge

Figure 2. Simulation results on multiclass problem. Left: X-probability of achieving a given

level t of conditional coverage versus coverage t, i.e., t 7→ PX(P(Y ∈ CMethod(X) | X) ≥ t). The

ideal is to observe t 7→ 1 {t ≤ 1− α}. Right: class-wise coverage P(Y ∈ CMethod(X) | Y = y)on the distribution (23) (as in Fig. 1) for each method. Confidence bands and error bars displaythe range of the statistic over M = 20 trials.

while the marginal method (3) covers 70% of the time (the “bumps” in plot occur because ofthe discrete probabilities of observing each class y). Consequently, the CQC method provideshigher coverage on the sub-populations of the data, as it also higher coverage for minorityclasses (see right plot of Fig. 2).

4.1.2 Improved coverage with graphical models

Our second simulation addresses the multilabel setting, where we have a predictive modeloutputting scores sk(x) ∈ R for each task k. We run three methods. First, the directInner/Outer method (CDioC), Alg. 3. Second, we use the graphical model score from thelikelihood model in Alg. 5 to choose a scoring function sT : X × Y → R, which we call thePGM-CQC method; we then either use the CQC method 1 with this scoring function directly,that is, the implicit confidence set (recall (12)) Cimp(x) = {y ∈ Y | sT (x, y) ≥ q(x)−Q1−α} or

the explicit Cio set of Eqs. (10)–(11). Finally, we do the same except that we use the arbitrarypredictor method (Alg. 4) to construct the score sT , where we the {±1}K assignment y insteadof scores as input predictors, which we term ArbiTree-CQC.

We consider a misspecified logistic regression model, where hidden confounders inducecorrelation between labels. Because of the dependency structure, we expect the tree-basedmethods to output smaller and more robust confidence sets the direct CDioC method 3.Comparing the two score-based methods (CDioC and PGM-CQC) with the scoreless treemethod ArbiTree-CQC is perhaps unfair, as the latter uses less information—only signs ofpredicted labels. Yet we expect the label correlation the ArbiTree-CQC method can leverageto mitigate this disadvantage, at least relative to CDioC.

Our setting follows, where we consider K = 5 tasks and dimension d = 2. For eachexperiment, we choose a tree T = ([K], E) uniformly at random, where each edge e ∈ Ehas a correlation strength value τe ∼ Uni[−5, 5], and for every task k we sample a vector

16

0.80

0.85

0.90

0.95

Oracle CDioC ArbiTree−CQC PGM−CQC

Method

Cov

erag

e

Type of CS

Implicit

Explicit

1.0

1.5

2.0

2.5

3.0

3.5

Oracle CDioC ArbiTree−CQC PGM−CQC

Method

Ave

rage

Siz

e Type of Set

Inner

Outer

P(Y∈C

(X))

E|y

in(X

)|an

dE|y

out(X

)|

Figure 3. Results for simulated multilabel experiment with label distribution (24). Methodsare the true oracle confidence set; the conformalized direct inner/outer method (CDioC), Alg. 3;

and tree-based methods with implicit confidence sets Cimp or explicit inner/outer sets Cio,labeled ArbiTree and PGM(see description in Sec. 4.1.2). Left: Marginal coverage probabilityat level 1− α = .8 for different methods. Right: confidence set sizes of methods.

θk ∼ Uni(rSd−1) with radius r = 5. For each observation, we draw Xiid∼ N(0, Id) and draw

independent uniform hidden variables He ∈ {−1, 1} for each edge e of the tree. Letting Ekbeing the set of edges adjacent to k, we draw Yk ∈ {−1, 1} from the logistic model

P (Yk = yk | X,H) ∝ exp

{−yk

(1 +XT θk +

∑e∈Ek

τeHe

)}. (24)

We simulate ntotal = 200,000 data points, using ntr = 100,000 for training, nv = 40,000 forvalidation, nc = 40,000 for calibration, and nte = 20,000 for testing.

The methods CDioC and PGM-CQC require per-task scores sk, so for each method we fitK separate logistic regressions of Yk against X (leaving the hidden variables H unobserved)on the training data. We then use the fit parameters θk ∈ Rd to define scores sk(x) = θTk x (forthe methods CDioC and PGM-CQC) and the “arbitrary” predictor yk(x) = sign(sk(x)) (formethod ArbiTree-CQC). We use a one-layer fully-connected neural network with 16 hiddenneurons as the class Q in the quantile regression step (5) of the methods; no matter our choiceof Q, the final conformalization step guarantees (marginal) validity.

Figure 3 and 4 show our results. In Fig. 3, we give the marginal coverage of each method(left plot), including both the implicit Cimp and explicity Cio confidence sets, and the average

confidence set sizes for the explicit confidence set Cio in the right plot. We see that theexplicit sets Cio in the ArbiTree-CQC and PGM-CQC methods both overcover, and thesizes of Cio for the PGM-CQC and CDioC methods are similar, though the latter has alarger inner and outer sets. On the other hand, the PGM-CQC explicit confidence sets(as they expand (11) the implicit set Cimp) provide substantially better coverage than thedirect Inner/Outer method CDioC. The confidence sets of the scoreless method ArbiTree-CQC are wider, which is consistent with the limited information available to the method viathe predictions yk = sign(sk).

Figure 4 shows our results on the worst-slab coverage measures (21), dovetailing with ourexpectations that the PGM-CQC and ArbiTree-CQC methods are more robust and feature-

adaptive. Each of the M = 103 experiments corresponds to a draw of viid∼ Uni(Sd−1), then

17

0.6

0.7

0.8

0.9

CDioC ATree−Imp ATree−IO PGM−Imp PGM−IO

Method

Wor

st−S

lab

Cov

erag

e

0.0

0.1

0.2

ATree−Imp ATree−IO PGM−Imp PGM−IO

Methods comparedW

orst

−Sla

b C

over

age

Diff

eren

ce

WS

Cn(C

Meth

od,v

)

WS

Cn( C

Meth

od,v

)−

WS

Cn(C

CDioC,v

)Figure 4. Distribution of the worst-slab coverage (21) (with δ = 20%) on mis-specifed logisticregression model (24) over M = 1000 i.i.d. choices of direction v ∈ Sd−1. Left: distribution

of WSCn(C, v). Right: distribution of difference WSCn(CMethod1, v) −WSCn(CMethod2

, v),Method2 is always the CDioC method (Alg. 3), and the other four are ArbiTree-CQC and

PGM-CQC with both implicit Cimp and explicit Cio confidence sets.

evaluating the worst-slab coverage (21) with δ = .2. In the left plot, we show the variabilityacross draws of v for each method, which shows a substantial difference in coverage betweenthe CDioC method and the others. We also perform direct comparisons in the right plot: wedraw the same directions v for each method, and then (for the given v) evaluate the differenceWSCn(C, v)−WSCn(C ′, v), where C and C ′ are the confidence sets each method produces,respectively. Thus, we see that both tree-based methods always provide better worst-slabcoverage, whether we use the implicit confidence sets Cimp or the larger direct inner/outer

(explicit) confidence sets Cio, though in the latter case, some of the difference likely comesfrom the differences in marginal coverage.

4.2 More robust coverage on the CIFAR 10 dataset

In our first real experiment, we study a multiclass image classification problem with thestandard CIFAR-10 dataset, which consists of n = 60,000 32 × 32 images across 10 classes.We use ntr = 50,000 of them to train the full model, a validation set of size nv = 6,000 forfitting quantile functions and hyperparameter tuning, nc = 3,000 for calibration and the lastnte = 1,000 for testing. We train a standard ResNet 50 [14] architecture for 200 epochs, whichobtains an accuracy of 92.5 ± 0.5% on the test set. We use the d = 256-dimensional outputof the final layer of the ResNet as the inputs X to the quantile estimation.

We then apply the CQC method 1 with α = 5%, and we compare it to the benchmarkmarginal method (3). We expect the latter to typically output small confidence sets, asthe the neural network’s accuracy is close to the confidence level 1 − α; this allows almost(typically) predicting a single label while maintaining marginal coverage. Figure 5 shows thishappens precisely. In Figure 6, however, we compare worst-slab coverage (21) over M = 1000

18

0.92

0.93

0.94

0.95

0.96

0.97

Marginal CQCMethod

Cov

erag

e

1.05

1.10

1.15

1.20

1.25

1.30

Marginal CQCMethod

Ave

rage

Siz

e

P(Y∈C

(X))

E∣ ∣ ∣C(X

)∣ ∣ ∣Figure 5. Marginal coverage and average confidence set size on CIFAR-10 over M = 20 trials.

0.84

0.87

0.90

0.93

Marginal CQCMethod

Wor

st−S

lab

Cov

erag

e

−0.025

0.000

0.025

0.050

CQC − Marginal

Comparison

Wor

st−S

lab

Cov

erag

e D

iffer

ence

WS

Cn(C

Meth

od,v

)

WS

Cn(C

Meth

od1,v

)−

WS

Cn(C

Meth

od2,v

)

Figure 6. Worst-slab coverage for CIFAR-10 with δ = .1 over M = 1000 draws viid∼ Uni(Sd−1).

The dotted line is the desired (marginal) coverage. Left: distribution of worst-slab coverage.

Right: distribution of the coverage difference WSCn(CCQC, v)−WSCn(CMarginal, v).

19

0.75

0.80

0.85

CDioC ArbiTree−CQC PGM−CQCMethod

Cov

erag

eType of CS

Implicit

Explicit

0.25

1.00

4.00

16.00

CDioC ArbiTree−CQC PGM−CQCMethod

Ave

rage

Siz

e

Type of Set

Inner

Outer

P(Y∈C

(X))

E|y

in(X

)|an

dE|y

out(X

)|(l

og-

scal

e)Figure 7. Results for PASCAL VOC dataset [12] over M = 20 trials. Methods are theconformalized direct inner/outer method (CDioC), Alg. 3; and tree-based methods with implicit

confidence sets Cimp or explicit inner/outer sets Cio, labeled ArbiTree and PGM(see descriptionin Sec. 4.1.2). Left: Marginal coverage probability at level 1 − α = .8 for different methods.Right: confidence set sizes of methods.

draws of viid∼ Uni(Sd−1). Here, we see that the CQC method provides slight—but statistically

significant—gain in the worst-slab coverage over the marginal method; we expect this morerobust coverage from CQC, though in this case the effect is relatively small (about .5–1%better coverage).

4.3 A multilabel image recognition dataset

Our final set of experiments considers the multilabel image classification problems in thePASCAL VOC 2007 and VOC 2012 datasets [11, 12], which consist of n2012 = 11540 andn2007 = 9963 distinct 224 × 224 images, where the goal is to predict the presence of entitiesfrom K = 20 different classes, (e.g. birds, boats, people). We compare the direct inner outermethod (CDioC) 3 with the split-conformalized inner/outer method (CQioC) 2, where we usethe tree-based score functions that Algorthims 4 and 5 output.

In this problem, we use the d = 2048 dimensional output of a ResNet-101 with pretrainedweights on the ImageNet dataset [9] as our feature vectors X. For each of the K classes, we fita binary logistic regression model θk ∈ Rd of Yk ∈ {±1} against X, then use sk(x) = θTk x asthe scores for the PGM-CQC method (Alg. 5). We use yk(x) = sign(sk(x)) for the ArbiTree-CQC method 4. The F1-score of the fit predictors on held-out data is 0.77, so we do not expectour confidence sets to be uniformly small while maintaining the required level of coverage, inparticular for ArbiTree-CQC. We use a validation set of nv = 3493 images to fit the quantilefunctions (5) as above using a one layer fully-connected neural network with 16 hidden neuronsand tree parameters.

The results from Figure 8 are consistent with our hypotheses that the tree-based modelsshould improve robustness of the coverage. Indeed, while all methods ensure marginal coverage

20

0.65

0.70

0.75

0.80

0.85

CDioC ATree−Imp ATree−IO PGM−Imp PGM−IO

Method

Wor

st−S

lab

Cov

erag

e

0.0

0.1

ATree−Imp ATree−IO PGM−Imp PGM−IO

Methods comparedW

orst

−Sla

b C

over

age

Diff

eren

ce

WS

Cn(C

Meth

od,v

)

WS

Cn( C

Meth

od,v

)−

WS

Cn(C

CDioC,v

)

Figure 8. Worst-slab coverage for Pascal-VOC with δ = .2 over M = 1000 draws viid∼

Uni(Sd−1). For tree-structured methods labeled ATree and PGM, we compute the worst-slab

coverage using implicit confidence sets Cimp and explicit inner/outer sets Cio (resp. Imp andIO in this figure). The dotted line is the desired (marginal) coverage. Left: distribution

of worst-slab coverage. Right: distribution of the coverage difference WSCn(CMethod, v) −WSCn(CCDioC, v) for Methodi ∈ {ArbiTree-CQC,PGM-CQC} with implicit Cimp or explicit

inner/outer Cio confidence sets.

at level α = .8 (see Fig. 7), Figure 8 shows that worst-case slabs (21) for the tree-basedmethods have closer to 1 − α coverage. In particular, for most random slab directions v,the tree-based methods have higher worst-slab coverage than the direct inner/outer method(CDioC, Alg. 3). At the same time, both the CDioC method and PGM-CQC method (usingCio) provide similarly-sized confidence sets, as the inner set of the PGM-CQC method istypically smaller, as is its outer set. The ArbiTree-CQC method performs poorly on thisexample: its coverage is at the required level, but the confidence sets are too large and thusessentially uninformative.

5 Conclusions

As long as we have access to a validation set, independent of (or at least exchangeable with)the sets used to fit a predictive model, split conformal methods guarantee marginal validity,which gives great freedom in modeling. It is thus of interest to fit models more adaptive to thesignal inputs x at hand—say, by quantile predictions as we have done, or other methods yet tobe discovered—that can then in turn be conformalized. As yet we have limited understandingof what “near” conditional coverage might be possible: Barber et al. [1] provide proceduresthat can guarantee coverage uniformly across subsets of X-space as long as those subsets arenot too complex, but it appears computationally challenging to fit the models they provide,and our procedures empirically appear to have strong coverage across slabs of the data as inEq. (21). Our work, then, is a stepping stone toward more uniform notions of validity, and we

21

believe exploring approaches to this—perhaps by distributionally robust optimization [8, 10],perhaps by uniform convergence arguments [1, Sec. 4]—will be both interesting and essentialfor trusting applications of statistical machine learning.

References

[1] Rina Foygel Barber, Emmanuel J. Candes, Aaditya Ramdas, and Ryan J. Tibshirani. Thelimits of distribution-free conditional predictive inference. arXiv:1903.04684v2 [math.ST],2019.

[2] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101:138–156, 2006.

[3] Stephane Boucheron, Olivier Bousquet, and Gabor Lugosi. Theory of classification: asurvey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.

[4] Matthew R Boutell, Jiebo Luo, Xipeng Shen, and Christopher M Brown. Learningmulti-label scene classification. Pattern Recognition, 37(9):1757–1771, 2004.

[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UniversityPress, 2004.

[6] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with de-pendence trees. IEEE Transactions on Information Theory, 14(3):462–467, 1968.

[7] A. P. Dawid. Applications of a general propagation algorithm for probabilistic expertsystems. Statistics and Computing, 2(1):25–36, 1992.

[8] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncer-tainty with application to data-driven problems. Operations Research, 58(3):595–612,2010.

[9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scalehierarchical image database. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 248–255, 2009.

[10] John C. Duchi and Hongseok Namkoong. Learning models with uniform performancevia distributionally robust optimization. arXiv:1810.08750 [stat.ML], 2018.

[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. ThePASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007. URL http:

//www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. ThePASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, 2012. URL http:

//www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.

[13] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fair-ness without demographics in repeated loss minimization. In Proceedings of the 35thInternational Conference on Machine Learning, 2018.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 770–778, 2016.

[15] Yotam Hechtlinger, Barnabas Poczos, and Larry Wasserman. Cautious deep learning.arXiv:1805.09460 [stat.ML], 2019.

[16] Francisco Herrera, Francisco Charte, Antonio J Rivera, and Marıa J Del Jesus. Multilabelclassification. In Multilabel Classification, pages 17–31. Springer, 2016.

22

http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html




[17] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricksfor efficient text classification. arXiv:1607.01759 [cs.CL], 2016.

[18] Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: Journal ofthe Econometric Society, 46(1):33–50, 1978.

[19] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Tech-niques. MIT Press, 2009.

[20] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tinyimages. Technical report, University of Toronto, 2009.

[21] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic modelsfor segmenting and labeling sequence data. In Proceedings of the Eighteenth InternationalConference on Machine Learning, pages 282–289, 2001.

[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra-manan, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO: Common objects incontext. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer In-ternational Publishing.

[23] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins.Text classification using string kernels. Journal of Machine Learning Research, 2:419–444,2002.

[24] Andrew McCallum and Kamal Nigam. Employing EM in pool-based active learningfor text classification. In Machine Learning: Proceedings of the Fifteenth InternationalConference, 1998.

[25] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains formulti-label classification. Machine Learning, 85:333–359, 2011.

[26] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:Unified, real-time object detection. In Proceedings of the 26th IEEE Conference onComputer Vision and Pattern Recognition, pages 779–788, 2016.

[27] Yaniv Romano, Evan Patterson, and Emmanuel J. Candes. Conformalized quantileregression. In Advances in Neural Information Processing Systems 32, 2019. URL https:

//arxiv.org/abs/1905.03222.

[28] Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classifierswith bounded error levels. Journal of the American Statistical Association, 114(525):223–234, 2019. doi: 10.1080/01621459.2017.1395341. URL https://doi.org/10.1080/

01621459.2017.1395341.

[29] Matteo Sesia and Emmanuel J. Candes. A comparison of some conformal quantile re-gression methods. arXiv:1909.05433 [stat.ME], 2019.

[30] Mingxing Tan, Ruoming Pang, and Quoc V Le. EfficientDet: Scalable and efficient objectdetection. arXiv:1911.09070 [cs.CV], 2019.

[31] A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods.Journal of Machine Learning Research, 8:1007–1025, 2007.

[32] Vladimir Vovk. Conditional validity of inductive conformal predictors. In Proceedingsof the Asian Conference on Machine Learning, volume 25 of Proceedings of MachineLearning Research, pages 475–490, 2012. URL http://proceedings.mlr.press/v25/

vovk12.html.

[33] Vladimir Vovk, Alexander Grammerman, and Glenn Shafer. Algorithmic Learning in aRandom World. Springer, 2005.

23

https://arxiv.org/abs/1905.03222

https://arxiv.org/abs/1905.03222

https://doi.org/10.1080/01621459.2017.1395341

https://doi.org/10.1080/01621459.2017.1395341

http://proceedings.mlr.press/v25/vovk12.html

http://proceedings.mlr.press/v25/vovk12.html

[34] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families,and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305,2008.

[35] Min-Ling Zhang and Zhi-Hua Zhou. Multilabel neural networks with applications tofunctional genomics and text categorization. IEEE transactions on Knowledge and DataEngineering, 18(10):1338–1351, 2006.

[36] T. Zhang. Statistical analysis of some multi-category large margin classification methods.Journal of Machine Learning Research, 5:1225–1251, 2004.

A Technical proofs and appendices

A.1 Proof of Theorem 3

Our proof adapts arguments similar to those that Sesia and Candes [29] use in the regression

setting, repurposing and modifying them for classification. We have that ‖s− s‖L2(PX)p→ 0

and ‖qσα − qσα‖L2(PX)p→ 0. We additionally have that

Q := Q1−α(Eσ, I3)p→ 0. (25)

This is item (ii) in the proof of Theorem 1 of Sesia and Candes [29] (see Appendix A of theirpaper), which proves the convergence (25) of the marginal quantile of the error precisely whenqσα and s are L2-consistent in probability (when the randomized quantity s(x, Y ) + σZ has adensity), as in our Assumption A1.

Recalling that s, q tacitly depend on the sample size n, let ε > 0 be otherwise arbitrary,and define the sets

Bn :={x ∈ X | ‖s(x, ·)− s(x, ·)‖∞ > ε2 or |qσα(x)− qσα(x)| > ε2

}.

Then Bn ⊂ X is measurable, and by Markov’s inequality,

PX(Bn) ≤‖s− s‖L2(PX)

ε+‖qσα − qσα‖L2(PX)

ε,

soP(PX(Bn) ≥ ε) ≤ P(‖s− s‖L2(PX) > ε2) + P(‖qσα − qσα‖L2(PX) > ε2)→ 0. (26)

Thus, the measure of the sets Bn tends to zero in probability, i.e., PX(Bn)p→ 0.

Now recall the shorthand (25) that Q = Q1−α(Eσ, I3). Let us consider the event thatCσ1−α(x, z) 6= Cσ1−α(x, z). If this is the case, then we must have one of

A1(x, k, z) :={s(x, k) + σz ≥ qσα(x)− Q and s(x, k) + σz < qσα(x)

}or

A2(x, k, z) :={s(x, k) + σz < qσα(x)− Q and s(x, k) + σz ≥ qσα(x)

}.

(27)

We show that the probability of the set A1 is small; showing that the probability of set A2

is small is similar. Using the convergence (25), let us assume that |Q| ≤ ε, and supposethat x 6∈ Bn. Then for A1 to occur, we must have both s(x, k) + ε + σz ≥ qσα(x) − 2ε ands(x, k) + σz < qσα(x), or

qσα(x)− s(x, k)− 3ε ≤ σz < qσα(x)− s(x, k).

24

As Z has a bounded density, we have lim supε→0 supa∈R PZ(a ≤ σZ ≤ a + 3ε) = 0, or (withsome notational abuse) lim supε→0 supx 6∈Bn PZ(A1(x, k, Z)) = 0.

Now, let Fn = σ({Xi}ni=1, {Yi}ni=1, {Zi}ni=1) be the σ-field of the observed sample. Thenby the preceding derivation (mutatis mutandis for the set A2 in definition (27)) for any η > 0,there is an ε > 0 (in the definition of Bn) such that

supx∈X

P(Cσ1−α(x, Zn+1) 6= Cσ1−α(x, Zn+1) | Fn

)1 {x 6∈ Bn} 1

{|Q| ≤ ε

}≤ sup

x 6∈Bn

K∑k=1

P (A1(x, k, Zn+1) or A2(x, k, Zn+1) | Fn) 1{|Q| ≤ ε

}≤ η.

In particular, by integrating the preceding inequality, we have

P(Cσ1−α(Xn+1, Zn+1) 6= C(Xn+1, Zn+1), Xn+1 6∈ Bn, |Q| ≤ ε

)≤ η.

As P(|Q| ≤ ε) → 1 and P(Xn+1 6∈ Bn) → 1 by the convergence guarantees (25) and (26), wehave the theorem.

A.2 Proof of Theorem 4

We wish to show that limσ→0 P(Cσ1−α(X,Z) ⊂ C1−α(X)

)= 1, which, by a union bound over

k ∈ [K], is equivalent to showing

P (s(X, k) + σZ ≥ qσα(X), s(X, k) < qα(X)) −→σ→0

0

for all k ∈ [K]. Fix k ∈ [K], and define the events

A := {s(X, k) < qα(X)}

andBσ := {s(X, k) + σZ ≥ qσα(X)}.

Now, for any δ > 0, consider Aδ := {s(X, k) ≤ qα(X) − δ}. On the event Aδ ∩ Bσ, it musthold that

δ ≤ qα(X)− qσα(X) + σZ.

The following lemma—whose proof we defer to Section A.2.1—shows that the latter can onlyoccur with small probability.

Lemma A.1. With probability 1 over X, the quantile function satisfies

lim infσ→0

qσα(X) ≥ qα(X),

and hence lim supσ→0 {qα(X)− qσα(X) + σZ} ≤ 0 almost surely.

Lemma A.1 implies that

P (qα(X)− qσα(X) + σZ ≥ δ) −→σ→0

0,

which, in turn, shows that, for every fixed δ > 0,

P (Aδ ∩Bσ) −→σ→0

0.

25

To conclude the proof, fix ε > 0. The event Aδ increases to A when δ → 0, so there existsδ > 0 so that

P (A \Aδ) ≤ ε.

Finally,

lim supσ→0

P (A ∩Bσ) ≤ P (A \Aδ) + lim supσ→0

P (Aδ ∩Bσ) ≤ ε,

as lim supσ→0 P (Aδ ∩Bσ) = 0. We conclude the proof by sending ε→ 0.

A.2.1 Proof of Lemma A.1

Fix x ∈ X . Let Fσ,x and Fx be the respective cumulative distribution functions of sσ(x, Y, Z)and s(x, Y ) conditionally on X = x, and define the (left-continuous) inverse CDFs

F−1σ,x(u) = inf{t ∈ R : u ≤ Fσ,x(t)} and F−1x (u) = inf{t ∈ R : u ≤ Fx(t)}.

We use a standard lemma about the convergence of inverse CDFs, though for lack of a properreference, we include a proof in Section A.2.2.

Lemma A.2. Let (Fn)n≥1 be a sequence of cumulative distribution functions convergingweakly to F , with inverses F−1n and F−1. Then for each u ∈ (0, 1),

F−1(u) ≤ lim infn→∞

F−1n (u) ≤ lim supn→∞

F−1n (u) ≤ F−1(u+). (28)

As Fσ,x converges weakly to Fx as σ → 0, Lemma A.2 implies that

F−1x (α) ≤ lim infσ→0

F−1σ,x(α).

But observe that qσα(x) = F−1σ,x(α) and qα(x) = F−1x (α), so that we have the desired resultqα(x) ≤ lim infσ→0 q

σα(x).

A.2.2 Proof of Lemma A.2

We prove only the first inequality, as the last inequality is similar. Fix u ∈ (0, 1) and ε > 0.F is right-continuous and non-decreasing, so its set of continuity points is dense in R; thus,there exists a continuity point w of F such that w < F−1(u) ≤ w + ε.

Since w < F−1(u), it must hold that F (w) < u, by definition of F−1. As w is a continuitypoint of F , limn→∞ Fn(w) = F (w) < u, which means that Fn(w) < u for large enough n, orequivalently, that w < F−1n (u). We can thus conclude that

lim inf F−1n (u) ≥ w ≥ F−1(u)− ε.

Taking ε→ 0 proves the first inequality.

A.3 Efficient computation of maximal marginals for condition (11)

We describe a more or less standard dynamic programming approach to efficiently compute themaximum marginals (15) (i.e. maximal values of a tree-structured score s : Y = {−1, 1}K →R), referring to standard references on max-product message passing [7, 34, 19] for more. Let

26

T = ([K], E) be a tree with nodes [K] and undirected edges E, though we also let edges(T )and nodes(T ) denote the edges and nodes of the tree T . Assume

s(y) =

K∑k=1

ϕk(yk) +∑

e=(k,l)∈E

ψe(yk, yl).

To compute the maximum marginals maxy∈Y{s(y) | yk = yk}, we perform two messagepassing steps on the tree: an upward and downward pass. Choose a node r arbitrarily to bethe root of T , and let Tdown be the directed tree whose edges are E with r as its root (whichis evidently unique); let Tup be the directed tree with all edges reversed from Tdown.

A maximum marginal message passing algorithm then computes a single downward anda single upward pass through each tree, each in topological order of the tree. The downardmessages ml→k : {−1, 1} → R are defined for y ∈ {−1, 1} by

ml→k(y) = maxyl∈{−1,1}

ϕl(yl) + ψ(l,k)(yl, y) +∑

i:(i→l)∈edges(Tdown)

mi→l(yl)

,

while the upward pass is defined similarly except that Tup replaces Tdown. After a singledownward and upward pass through the tree, which takes time O(K), the invariant of messagepassing on the tree [19, Ch. 13.3] then guarantees that for each k ∈ [K],

maxy∈Y{s(y) | yk = yk} = ϕk(yk) +

∑e=(l,k)∈T

ml→k(yk), (29)

where we note that there exists exactly one message to k (whether from the downward orupward pass) from each node l neighboring k in T .

We can evidently then compute all of these maximal values (the sets S± in Eq. (15))simultaneously in time of the order of the number of edges in the tree T , or O(K), as eachmessage ml→k can appear in at most one of the maxima (29), and there are 2K messages.

A.4 Concentration of coverage quantities

We sketch a derivation of inequality (22); see also [1, Theorem 5] for related arguments.We begin with a technical lemma that is the basis for our result. In the lemma, we abuse

notation briefly, and let F ⊂ Z → {0, 1} be a collection of functions with VC-dimension d.We define Pf =

∫f(z)dP (z) and Pnf = 1

n

∑ni=1 f(Zi), as is standard.

Lemma A.3 (Relative concentration bounds, e.g. [3], Theorem 5.1). Let VC(F) ≤ d. Thereis a numerical constant C such that for any t > 0, with probability at least 1− e−t,

supf∈F

{|Pf − Pnf | − C

√min{Pf, Pnf}

d log n+ t

n

}≤ Cd log n+ t

n.

Proof By Boucheron et al. [3, Thm. 5.1] for t > 0, with probability at least 1−e−t we have

supf∈F

Pf − Pnf√Pf

≤ C√d log n+ t

nand sup

f∈F

Pnf − Pf√Pnf

≤ C√d log n+ t

n.

27

Let εn = C√

(d log n+ t)/n for shorthand. Then the second inequality is equivalent to thestatement that for all f ∈ F ,

Pnf − Pf ≤1

2ηPnf +

η

2ε2n for all η > 0.

Rearranging the preceding display, we have(1− 1

2η

)(Pnf − Pf) ≤ 1

2ηPf +

η

2ε2n.

If√Pf ≥ εn, we set η =

√Pf/εn and obtain 1

2(Pnf − Pf) ≤√Pfεn, while if

√Pf < εn,

then setting η = 1 yields 12(Pnf − Pf) ≤ 1

2(Pf + ε2n) ≤ 12(√Pfεn + ε2n). In either case,

Pnf − Pf ≤ C

[√Pf

d log n+ t

n+d log n+ t

n

].

A symmetric argument replacing each Pn with P (and vice versa) gives the lemma.

We can now demonstrate inequality (22). Let V ⊂ Rd and V := {{x ∈ Rd | vTx ≤ 0}}v∈Vthe collection of halfspaces it induces. The collection S = {Sv,a,b}v∈V,a<b of slabs has VC-dimension VC(S) ≤ O(1)VC(V). Let f : X × Y → R and c : S → R be arbitrary functions.If for S ⊂ X we define S+ := {(x, y) | x ∈ S, f(x, y) ≥ c(S)} and the collection S+ := {S+ |S ∈ S}, then VC(S+) ≤ VC(S) + 1 [1, Lemma 5]. As a consequence, for the conformal setsinequality (22) specifies, for any t > 0 we have with probability at least 1− e−t that∣∣∣Pn(Y ∈ C(X), X ∈ Sv,a,b)− P (Y ∈ C(X), X ∈ Sv,a,b)

∣∣∣≤ O(1)

[√min{P (Y ∈ C(X), X ∈ Sv,a,b), Pn(Y ∈ C(X), X ∈ Sv,a,b)}

VC(V) log n+ t

n+

VC(V) log n+ t

n

]simultaneously for all v ∈ V, a < b ∈ R, and similarly

|Pn(X ∈ Sv,a,b)− P (X ∈ Sv,a,b)|

≤ O(1)

[√min{P (X ∈ Sv,a,b), Pn(X ∈ Sv,a,b)}

VC(V) log n+ t

n+

VC(V) log n+ t

n

].

Now, we use the following simple observation. For any ε and nonnegative α, β, γ withα ≤ γ and 2|ε| ≤ γ, ∣∣∣∣ α

γ + ε− β

γ

∣∣∣∣ ≤ |α− β|γ+|αε|

γ2 + γε≤ |α− β|

γ+

2|ε|γ.

Thus, as soon as δ ≥ VC(V) logn+tn , we have with probability at least 1− e−t that

|Pn(Y ∈ C(X) | X ∈ Sv,a,b)− P (Y ∈ C(X) | X ∈ Sv,a,b)|

=

∣∣∣∣∣Pn(Y ∈ C(X), X ∈ Sv,a,b)Pn(X ∈ Sv,a,b)

−P (Y ∈ C(X), X ∈ Sv,a,b)

P (X ∈ Sv,a,b)

∣∣∣∣∣≤ O(1)

[√VC(V) log n+ t

δn+

VC(V) log n+ t

δn

]

28

simultaneously for all v ∈ V, a < b ∈ R, where we substitute γ = P (X ∈ Sv,a,b), α = P (Y ∈C(X), X ∈ Sv,a,b), β = Pn(Y ∈ C(X), X ∈ Sv,a,b), and ε = (Pn − P )(X ∈ Sv,a,b). Note that

if VC(V) logn+tδn ≥ 1, the bound is vacuous in any case.

29

multiclass and multilabel prediction - arXiv · 2020-04-22 · Knowing what you know: valid con...

Documents

Transcript of multiclass and multilabel prediction - arXiv · 2020-04-22 · Knowing what you know: valid con...