Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical...

17
Stochastic Gradient Methods for Distributionally Robust Optimization with f -divergences Hongseok Namkoong Stanford University [email protected] John C. Duchi Stanford University [email protected] Abstract We develop efficient solution methods for a robust empirical risk minimization problem designed to give calibrated confidence intervals on performance and provide optimal tradeoffs between bias and variance. Our methods apply to dis- tributionally robust optimization problems proposed by Ben-Tal et al., which put more weight on observations inducing high loss via a worst-case approach over a non-parametric uncertainty set on the underlying data distribution. Our algorithm solves the resulting minimax problems with nearly the same computational cost of stochastic gradient descent through the use of several carefully designed data structures. For a sample of size n, the per-iteration cost of our method scales as O(log n), which allows us to give optimality certificates that distributionally robust optimization provides at little extra cost compared to empirical risk minimization and stochastic gradient methods. 1 Introduction In statistical learning or other data-based decision-making problems, it is desirable to give solutions that come with guarantees on performance, at least to some specified confidence level. For tasks such as driving or medical diagnosis where safety and reliability are crucial, confidence levels have additional importance. Classical techniques in machine learning and statistics, including regularization, stability, concentration inequalities, and generalization guarantees [6, 25] provide such guarantees, though often a more fine-tuned certificate—one with calibrated confidence—is desirable. In this paper, we leverage techniques from the robust optimization literature [e.g. 2], building an uncertainty set around the empirical distribution of the data and studying worst case performance in this uncertainty set. Recent work [15, 13] shows how this approach can give (i) calibrated statistical optimality certificates for stochastic optimization problems, (ii) performs a natural type of regularization based on the variance of the objective and (iii) achieves fast rates of convergence under more general conditions than empirical risk minimization by trading off bias (approximation error) and variance (estimation error) optimally. In this paper, we propose efficient algorithms for such distributionally robust optimization problems. We now provide our formal setting. Let X R d be a compact convex set, and for a convex function f : R + R with f (1) = 0, define the f -divergence between distributions P and Q by D f (P ||Q)= R f ( dP dQ )dQ. Letting P ρ,n := {p R n : p > =1,p 0,D f (p|| /n) ρ n } be an uncertainty set around the uniform distribution /n, we develop methods for solving the robust empirical risk minimization problem minimize x∈X sup p∈Pρ,n n X i=1 p i i (x). (1) 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Transcript of Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical...

Page 1: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

Stochastic Gradient Methods for DistributionallyRobust Optimization with f -divergences

Hongseok NamkoongStanford University

[email protected]

John C. DuchiStanford University

[email protected]

Abstract

We develop efficient solution methods for a robust empirical risk minimizationproblem designed to give calibrated confidence intervals on performance andprovide optimal tradeoffs between bias and variance. Our methods apply to dis-tributionally robust optimization problems proposed by Ben-Tal et al., which putmore weight on observations inducing high loss via a worst-case approach over anon-parametric uncertainty set on the underlying data distribution. Our algorithmsolves the resulting minimax problems with nearly the same computational costof stochastic gradient descent through the use of several carefully designed datastructures. For a sample of size n, the per-iteration cost of our method scales asO(log n), which allows us to give optimality certificates that distributionally robustoptimization provides at little extra cost compared to empirical risk minimizationand stochastic gradient methods.

1 Introduction

In statistical learning or other data-based decision-making problems, it is desirable to give solutionsthat come with guarantees on performance, at least to some specified confidence level. For taskssuch as driving or medical diagnosis where safety and reliability are crucial, confidence levelshave additional importance. Classical techniques in machine learning and statistics, includingregularization, stability, concentration inequalities, and generalization guarantees [6, 25] providesuch guarantees, though often a more fine-tuned certificate—one with calibrated confidence—isdesirable. In this paper, we leverage techniques from the robust optimization literature [e.g. 2],building an uncertainty set around the empirical distribution of the data and studying worst caseperformance in this uncertainty set. Recent work [15, 13] shows how this approach can give (i)calibrated statistical optimality certificates for stochastic optimization problems, (ii) performs anatural type of regularization based on the variance of the objective and (iii) achieves fast rates ofconvergence under more general conditions than empirical risk minimization by trading off bias(approximation error) and variance (estimation error) optimally. In this paper, we propose efficientalgorithms for such distributionally robust optimization problems.

We now provide our formal setting. Let X ⊂ Rd be a compact convex set, and for a convexfunction f : R+ → R with f(1) = 0, define the f -divergence between distributions P and Q byDf (P ||Q) =

∫f( dPdQ )dQ. Letting Pρ,n := p ∈ Rn : p>1 = 1, p ≥ 0, Df (p||1/n) ≤ ρ

n be anuncertainty set around the uniform distribution 1/n, we develop methods for solving the robustempirical risk minimization problem

minimizex∈X

supp∈Pρ,n

n∑i=1

pi`i(x). (1)

29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Page 2: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

In problem (1), the functions `i : X → R+ are convex and subdifferentiable, and we consider thesituation in which `i(x) = `(x; ξi) for ξi

iid∼ P0. We let `(x) = [`1(x) · · · `n(x)]> ∈ Rn denote thevector of convex losses, so the robust objective (1) is supp∈Pρ,n p

T `(x).

A number of authors show how the robust formulation (1) provides guarantees. Duchi et al. [15]show that the objective (1) is a convex approximation to regularizing the empirical risk by variance,

supp∈Pρ,n

n∑i=1

pi`i(x) =1

n

n∑i=1

`i(x) +

√ρ

nVarP0(`(x; ξ)) + oP0(n−

12 ) (2)

uniformly in x ∈ X . Since the right hand side naturally trades off good loss performance (ap-proximation error) and minimizing variance (estimation error) which is usually non-convex, therobust formulation (1) provides a convex regularization for the standard empirical risk minimization(ERM) problem. This trading between bias and variance leads to certificates on the optimal valueinfx∈X EP0

[`(x; ξ)] so that under suitable conditions, we have

limn→∞

P(

infx∈X

EP0 [`(x; ξ)] ≤ un)

= P (W ≥ −√ρ) for W ∼ N(0, 1) (3)

where un = infx∈X supp∈Pρ,n pT `(x) is the optimal robust objective. Duchi and Namkoong [13]

provide finite sample guarantees for the special case that f(t) = 12 (t− 1)2, making the expansion (2)

more explicit and providing a number of consequences for estimation and optimization based on thisexpansion (including fast rates for risk minimization). A special case of their results [13, §3.1] is asfollows. Let xrob ∈ argminx∈X supp∈Pρ,n p

T `(x), let VC(F) denote the VC-(subgraph)-dimensionof the class of functions F := `(x; ·) | x ∈ X, assume that M ≥ `(x; ξ) for all x ∈ X , ξ ∈ Ξ, andfor some fixed δ > 0, define ρ = log 1

δ + 10VC(F) logVC(F). Then, with probability at least 1− δ,

EP0[`(xrob; ξ)] ≤ un +O(1)

n≤ infx∈X

EP0[`(x; ξ)] + 2

√2ρVarPn(`(x; ξ))

n

+O(1)Mρ

n

(4)

For large n, evaluating the objective (1) may be expensive; with fixed p = 1/n, this has motivated anextensive literature in stochastic and online optimization [27, 23, 19, 16, 18]. The problem (1) doesnot admit quite such a straightforward approach. A first idea, common in the robust optimizationliterature [3], is to obtain a problem that may be written as a sum of individual terms by taking thedual of the inner supremum, yielding the convex problem

infx∈X

supp∈Pρ,n

p>`(x) = infx∈X ,λ≥0,η∈R

1

n

n∑i=1

λf∗(`i(x)− η

λ

)+ρ

nλ+ η. (5)

Here f∗(s) = supt≥0st − f(t) is the Fenchel conjugate of the convex function f . While theabove dual reformulation is jointly convex in (x, λ, η), canonical stochastic gradient descent (SGD)procedures [23] generally fail because the variance of the objective (and its subgradients) explodes asλ→ 0. (This is not just a theoretical issue: in extensive simulations that we omit because they are abit boring, SGD and other heuristic approaches that impose shrinking bounds of the form λt ≥ ct > 0at each iteration t all fail to optimize the objective (5).)

Instead, we view the robust ERM problem (1) as a game between the x (minimizing) player and p(maximizing) player. Each player performs a variant of mirror descent (ascent), and we show howsuch an approach yields strong convergence guarantees, as well as good empirical performance. Inparticular, we show (for many suitable divergences f ) that if `i is L-Lipschitz and X has radiusbounded by R, then our procedure requires at most O(R

2L2+ρε2 ) iterations to achieve an ε-accurate

solution to problem (1), which is comparable to the number of iterations required by SGD [23]. Oursolution strategy builds off of similar algorithms due to Nemirovski et al. [23, Sec. 3] and Ben-Tal et al.[4], and more directly procedures developed by Clarkson et al. [10] for solving two-player convexgames. Most directly relevant to our approach is that of Shalev-Shwartz and Wexler [26], whichsolves problem (1) under the assumption that Pρ,n = p ∈ Rn+ : pT1 = 1 and that there is somex with perfect loss performance, that is,

∑ni=1 `i(x) = 0. We generalize these approaches to more

challenging f -divergence-constrained problems, and, for the χ2 divergence with f(t) = 12 (t− 1)2,

2

Page 3: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

develop efficient data structures that give a total run-time for solving problem (1) to ε-accuracyscaling as O((Cost(grad) + log n)R

2L2+ρε2 ). Here Cost(grad) is the cost to compute the gradient of

a single term ∇`i(x) and perform a mirror descent step with x. Using SGD to solve the empiricalminimization problem to ε-accuracy has run-time O(Cost(grad)R

2L2

ε2 ), so we see that we can achievethe guarantees (3)–(4) offered by the robust formulation (1) at little additional computational cost.

The remainder of the paper is organized as follows. We present our abstract algorithm in Section 2and give guarantees on its performance in Section 3. In Section 4, we give efficient computationalschemes for the case that f(t) = 1

2 (t− 1)2, presenting experiments in Section 5.

2 A bandit mirror descent algorithm for the minimax problem

Under the conditions that ` is convex and X is compact, standard results [7] show that there exists asaddle point (x?, p?) ∈ X × Pρ,n for the robust problem (1) satisfying

supp>`(x?) | p ∈ Pρ,n

≤ p?>`(x?) ≤ inf

p?>`(x) | x ∈ X

.

We now describe a procedure for finding this saddle point by alternating a linear bandit-convexoptimization procedure [8] for p and a stochastic mirror descent procedure for x. Our approach buildsoff of Nemirovski et al.’s [23] development of mirror descent for two-player stochastic games.

To describe our algorithm, we require a few standard tools. Let ‖·‖x denote a norm on the spaceX with dual norm ‖y‖x,∗ = sup〈x, y〉 : ‖x‖ ≤ 1, and let ψx be a differentiable strongly convexfunction onX , meaning ψx(x+∆) ≥ ψx(x)+∇ψx(x)>∆+ 1

2 ‖∆‖2x for all ∆. Let ψp a differentiable

strictly convex function on Pρ,n. For a differentiable convex function h, we define the Bregmandivergence Bh(x, y) = h(x)− h(y)− 〈∇h(y), x− y〉 0. The Fenchel conjugate ψ∗p of ψp is

ψ∗p(s) := supp〈s, p〉 − ψp(p) and ∇ψ∗p(s) = argmax

p〈s, p〉 − ψp(p) .

(ψ∗p is differentiable because ψp is strongly convex [20, Chapter X].) We let gi(x) ∈ ∂`i(x) be aparticular subgradient selection.

With this notation in place, we now give our algorithm, which alternates between gradient ascentsteps on p and subgradient descent steps on x. Roughly, we would like to alternate gradient ascentsteps for p, pt+1 ← pt+αp`(xt), and descent steps xt+1 ← xt−αxgi(xt) for x, where i is a randomindex drawn according to pt. This procedure is inefficient—requiring time of order nCost(grad) ineach iteration—so that we use stochastic estimates of the loss vector `(xt) developed in the linearbandit literature [8] and variants of mirror descent to implement our algorithm.

Algorithm 1 Two-player Bandit Mirror Descent

1: Input: Stepsize αx, αp > 0, initialize: x1 ∈ X , p1 = 1/n2: for t = 1, 2, . . . , T do3: Sample It ∼ pt, that is, set It = i with probability pt,i4: Compute estimated loss for i ∈ [n]: t,i(x) = `i(x)

pi,t1 It = i

5: Update p: wt+1 ← ∇ψ∗p(∇ψp(pt) + αpt(xt)), pt+1 ← argminp∈Pρ,n Bψp(p, wt+1)

6: Update x: yt+1 ← ∇ψ∗x (ψx(xt)− αxgIt(xt)), xt+1 ← argminx∈X Bψx(x, yt+1)7: end for

We specialize this general algorithm for specific choices of the divergence f and the functions ψx andψp presently, first briefly discussing the algorithm. Note that in Step 5, the updates for p depend onlyon a single index It ∈ 1, . . . , n (the vector (xt) is 1-sparse), which, as long as the updates for pare efficiently computable, can yield substantial performance benefits.

3 Regret bounds

With our algorithm described, we now describe its convergence properties, specializing later tospecific families of f -divergences. We begin with the following result on pseudo-regret, which (withminor modifications) is known [23, 10, 26]. We provide a proof for completeness in Appendix A.1.

3

Page 4: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

Lemma 1. Let the sequences xt and pt be generated by Algorithm 1. Define xT := 1T

∑Tt=1 xt and

pT := 1T

∑Tt=1 pt. Then for the saddle point (x?, p?) we have

TE[p?>`(xT )− p>T `(x?)] ≤1

αxBψx(x

?, x1) +αx

2

T∑t=1

E[‖gIt(xt)‖2x,∗]︸ ︷︷ ︸

T1: ERM regret

+

T∑t=1

E[t(xt)>(p? − pt)]︸ ︷︷ ︸T2: robust regret

where the expectation is taken over the random draws It ∼ pt. Moreover, E[t(xt)>(p − pt)] =E[`(xt)

>(p− pt)] for any vector p.

In the lemma, T1 is the standard regret when applying mirror descent to the ERM problem. Inparticular, if Bψx(x

?, x1) ≤ R2 and `i(x) is L-Lipschitz, then choosing αx = RL

√2/T yields

T1 ≤ RL√T . Because it is (relatively) easy to bound the term T1, the remainder of our arguments

focus on bounding the the second term T2, which is the regret that comes as a consequence of therandom sampling for the loss vector t. This regret depends strongly on the distance-generatingfunction ψp. To the end of bounding T2, we use the following bound for the pseudo-regret of p, whichis standard [9, Chapter 11], [8, Thm 5.3]. For completeness we outline the proof in Appendix A.2.Lemma 2. For any p ∈ Pρ,n, Algorithm 1 satisfies

T∑t=1

t(xt)

>(p− pt) ≤Bψp(p, p1)

αp+

1

αp

T∑t=1

Bψ∗p

(∇ψp(pt) + αp

t(xt),∇ψp(pt)

). (6)

Lemma 2 shows that controlling the Bregman divergences Bψp and Bψ∗p is sufficient to bound T2 inthe basic regret bound of Lemma 1.

Now, we narrow our focus slightly to a specialized—but broad—family of divergences for which wecan give more explicit results. For k ∈ R, the Cressie-Read divergence [12] of order k is

fk(t) =tk − kt+ k − 1

k(k − 1), (7)

where fk(t) = ∞ for t < 0, and for k ∈ 0, 1 we define fk by its limits as k → 0 or 1 (we havef1(t) = t log t− t+ 1 and f0(t) = − log t+ t− 1). Inspecting expression (6), we might hope thatcareful choices of ψp could yield regret bounds that grow slowly with T and have small dependenceon the sample size n. Indeed, this is the case, as we show in the sequel: for each divergence fk, wemay carefully choose ψp to achieve small regret. To prove our bounds, however, it is crucial thatthe importance sampling estimator t has small variance, which in turn necessitates that pt,i is nottoo small. Generally, this means that in the update (Alg. 1, Line 5) to construct pt+1, we chooseψ(p) to grow quickly as pi → 0 (e.g. | ∂∂piψp(p)| → ∞), but there is a tradeoff in that this may causelarge Bregman divergence terms (6). In the coming sections, we explore this tradeoff for various k,providing regret bounds for each of the Cressie-Read divergences (7).

To control the Bψ∗p terms in the bound (6), we use the curvature of ψp (dually, smoothness of ψ∗p )to show that Bψ∗p (u, v) ≈

∑(ui − vi)2. For this approximation to hold, we shift our loss functions

based on the f -divergence. When k ≥ 2, we assume that `(x) ∈ [0, 1]n. If k < 2, we instead applyAlgorithm 1 with shifted losses `′(x) = `(x)− 1, so that `′(x) ∈ [−1, 0]n. We call the method with`′ Algorithm 1’, noting that t,i(xt) = `i(xt)−1

pt,i1 It = i in this case.

3.1 Power divergences when k 6∈ 0, 1

For our first results, we prove a generic regret bound for Algorithm 1 when k 6∈ 0, 1 by taking thedistance-generating function ψp(p) = 1

k(k−1)∑ni=1 p

ki , which is differentiable and strictly convex on

Rn+. Before proceeding further, we first note that for p ∈ Pρ,n and p1 = 1n1, we have

Bψp(p, p1) = ψp(p)− ψp(p1)−∇ψp(p1)>(p− p1)

=n−k

k(k − 1)

n∑i=1

(npi)

k − knpi + k − 1

= n−kDf (p||1/n) ≤ n−kρ (8)

4

Page 5: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

bounding the first term in expression (6). From Lemma 2, it remains to bound the Bregman divergenceterms Bψ∗p . Using smoothness of ψ∗p in the positive orthant, we obtain the following bound.

Theorem 1. Assume that `(x) ∈ [0, 1]n. For any real-valued k ≥ 2 and any p ∈ Pρ,n, Algorithm 1satisfies

T∑t=1

E[`(xt)>(p− pt)] =

T∑t=1

E[t(xt)>(p− pt)] ≤n−kρ

αp+αp

2

T∑t=1

E

∑i:pt,i>0

p1−kt,i

. (9)

For k ≤ 2 with k 6∈ 0, 1, an identical bound holds for Algorithm 1’ with `′(x) = `(x)− 1.

See Appendix A.3 for the proof. We now use Theorem 1 to obtain concrete convergence guarantees forCressie-Read divergences with parameter k < 1, giving sublinear (in T ) regret bounds independentof n. In the corollary, whose proof we provide in Appendix A.4, we let Ck,ρ = (1−k)(1−kρ)

−k , whichis positive for k < 0.

Corollary 1. For k ∈ (−∞, 0) and αp = Ck−12

k,ρ n−k√

2ρ/T Algorithm 1’ with `′(x) = `(x)− 1 ∈[−1, 0]n acheives the regret bound

T∑t=1

E[`(xt)>(p− pt)] =

T∑t=1

E[t(xt)>(p− pt)] ≤√

2C1−kk,ρ ρT .

For k ∈ (0, 1) and αp = n−k√

2ρ/T , Algorithm 1’ with `′(x) = `(x)− 1 ∈ [−1, 0]n acheives theregret bound

T∑t=1

E[`(xt)>(p− pt)] =

T∑t=1

E[t(xt)>(p− pt)] ≤√

2ρT .

It is worth noting that despite the robustification, the above regret is independent of n. In the specialcase that k ∈ (0, 1), Theorem 1 is the regret bound for the implicitly normalized forecaster ofAudibert and Bubeck [1] (cf. [8, Ch 5.4]).

3.2 Regret bounds using the KL divergences (k = 1 and k = 0)

The choice f1(t) = t log t − t + 1 yields Df (P ||Q) = Dkl (P ||Q), and in this case, we takeψp(p) =

∑ni=1 pi log pi, which means that Algorithm 1 performs entropic gradient ascent. To control

the divergence Bψ∗p , we use the rescaled losses `′(x) = `(x)− 1 (as we have k < 2). Then we havethe following bound, whose proof we provide in Appendix A.5.Theorem 2. Algorithm 1’ with loss `′(x) = `(x)− 1 yields

T∑t=1

E[`(xt)>(p− pt)] =

T∑t=1

E[t(xt)>(p− pt)] ≤ρ

nαp+αp

2nT. (10)

In particular, when αp = 1n

√2ρT , we have

∑Tt=1 E[`(xt)

>(p− pt)] ≤√

2ρT .

Using k = 0, so that f0(t) = − log t + t − 1, we obtain Df (P ||Q) = Dkl (Q||P ), which resultsin a robustification technique identical to Owen’s original empirical likelihood [24]. We again usethe rescaled losses `′(x) = `(x) − 1, but in this scenario we use the proximal function ψp(p) =−∑ni=1 log pi in Algorithm 1’. Then we have the following regret bound (see Appendix A.6).

Theorem 3. Algorithm 1’ with loss `′(x) = `(x)− 1 yieldsT∑t=1

E[`(xt)>(p− pt)] =

T∑t=1

E[t(xt)>(p− pt)] ≤ρ

αp+αp

2T.

In particular, when αp =√

2ρT , we have

∑Tt=1 E[`(xt)

>(p− pt)] ≤√

2ρT .

In both of these cases, the expected pseudo-regret of our robust gradient procedure is independent ofn and grows as

√T , which is essentially identical to that achieved by pure online gradient methods.

5

Page 6: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

3.3 Power divergences (k > 1)

Corollary 1 provides convergence guarantees for power divergences fk with k < 1, but says nothingabout the case that k > 1; the choice ψp(p) = 1

k(k−1)∑ni=1 p

ki allows the individual probabilities

pt,i to be too small, which can cause excess variance of . To remedy this, we regularize the robustproblem (1) by re-defining our robust empirical distributions set, taking

Pρ,n,δ :=p ∈ Rn+ | p ≥

δ

n,

n∑i=1

f(npi) ≤ ρ,

where we no longer constrain the weights p to satisfy 1>p = 1. Nonetheless, it is still possible toshow that the guarantees (2) and (3) hold with Pρ,n,δ replacing Pρ,n. Indeed, we may give bounds forthe pseudo-regret of the regularized problem with Pρ,n,δ , where we apply Algorithm 1 with a slightlymodified sampling strategy, drawing indices i according to the normalized distribution pt/

∑ni=1 pt,i

and appropriately normalizing the loss estimate via

t,i(xt) =

(n∑i=1

pt,i

)`i(xt)

pt,i1 It = i .

This vector is still unbiased for `(xt). Define the constant Ck := max t : fk(t) ≤ t ∨ ρn <∞ (so

C2 = 2 +√

3). With our choice ψp(p) = 1k(k−1)

∑ni=1 p

ki and for δ > 0, we obtain the following

result, whose proof we provide in Appendix A.7.

Theorem 4. For k ∈ [2,∞), any p ∈ Pρ,n,δ , Algorithm 1 with αp = n−k√ρδk−1/ (4C3

kT ) yieldsT∑t=1

E[`(xt)>(p− pt)] =

T∑t=1

E[t(xt)>(p− pt)] ≤ 2Ck√ρCkδ1−kT

For k ∈ (1, 2), assume that `(x) ∈ [−1, 0]n. Then, Algorithm 1 gives identical bounds.

4 Efficient updates when k = 2

The previous section shows that Algorithm 1 with careful choice of ψp yields sublinear regret bounds.The projection step pt+1 = argminp∈Pρ,n,δ Bψp(p, wt+1), however, can still take time linear in n

despite the sparsity of (xt) (see Appendix B for concrete updates for each of our cases). In thissection, we show how to compute the bandit mirror descent update in Alg. 1, line 5, in time O(log n)time for f2(t) = 1

2 (t − 1)2 and ψp(p) = 12

∑ni=1 p

2i . Building off of Duchi et al. [14], we use

carefully designed balanced binary search trees (BSTs) to this end.

The Lagrangian for the update pt+1 = argminp∈Pρ,n,δ Bψp(p, wt+1) (suppressing t) is

L(p, λ, θ) = Bψp(p, w)− λ

n2

(ρ−

n∑i=1

f2(npi)

)− θ>

(p− δ

n1

)where λ ≥ 0, θ ∈ Rn+. The KKT conditions imply (1+λ)p = w+ λ

n1+θ, and strict complementarityyields

p(λ) =

(1

1 + λw +

λ

1 + λ

1

n− δ

n1

)+

n1, (11)

where p(λ) = argminp∈Pρ,n,δ infθ∈Rn+ L(p, λ, θ). Substituting this into the Lagrangian, we obtainthe concave dual objective

g(λ) := supθ

infp∈Pρ,n,δ

L(p, λ, θ) = Bψp(p(λ), w)− λ

(ρ−

n∑i=1

fk(npi(λ))

).

We can run a bisection search on the nondecreasing function g′(λ) to find λ such that g′(λ) = 0.After algebraic manipulations, we have that

∂λg(λ) = g1(λ)

∑i∈I(λ)

w2i + g2(λ)

∑i∈I(λ)

wi + g3(λ)|I(λ)|+ (1− δ)2

2n− ρ

n2,

6

Page 7: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

where I(λ) := 1 ≤ i ≤ n : wi ≥ δn + ( δn − 1)λ and (see expression (18) in Appendix B.4)

g1(λ) =1

(1 + λ)2, g2(λ) =

−2

n(1 + λ)2, g3(λ) =

1

n2(1 + λ)2− (1− δ)2

2n.

To see that we can solve for λ∗ that acheives |g′(λ∗)| ≤ ε in O(log n + log 1ε ) time, it suffices to

evaluate∑i∈I(λ) w

qi for q = 0, 1, 2 in time O(log n). To this end, we store the w’s in a balanced

search tree (e.g., red-black tree) keyed on the weights up to a multiplicative and an additive constant.A key ingredient in our implementation is that the BST stores in each node the sum of the appropriatepowers of values in the left and right subtree [14]. See Appendix C for detailed pseudocode for alloperations required in Algorithm 1: each subroutine (sampling It ∼ pt, updating w, computing λ∗,and updating p(λ∗)) require time O(log n) using standard BST operations.

5 Experiments

In this section, we present experimental results demonstrating the efficiency of our algorithm. We firstcompare our method with existing algorithms for solving the robust problem (1) on a synthetic dataset,then investigating the robust formulation on real datasets to show how the calibrated confidenceguarantees behave in practice, especially in comparison to the ERM. We experiment on natural highdimensional datasets as well as those with many training examples.

Our implementation uses the efficient updates outlined in Section 4. Throughout our experiments,we use the best tuned step sizes for all methods. For the first two experiments, we set ρ = χ2

1,.9

so that the resulting robust objective (1) will be a calibrated 95% upper confidence bound on theoptimal population risk. For our last experiment, the asymptotic regime (3) fails to hold due to thehigh dimensional nature of the problem, so we choose ρ = 50 (somewhat arbitrarily, but other ρ givesimilar behavior). We take X =

x ∈ Rd : ‖x‖2 ≤ R

for our experiments.

For the experiment with synthetic data, we compare our algorithm against two benchmark methodsfor solving the robust problem (1). The first is the interior point method for the dual reformulation (5)using the Gurobi solver [17]. The second is using gradient descent, viewing the robust formulation (1)as a minimization problem with the objective x 7→ supp∈Pρ,n,δ p

>`(x). To efficiently compute thegradient, we bisect over the dual form (5) with respect to λ ≥ 0, η. We use the best step sizes forboth our proposed bandit-based algorithm and gradient descent.

To generate the data, we choose a true classifier x∗ ∈ Rd and sample the feature vectors aiiid∼ N(0, I)

for i ∈ [n]. We set the labels to be bi = sign(a>i x∗) and flip them with probability 10%. We use

the hinge loss `i(x) =(1− bia>i x

)+

with n = 2000, d = 500 and R = 10 in our experiment.In Figure 1a, we plot the log optimality ratio (log of current objective value over optimal value)with respect to the runtime for the three algorithms. While the interior point method (IPM) obtainsaccurate solutions, it scales relatively poorly in n and d (the initial flat region in the plot is due topre-computations for factorizing within the solver). Gradient descent performs quite well in thismoderate sized example although each iteration takes time Ω(n).

We also perform experiments on two datasets with larger n: the Adult dataset [22] and the ReutersRCV1 Corpus [21]. The Adult dataset has n = 32,561 training and 16,281 test examples with123-dimensional features. We use binary logistic loss `i(x) = log(1 + exp(−bia>i x)) to classifywhether the income level is greater than $5K. For the Reuters RCV1 Corpus, our task is to classifywhether a document belongs to the Corporate category. With d = 47,236 features, we randomlysplit the 804,410 examples into 723,969 training (90% of data) and 80,441 (10% of data) testexamples. We use the hinge loss and solve the binary classification problem for the document type.To test the efficiency of our method in large scale settings, we plot the log ratio log Rn(x)

Rn(x?), where

Rn(x) = supp∈Pρ,n,δ p>`(x), versus CPU time for our algorithm and gradient descent in Figure 1b.

As is somewhat typical of stochastic gradient-based methods, our bandit-based optimization algorithmquickly obtains a solution with small optimality gap (about 2% relative error), while the gradientdescent method eventually achieves better loss.

In Figures 2a–2d, we plot the loss value and the classification error compared with applying purestochastic gradient descent to the standard empirical loss, plotting the confidence bound for the robust

7

Page 8: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

0 10 20 30 40 50 60 70CPU Time

0.0

0.5

1.0

1.5

2.0

2.5

3.0

log(Optimality Gap)

bandit-minimaxGDIPM

(a) Synthetic Data (n = 2000, d = 500)

0 200 400 600 800 1000 1200CPU time

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

log(Optimality Gap)

bandit-minimaxGD

(b) Reuters Corpus (n = 7.2 · 105, d ≈ 5 · 104)

Figure 1: Comparison of Solvers

0 10 20 30 40 50 60 70 80Iter (1e3)

0.32

0.33

0.34

0.35

0.36

0.37

0.38

loss

train[ℓ(xerm; ξ)]

test[(xerm; ξ)]

train[(xrob; ξ)]

test[(xrob; ξ)]

supP∈Pρ, n

P[(xrob; ξ)]

(a) Adult: Logistic Loss

0 10 20 30 40 50 60 70 80Iter (1e3)

0.145

0.150

0.155

0.160

0.165

0.170

0.175

error

ℙtrain

(sgn(x⊤erma) = b

)ℙtest

(sgn(x⊤erma) = b

)ℙtrain

(sgn(x⊤roba) = b

)ℙtest

(sgn(x⊤roba) = b

)

(b) Adult: Classification Error

0 50 100 150 200Iter (1e3)

0.45

0.50

0.55

0.60

0.65

0.70

0.75

loss

train[ℓ(xerm; ξ)]

test[(xerm; ξ)]

train[(xrob; ξ)]

test[(xrob; ξ)]

supP∈Pρ, n

P[(xrob; ξ)]

(c) Reuters: Hinge Loss

0 50 100 150 200Iter (1e3)

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

error

ℙtrain

(sgn(x⊤erma) = b

)ℙtest

(sgn(x⊤erma) = b

)ℙtrain

(sgn(x⊤roba) = b

)ℙtest

(sgn(x⊤roba) = b

)

(d) Reuters: Classification ErrorFigure 2: Comparison with ERM

method as well. As the theory suggests [15, 13], the robust objective provides upper confidencebounds on the true risk (approximated by the average loss on the test sample).

8

Page 9: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

References[1] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. In

Journal of Machine Learning Research, pages 2635–2686, 2010.[2] A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press,

2009.[3] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions

of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.

[4] A. Ben-Tal, E. Hazan, T. Koren, and S. Mannor. Oracle-based robust optimization via onlinelearning. Operations Research, 63(3):628–638, 2015.

[5] J. Borwein, A. J. Guirao, P. Hájek, and J. Vanderwerff. Uniformly convex functions on Banachspaces. Proceedings of the American Mathematical Society, 137(3):1081–1091, 2009.

[6] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of some recentadvances. ESAIM: Probability and Statistics, 9:323–375, 2005.

[7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.[8] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed

bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.[9] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,

2006.[10] K. Clarkson, E. Hazan, and D. Woodruff. Sublinear optimization for machine learning. Journal

of the Association for Computing Machinery, 59(5), 2012.[11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT

Press, 2001.[12] N. Cressie and T. R. Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical

Society. Series B (Methodological), pages 440–464, 1984.[13] J. C. Duchi and H. Namkoong. Statistics of robust optimization: A generalized empirical

likelihood approach. arXiv:1610.02581 [stat.ML], 2016. URL https://arxiv.org/abs/1610.02581.

[14] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the`1-ball for learning in high dimensions. In Proceedings of the 25th International Conference onMachine Learning, 2008.

[15] J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: A generalizedempirical likelihood approach. arXiv:1610.03425 [stat.ML], 2016. URL https://arxiv.org/abs/1610.03425.

[16] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convexstochastic composite optimization, I: a generic algorithmic framework. SIAM Journal onOptimization, 22(4):1469–1492, 2012.

[17] I. Gurobi Optimization. Gurobi optimizer reference manual, 2015. URL http://www.gurobi.com.

[18] E. Hazan. The convex optimization approach to regret minimization. In Optimization forMachine Learning, chapter 10. MIT Press, 2012.

[19] E. Hazan and S. Kale. An optimal algorithm for stochastic strongly convex optimization. InProceedings of the Twenty Fourth Annual Conference on Computational Learning Theory, 2011.

[20] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II.Springer, New York, 1993.

[21] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorizationresearch. Journal of Machine Learning Research, 5:361–397, 2004.

[22] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

[23] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approachto stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

[24] A. B. Owen. Empirical likelihood. CRC press, 2001.[25] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to

Algorithms. Cambridge University Press, 2014.[26] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why? In Proceedings

of the 32nd International Conference on Machine Learning, 2016.[27] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In

Proceedings of the Twentieth International Conference on Machine Learning, 2003.

9

Page 10: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

A Proofs of Regret Bounds

A.1 Proof of Lemma 1

From convexity of the loss function `(·), we have

p>`(xT )− p>T `(x) ≤ 1

T

T∑t=1

p>`(xt)− p>t `(x) ≤ 1

T

T∑t=1

`(xt)

>(p− pt) + p>t g(xt)(x− xt)

(12)

where we have used g(xt) ∈ Rn×d to denote the n-by-d matrix whose rows are gi(xt)>. Note that

E[t(xt)>(p− pt)|It−11 , xt1] = `(xt)>(p− pt), E[gIt(xt)

>(x− xt)|It−11 , xt1] = p>t g(xt)(x− xt)since It ∼ pt and pt is σ(It−11 , xt1)-measurable. Further, from the standard mirror descent result(e.g., [23, Section 2.3]), we have

T∑t=1

E[gIt(xt)>(x− xt)] ≤

1

αxBψx(x

?, x1) +αx2

T∑t=1

E ‖gIt(xt)‖2x,∗ .

Taking expectation in (12) and applying these facts, desired result follows.

A.2 Proof of Lemma 2

From Algorithm 1, we have

αpt(xt)

>(p− pt) = (∇ψp(wt+1)−∇ψp(xt))>

(p− pt)= Bψp(p, pt) +Bψp(pt, wt+1)−Bψp(p, wt+1). (13)

For any p ∈ Pρ,n, we have for all p ∈ Pρ,n,

Bψp(p, wt+1) ≥ Bψp(p, pt+1) +Bψp(pt+1, wt+1) ≡ (∇ψp(p)−∇ψp(wt+1))>

(p− pt+1) ≥ 0.

The latter inequality is just the optimality condition for pt+1 = argminp∈Pρ,n Bψp(p, wt+1). Apply-ing the first equality in (13) and summing for t = 1, . . . , T , we obtain

αp

T∑t=1

t(xt)

>(p− pt) ≤ Bψp(p, p1)−Bψp(p, pT+1) +

T∑t=1

(Bψp(pt, wt+1)−Bψp(pt+1, wt+1)

)≤ Bψp(p, p1) +

T∑t=1

Bψp(pt, wt+1)

= Bψp(p, p1) +

T∑t=1

Bψ∗p (∇ψp(wt+1),∇ψp(pt)).

Now, noting that∇ψp(wt+1) = ∇ψp(pt) + αpt(xt), we obtain the result.

A.3 Proof of Theorem 1

The conjugate of ψp is

ψ∗p(s) =1

k((k − 1)s)

k∗+ +∞ · 1 s ≤ 0, k < 1 .

From Taylor’s theorem, we have

Bψ∗p (u, v) =1

k

n∑i=1

(((k − 1)ui)

k∗+ − ((k − 1)vi)

k∗+

)−

n∑i=1

((k − 1)vi)k∗−1+ (ui − vi)

=

n∑i=1

(∫ ui

vi

((k − 1)t)k∗−1+ dt− ((k − 1)vi)

k∗−1+ (ui − vi)

)

≤ 1

2

n∑i=1

maxt∈[vi,ui]

((k − 1)t)k∗−2+ (ui − vi)2

which gives the following useful lemma.

10

Page 11: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

Lemma 3 (Bubeck and Cesa-Bianchi [8], Lemma 5.9).

Bψ∗p (u, v) ≤ 1

2

n∑i=1

maxt∈[vi,ui]

((k − 1)t)k∗−2+ (ui − vi)2.

For later use, we define the conjugate k∗ = kk−1 and note that

k∗ =k

k − 1=

< 1 if k ∈ (−∞, 0)

< 0 if k ∈ (0, 1)

> 0 if k ∈ (1, 2)

< 2 if k ∈ (2,∞)

Now, define u := ∇ψp(pt) + αpt(xt) = 1

k−1pk−1t + αp

t(xt) and v := ∇ψp(pt) = 1

k−1pk−1t ,

where pk−1 indicates the vector with each of its entries raised to the power k − 1.

When k ≥ 2, we have from Lemma 3 that

Bψ∗p

(∇ψp(pt) + αp

t(xt),∇ψp(pt)

)≤p2−kt,It

2

(αp`It(xt)

pt,It

)2

≤α2p

2p−kt,It (14)

where we have used that `i(x) ∈ [0, 1]. Substituting this in the bound (6) and taking expectations, weobtain the result by noting that It ∼ pt.

For k < 2, note that since p, pt are probability vectors and pt is σ(It−11 , xt1)-measurable, we have

E[t(xt)>(p− pt)|It−11 , xt1] = `′(xt)>(p− pt) = (`(xt)− 1)>(p− pt) = `(xt)

>(p− pt) (15)

from which the first equality of the theorem follows. Following the proof of Lemma 2 verbatim, wehave the usual regret bound

T∑t=1

t(xt)

>(p− pt) ≤Bψp(p, p1)

αp+

1

αp

T∑t=1

Bψ∗p

(∇ψp(pt) + αp

t(xt),∇ψp(pt)

)(16)

where we now have t,i(xt) =`′i(xt)pt,i

1 It = i = `i(xt)−1pt,i

1 It = i. Now, note that if k ≤ 2

with k 6∈ 0, 1, we have that ((k − 1)s)k∗−2+ is nondecreasing in s. Hence, we again obtain the

bound (14) from Lemma 3.

A.4 Proof of Corollary 1

When k ∈ (−∞, 0), the f -divergence constraint

1

nk(k − 1)

n∑i=1

(npi)

k − k(npi − 1)− 1≤ ρ

n

implies that −knpi ≤ (npi)k − knpi ≤ (1− k)(1− kρ) and hence pi ≤ Ck

n . Using this to boundthe sum in (9), we get

T∑t=1

E[t(xt)>(p− pt)] ≤n−kρ

αp+αp

2TnkC1−k

k .

Minimizing with respect to αp > 0 gives the first result. When k ∈ (0, 1), we use Holder inequalitywith p = 1

1−k > 1 and q = 1k > 1:

n∑i=1

p1−kt,i ≤

(n∑i=1

(p1−kt,i )1

1−k

)1−k( n∑i=1

1

)k= nk.

Applying this bound in (9) and minimizing with respect to αp, result follows.

11

Page 12: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

A.5 Proof of Theorem 2

Proceeding as in Section A.3 for k ≤ 2, we obtain the regret bound (16). First, note thatBψp(p, p1) =∑ni=1 pi log npi ≤ ρ

n . We now bound Bψ∗p(∇ψp(pt) + αp

t(xt),∇ψp(pt)

). Using exp(−x)− 1 +

x ≤ x2

2 for x ≥ 0, we have

Bψ∗p

(log pt + 1 + αp

t(xt), log pt + 1

)=∑i 6=It

exp(log pt,i) + exp(log pt,It + αpt,It(xt))−

n∑i=1

exp(log pt,i)− exp(log pt,It)αpt,It(xt)

= pt,It

exp(αp

t,It(xt))− 1− αp

t,It(xt)

≤ 1

2pt,It

t,It(xt)

2 =(`It(xt)− 1)2

2pt,It

where we used x = −αpt,It(xt) ≥ 0. Plugging the above observations into (6) and taking

expectations, we obtain

T∑t=1

E[t(xt)>(p− pt)] ≤ρ

nαp+αp

2

T∑t=1

E

[n∑i=1

(`i(xt)− 1)2

].

Bounding (`i(xt)− 1)2 ≤ 1, the first claim follows. Optimizing the bound with respect to αp > 0,we obtain the second result.

A.6 Proof of Theorem 3

As in Section A.3, the first equality and the interim regret bound follows from (15), (16). Now,note that Bψp(p, p1) = −

∑ni=1(log(npi) − npi + 1) ≤ ρ to bound the first term. Next, we use

x− log(1 + x) ≤ x2

2 for x ≥ 0 to get

Bψ∗p

(∇ψp(pt) + αp

t(xt),∇ψp(pt)

)= Bψ∗p

(− 1

pt+ αp

t(xt),−

1

pt

)= − log

(1− αp`

′It(xt)

)− αp`

′It(xt) ≤

α2p`′It

(xt)2

2

where we have used x = −αp`′It

(xt) ≥ 0 and `′It(xt) ∈ [−1, 0]. Plugging these into the bound (6)and taking expectations, we have

T∑t=1

E[t(xt)>(p− pt)] ≤ρ

αp+αp

2

T∑t=1

n∑i=1

pt,i(`i(xt)− 1)2.

Bounding (`i(xt)− 1)2 ≤ 1, the first claim follows. Minimizing with respect to αp gives the finalclaim.

A.7 Proof of Theorem 4

For k ∈ [2,∞), we proceed identically as in Section A.3 to obtain

T∑t=1

E[`(xt)>(p−pt)] =

T∑t=1

E[t(xt)>(p−pt)] ≤Bψp(pt, p1)

αp+αp

2

T∑t=1

E

( n∑i=1

pt,i

)3 ∑i:pt,i>0

p1−kt,i

where the extra summation term appeared since pt’s are no longer normalized. We note thatBψp(pt, p1) ≤ n−kρ since (8) still holds.

From the definition of Ck = max t : fk(t) ≤ t ∨ ρn , we have

n∑i=1

npi ≤∑

i:npi≤Ck

npi +∑

i:npi>Ck

f(npi) ≤ nCk + ρ ≤ 2nCk

12

Page 13: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

for all p ∈ Pρ,n,δ . Hence, it follows thatT∑t=1

E[t(xt)>(p− pt)] ≤n−kρ

αp+ 8αpTC

3kδ

1−knk.

Minimizing with respect to αp, we obtain the first result.

When k ∈ (1, 2], we proceed identically and use the fact that k∗ ≥ 2 and ` ∈ [−1, 0] in Lemma 3.Plugging this into the bound (6) and taking expectations, we obtain the second claim by followingidentical steps as in the case k ≥ 2.

B Updates for p

In this section, we will explicitly write down the computations required for mirror descent updates inp ∈ Pρ,n. The updates for p is

pt+1 := argminp∈Pρ,n

Bψp(p, wt+1) (17)

where wt+1 = ∇ψp(pt) + αt(xt). In the following, we omit subscripts for ease of notation. Notethat for k ≤ 1, since ‖∇ψp(p)‖ → ∞ as pi → 0 for any 1 ≤ i ≤ n, we can ignore the nonnegativityconstraint in (17).

B.1 Power divergence for k ∈ (−∞, 1) \ 0

Writing down the Lagrangian for the optimization problem (17) with ψp(p) = 1k(k−1)

∑ni=1 p

ki , we

have

L(p, η, λ) =1

k(k − 1)

n∑i=1

(pki − wki )− 1

k − 1

n∑i=1

wk−1i (pi − wi)

− η(p>1− 1)− n−kλ

(ρ− 1

k(k − 1)

n∑i=1

((npi)k − 1)

)where η ∈ Rn and λ ≥ 0. In any case, the first order conditions for p yield

(1 + λ)pk−1 = wk−1 + (k − 1)η1.

Plugging this in the constraint f -divergence constraint∑ni=1 p

ki ≤ n−k(k(k − 1)ρ+ n) and using

strict complementarity, we have

λ(η) =

((nk

k(k − 1)ρ+ n

) 1k∗ ∥∥wk−1 + (k − 1)η1

∥∥k∗− 1

)+

.

Plugging this in the Lagrangian

L(η) = minλ≥0

maxp∈Pρ,n

L(p, η, λ) = Bψp(p(η), w)− η(p(η)>1− 1)

where p(η) = (1 + λ(η))1−k∗(wk−1 + (k − 1)η1)k∗−1. Now, it remains to minimize L(η). Notingthat L(η) is a concave function, the derivative d

dηL(η) is an nondecreasing function. Hence, we canrun a bisection search to find η such that d

dηη = 0. To this end, compute

d

dηL(η) =(1 + λ(η))1−k∗

(λ′(η)

k − 1− 1

) n∑i=1

(wi + (k − 1)η)k∗−1

− (1 + λ(η))2−k∗λ′(η)

k − 1

n∑i=1

wk−1i (wk−1i + (k − 1)η)k∗−2

− ηλ′(η)(1 + λ(η))2−k∗n∑i=1

(wk−1i + (k − 1)η)k∗−2

13

Page 14: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

where

λ′(η) =

(k − 1)(

nk

k(k−1)ρ+n

) 1k∗ ∥∥wk−1 + (k − 1)η1

∥∥1−k∗k∗

∑ni=1(wk−1i + (k − 1)η)k∗−1 if λ(η) ≥ 0

0 otherwise.

Since evaluating ddηL(η) takes O(n) time, the bisection on η will find a ε-accurate solution in

O(n log 1ε ) time. Using this optimal η to to compute p(η) takes another O(n) time.

B.2 KL divergence (k = 1)

Lagrangian for the optimization problem (17) with ψp(p) =∑ni=1 pi log pi is

L(p, η, λ) =

n∑i=1

pi logpiwi− η(p>1− 1)− λ

n

(ρ−

n∑i=1

npi log(npi)

).

The first order conditions for p yield pi = w1

1+λ

i n−λ

1+λ exp(

η1+λ

)and from p>1 = 1, it follows

that pi = w1

1+λ

i /∑ni=1 w

11+λ

i . Plugging this back into the Lagrangian, we have

L(λ) = minη

maxp∈Pρ,n

L(p, λ, η) = λ(

log n− ρ

n

)− αIt(xt)− (1 + λ) log

n∑i=1

w1

1+λ

i .

Taking derivatives, we get

d

dλL(λ) = log n− ρ

n− log

n∑i=1

w1

1+λ

i −∑ni=1 w

− λ1+λ

i∑ni=1 w

11+λ

i

which can be computed in O(n) flops. Since L(λ) is concave, λ ≥ 0 such that ddλL(λ) = 0 can be

found to ε-accuracy in O(n log 1ε ). Then, the update p(η) takes O(n) to compute.

B.3 EL divergence (k = 0)

Lagrangian for the optimization problem (17) with ψp(p) = −∑ni=1 log pi is

L(p, η, λ) = −n∑i=1

logpiwi− η(p>1− 1)− λ

(ρ+

n∑i=1

log(npi)

).

The first order conditions for p yield pi = (1 + λ)( 1wi− η)−1. Plugging this into the divergence

constraint and using strict complementarity, we have

λ(η) =

(exp

(1

n

n∑i=1

log

(1

nwi− η

n

)− ρ

n

)− 1

)+

.

Then, it suffices to solve

L(η) = minλ≥0

maxp∈Pρ,n

L(p, λ, η) =

n∑i=1

pi(η) logpi(η)

wi− η(p(η)>1− 1).

From concavity, we can run bisection search on the monotone function ddηL(η) to find its zero. To

this end, compute

d

dηL(η) =

n∑i=1

p′i(η)

(log

pi(η)

wi− η − 1

)+ pi(η)

+ 1

where

p′i(η) = (1 + λ(η))

(1

wi− η)−2

+ λ′(η)

(1

wi− η)−1

λ′(η) =

− 1n exp

(1n

∑ni=1 log

(1nwi− η

n

)− ρ

n

)∑ni=1

(1wi− η)−1

if λ(η) > 0

0 otherwise.

Hence, the update p(η) can be computed in O(n) flops.

14

Page 15: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

B.4 Power divergences (k > 1)

After some calculations, we have that

g′(λ) =∂

∂λBψp(p(λ), w) + λ

∂λ

n∑i=1

fk(npi(λ))

=

nk+1λ

(k − 1)2− n

k − 1+ λ(λ− 1)

n2k+1

(k − 1)2

(1 + nkλ

)k∗−1 ∑i∈I(λ)

(wk−1i + nλ)k∗−1

+

(k − 1)2+ λ(λ− 1)

nk+2

(k − 1)2

(1 + nkλ

)k∗ ∑i∈I(λ)

(wk−1i + nλ)k∗−2

+nk

k(k − 1)

(1 + nkλ

)k∗ ∑i∈I(λ)

wk−1i (wk−1i + nλ)k∗

+

λ(λ− 1)

n2k+1

(k − 1)2− n2kλ

(k − 1)2

(1 + nkλ

)k∗−1 ∑i∈I(λ)

wk−1i (wk−1i + nλ)k∗−1

+

λ(λ− 1)

nk+2

(k − 1)2− nk+1λ

(k − 1)2

(1 + nkλ

)k∗ ∑i∈I(λ)

wk−1i (wk−1i + nλ)k∗−2

− δk − kδk(k − 1)

|I(λ)| − ρ+nλ

k+n(δk − kδ)k(k − 1)

where

I(λ) =

1 ≤ i ≤ n : wk−1i ≥

n

)k−1(1 + nkλ)− λn

.

Hence, we can run bisection search on λ ≥ 0 to find the zero of the monotone function ∂∂λL(λ) as

before.

When k = 2, under the change of variables λ = n2λ, we have

∂λg(λ) =

1

(1 + λ)2

∑i∈I(λ)

(wi −

1

n

)2

− ρ

n2+

(1− δ)2

2n2(n− |I(λ)|)

=1

(1 + λ)2

∑i∈I(λ)

w2i −

2

n(1 + λ)2

∑i∈I(λ)

wi

+

(1

n2(1 + λ)2− (1− δ)2

2n2

)|I(λ)|+ (1− δ)2

2n− ρ

n2(18)

Making the additional change of variables α = λ/(1 + λ) and I(α) =i : (1− α)wi + α/n ≥ δ

n

,

we have

∂αg(α) =

1

2

∑i∈I(α)

w2i −

1

n

∑i∈I(α)

wi +1

2n2(1− α)2((1− α)2 − (1− δ)2

)|I(α)|

+1

2n2(1− α)2(n(1− δ)2 − 2ρ),

(19)

which is non-increasing in α ∈ [0, 1].

C Procedures for Efficient Updates when k = 2

We detail the operations involving the balanced binary search tree (BST) required for Algorithm 1.The weights w are stored up to multiplicative and additive factors mult and addi. Each node in the

15

Page 16: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

BST stores the following variables:

i = index in 1, . . . , n of nodeleft = pointer to the left child. ∅ if empty (NULL)

right = pointer to the right child. ∅ if empty (NULL)w = weight, stored up to multiplicative and additive factors (mult and addi)Nl = number of weights in the left subtree (smaller weights)Nr = number of weights in the right subtree (bigger weights)Sl = sum of weights in the left subtree (smaller weights)Sr = sum of weights in the right subtree (bigger weights)

S2l = sum of squared weights in the left subtree (smaller weights)

S2r = sum of squared weights in the right subtree (bigger weights)

By computing 1 + Nl + Nr at the root node, the number of elements in the BST is available inconstant time.

We first give the pseudo-code for the sampling procedure used in Line 1.3 of Algorithm 1. Sample(tree)samples a node from the given tree with probabilities proportional to the weights of the nodes. At anygiven node, the procedure decides whether to stay at the current node or recurse down the tree bytossing a coin proportional to the current weight w (stay) and the sum of weights sl (go left) and sr(go right). The algorithm returns the node if the coin flip results in a “stay” decision or it reaches aleaf node. By virtue of this recursive strategy, the sampling procedure requires O(log n).

Algorithm 2 Sample It1: coin← Uniform(0,1)2: node← root3: while node is not a leaf do4: if coin < 1

1+node.Nl+node.Nr then5: return node6: else if coin < (1 + node.Nl)/(1 + node.Nl + node.Nr) then7: node← node.left8: else9: node← node.right

10: end if11: end while12: return node

Next, we briefly outline the procedure for updating the sampled node with index It from pt to wt+1.Using the standard BST operations Remove and Insert, this step requires time O(log n). For exam-ple, a red-black tree uses subtree rotations to update and maintain the values Nl, Nr, Sl, Sr, S2

l , S2r

along with the weights in logarithmic time [11]. See Duchi et al. [14] for explicitly updates whenstoring subtree weights and counts, as in our case.

Algorithm 3 Update w

1: Input: pt,It , wt,It , It2: Remove(pt,It , It), Insert(wt,It , It)3: return root

We next give a procedure that computes an ε-accurate solution to ∂∂αg(α) = 0 as in expression (19).

We first bisect on the nodes to find the node with its weight at the optimal threshold. Then, we bisecton α to compute the exact value. Since the algorithm proceeds in two bisection steps, it only takesO(log n+ log 1

ε ) time.

16

Page 17: Stochastic Gradient Methods for Distributionally Robust ...jduchi/projects/NamkoongDu16.pdfClassical techniques in machine learning and statistics, including ... (iii) achieves fast

Algorithm 4 Compute α∗

1: node = root, noder, nodel = ∅2: cnum, csum, csum2 = 0, `num, `sum, `sum2 = 03: while true do4: w ← node.w, α← (δ − nw)/(1− nw)5: g(α)← 1

2 (csum2 + w2 + node.S2r )− 1

n (csum + w + node.Sr)6: + 1

2n2(1−α)2 ((1−α)2− (1− δ)2)(cnum + 1 + node.Nr) + 12n2(1−α)2 (n(1− δ)2− 2ρ)

7: if g(α) < 0 then // too small, increase α8: noder ← node9: if node.right = ∅ then break

10: end if11: node← node.right12: else // too big, decrease α13: nodel ← node14: cnum ← cnum + 1 + node.Nr, csum ← csum + node.w + node.Sr15: csum2 ← csum2 + node.w2 + node.S2

r16: `num ← cnum, `sum ← csum, `sum2 ← csum2

17: if node.left = ∅ then break18: end if19: node← node.left20: end if21: end while22: if nodel 6= ∅ then23: cnum = `num, csum = `sum, csum2 = `sum2

24: end if25: u← 1, l← 0, α← .526: while u− l > ε do27: if g(α, `) < 0 then28: u← α29: else30: l← α31: end if32: end while33: Update mult← (1− α)mult, addi← (1− α) ∗ addi + α/n34: return α

In Line 4.27, we used g(α, `) to denote g(α) as computed with `num, `sum, `sum2 as the relevantsums.

Provided λ∗ = 1/(1− α∗), Algorithm 5 gives a procedure for updating the tree to p(λ∗) in O(log n)time. By virtue of the updates (11), we have pi(λ) ≥ δ

n for i 6= It since wi ≥ δn . Hence, the only

potential truncation is for index It, which takes O(log n) time by removing and reinserting the nodeinto the tree.

Algorithm 5 Update p

1: Input: λ∗, wt,It , It2: if wt,It < δ

n then3: // If modified weight was too low, truncate.4: Remove(wt,It , It), Insert( δn , It)5: end if

17