A deterministic algorithm for solving multistage ... · A deterministic algorithm for solving...

A deterministic algorithm for solving multistage stochasticprogramming problems

Regan Bauckea,b,∗, Anthony Downwarda, Golbon Zakeria,b

aElectrical Power Optimization Centre at The University of Auckland, 70 Symonds Street,Auckland 1010, New Zealand

bThe Energy Centre at The University of Auckland, 12 Grafton Road, Auckland 1010, NewZealand

Abstract

Multistage stochastic programming problems are an important class of optimisa-tion problems, especially in energy planning and scheduling. These problems andtheir solution methods have been of particular interest to researchers in stochasticprogramming recently.

Because of the large scenario trees that these problems induce, current solutionmethods require random sampling of the tree in order to build a candidate pol-icy. Candidate policies are then evaluated using Monte Carlo simulation. Undercertain sampling assumptions, theoretical convergence is obtained almost surely.In practice, convergence of a given policy requires a statistical test and is onlyguaranteed at a given level of confidence.

In this paper, we present a deterministic algorithm to solve these problems. Themain feature of this algorithm is a deterministic path sampling scheme during theforward pass phase of the algorithm which is guaranteed to reduce the bound gapat all the nodes visited. Because policy simulation is no longer required, there isan improvement in performance over traditional methods for problems in whicha high level of confidence is sought.

Keywords: dynamic programming, decomposition, multistage, SDDP,stochastic programming

∗Corresponding authorEmail address: [email protected] (Regan Baucke)

October 5, 2017

1. Introduction

Since the seminal work of [10], convex multistage stochastic programming prob-lems (often implemented as stochastic dual dynamic programmes or SDDP) havebeen studied extensively by researchers in the field of stochastic programming[13, 3, 11, 5].

These problems are notorious in their difficulty – this is due to the exponentialgrowth of the scenario tree that represents the filtration on the problem’s proba-bility space, and the often high-dimensional state-spaces that must be representedin each stage. Several algorithms exist that solve SDDP-type problems, readersmay be familiar with [3] and [11].

The size of these problems mean that direct methods are insufficient; the problemsrequire specialised decomposition algorithms. The essence of these decompositionmethods is to break the problem into a series of stage problems which includea representation of the value/cost-to-go associated with future stages. Thesemethods are rooted in the concepts underpinning dynamic programming.

This class of problems has many applications within the energy planning andscheduling industry. A classic example is the hydro-thermal scheduling problem– where a release policy from a hydro-electric reservoir is sought. This policy canbe defined in terms of the value of water storage in the reservoir. In New Zealand,such a water valuation tool is made available to market participants by the elec-tricity regulator. This tool (named EMI-DOASA, see [12]) provides a methodol-ogy for solving large instances of an simplified model of New Zealand’s generationassets and transmission network. This model has been used to compute an op-timal reservoir management for a hypothetical centrally planned system, andthereby provide a benchmark for comparison to actual releases. Moreover, themarginal value of water that defines the optimal policy can help guide reservoirmanagement decision of individual agents.

Most implementations of SDDP algorithms sample the scenario tree at randomto build piece-wise linear lower approximations of the cost function at each stage,in a nested fashion. The output of such methods are a set of lower-bound ap-proximate cost functions. From these cost functions, a feasible policy can becomputed by solving of a sequence of stage problems that incorporate the cost-to-go functions. In order to terminate the algorithm, a convergence test must beperformed which requires the evaluation of the expected cost of the policy. Theevaluation of the policy over the whole scenario tree can be a large computationalexpense and is intractable even for seemingly small problems. It was suggestedby [10] that a policy evaluation could be achieved by a Monte Carlo simulation– reducing the computational expense, but forgoing a deterministic convergencecriterion.

2

Several proofs (see [11, 5]) exist giving convergence of these methods in an almost-sure sense. However, in practice, these methods can become slow, since parts ofthe scenario that need to be sampled to improve the lower-bound can take an arbi-trarily large number of iterations to be explored. In the meantime, linear cuttingplanes which bound the cost function (called ‘cuts’) are evaluated and includedin the sub-problems at every iteration, but may not give new information.

It is the focus of this paper to present a new algorithm to solve these optimisationproblems. The algorithm is similar to that of [11], but includes two additions:an upper-bound cost function and a deterministic scenario tree sampling scheme.We consider the problem class presented by [5]; that is, we make no assump-tions about linearity of cost functions nor the polyhedral nature of the feasibilitysets. The presence of an upper-bound function precludes the use of Monte Carlosimulation in order to evaluate the candidate policy, leading to an increase incomputational performance.

Convergence proofs throughout this paper give results in the traditional sense;that is, we show that sequences of value function iterates become arbitrarily closeto the true objective value, using our deterministic sampling scheme.

This paper is structured as follows: Section 2 introduces some preliminary con-cepts required for the algorithm; Section 3 extends Section 2 by describing thealgorithm in the complete multistage stochastic setting; and Section 4 presentsthe results of several computational experiments to conclude the paper.

2. Algorithm preliminaries

This section introduces several aspects of our algorithm for multistage stochasticproblems through simpler problems to build the reader’s intuition. Firstly, weprovide an alternative proof to Kelley’s cutting plane algorithm from [8], whichutilises a converging lower-bound and upper-bound. We will extended this ideato the two-stage and then multistage case. Initially, we are concerned with thesolving following optimisation problem:

w∗ = minu∈U

W (u), (2.1)

where U ⊂ Rn is convex and compact, and W (u) is convex and α-Lipschitz on U .We are interested in solving this problem by forming a series of approximationsto the objective function. This was first seen in [8], and works by forming alower-bound function of W , which is defined as the point-wise maximum of linearfunctions over the affine hull of U . We extend this method by including an upper-bound function.

3

2.1. Approximations of convex functions

For a given α-Lipschitz convex function W : U ⊂ Rn 7→ R and a finite set ofsample points S ⊂ U , with u ∈ S, we define the following lower-bound functionof W (u):

Definition 2.1.

¯WS(u) = max

u∈SW (u) + 〈∇W (u), (u− u)〉 (2.2)

where ∇W (u) is an arbitrary sub-gradient of W (u) at u.

The lower-bound function possesses many attractive qualities for use in decom-position algorithms. Three that we wish to highlight here are:

S1 ⊆ S2 =⇒¯WS1(u) ≤

¯WS2(u) ∀u ∈ U , (2.3)

¯WS(u) ≤W (u) ∀u ∈ U , ∀S ∈ [U ], (2.4)

¯WS(u) = W (u) ∀u ∈ S, ∀S ∈ [U ], (2.5)

where [U ] is the set of all finite subsets of U . These three properties are exploitedfrequently throughout this paper. Further, it can be seen that because W (u) is α-Lipschitz, so is the lower-bound function

¯WS(u). These facts will not be proved

in this paper; we contend that these properties are well known from previousworks (for instance, [8]). We now introduce an upper-bound function of W (u),which is defined as the objective of the following optimisation problem:

WS(u) = minσu

∑u∈S

W (u)σu

s.t.∑u∈S

σu = 1 [µ]∑u∈S

uσu = u [λ]

σu ≥ 0.

(2.6)

Intuitively, WS(u) can be thought of as the minimal value over all convex com-binations of points in S that contain u in their convex hull. When (2.6) isinfeasible, we take the convention that WS(u) =∞. This means function WS(u)is an upper-bound of W (u) for all u ∈ U – however we do lose the favourableproperty of Lipschitz continuity over U with this definition; WS(u) is Lipschitzover Conv(S) only. To restore this property, we extend the definition to thefollowing:

4

Definition 2.2.

WS(u) = maxµ,λ

µ+ 〈λ, u〉

s.t. µ+ 〈λ, u〉 ≤W (u), ∀u ∈ S,||λ||q ≤ αµ free,

(2.7)

where || · ||q is a p-norm.

This definition is constructed from the dual of (2.6) and then further constrainingthe magnitude of λ. We will proceed by proving that the upper-bound functionWS(u) has the analogous properties to the lower-bound function

¯WS(u). We first

show, in Lemma 2.1, that WS(u) is α-Lipschitz, and then in Lemmas 2.2–2.4 weshow analogues for the properties (2.3), (2.4), and (2.5), respectively.

Lemma 2.1. WS(u) is α-Lipschitz under the || · ||p norm where

1

p+

1

q= 1 and p, q > 0. (2.8)

Proof. It is easy to see that the value of |WS(u2) − WS(u1)| is bounded aboveby

maxλ

〈λ, u2 − u1〉

s.t. ||λ||q ≤ α.(2.9)

If α = 0, then clearly |WS(u2)− WS(u1)| = 0 and so the Lipschitz condition

|WS(u2)− WS(u1)| ≤ α||u2 − u1||p, ∀u2, u1 ∈ U , (2.10)

holds. If α > 0 then, we can write (2.9) as

α×maxλ

〈 1αλ, u2 − u1〉

s.t. || 1αλ||q ≤ 1.

(2.11)

By the definition of the dual norm of Lp spaces, (2.11) is equal to α||u2 − u1||pwhere

1

p+

1

q= 1,

completing the proof.

Lemma 2.2. If W (u) is convex and α-Lipschitz on U under an arbitrary butfixed || · ||p norm, and u ∈ S, then WS(u) = W (u).

5

Proof. For a given S, the points (µ, λ) in the feasible region of (2.7) representthe coefficients of a supporting hyperplane of the points (u,W (u)) ∀u ∈ S whereµ is the intercept and λ is the gradient of the hyperplane. We note here that thefeasible region is always non-empty as

µ = minu∈S

W (u), and λ = 0, (2.12)

is always feasible. Because the function W (u) is convex and α-Lipschitz, thereexists a supporting hyperplane

W (u) + 〈g, u− u〉 ≤W (u), ∀u ∈ U , (2.13)

where ||g||q ≤ α and g is in the set of sub-gradients of W (u) at u. So a feasiblecandidate solution for WS(u) is µ = W (u)−〈g, u〉 and λ = g which gives WS(u) =W (u). If another candidate (µ, λ) gives µ + 〈λ, u〉 < W (u), then the pair is notoptimal because of the existence of (µ, λ). If a further candidate (µ, λ) givesµ + 〈λ, u〉 > W (u), then the pair doesn’t support the function at u, making itinfeasible. This completes the proof.

Lemma 2.3. If S1 ⊆ S2, then WS1(u) ≥ WS2

(u).

Proof. The function WS(u) is defined by a maximisation problem which is feasibleand bounded. S1 ⊆ S2 implies that the feasibility set of WS2

is contained withinWS1

, giving the result.

Lemma 2.4. If W (u) is convex and α-Lipschitz on U under the || · ||p norm,then WS(u) ≥W (u) for all u ∈ U .

Proof. From our monotonicty property, we have WS∗(u) ≤ WS(u), ∀S ⊆ S∗.

Suppose there exists a u ∈ U for which WS(u) < W (u). By defining S∗ = S∪u,and by Lemma 2.2, we have W (u) = WS∗

(u). So WS(u) < WS∗(u), which

is a contradiction of the monotonicity of inclusion of WS(u), completing theproof.

Figure 1 shows the bounding functions on an arbitrary convex function f(x) onX ⊂ R1. We now proceed to give a new proof to Kelley’s Theorem that usesboth bounding functions described above. We note here that this proof doesnot enhance the original result, but serves as a useful introduction to the upper-bound function, and gives the format for further proofs in this paper. We requirethe conditions set out in (2.1) on W (u) and U for the following lemma.

6

x

f(x

)

Figure 1: A convex function f(x) on X ⊂ R, with f(x) in blue and¯f(x) in red.

Lemma 2.5. Consider the sequences

uk = arg minu∈U ¯

WSk−1(u), Sk = Sk−1 ∪ uk, (2.14)

withwk = min

u∈UWSk−1(u),

¯wk = min

u∈U ¯WSk−1(u). (2.15)

We have thatlimk→∞

(wk −

¯wk)

= 0.

We will use W k(u) as a notational shorthand for WSk(u). We proceed with theproof of Lemma 2.5.

Proof. From (2.14) and (2.15), we have

¯wk =

¯W k(uk), ∀k.

For wk, from (2.15) we have

wk ≤ W k(u), ∀k, ∀u ∈ U .

It follows then that,

wk −¯wk ≤ W k(uk)−

¯W k(uk), ∀k. (2.16)

We will now show that the RHS of (2.16) converges to 0 as k tends to infinity –this will complete the proof. Suppose there exist some ε > 0 for which

ε ≤ W k(uk)−¯W k(uk), ∀k.

7

Subtracting W k(uk)−¯W k(uk) from both sides gives

ε− W k(uk) +¯W k(uk) ≤ W k(uk)−

¯W k(uk)− W k(uk) +

¯W k(uk), ∀k, k.

From the α-Lipschitz continuity of W k and¯W k, we have

ε− W k(uk) +¯W k(uk) ≤ 2α||uk − uk||, ∀k, k.

From Lemma 2.2, W k(uk)−¯W k(uk) = 0, ∀k < k so,

ε

2α≤ ||uk − uk||, ∀k < k, k,

which is a contradiction of the compactness of U . It must follow then, that noε > 0 exists and wk −

¯wk → 0 as k →∞.

Because the sequences wk and¯wk bound each other and are monotonic, we remark

that wk,¯wk → w∗, as defined in (2.1). We also note that wk, the minimum of

the upper-bound function at iteration k, simply collapses to the minimum of thevalues W (uk) sampled so far; it is not necessary in this case to store the upper-bound as a function over U . This simpler upper-bound has been discussed by [9]in relation to its role in the Kelley Method.

2.2. The two-stage case

Suppose instead of minimising a single convex function using the above method,one was tasked with minimising the sum of several convex functions. This prob-lem is very familiar to the stochastic programming community; the sum in ques-tion is an expectation of value functions over different scenario realisations. With-out loss of generality, we wish to compute the value of

w∗ = minu∈U

∑ω∈Ω

pωWω(u), (2.17)

with the same requirements of convexity and α-Lipschitz continuity of Wω(u), forall ω ∈ Ω and a convex and compact U . Here, Ω is finite. The first decompositionmethod for this problem class was developed in [1] and is commonly referredto as Bender’s Decomposition. This decomposition method comes in two mainvariants: the average-cut technique and the multi-cut technique. Briefly, anaverage-cut method brings a single cut back to the first stage problem at everyiteration. This cut has gradients equal to the average of all the second stagesub-gradients. Similarly, the constant term is the average of the second stagefunction evaluations. A multi-cut method includes one cut for every second stagefunction; the ‘averaging’ is done within the first stage optimisation problem itself.

8

We propose a different algorithm to solve this problem, which uses the upper-bound function introduced earlier, and a sampling scheme which determines themost ‘useful’ ω ∈ Ω to sample at every iteration. Our method requires the multi-cut (as opposed to average-cut) approach in the spirit of [2]; bounding functionsare computed and stored for each Wω(u).

The algorithm is as follows: At iteration 0, define¯W 0ω := −∞, W 0

ω :=∞. Startingwith k = 1, xk0 = x0, solve

uk = arg minu∈U

∑ω∈Ω

pω¯W k−1ω (u), (2.18)

Next, evaluate1

ωk = arg maxω∈Ω

pω(W k−1ω (uk)−

¯W k−1ω (uk)

). (2.19)

We now update our approximation of¯W k−1ωk (u) as

¯W kωk(u) := max

¯W k−1ωk (u),W k

ωk(uk) + 〈∇Wω(uk), u− uk〉,

and W k−1ωk (u) as

W kωk(u) = max

µ,λµ+ 〈λ, u〉

s.t. µ+ 〈λ, uk〉 ≤Wωk(uk) ∀k < k

µ+ 〈λ, uk〉 ≤Wωk(uk)

||λ|| ≤ αµ, free.

(2.20)

No updates are performed on the other functions. This concludes an iteration ofthe algorithm. The following theorem gives the convergence result for the abovealgorithm.

Theorem 2.1. Consider the sequences of functions W kω and

¯W kω and controls uk

generated by the algorithm. With

wk = minu∈U

∑ω∈Ω

pωWk−1ω (u),

¯wk = min

u∈U

∑ω∈Ω

pω¯W k−1ω (u), (2.21)

we havelimk→∞

(wk −

¯wk)

= 0.

1Equation (2.19) may not have an a maximal value, so the arg max would be empty. Inparticular, if there exists an D ⊆ Ω for which

¯W k−1

ω (uk) = −∞, ∀ω ∈ D, then define ωk as anarbitrary element of D.

9

Proof. From the definition of¯wk in (2.21) and uk in (2.18) we have,

¯wk =

∑ω∈Ω

pω¯W kω (uk), ∀k. (2.22)

For wk, we have

wk ≤∑ω∈Ω

pωWkω (uk), ∀k.

It follows then that,

wk −¯wk ≤

∑ω∈Ω

pω(W kω (uk)−

¯W kω (uk)

), ∀k.

From the ω-criterion in (2.19), we have that

wk −¯wk ≤ |Ω|pωk

(W kωk(uk)−

¯W kωk(uk)

)∀k.

From Lemma 2.5, the RHS goes to zero as k →∞, concluding the proof.

This demonstrates the usefulness of the upper-bound function; it provides thealgorithm with the ability to sample a particular ω ∈ Ω in a fashion which isguaranteed to converge. The difference between this algorithm and Benders isthat only one function is sampled at every iteration, rather than the |Ω| functionsamples required for Benders. However, this comes at a cost of solving a multi-cut first-stage problem, and evaluating W k−1

ω (uk) ∀ω ∈ Ω at every iteration inorder to determine which function Wω(u) to sample.

3. Multistage setting

In this section, we extend the ideas in the previous section to the more difficultsetting of multistage optimisation. In the previous section, our bounding func-tions did not converge to the true objective function, but rather their minimaconverged to the true function’s minimum. An analogue holds for the multistagesetting; we show by induction that nested cost function approximations (both up-per and lower) converge at the state trajectories generated by the our proposedalgorithm. Once again, we begin by studying the deterministic problem first, andthen we move on to the stochastic version of the problem.

10

3.1. The deterministic case

We consider the following general deterministic multistage optimisation prob-lem:

minx,u

T−1∑t=0

Ct(xt, ut) + VT (xT )

s.t. x0 is given,

xt+1 = ft(xt, ut) ∀t ∈ 0, . . . , T − 1xt ∈ Xt ∀t ∈ 0, . . . , Tut ∈ Ut(xt) ∀t ∈ 0, . . . , T − 1.

(3.1)

We can write the above in a dynamic programming formulation for t ∈ 1, . . . , T − 1as

Vt(xt) = minxt+1,u

Ct(xt, ut) + Vt+1(xt+1)

s.t. xt+1 = ft(xt, ut)

xt+1 ∈ Xt+1

ut ∈ Ut(xt).

(3.2)

We require several conditions on the functions and multifunctions above. Briefly,the conditions ensure that two properties hold: that all the optimisation problemsdefining Vt(xt) are always convex, feasible and bounded, and that a constraintqualification is satisfied (giving the existence of sub-gradients). We direct thereader to Appendix A for their full technical description. Our algorithm is asfollows: at iteration 0, define

¯V 0t := −∞, V 0

t :=∞, ∀ t ∈ 0, . . . , T − 1. Becausethe final cost function is given, we set

¯V kT = VT , V

kT = VT , ∀k ∈ N. Starting with

k = 1, xk0 = x0, with t = 0, solve:

(ukt , xkt+1) = arg min

ut,xt+1

Ct(xkt , ut) +

¯V k−1t+1 (xt+1)

s.t. xt+1 = ft(xkt , ut)

xt+1 ∈ Xt+1

ut ∈ Ut(xkt ).

(3.3)

Repeat the above2 with t← t+ 1 until t = T − 1.

2Here we note that we need not make a forward pass to the final stage, we may terminatewhen V k

t+1(xkt+1)−

¯V kt+1(xk

t+1) = 0. This occurs in the final stage automatically, but may occurearlier in the forward pass, saving computational expense.

11

We then begin the backward pass of the algorithm. Starting at t ← T − 1, wesolve,

¯θkt = min

xt+1,utCt(x

kt , ut) +

¯V kt+1(xt+1)


xt+1 ∈ Xt+1

ut ∈ Ut(xkt ),

(3.4)

and compute the sub-gradients (βkt ) with respect to xkt . Define the lower-boundcost function as

¯V kt (x) := max

¯V k−1t (x),

¯θkt + 〈βkt , x− xkt 〉. (3.5)

Solveθkt = min

xt+1,utCt(x

kt , ut) + V k

t+1(xt+1)


xt+1 ∈ Xt+1

ut ∈ Ut(xkt ),

(3.6)

in order to update the upper-bound cost function as:

V kt (x) = max

µ,λµ+ 〈λ, x〉

s.t. µ+ 〈λ, xkt 〉 ≤ θkt ∀k ≤ k − 1

µ+ 〈λ, xkt 〉 ≤ θkt||λ||q ≤ αµ free.

(3.7)

We repeat this process with t ← t − 1 until t = 0, and set k ← k + 1. Thatconcludes an iteration of the algorithm.

Before proceeding with the proposed proof of convergence, several important pre-liminary remarks need to be made. In [5], the authors demonstrate the importantresult of the Lipschitz-continuity of the value functions of our problem describedin (3.2) and their lower-bound estimates in (3.5). This result allows the use ofarguments similar to that for the proof of convergence for Kelley’s Method intheir own convergence proof. We take advantage of their result for our proofregarding the Lipschitz-continuity of Vt(xt) and

¯V kt (xt). Additionally, our proof

requires a similar result with respect to the upper-bounding functions V kt (xt). We

remind the reader that the Lipschitz-continuity of our upper-bound functions is,of course, automatic and is given by (3.7) combined with Lemma 2.1; no furtheranalysis is required.

12

With regards to the particular Lipschitz constant employed in the proof, [5] givean expression for a bound on the sub-gradients exhibited by the value functionestimates. For the purposes of our proof, we require only that such a constantexists.

Theorem 3.1 below gives the result that the upper and lower bounds convergeat the optimal state trajectories. However, we first present Lemma 3.1, whichprovides a useful result which is leveraged in our proof of Theorem 3.1.

Lemma 3.1. Consider the sequences of functions V kt and

¯V kt , and state trajec-

tories xkt generated by the algorithm. For all t = 0, . . . , T − 1, we have

limk→∞

(V kt (xkt )− ¯

V kt (xkt )

)= 0 =⇒ lim

k→∞

(V k−1t (xkt )− ¯

V k−1t (xkt )

)= 0. (3.8)

Proof. Suppose that there exists an ε > 0 for which

ε ≤(V k−1t (xkt )− ¯

V k−1t (xkt )

), ∀k.

Subtracting(V kt (xkt )− ¯

V kt (xkt )

)from both sides gives

ε−(V k−1t (xkt )−¯

V k−1t (xkt )

)≤(V k−1t (xkt )−¯

V k−1t (xkt )

)−(V k−1t (xkt )−¯

V k−1t (xkt )

), ∀k, k.

Because both V kt and

¯V kt are α-Lipschitz on Xt, we have

ε−(V k−1t (xkt )− ¯

V k−1t (xkt )

)≤ 2α||xkt − xkt ||, ∀k, k.

Becauselimk→∞

(V kt (xkt )− ¯

V kt (xkt )

)= 0,

and(V kt (x)−

¯V kt (x)

)is monotonically decreasing in k for all x ∈ Xt, there exists

some k large enough for which(V k−1t (xkt )− ¯

V k−1t (xkt )

)≤ ε

2 , ∀k ≥ k, k ≤ k < k.So

ε

2≤ 2α||xkt − xkt ||, ∀k ≥ k, k ≤ k < k.

Dividing through, we have

ε

4α≤ ||xkt − xkt ||, ∀k ≥ k, k ≤ k < k,

which is a contradiction of the compactness of Xt. So there exists no such ε > 0for which

ε ≤ V k−1t (xkt )− ¯

V k−1t (xkt ), ∀k,

concluding the proof.13

Theorem 3.1. Consider the sequence functions V kt and

¯V kt , and state and control

trajectories (xkt , ukt ) generated by the algorithm. For all t = 0, . . . , T − 1, we have

limk→∞

(V kt (xkt )− ¯

V kt (xkt )

)= 0.

Proof. The proof proceeds with a backward induction. At time t+1, the inductionhypothesis is

limk→∞

(V kt+1(xkt+1)−

¯V kt+1(xkt+1)

)= 0.

Obviously this is true for last stage by definition. Now we want to show

limk→∞

(V kt (xkt )− ¯

V kt (xkt )

)= 0,

assuming the induction hypothesis. From the definition of our bounding functionsin (3.7) and (3.5), we have

¯V kt (xkt ) ≥ ¯

θkt ≥ minut∈Ut(xkt ),

ft(xkt ,ut)∈Xt+1

Ct(xkt , ut)+

¯V kt (ft(x

kt , ut)) ≥ Ct(xkt , ukt )+

¯V k−1t (xkt+1),

andV kt (xkt ) ≤ θkt ≤ Ct(xkt , ukt ) + V k−1

t+1 (xkt+1).

Subtracting these, we have

V kt (xkt )− ¯

V kt (xkt ) ≤ V k−1

t+1 (xkt+1)−¯V k−1t+1 (xkt+1). (3.9)

From Lemma 3.1 (applied at t + 1), we have that the RHS of (3.9) goes to 0 ask →∞ completing the proof.

3.2. The stochastic case

In this section, we extend the multistage deterministic model by incorporatingstochasticity. The problem form is very general; rather than describe a filtrationprocess on a particular discrete probability space, we speak in terms of a gen-eral underlying scenario tree object with N being the set of vertices of the treecontaining the leaf vertices L ⊂ N .

It is not the intention of the authors to discuss the philosophy of modelling mul-tistage stochastic optimisation problems here; we direct the reader to [13] and [3]which both contain interesting discussion on differing perspectives of multistagestochastic modelling. Whether taking a sample-average approximation approachor proposing that a particular tree truly represents the problem’s uncertainty isof no concern, as both viewpoints permit scenario tree representations.

14

Each vertex n ∈ N \ L has a cost function Cn(x, u) and an associated statefeasibility set Xn and control set Un. Arc weights pn give the probability ofarriving at vertex n from the root node. Finally, the leaves have final valuefunctions Vm(xm) ∀m ∈ L defined over Xm. Of course, the deterministic casecan be thought of as a special case of the above; one where the scenario tree issimply a path graph and all value functions have unit probabilities. The completedescription of the optimisation problem is given by the following:

V (x0) = minx,u

∑n∈N\L

pnCn(xn, un) +∑m∈L

pmVm(xm)

s.t. x0 = x0,

xm = fm(xn, un), ∀m ∈ R(n), ∀n ∈ N \ Lxn ∈ Xn, ∀n ∈ Nun ∈ Un(xn), ∀n ∈ N \ 0.

(3.10)

The associated dynamic programming equations are given by

Vn(xn) = minxm,un

Cn(xn, un) +∑

m∈R(n)

pmpnVm(xm)

s.t. xm = fm(xn, un), ∀m ∈ R(n)

xm ∈ Xm, ∀m ∈ R(n)

un ∈ Un(xn).

(3.11)

where R(n) is the set of immediate children of vertex n. The vertex P (n) isdefined as the parent of vertex n. The symbol n here refers to a vertex inthe scenario tree rather than the dimension of an associated Euclidean space.Figure 2 depicts the staging structure used for the problem description. We notethat this is not restrictive; differing staging frameworks can be reconciled withan appropriate transformation of variables. We require the same conditions as

Enterstate

xn

Controlmade

un

|R(n)| futurestates evaluated

fm(xn, un)

Uncertaintyrealised

m∗

Enterstate

xm∗

Problem n Problem m∗

Figure 2: The staging considered in our problem formulation.

15

those laid out for the deterministic problem (3.1), but rather than the conditionsholding for every stage of the problem, we have them for every vertex in the tree.

Much like the deterministic multistage algorithm, every iteration of the algorithmhas two phases; a forward and backward pass. Our algorithm is as follows: Atiteration 0, define

¯V 0n := −∞, V 0

n :=∞, ∀n ∈ N\L. We set¯V km = Vm(x), V k

n (x) =Vn(x), ∀n ∈ L. We introduce a new object Φk

n, which keeps track of child vertex ofvertex n with the maximal difference in bounds at iteration k during the forwardpass. Starting with k = 1, xk0 = x0, with n = 0, we begin an iteration. For thecurrent node n, solve

(un, xm) = arg minun,xm

Cn(xkn, u) +∑

m∈R(n)

pmpn ¯V k−1m (xm)

s.t. xm = fm(xkn, un), ∀m ∈ R(n),

xm ∈ Xm, ∀m ∈ R(n),

un ∈ Un(xkn).

(3.12)

Next, update Φkn as:

Φkn = arg max

m∈R(n)

pmpn

(V k−1m (xm)−

¯V k−1m (xm)

). (3.13)

Further, update ukn as un, and xkΦk

nas xΦk

n. Set n := Φk

n – if Φkn is not unique, then

select one arbitrarily. If Φkn is a leaf node, terminate the forward pass. Otherwise,

continue from (3.12) with the updated vertex n. We now begin the backward passof the algorithm. Starting with the last vertex visited in the forward pass n, wesolve

¯θkn = min

xm,unCn(xkn, un) +

∑m∈R(n)

pmpn ¯V km(xm)


xm ∈ Xm, ∀m ∈ R(n),

un ∈ Un(xn).

(3.14)

Compute the sub-gradients (βkn) with respect to xkn then define the lower-boundvalue function as

¯V kn (x) := max

¯V k−1n (x),

¯θkn + 〈βkn, x− xkn〉. (3.15)

Solveθkn = min

xm,unCn(xkn, un) +

∑m∈R(n)

pmpnV km(xm)


xm ∈ Xm, ∀m ∈ R(n),

un ∈ Un(xn),

(3.16)

16

in order to update the upper-bound value function as

V kn (x) = max

µ,λµ+ 〈λ, x〉

s.t. µ+ 〈λ, xkn〉 ≤ θkn ∀k ≤ k − 1

µ+ 〈λ, xkn〉 ≤ θkn||λ||q ≤ αµ, free.

(3.17)

We repeat this process from 3.14 with n← P (n) until n = 0. This concludes aniteration of the algorithm.

In the following theorem, we prove that the algorithm that we have detailed abovefor solving (3.10) is convergent.

Theorem 3.2. Consider the sequences of functions V kn and

¯V kn and state, control

and maximal-child sets, (xkn, ukn,Φ

kn) generated by the algorithm. For all n ∈ N\L,

we havelimk→∞

(V kn (xkn)−

¯V kn (xkn)

)= 0.

Proof. The proof proceeds with a backwards induction. At nodes m ∈ R(n), theinduction hypothesis is

limk→∞

(V km(xkm)−

¯V km(xkm)

)= 0.

Obviously this is true for m ∈ L by definition. Now we want to show

limk→∞

(V kn (xkn)−

¯V kn (xkn)

)= 0,

assuming the induction hypothesis. By the same token as Theorem 3.1: from thedefinition of our bounding functions in (3.15) and (3.17), we have

¯V kn (xkn) ≥ Ct(xkn, ukn) +

∑m∈R(n)

pmpn ¯V k−1m (xkm),

andV kn (xkn) ≤ Ct(xkn, ukn) +

∑m∈R(n)

pmpnV k−1m (xkm),

Subtracting these, we have

V kn (xkn)−

¯V kn (xkn) ≤

∑m∈R(n)

pmpn

(V k−1m (xkm)−

¯V k−1m (xkm)

).

17

From (3.13) we have

V kn (xkn)−

¯V kn (xkn) ≤ |R(n)|

pΦkn

pn

(V k−1

Φkn

(xkΦkn)−

¯V k−1

Φkn

(xkΦkn)). (3.18)

From Theorem 3.1, the RHS of (3.18) approaches zero as k →∞, completing theproof.

3.3. Discussion

The algorithm detailed above is deterministic, it requires no random sampling.Furthermore, there is no requirement to form an upper-bound through policysimulation. The algorithm agrees with a certain intuition: that if a path which isguaranteed to have ‘disagreement’ in its bounds is sampled and improved, thenthe bound gaps over the whole tree must converge in the limit.

In [5], the authors present an algorithm class. For instance, the CUPPS andDOASA algorithms are both in this class, as they only differ in when cuts areformed and included in the value function estimates per major iteration. In theCUPPS approach of [3], simulation and update are performed at the same timeon the same vertex; whereas in the DOASA algorithm, a complete forward passis performed, followed by a complete backward pass (as illustrated in Figure3) – both algorithms are convergent. We have refrained from including thisgeneralisation into our algorithm (we present only a DOASA approach) lest weobfuscate our algorithm’s upper-bound and tree sampling aspects.

1

23

6

5

DOASA4

1

3

CUPPS5

2

46

Figure 3: CUPPS and DOASA algorithms acting on a scenario tree.

The algorithm in general does not exploit any particular structure in the scenariotree; the tree object is simply used as a means to describe a proof in the mostgeneral setting. In practice, what makes these problems tractable is the effectivecompression of the scenario trees brought on by a particular structure in the

18

uncertainty. In any one iteration, the information we learn at a particular vertexcould be valid for many vertices in our tree. This idea is commonly referred toas cut sharing and is discussed extensively in [7].

Our algorithm provides two major advantages. Firstly, the forward pass is guidedand visits vertices in the tree which are guaranteed to ‘discover’ the most aboutthe value functions. Secondly, a deterministic upper-bound can be computed ina way which takes advantage of the structure of scenario tree in exactly the sameway as described in [7] for the lower-bounds. We refer to this as column sharing ;it is the key to forming a deterministic upper-bound for a given policy withoutenumeration.

For random sampling methods, Philpott and Guan in [11] show that almost-sureconvergence can be achieved after a finite number of iterations if value functionsare polyhedral. We believe that under the same assumptions, their finite-timeconvergence arguments hold analogously for our algorithm as well.

4. Computational experiments

In this section, we present the results of several computational experiments. Twoalgorithms are considered in this section; the first being the algorithm describedin this paper, and the second being a multi-cut DOASA algorithm. We note thatthese algorithms share many similarities. Indeed, the former algorithm becomesthe latter when the upper-bound information is neglected and the forward passis conducted with random sampling.

We present a scalable test problem used to benchmark the implementations of ouralgorithms. Our implementations are written in Julia, making use of the JuMPlibrary (see [4]) to interface with Gurobi (see [6]) to solve the sub-problems.

4.1. A test problem

In line with SDDP tradition, we will present a multistage hydro-thermal schedul-ing problem to benchmark the different solution techniques. Our problem consistsof a three-reservoir two-chain system meeting demand at two nodes with a trans-mission line. The reservoirs receive stage-wise independent inflows. Demand canbe met by hydro generation or by a mix of thermal generation whose fuel costs aredeterministic but may differ between stages. Many other hydro-thermal modelswith varying degrees of complexity exist – see [10] and [12] as examples.

One such stage problem is shown in (4.1) where: lh is the reservoir level, rh andsh are the release and spill from reservoir h respectively; af,i is the supply oftype f at node i which has a stage-dependent cost ctf,i per unit; U(l) represents

19

the reservoir dynamics and reservoir capacities; and fm(rh, sh, lth) are the linear

dynamics of the system (containing a random inflow term).

Vn(lth) = mina,lt+1

h,m,rh,sh

∑i∈I

∑f∈F

ctf,iaf,i + E[V t+1m (lt+1

h,m)]

s.t. 0 ≤ af,i ≤ Kf,i∑f∈F

af,1 +∑h∈H

rh,1 + t = dt1∑f∈F

af,2 +∑h∈H

rh,2 − t = dt2

−T ≤ t ≤ T(rh, sh) ∈ Un(lth)

lt+1h,m = fm(rh, sh, l

th), ∀m ∈ R(n).

(4.1)

Because of the independence structure of the uncertainty, we are able to sharecuts (and columns!) between vertices in our scenario tree. The norm consideredfor this problem (which appears in (3.17)) is the ||·||∞ norm; this allows the upper-bound function to be computed as a linear programme. Our implementationinvolves embedding the dual of (3.17) directly into (3.16) – similar to what isdone for the lower-bound functions in many SDDP implementations.

4.2. Results

Problem 1 is a 20-stage, 9-scenario per stage instance of the problem above. Table1 shows the results of several runs of the deterministic algorithm with a decreasing

relative ε, where εk =V k

0 (x0)−¯V k

0 (x0)

V k0 (x0)

. As expected, in relatively few iterations, the

Table 1: Results for Problem 1

Relative ε Lower Upper Time (s) Iterations LP Solves

20.0% 406.86 506.89 48.02 295 2802510.0% 413.10 457.63 137.26 533 506055.0% 416.25 438.06 396.39 929 881801.0% 419.01 423.24 7461.57 3653 346080

algorithm is able produce a ‘good’ policy; similar to policies generated by methodsthat use random sampling. Much more computational effort is required to reducethe relative bound gap as the gap begins to close. We will discuss these ideasfurther in relation to Figures 4 and 5. It is worth mentioning here that withoutour method, a policy evaluation (in order to form a deterministic upper-bound)

20

for this problem would require 1,519,708,182,382,116,100 (≈ 1.52 × 1018) linearprogramme evaluations.

0 100 200 300 400 500 600 700 800 900 1,000390

400

410

420

k

Exp

ecte

dC

ost

Random

Determ.

414

416

418

420

Figure 4: Lower-bounds on the Expected Cost between the two algorithms.

Comparing results between between our method and other methods requires care-ful consideration. We will first discuss the results shown in Figure 4. This figureplots the the cost of the the lower-bound at each iteration i.e.

¯V k

0 (x0). Note thatbecause DOASA uses random sampling, we have plotted the results of severalruns of the algorithm on this problem, each with different random seeds. Herewe can see that initially our lower-bound lags behind the lower-bounds generatedby DOASA with random sampling. However, at iteration 600, we can see thelower-bound of our method begins to match the others. Beyond this, our methoddelivers a higher lower-bound, for the same number of iterations. We believe theearly relatively poor lower-bound performance is due to the fact that the algo-rithm’s sampling is prioritising bringing the upper-bound down. We concludehere that a random sampling scheme under stage-wise independence provides acoarse lower-bound estimate quite quickly, but struggles to make the marginalgains towards the later portion of the algorithm.

We conjecture that one reason for the good early performance of the randomsampling method is the relatively simple uncertainty structure (stage-wise inde-pendence) in our example. Moreover, the reason the random sampling methodcannot make those small gains towards the end, is that these may be associatedwith specific sample paths that have not been sampled, whereas our methodidentifies a path to search which guarantees and improvement at each iteration.

21

0 100 200 300 400 500 600 700 800 900 1,000400

420

440

460

480

500

k

Exp

ecte

dC

ost

95%

Mean

Determ.

Figure 5: Convergence plot with Monte Carlo estimates for an upper bound.

4.3. Solution verification

Figure 5 shows both the upper-bound and lower-bound of the deterministicmethod, and an 800 sample Monte Carlo estimate of the expected cost of apolicy generated from random sampling3. The dashed line is the upper-boundof a 95% confidence interval constructed about the mean i.e. we are 95% surethat the expected cost of the current policy lies below this line and above the95% confidence interval lower bound (not shown in the plot). The construction

Table 2: Monte Carlo estimates required for convergence tests

GapConfidence Level

95% 97.5% 99% 99.9%

10% 440 574 762 12385% 1832 2393 3174 51611% 48690 63595 84366 137189

of these confidence intervals (to check for convergence) can be computationallyexpensive and requires the user to determine how often this check is performed,and to what confidence level. We contrast this with our deterministic algorithm:for each iteration of our algorithm, we have a bound on the maximum distancethe current policy is from optimality.

3The authors would like to acknowledge Oscar Dowson for kindly providing the code whichperformed the Monte Carlo simulation from a policy generated from a random sampling DOASAimplementation.

22

In Table 2 above, we investigate the computational expense required to buildconfidence intervals to various tolerance levels (% gap) assuming the variancegiven by the policy in Figure 5 at iteration 1000. Specifically, we compute thenumber of Monte Carlo samples required to deliver a confidence interval with thesame width as 1%, 5% or 10% of the best lower-bound for the objective function.For example, almost 140,000 Monte Carlo samples would have to be carried outin order to form a 99.9% confidence interval with a width equal to the differenceof the deterministic upper and lower bound at 1%. Because each sample requiresthe evaluation of 20 stage problems, this number of samples requires more thansix times the number of LP solves (simply to evaluate an confidence interval)than our method would take to give a solution with a guaranteed 1% gap.

Finally, we note that convergence of a stochastic upper-bound is sensitive to thevariance of the distribution of costs given by a particular policy. The higher thepolicy’s variance, the wider the confidence interval will be for a given sample-size and confidence level. This means that random sampling algorithms mayfind no evidence that the policy has not converged, even when the policy isfar from optimal. However, since our method that we propose in this paperis deterministic, we do not need to simulate to estimate an upper-bound and sonever encounter cases where a convergence test gives either false positives or falsenegatives.

Acknowledgements

The authors would like to thank the members of the Electrical Power Optimi-sation Centre at the University of Auckland for stimulating discussions on thetopic. Secondly, we thank Oscar Dowson for kindly providing the code used tosimulate a stochastic upper-bound for Figure 5. Finally, the authors would liketo recognise the support of the Energy Education Trust of New Zealand over theperiod of the development of this work.

23

Appendix A. Technical conditions

We consider the following general deterministic multistage optimisation prob-lem:

minx,u

T−1∑t=0

Ct(xt, ut) + VT (xT )

s.t. x0 = x0,

xt+1 = ft(xt, ut) ∀t ∈ 0, . . . , T − 1xt ∈ Xt ∀t ∈ 0, . . . , Tut ∈ Ut(xt) ∀t ∈ 0, . . . , T − 1

(A.1)

with the following conditions:

1. for all t, Xt is a non-empty subset of Rn;

2. for all t, multifunctions Ut : Rn 7→ Rm are convex and non-empty compactvalued;

3. the final cost function VT is a convex lower semicontinous proper function.The cost functions Ct(x, u) ∀t < T are convex lower semicontinous properfunctions of x and u;

4. for all t < T , functions ft are affine;

5. the final cost function VT is finite-valued and Lipschitz-continuous on XT ;

6. for all t < T , there exists δt > 0, defining X ′t := Xt +Bt(δt), where

Bt(δ) = y ∈ Aff(Xt) | ||y|| ≤ δ

such that

(a) ∀x ∈ X ′t , ∀u ∈ Ut(x), Ct(x, u) <∞;(b) ∀x ∈ X ′t

ft(x,Ut(x))) ∩ Xt+1 is non-empty.

Conditions 1-5 ensure that (A.1) and its value function definitions are convexoptimisation problems. Condition 6 is designed to give a constraint qualificationcondition for the optimisation problem (as it is non-linear in general). In [5], theauthors call this condition extended relatively complete recourse, which is a moregeneral class of recourse (which includes complete recourse). In their analysis,the existence of δt allows the construction of a Lipschitz constant to give theirLipschitz-continuity of Vt(xt) and

¯Vt(xt) result. We note that these conditions

are identical to those required by [5] in their paper.

24

References

[1] J.F. Benders. Partitioning procedures for solving mixed-variables programming problems.Numerische Mathematik, 4:238–252, 1962/63. URL http://eudml.org/doc/131533.

[2] John R. Birge and Francois V. Louveaux. A multicut algorithm for two-stage stochasticlinear programs. European Journal of Operational Research, 34(3):384–392, 1988. ISSN03772217. doi: 10.1016/0377-2217(88)90159-2.

[3] Z. Chen and W. Powell. Convergent Cutting-Plane and Partial-Sampling Algorithm forMultistage Stochastic Linear Programs with Recourse 1. Optimization, 102(3):497–524,1999. ISSN 0022-3239. doi: 10.1023/A:1022641805263.

[4] Iain Dunning, Joey Huchette, and Miles Lubin. JuMP: A Modeling Language forMathematical Optimization. arXiv:1508.01982 [math.OC], pages 1–25, 2015. URLhttp://arxiv.org/abs/1508.01982.

[5] P Girardeau, V Leclere, and A B Philpott. On the Convergence of DecompositionMethods for Multistage Stochastic Convex Programs. Mathematics of Operations Re-search, 40(1):130–145, 2014. ISSN 0364-765X. doi: 10.1287/moor.2014.0664. URLhttp://dx.doi.org/10.1287/moor.2014.0664.

[6] Inc. Gurobi Optimization. Gurobi Optimizer Reference Manual, 2016. URLhttp://www.gurobi.com.

[7] G Infanger and D P Morton. Cut Sharing for Multistage Stochastic Linear Programs withInterstage Depency. Mathematical Programming, 75(241-256):241–256, 1996.

[8] JE Jr Kelley. The cutting-plane method. Journal of the Society for Industrial and AppliedMathematics, 8(4):703–712, 1960.

[9] Yu Nesterov. Introductory Lectures on Convex Programming Volume I: Basiccourse. Lecture Notes, I:1–211, 1998. doi: 10.1007/978-1-4419-8853-9. URLhttp://enpub.fulton.asu.edu/cseml/Fall2008 ConvOpt/book/Intro-nl.pdf.

[10] M. V F Pereira and L. M V G Pinto. Multi-stage stochastic optimization applied toenergy planning. Mathematical Programming, 52(1-3):359–375, 1991. ISSN 00255610. doi:10.1007/BF01582895.

[11] A. B. Philpott and Z. Guan. On the convergence of stochastic dual dynamic programmingand related methods. Operations Research Letters, 36(4):450–455, 2008. ISSN 01676377.doi: 10.1016/j.orl.2008.01.013.

[12] Andy Philpott and Geoff Pritchard. EMI-DOASA. 2013.[13] Alexander Shapiro. Analysis of stochastic dual dynamic programming method. Eu-

ropean Journal of Operational Research, 209(1):63–72, 2011. ISSN 03772217. doi:10.1016/j.ejor.2010.08.007. URL http://dx.doi.org/10.1016/j.ejor.2010.08.007.

25

A deterministic algorithm for solving multistage ... · A deterministic algorithm for solving...

Documents

Transcript of A deterministic algorithm for solving multistage ... · A deterministic algorithm for solving...