Ch3 Abstract DP

20
3 Monotonic-Semicontractive Models Contents 3.1. Semicontractive Models . . . . . . . . . . . . . . . p. 84 3.2. Fixed Points and Optimality Conditions . . . . . . . . p. 85 3.3. Algorithms . . . . . . . . . . . . . . . . . . . . . p. 93 3.3.1. Asynchronous Value Iteration . . . . . . . . . . p. 93 3.3.2. Asynchronous Policy Iteration . . . . . . . . . . p. 95 3.4. Notes, Sources, and Exercises . . . . . . . . . . . . p. 100 DRAFT - WORK IN PROGRESS: 10/13/2012 83

Transcript of Ch3 Abstract DP

Page 1: Ch3 Abstract DP

3

Monotonic-SemicontractiveModels

Contents

3.1. Semicontractive Models . . . . . . . . . . . . . . . p. 843.2. Fixed Points and Optimality Conditions . . . . . . . . p. 853.3. Algorithms . . . . . . . . . . . . . . . . . . . . . p. 93

3.3.1. Asynchronous Value Iteration . . . . . . . . . . p. 933.3.2. Asynchronous Policy Iteration . . . . . . . . . . p. 95

3.4. Notes, Sources, and Exercises . . . . . . . . . . . . p. 100

DRAFT - WORK IN PROGRESS: 10/13/2012

83

Page 2: Ch3 Abstract DP

84 Monotonic-Semicontractive Models Chap. 3

In this chapter, we discuss abstract DP models where Tµ is a monotonemapping that has a contraction-like character for some but not all µ 2M.The resulting models, called semicontractive, have very strong propertiesunder certain assumptions, which ensure that noncontractive policies can-not be optimal. While these assumptions involve certain restrictions, themodels apply to important classes of problems, including the stochasticshortest path problems discussed in Example 1.2.6.

3.1 SEMICONTRACTIVE MODELS

Our basic model in this chapter is the same as the one of Chapter 2, but theassumptions are di↵erent. In particular, we will maintain the monotonicityassumption, but we will weaken the contraction assumption, and we willintroduce some other conditions in its place.

We consider a set of “states” X and a set of “controls” U , and foreach x 2 X, a nonempty control constraint set U(x) ⇢ U . We denote byM the set of all functions µ : X 7! U with µ(x) 2 U(x) for all x 2 X, andby R(X) the set of real-valued functions J : X 7! <. We have a mappingH : X ⇥ U ⇥ R(X) 7! < and for each µ 2 M, we consider the mappingTµ : R(X) 7! R(X) defined by

(TµJ)(x) = H�x, µ(x), J

�, 8 x 2 X.

We also consider the mapping T defined by

(TJ)(x) = infu2U(x)

H(x, u, J), 8 x 2 X.

Let ⇧ be the set of all sequences ⇡ = {µ0, µ1, . . .} with µk 2M for allk (nonstationary policies in the DP context), let J be an initial real-valuedfunction, and define

J⇡(x) = lim supk!1

(Tµ0 · · ·Tµk J)(x), 8 x 2 X.

We want to find the optimal cost function J*,

J*(x) = inf⇡2⇧

J⇡(x), 8 x 2 X,

and a corresponding optimal policy.We will assume throughout the chapter the monotonicity assumption

of Chapter 2.

Assumption 3.1.1: (Monotonicity) If J, J 0 2 R(X) and J J 0,then

H(x, u, J) H(x, u, J 0), 8 x 2 X, u 2 U(x).

Page 3: Ch3 Abstract DP

Sec. 3.2 Fixed Points and Optimality Conditions 85

As in Chapter 2, we introduce a positive function v : X 7! <, and theweighted sup-norm

kJk = supx2X

��J(x)��

v(x)

on B(X), the real-valued functions J on X such that J(x)/v(x) is boundedover x 2 X. We assume throughout this chapter that the initial functionJ belongs to B(X):

J 2 B(X).

A stationary policy µ is called proper if Jµ is the unique fixed pointof Tµ within B(X), and we have

T kµ J ! Jµ, 8 J 2 B(X).

A policy that is not proper is called improper . These definitions generalizethe ones given for stochastic shortest path problems in Example 1.2.6. Atypical case where µ is proper is when Tµ is an m-stage contraction mappingwith respect to k · k for some m (which may depend on µ); cf. Prop. A.2.However, even in this case, there is no restriction, over the set of all properpolicies, that m is bounded above, or that the modulus of contraction isbounded away from 1.

Generally, properness of a policy µ is associated with the notion ofglobal stability of the dynamic system

Jk+1 = TµJk, k = 0, 1, . . . .

For properness of µ, this system should converge to a unique equilibriumin B(X), starting from any point in B(X). This is a useful observation be-cause there is extensive and well-established analysis of the stability prop-erties of dynamic systems, including infinite-dimensional ones.

3.2 FIXED POINTS AND OPTIMALITY CONDITIONS

We will now develop the main analytical results for semicontractive models.These results are almost as strong as those for the contractive models ofChapter 2. In particular, under assumptions to be given shortly, we showthat:

(a) The optimal cost function J* is the unique fixed point of the mappingT within B(X).

(b) A stationary policy µ is optimal if and only if TµJ* = TJ*.

(c) The method of value iteration converges to the optimal cost functionJ* starting from an arbitrary initial function.

Page 4: Ch3 Abstract DP

86 Monotonic-Semicontractive Models Chap. 3

(d) The method of policy iteration yields an optimal proper policy start-ing from an arbitrary proper policy.

(e) There are valid synchronous, asynchronous, and optimistic versionsof value and policy iteration.

Thus semicontractive models exhibit similar behavior to the contrac-tive models of Chapter 2, where all policies are proper (in fact contractionswith a common contraction modulus). To obtain this behavior we will im-pose conditions guaranteeing that only proper policies can be optimal, andthat there is enough mathematical structure to bypass all complicationsdue to the presence of improper policies. These conditions set the direc-tion of the subsequent analysis, which in e↵ect aims to confine attention toproper policies, thereby taking advantage of their inherent contraction-likedefining property. With this in mind we introduce the following assump-tion.

Assumption 3.2.1: (Semicontraction)

(a) For all J 2 B(X) and µ 2M, the functions TµJ and TJ belongto B(X).

(b) There exists at least one proper policy, and there exists J 2 B(X)such that Jµ � J for all proper policies µ.

(c) For each improper policy µ and each J 2 B(X), there is at leastone state x 2 X such that

lim supk!1

(T kµ J)(x) = 1. (3.1)

(d) The control set U is a metric space, and the set�u 2 U(x)

�� H(x, u, J) �

(3.2)

is compact for every J 2 B(X), x 2 X, and � 2 <.

(e) For each sequence {Jm} ⇢ B(X) with Jm # J for some J 2 B(X)we have

limm!1

H(x, u, Jm) = H (x, u, J) , 8 x 2 X, u 2 U(x).

(f) For all scalars r > 0 and functions J 2 B(X), we have

T (J + r v) TJ + r v. (3.3)

Page 5: Ch3 Abstract DP

Sec. 3.2 Fixed Points and Optimality Conditions 87

The two critical conditions of the preceding assumption are (b) and(c), which suggest that improper policies cannot be optimal. The reasonis that proper policies have finite cost for all states by definition, whileimproper policies have infinite cost for at least one state by Eq. (3.1) [takingJ = J , we obtain Jµ(x) = 1]. Note that while contractions do not appearexplicitly, the notion of a proper policy is central, and it is intimatelyconnected with contractions.

Parts (b) and (c) also contain some some additional conditions, whichare technical in nature and facilitate the analysis. The lower bound J on thecost functions of the proper policies in part (b) will ensure that J* 2 B(X).Note here that even if all the mappings Tµ are contractions with respectto k · k, and hence all policies are proper, it does not follow that T is acontraction or that J* is real-valued (see the blackmailer’s problem in theexercises). †

The condition (3.1) in part (c) is satisfied if for every improper policyµ we have Jµ(x) = 1 for some x and moreover Tµ has the (semi)nonexpansi-veness property

kTµJ � TµJk kJ � Jk, 8 J 2 B(X), µ 2M,

which typically holds in DP problems. In this case we have

kT kµ J � T k

µ Jk kJ � Jk, (3.4)

so that(T k

µ J)(x) (T kµ J)(x) + kJ � Jk v(x), 8 k � 1,

and by taking lim sup as k ! 1, it follows that if Jµ(x) = 1, then Eq.(3.1) holds.

Parts (d)-(f) of the Semicontraction Assumption are technical con-ditions, which facilitate the analysis and will be reencountered in variousforms in the noncontractive models of Chapter 4. The proof of our mainresults does not go through if part (d) is replaced by the more general as-sumption that for all J 2 B(X), there exists µ 2M such that TµJ = TJ

† If all mappings Tµ are contractions with respect to the weighted sup-normk · k, and with the same contraction modulus (i.e., the contraction assumptionof Chapter 2 holds), then T is also a contraction with respect to k · k and withthe same modulus. Then we have J⇤ 2 B(X) and Jµ � J⇤ for all µ 2 M, andthe lower bound J 2 B(X) in part (b) is unnecessary. The existence of the lowerbound J may be easily verified in some problems, such as when H satisfies thefollowing monotone increase condition

H(x, u, J) � J(x), 8 x 2 X, u 2 U(x),

or when the number of proper policies is finite (e.g., when X and U are finitesets).

Page 6: Ch3 Abstract DP

88 Monotonic-Semicontractive Models Chap. 3

(cf. the next proposition). Part (e) holds if H is continuous in J , viewedas a mapping from B(X) to <, for fixed (x, u). Part (f) holds if Tµ isnonexpansive, i.e.,

kTµJ � TµJ 0k kJ � J 0k, 8 J, J 0 2 B(X), µ 2M,

which also implies that T is nonexpansive:

kTJ � TJ 0k kJ � J 0k, 8 J, J 0 2 B(X).

The compactness assumption of part (d) holds if for each x and J 2B(X), the set U(x) is compact and H(x, ·, J) is lower semicontinuous asa function of u over U(x). This can be easily verified in a few interestingcases, including when U is a finite set, and important cases of finite-stateand countable-state MDP with infinite but compact control spaces (seee.g., [BeT91]).

An example where the Semicontraction Assumption is satisfied is thestochastic shortest path problem as described in Example 1.2.6. In fact thetheory of the present section is patterned after the analysis of that problemin the book [Ber87] and the paper [BeT91].

Main Results

The following two propositions provide some basic preliminary results.

Proposition 3.2.1: Under Assumption 3.2.1(d), for each J 2 B(X),there exists a µ 2M such that TµJ = TJ .

Proof: We show that the infimum of H(x, u, J) over u 2 U(x) is attainedusing a form of Weierstrass’ Theorem. In particular, let {�m} be a sequencewith

�m # infu2U(x)

H(x, u, J).

The setUm(x) =

�u 2 U(x)

��H(x, u, J) �m ,

is nonempty and compact, and the set of points attaining the infimum ofH(x, u, J) over U(x) is the intersection of \1m=0Um(x), which is nonemptyby the compactness property of Assumption 3.2.1(d). Let µ(x) be a pointin this intersection, for each x 2 X. Then µ 2M and

H�x, µ(x), J

� �m, 8 m � 0.

Taking the limit as m !1 shows that µ satisfies TµJ = TJ . Q.E.D.

Page 7: Ch3 Abstract DP

Sec. 3.2 Fixed Points and Optimality Conditions 89

Proposition 3.2.2: Let the Monotonicity and Semicontraction As-sumptions 3.1.1 and 3.2.1 hold. Then a policy µ 2M is proper if andonly if it satisfies

J � TµJ

for some J 2 B(X).

Proof: By definition, if µ is proper, Jµ is the unique fixed point of Tµ

within B(X). Thus we have J � TµJ for J = Jµ. Conversely, if J � TµJfor some J 2 B(X), then by the monotonicity of Tµ, we have T k

µ J J ,for all k � 1. Taking lim sup as k ! 1 and using Assumption 3.2.1(c), itfollows that µ cannot be improper. Q.E.D.

The following proposition is the main result of this section, and pro-vides analogs to some of the principal results for contractive models givenin Section 2.1.

Proposition 3.2.3: Let the Monotonicity and Semicontraction As-sumptions 3.1.1 and 3.2.1 hold.

(a) The optimal cost function J* is the unique fixed point of T withinB(X).

(b) We have T kJ ! J* for every J 2 B(X). Moreover, there existsan optimal proper policy.

(c) A policy µ 2M is optimal if and only if TµJ* = TJ*.

(d) If J 2 B(X) is such that J TJ , we have J J*, and ifJ � TJ , we have J � J*.

Proof: (a), (b) The proof is long and proceeds in several steps. We firstshow that T has at most one fixed point within B(X). Then we constructa fixed point J1 2 B(X) (necessarily unique) as the limit of the cost func-tions of a sequence of proper policies obtained through a policy iterationprocess. We then show that T kJ ! J1 for all J 2 B(X). We finally showthat J1 = J* and construct an optimal proper policy.

To show that T has at most one fixed point, let J 2 B(X) and J 0 2B(X) be two fixed points, and select µ and µ0 such that J = TJ = TµJ andJ 0 = TJ 0 = Tµ0J 0; this is possible in view of Prop. 3.2.1. By Prop. 3.2.2, µand µ0 are proper, so we have J = Jµ and J 0 = Jµ0 . Thus J = T kJ T k

µ0J

for all k � 1, and since µ0 is proper, J limk!1 T kµ0J = Jµ0 = J 0.

Similarly, J 0 J , showing that J = J 0.We next construct a fixed point of T within B(X). Let µ be a proper

Page 8: Ch3 Abstract DP

90 Monotonic-Semicontractive Models Chap. 3

policy [there exists one by Assumption 3.2.1(b)]. Choose µ0 such that

Tµ0Jµ = TJµ

(cf. Prop. 3.2.1). Then we have Jµ = TµJµ � Tµ0Jµ. By Prop. 3.2.2, µ0 isproper, and using the monotonicity of Tµ0 , we obtain

Jµ � limk!1

T kµ0Jµ = Jµ0 . (3.5)

Continuing in the same manner, we construct a sequence of proper policies{µk} such that

Jµk � TJµk � Jµk+1 , k = 0, 1, . . . (3.6)

DenoteJ1 = lim

k!1TJµk = lim

k!1Jµk .

We have Jµk 2 B(X) since µk is proper, and by Assumption 3.2.1(b), wehave Jµk � J 2 B(X), so it follows that J1 2 B(X). By taking the limitin Eq. (3.6), we have

J1 = limk!1

TJµk � TJ1, (3.7)

where the inequality follows from the fact Jµk # J1. Using also Assumption3.2.1(e) and the fact J1 2 B(X), we have for all x 2 X and u 2 U(x),

H(x, u, J1) = limk!1

H(x, u, Jµk) � limk!1

(TJµk)(x) = J1(x).

By taking the infimum of the left-hand side over u 2 U(x), we obtainTJ1 � J1, which combined with Eq. (3.7), yields

J1 = TJ1.

Thus J1 is a fixed point of T , and in view of the uniqueness propertyshown earlier, it is the unique fixed point within B(X).

Next, before showing that J1 = J*, we prove that T kJ ! J1 for allJ 2 B(X). Using Eq. (3.3) with J = J1 � r v and with J = J1, we havefor any r > 0

J1�r v = TJ1�r v T (J1�r v) T (J1+r v) TJ1+r v = J1+r v,

where the middle inequality holds by the monotonicity of T . It followsthat T k(J1 � r v) and T k(J1 + r v) converge monotonically to functionsin B(X), denoted J� and J+, respectively. We will show that J� and J+

Page 9: Ch3 Abstract DP

Sec. 3.2 Fixed Points and Optimality Conditions 91

are fixed points of T and hence are equal to J1 [since J1 was shown to bethe unique fixed point of T within B(X)].

Indeed, since T k(J1 + r v) � J+, we have by the monotonicity of T ,

T k+1(J1 + r v) � TJ+,

and by taking the limit as k ! 1, we obtain J+ � TJ+. Using alsoAssumption 3.2.1(e) and the fact J+ 2 B(X), we have for all x 2 X andu 2 U(x),

H(x, u, J+) = limk!1

H�x, u, T k(J1+r v)

�� lim

k!1T k+1(J1+r v) = J+(x).

By taking the infimum of the left-hand side over u 2 U(x), we obtainTJ+ � J+, which combined with the inequality J+ � TJ+ shown earlier,yields J+ = TJ+, so that J+ = J1.

Also since T k(J1 � r v) J�, we have by the monotonicity of T ,

T k+1(J1 � r v) TJ�,

and by taking the limit as k !1, we obtain J� TJ�. Assume to arriveat a contradiction that there exists a state x 2 X such that

J�(x) < (TJ�)(x). (3.8)

For every k, consider the sets

Uk(x) =nu 2 U(x)

�� H�x, u, T k(J1 � r v)

� J�(x)

o. (3.9)

Let also uk be a point attaining the infimum in�T k+1(J1 � r v)

�(x) = inf

u2U(x)H�x, u, T k(J1 � r v)

�;

i.e., uk is such that�T k+1(J1 � r v)

�(x) = H

�x, uk, T k(J1 � r v)

�(such a point exists by Prop. 3.2.1). For every k, consider the sequence{ui}1i=k. Since T k(J1 � r v) " J�, it follows that for all i � k,

H�x, ui, T k(J1 � r v)

� H

�x, ui, T i(J1 � r v)

� J�(x).

Therefore {ui}1i=k ⇢ Uk(x), and since Uk(x) is compact, all the limit pointsof {ui}1i=k belong to Uk(x) and at least one such limit point exists. Hencethe same is true of the limit points of the whole sequence {ui}. It followsthat if u is a limit point of {ui} then

u 2 \1k=0Uk(x).

Page 10: Ch3 Abstract DP

92 Monotonic-Semicontractive Models Chap. 3

By Eq. (3.9), this implies that for all k � 0

J�(x) � H�x, u, T k(J1 � r v)

�.

Taking the limit as k !1 and using Assumption 3.2.1(e), we obtain

J�(x) � H(x, u, J�),

implying thatJ�(x) � (TJ�)(x).

This contradicts Eq. (3.8) and shows that J� = TJ�.In conclusion, we have

limk!1

T k(J1 � r v) = limk!1

T k(J1 + r v) = J1, 8 r > 0.

For any J 2 B(X), let r > 0 be such that

J1 � r v J J1 + r v.

Then by the monotonicity of T , we have

T k(J1 � r v) T kJ T k(J1 + r v), 8 k = 0, 1, . . . ,

and it follows that T kJ ! J1.There remains to show that J1 = J* and that there exists an optimal

proper policy. Let us choose µ 2M such that

TµJ1 = TJ1.

Then J1 = TµJ1, and by Prop. 3.2.2, µ is proper and we obtain Jµ = J1.We note that for any policy ⇡ = {µ0, µ1, . . .}, we have

Tµ0 · · ·Tµk�1 J � T kJ .

Taking the lim sup of both sides as k ! 1 in the preceding inequality, itfollows that

J⇡ � J1 = Jµ,

so µ is an optimal stationary policy and J1 = Jµ = J*.

(c) If µ is optimal, then Jµ = J* 2 B(X), so µ is proper. Hence,

TµJ* = TµJµ = Jµ = J* = TJ*.

Conversely, if J* = TJ* = TµJ*, it follows from Prop. 3.2.2 that µ isproper, so J* = Jµ. Therefore, µ is optimal.

Page 11: Ch3 Abstract DP

Sec. 3.3 Algorithms 93

(d) If J TJ , by repeatedly applying T to both sides and using themonotonicity of T , we obtain J T kJ for all k. Taking the limit ask !1 and using the fact T kJ ! J* [cf. part (b)], we obtain J J*. Theproof that J � J* if J � TJ is similar. Q.E.D.

The preceding proof also shows that PI starting from a proper policyyields an optimal proper policy in a finite number of iterations. Exercise 4.8from the next chapter gives a finite-state finite-control stochastic shortestpath example where this is not so, and in fact the method oscillates betweentwo suboptimal policies. In this example, the Semicontraction Assumption3.2.1 is violated because there exist (nonoptimal) improper policies withfinite cost for all initial states.

3.3 ALGORITHMS

Proposition 3.2.3 provides the basis for computational methods similar tothe ones for the contractive model of the preceding chapter. In particular:

(a) The value iteration (VI for short) method converges to the optimalcost function J* for an arbitrary starting function in B(X) [cf. Prop.3.2.3(b)]. We will discuss in what follows some special properties ofthe method along with asynchronous versions.

(b) The policy iteration (PI for short) method yields an optimal properpolicy starting from an arbitrary proper policy. This is shown as partof the proof that T has a fixed point [cf. Prop. 3.2.3(b)]. Unfortu-nately, when there are improper policies, this algorithm is seriouslylimited, because an initial proper policy may not be known, and alsobecause when asynchronous and approximate versions of the algo-rithm are used, it is di�cult to guarantee that all the policies that itgenerates are proper. We will discuss an algorithm that is similar tothe PI algorithm with a uniform fixed point of Section 2.7.3, whichis una↵ected by the presence of improper policies and also may beimplemented asynchronously.

Throughout this section, we assume that the Monotonicity and Semi-contraction Assumptions 3.1.1 and 3.2.1 hold, and in the case of the PIalgorithm of Section 3.3.2, we assume that X and U are both finite sets(which also implies that the set of policies M is finite as well).

3.3.1 Asynchronous Value Iteration

Let us consider the general model of asynchronous distributed solution ofthe fixed point equation J = TJ and the asynchronous distributed VImethod of Section 2.7.1. The model involves a partition of X into disjointnonempty subsets X1, . . . ,Xm, and a network of m processors, each up-

Page 12: Ch3 Abstract DP

94 Monotonic-Semicontractive Models Chap. 3

dating corresponding components of J . In particular, J is partitioned asJ = (J1, . . . , Jm), where J` is the restriction of J on the set X`.

We assume that J` is updated only by processor `, and only for times tin a selected subset R` of iterations. Moreover processor ` uses componentsJj supplied by other processors with communication “delays” t� ⌧`j(t):

J t+1` =

⇢T (J⌧`1(t)

1 , . . . , J⌧`n(t)m ) if t 2 R`,

J t` if t /2 R`.

(3.10)

We will establish convergence by using the Asynchronous Convergence The-orem (Prop. 2.7.1) of Section 2.7.1, under the following assumption (alsomade in that section).

Assumption 3.3.1: (Continuous Updating and InformationRenewal)

(1) The set of times R` at which processor ` updates J` is infinite,for each ` = 1, . . . , n.

(2) limt!1 ⌧`j(t) = 1 for all `, j = 1, . . . , n.

While T may not be a contraction, T is still monotone and has J*

as its unique fixed point, and it turns out that this is su�cient for asyn-chronous convergence. Indeed, for a scalar c > 0, denote

J = J* � c v, J = J* + c v.

Then we haveJ TJ TJ J, (3.11)

in view of the form of T ,

(TJ)(x) = infu2U(x)

H(x, u, J), x 2 X,

and the nonexpansiveness of T .We now apply the Asynchronous Convergence Theorem (Prop. 2.7.1)

withS(k) =

�J 2 <n | T kJ J T kJ

, k = 0, 1, . . . ,

and with c su�ciently large so that the initial condition J0 satisfies J0 2S(0). Then we can show that the generated sequence {J t} by the asyn-chronous VI algorithm converges pointwise to J*. Indeed, the sets S(k)clearly satisfy the Synchronous Convergence and Box Conditions of Prop.2.7.1. They also satisfy

S(k + 1) ⇢ S(k)in view of Eq. (3.11) and the monotonicity of T . Thus under Assumption3.3.1, all the conditions of Prop. 2.7.1 are satisfied, and the convergence ofthe algorithm follows.

Page 13: Ch3 Abstract DP

Sec. 3.3 Algorithms 95

3.3.2 Asynchronous Policy Iteration

We will now discuss an optimistic asynchronous distributed PI algorithmthat addresses the convergence issue in a way similar to the PI algorithmof Section 2.7.3, which is based on a uniform fixed point property. As inthat section, the distributed asynchronous framework involves a partitionof the state space into sets X1, . . . ,Xm, and assignment of each subset X`

to a processor ` 2 {1, . . . ,m}. For each `, there are two disjoint subsets oftimes R`,R` ⇢ {0, 1, . . .}, corresponding to policy improvement and policyevaluation iterations, respectively.

We introduce a new mapping that is a weighted sup-norm contraction,and has a common fixed point for all µ. As in Section 2.7.3, the mappingoperates on a pair (V,Q) where:

• V is a vector with a component V (x) for each x.

• Q is a vector with a component Q(x, u) for each pair (x, u).

The mapping produces a pair

�MFµ(V,Q), Fµ(V,Q)

�,

where

• Fµ(V,Q) is a vector with a component Fµ(V,Q)(x, u) for each (x, u),defined by

Fµ(V,Q)(x, u) def= H�x, u,min{V,Qµ}

�, (3.12)

where for any Q and µ, we denote by Qµ the function of x defined by

Qµ(x) = Q�x, µ(x)

�, x 2 X,

and for any two functions V1 and V2, we denote by min{V1, V2} thefunction of x given by

min{V1, V2}(x) = min�V1(x), V2(x)

, x 2 X.

• MFµ(V,Q) is a vector with a component�MFµ(V,Q)

�(x) for each x,

where M denotes pointwise minimization over u, so that †�MFµ(V,Q)

�(x) = min

u2U(x)Fµ(V,Q)(x, u). (3.13)

† As noted earlier, we assume that X and U are finite sets, so the infimumover u 2 U(x) in various operations is attained, and we write min in place of inf.

Page 14: Ch3 Abstract DP

96 Monotonic-Semicontractive Models Chap. 3

Each processor ` operates on V t(x), Qt(x, u), and µt(x), only for xin its “local” state space X`. In particular, at each time t, each processor` does one of the following:

(a) Local policy improvement: If t 2 R`, processor ` sets for all x 2 X`,

V t+1(x) = minu2U(x)

H�x, u,min{V t, Qt

µt}�,

sets µt+1(x) to a u that attains the minimum, and leaves Q un-changed, i.e., Qt+1(x, u) = Qt(x, u) for all x 2 X` and u 2 U(x).

(b) Local policy evaluation: If t 2 R`, processor ` sets for all x 2 X` andu 2 U(x),

Qt+1(x, u) = H�x, u,min{V t, Qt

µt}�,

and leaves V and µ unchanged, i.e., V t+1(x) = V t(x) and µt+1(x) =µt(x) for all x 2 X`.

(c) No local change: If t /2 R` [ R`, processor ` leaves Q, V , and µunchanged, i.e., Qt+1(x, u) = Qt(x, u) for all x 2 X` and u 2 U(x),V t+1(x) = V t(x), and µt+1(x) = µt(x) for all x 2 X`.

The algorithm aims to compute the optimal Q-factor function Q*, fromwhich J* may also be obtained as J* = MQ*.

Asynchronous Convergence

To analyze the algorithm, let us consider the function Q* defined by

Q*(x, u) = H(x, u, J*), x 2 X, u 2 U(x).

Note thatMQ* = J* = TJ*,

where as earlier, M is the operator of pointwise minimization over u:

(MQ)(x) = minu2U(x)

Q(x, u).

It follows that since by Prop. 3.2.3(a), J* is the unique fixed point of Twithin B(X), Q* is the unique fixed point of the mapping F given by

(FQ)(x, u) = H�x, u,MQ

�, x 2 X, u 2 U(x).

For the analysis, we will use the asynchronous convergence ideas ofSection 2.7.1. To this end, we introduce the µ-dependent mapping

Lµ(V,Q) =�MQ,Fµ(V,Q)

�. (3.14)

Page 15: Ch3 Abstract DP

Sec. 3.3 Algorithms 97

We will show that while Lµ may not be a sup-norm contraction, it is insteadmonotone and can be shown to have (J*, Q*) as its unique fixed point for allµ. These properties are su�cient to apply the Asynchronous ConvergenceTheorem of Prop. 2.7.1 and establish the convergence of the algorithm to(J*, Q*).

To this end, we construct a nested sequence of sets that contain theunique fixed point (J*, Q*), and satisfy the synchronous convergence andbox conditions of that theorem. Let F and F denote the mappings obtainedby minimization and maximization of Fµ over the finite set of all stationarypolicies µ, respectively:

F (V,Q)(x, u) = minµ

Fµ(V,Q)(x, u),

F (V,Q)(x, u) = maxµ

Fµ(V,Q)(x, u).

Note that in view of the definition (3.12) of Fµ and the monotonicity of H,for any fixed Q, there exists µ that attains the maximum above, uniformlyfor all V and (x, u), namely µ for which

Qµ(x) = Q�x, µ(x)

�= max

u2U(x)Q(x, u), 8 x 2 X,

[cf. Eq. (3.12)]. Similarly, there exists µ that attains the minimum above,uniformly for all V and (x, u):

Qµ(x) = Q�x, µ(x)

�= min

u2U(x)Q(x, u), 8 x 2 X, .

Consider the mappings L and L defined by

L(V,Q) =�MQ,F (V,Q)

�, L(V,Q) =

�MQ, F (V,Q)

�. (3.15)

We have the following proposition.

Proposition 3.3.1: For any µ, the mapping Lµ of Eq. (3.14), andthe mappings L and L of Eq. (3.15) are monotone and have (J*, Q*)as their unique fixed point. Furthermore:

(a) For any scalar c � 0, we have

(J�, Q�) L(J�, Q�) (J*, Q*) L(J+, Q+) (J+, Q+),

where we denote

J� = J* � ceV , Q� = Q* � ceQ,

J+ = J* + ceV , Q+ = Q* + ceQ,

Page 16: Ch3 Abstract DP

98 Monotonic-Semicontractive Models Chap. 3

with eV and eQ being the unit functions in the spaces of V and Q,respectively.

(b) For any (V,Q) the sequences Lk(V,Q) and Lk(V,Q) convergeto (J*, Q*) as k ! 1, where Lk (or Lk) denotes the k-foldcomposition of L (or L, respectively).

Proof: For any µ, and V1, V2, Q1, Q2, with V1 V2 and Q1 Q2, we have

MQ1 MQ2, Fµ(V1, Q1) Fµ(V2, Q2),

soLµ(V1, Q1) Lµ(V2, Q2)

and Lµ is monotone.To show that Lµ has (J*, Q*) as its unique fixed point, we note that

in view of the definitions of F and Fµ, we have

Q* = FQ*, J* = MQ*, Fµ(J*, Q*) = FQ* = Q*.

The two rightmost relations above imply that (J*, Q*) is a fixed point ofLµ. To show uniqueness of the fixed point, let (V , Q) be a fixed point ofLµ, i.e., V = MQ and Q = Fµ(V , Q). Then

Q = Fµ(V , Q) = FQ,

where the last equality follows from V = MQ. This shows that Q is a fixedpoint of F . Since F has Q* as its unique fixed point, Q = Q*. It thenfollows that V = MQ* = J*. Thus (J*, Q*) is the unique fixed point ofLµ.

The mappings L and L are defined by componentwise minimizationand maximization of Lµ(V,Q) over µ, respectively, so they inherit some ofthe properties that are common to all mappings Lµ: they are monotoneand have (J*, Q*) as fixed point. To show uniqueness of the fixed point ofL, suppose that (V,Q) is a fixed point, so (V,Q) = L(V,Q). Then fromthe property noted earlier, that for a fixed Q there exists µ that attainsthe maximum in the definition of L, uniformly for all (x, u), we have

(V,Q) = L(V,Q) = Lµ(V,Q).

Hence (V,Q) is the fixed point of Lµ, and it follows that (V,Q) = (J*, Q*).Similarly, we show that (J*, Q*) is the unique fixed point of L.

(a) For any µ, we have Lµ(J*, Q*) = (J*, Q*), so using also the definitionof Lµ,

(J�, Q�) Lµ(J�, Q�) (J*, Q*) Lµ(J+, Q+) (J+, Q+).

Page 17: Ch3 Abstract DP

Sec. 3.4 Notes, Sources, and Exercises 99

By taking minimum and maximum over µ, we obtain the desired inequali-ties.

(b) For the given (V,Q), let the scalar c be such that

(J�, Q�) (V,Q) (J+, Q+), (3.16)

where (J�, Q�) and (J+, Q+) are as defined in part (a). Denote for k =0, 1, . . . ,

(Vk, Qk) = Lk(J+, Q+), (V k, Qk) = Lk(J�, Q�). (3.17)

From part (a) and the monotonicity of L, we have that (Vk, Qk) convergesmonotonically from above to some (V , Q) � (J*, Q*), while (V k, Q

k) con-

verges monotonically from below to some (V ,Q) (J*, Q*). By taking thelimit in the equation

(Vk+1, Qk+1) = L(Vk, Qk),

and using the continuity of L, it follows that (V , Q) = L(V , Q), so (V , Q)must be equal to (J*, Q*), the unique fixed point of L. Similarly, (V ,Q) =(J*, Q*). In conclusion

(Vk, Qk) # (J*, Q*), (V k, Qk) " (J*, Q*). (3.18)

Now from Eqs. (3.16)-(3.17), and the monotonicity of L and L, wehave for all k,

(V k, Qk) Lk(V,Q) Lk(V,Q) (Vk, Qk),

which combined with Eq. (3.18), shows that Lk(V,Q) and Lk(V,Q) con-verge to (J*, Q*) as k !1. Q.E.D.

Consider the pairs (V k, Qk), (Vk, Qk) of Eq. (3.17), and the sets

S(k) =�(V,Q) | (V k, Q

k) (V,Q) (Vk, Qk)

, k = 0, 1, . . . ,

whose intersection is (J*, Q*) [cf. Eq. (3.18)]. We note that this set se-quence together with the mappings Lµ satisfy the synchronous convergenceand box conditions of the Asynchronous Convergence Theorem of Prop.2.7.1 (more precisely, its time-varying version of Exercise 2.2). This provesthe convergence of the algorithm that updates asynchronously the compo-nents of V and Q using the minimization mapping M and the mapping Fµ,respectively, while arbitrarily changing µ at each iteration.

Finally, let us note some variations of the asynchronous PI algorithm.For the case where the objective is just to calculate J*, we may use areduced space implementation where instead of V t, Qt, and µt, we iterateon V t, J t, and µt, with

J t(x) def= Q�x, µt(x)

�, x 2 X.

This algorithm is notationally and operationally identical to the one givenin Section 2.7.3. There is also a variant with interpolation, similar to theone of Section 2.7.3.

Page 18: Ch3 Abstract DP

100 Monotonic-Semicontractive Models Chap. 3

3.4 NOTES, SOURCES, AND EXERCISES

The semicontractive model framework of Sections 3.1 and 3.2 is new. Theanalysis generalizes the one of the stochastic shortest path problem of Ex-ample 1.2.6, which involves finite state and control spaces, as well as atermination state. It is straightforward to verify the Semicontraction As-sumption 3.2.1 for this problem, so with Prop. 3.2.3, we recover the corre-sponding favorable results of [BeT91] that we noted earlier. In the absenceof a termination state, a key idea has been to generalize the notion of aproper policy from one that leads to termination with probability 1, to onethat results in a stable system.

The line of proof of Prop. 3.2.3 was established in [BeT91] for stochas-tic shortest path problems with finite state and control spaces. However,the proof given here is more intricate because the state space is infinite [theproof of [BeT91] was given for the case of a compact control constraint set,and contains an intricate part (Lemma 3 of [BeT91]) to show a lower boundon the cost functions of proper policies, which is assumed here in part (b)of the Semicontraction Assumption]. Moreover, the finiteness of the statespace is essential for the analysis of [BeT91].

The asynchronous PI algorithm of Section 3.3.2 is based on the workof Yu and Bertsekas [YuB11], which focuses on the stochastic shortest pathproblem of Example 1.2.6, and also addresses related stochastic Q-learningalgorithms. In particular, the key Prop. 3.3.1 and its proof closely followthe corresponding result and proof of [YuB11]. A related paper, which alsodeals with asynchronous optimistic PI in an abstract setting, is Bertsekasand Yu [BeY10b].

By allowing an infinite state space, the analysis of the present chapterapplies among others to stochastic shortest path problems with a countablestate space. Such problems often arise in queueing control problems wherethe termination state corresponds to an empty queue. The problem thenis to empty the system with minimum expected cost. Generalized forms ofstochastic shortest path problems, which involve an infinite (uncountable)number of states, in addition to the termination state, are analyzed byPliska [Pli78], Hernandez-Lerma et al. [HCP99], and James and Collins[JaC06]. The latter paper allows improper policies, assumes that J* isbounded below, and generalizes the results of [BeT91] to infinite (Borel)state space, using a similar line of proof.

An important case of an SSP problem where the state space is infinitearises under imperfect state information. There the problem is convertedto a perfect state information problem whose states are the belief states,i.e., the posterior probability distributions of the original state given theobservations thus far. Patek [Pat07] proves results that are similar to theones for SSP problems with perfect state information. These results alsofollow as special cases of the analysis of this chapter. In particular, thecritical condition that the cost functions of proper policies are bounded

Page 19: Ch3 Abstract DP

Sec. 3.4 Notes, Sources, and Exercises 101

below by some function J [cf. Assumption 3.2.1(b)] is proved as Lemma5 in [Pat07], using the fact that the cost functions of the proper policiesare bounded below by the optimal cost function of a corresponding perfectstate information problem.

The stochastic shortest path problem is sometimes referred to in theliterature as the transient programming problem and proper policies arealso referred to as transient policies. Our analysis di↵ers from those of thepreceding references in that there is no termination state, and no notionof transition probabilities and transience (only stability, which mathemat-ically is a more general notion than transience). Moreover, our analysisprovides a bridge between the contractive and noncontractive models ofChapters 2 and 4, and combines elements of both. This results in a moreunified treatment, and highlights the critical properties needed to developthe main theory, without reliance on special problem structure.

E X E R C I S E S

3.1 (Counterexamples)

Consider an SSP problem with states 0, 1, . . . , n, where 0 is a cost-free and ab-sorbing termination state. There are two controls to choose from: stop whichmoves the state to the termination state at cost �, and continue which movesfrom state x = 1, . . . , n� 1 to state x + 1 at no cost, and from state n to state 1at no cost. The corresponding function J is identically 0, and the mapping H is

H(x, u, J) =

(� if u = stop and x = 1, . . . , n,J(x + 1) if u = continue and x = 1, . . . , n� 1,J(1) if u = continue and x = n,

H(0, u, J) = 0, u = stop or continue.

Note: This can be viewed as a deterministic shortest path problem in a graphwith nodes 1, . . . , n, and a destination node 0. The problem involves a cost-freecycle which introduces an improper policy with finite cost.

(a) Show that all policies are proper, except for the policy that continues atevery state.

(b) Show that T has an infinite number of fixed points.

(c) Let � > 0. Show that J⇤(x) = 0 for all x, and that the improper policy isoptimal. Which parts of Assumption 3.2.1 are violated?

(d) Let � < 0. Show that J⇤(x) = � for all x 6= 0, and that all the properpolicies are optimal. Which parts of Assumption 3.2.1 are violated?

Page 20: Ch3 Abstract DP

102 Monotonic-Semicontractive Models Chap. 3

(e) Let � = 0. Show that J⇤(x) = 0 for all x, and that all policies are optimal.Which parts of Assumption 3.2.1 are violated? Note: In this case J⇤ is theunique fixed point within the set {J | J � J}. This is a special case of ageneral result to be shown in Section 4.3.4.

3.2 (Blackmailer’s Dilemma)

Consider a stochastic shortest path problem where there are two states: state1, and a cost-free and absorbing state 0. At state 1, we can choose a controlu with 0 < u 1, while incurring a cost �u; we then move to state 0 withprobability u2, and stay in state 1 with probability 1� u2. We may regard u asa demand made by a blackmailer, and state 1 as the situation where the victimcomplies. State 0 is the situation where the victim (permanently) refuses to yieldto the blackmailer’s demand. The problem then can be seen as one whereby theblackmailer tries to maximize his total gain by balancing his desire for increaseddemands with keeping his victim compliant. In terms of abstract DP we have

X = {0, 1}, U(0) = U(1) = (0, 1], J(0) = J(1) = 0,

andH(0, u, J) = 0, H(1, u, J) = �u + (1� u2)J(1).

(a) Verify that each µ is a sup-norm contraction, and is therefore proper. Showalso that

Jµ(1) = � 1µ(1)

, Jµ(0) = 0,

so that J⇤(1) = �1, and there is no optimal policy. Which parts ofAssumption 3.2.1 are violated?

(b) Consider a variant of the problem where at state 1, we move to state 0 atno cost with probability u, and stay in state 1 at a cost �u with probability1� u. Here we have

H(0, u, J) = 0, H(1, u, J) = �u + u2 + (1� u)J(1).

Verify that each µ is proper, that

J⇤(1) = �1, J⇤(0) = 0,

but there is no optimal policy. Which parts of Assumption 3.2.1 are vio-lated?

(c) Repeat part (b) for the case where at state 1, we may also choose u = 0 ata cost c. Show that if c > 0, the policy µ that chooses µ(1) = 0, satisfiesJµ(1) = 1 and is therefore improper, but still there is no optimal policy.What happens if c = 0 or if c < 0. Which parts of Assumption 3.2.1 areviolated in the three cases where c > 0, c = 0, and c < 0?