Dp Chapter

8/13/2019 Dp Chapter

1/233

Dynamic Programming and Optimal Control

3rd Edition, Volume II

by

Dimitri P. Bertsekas

Massachusetts Institute of Technology

Chapter 6

Approximate Dynamic Programming

This is an updated version of the research-oriented Chapter 6 onApproximate Dynamic Programming. It will be periodically updated asnew research becomes available, and will replace the current Chapter 6 inthe books next printing.

In addition to editorial revisions, rearrangements, and new exercises,the chapter includes an account of new research, which is collected mostlyin Sections 6.3 and 6.8. Furthermore, a lot of new material has beenadded, such as an account of post-decision state simplifications (Section6.1), regression-based TD methods (Section 6.3), feature scaling (Section6.3), policy oscillations (Section 6.3), -policy iteration and explorationenhanced TD methods, aggregation methods (Section 6.4), new Q-learningalgorithms (Section 6.5), and Monte Carlo linear algebra (Section 6.8).

This chapter represents work in progress. It more than likely con-tains errors (hopefully not serious ones). Furthermore, its references to theliterature are incomplete. Your comments and suggestions to the authorat [email protected] are welcome. The date of last revision is given below.

November 11, 2011


2/233

6

Approximate

Dynamic Programming

Contents

6.1. General Issues of Cost Approximation . . . . . . . . p. 3276.1.1. Approximation Architectures . . . . . . . . . p. 3276.1.2. Approximate Policy Iteration . . . . . . . . . p. 3326.1.3. Direct and Indirect Approximation . . . . . . p. 3376.1.4. Simplifications . . . . . . . . . . . . . . . p. 3396.1.5. Monte Carlo Simulation . . . . . . . . . . . p. 3456.1.6. Contraction Mappings and Simulation . . . . . p. 348

6.2. Direct Policy Evaluation - Gradient Methods . . . . . p. 3516.3. Projected Equation Methods . . . . . . . . . . . . p. 357

6.3.1. The Projected Bellman Equation . . . . . . . p. 3586.3.2. Projected Value Iteration - Other Iterative Methodsp. 3636.3.3. Simulation-Based Methods . . . . . . . . . . p. 3676.3.4. LSTD, LSPE, and TD(0) Methods . . . . . . p. 3696.3.5. Optimistic Versions . . . . . . . . . . . . . p. 3806.3.6. Multistep Simulation-Based Methods . . . . . p. 3816.3.7. Policy Iteration Issues - Exploration . . . . . . p. 3946.3.8. Policy Oscillations - Chattering . . . . . . . . p. 4036.3.9. -Policy Iteration . . . . . . . . . . . . . . p. 4146.3.10. A Synopsis . . . . . . . . . . . . . . . . p. 420

6.4. Aggregation Methods . . . . . . . . . . . . . . . p. 4256.4.1. Cost Approximation via the Aggregate Problem . p. 4286.4.2. Cost Approximation via the Enlarged Problem . p. 431

6.5. Q-Learning . . . . . . . . . . . . . . . . . . . . p. 4406.5.1. Convergence Properties of Q-Learning . . . . . p. 443

6.5.2. Q-Learning and Approximate Policy Iteration . . p. 4476.5.3. Q-Learning for Optimal Stopping Problems . . . p. 4506.5.4. Finite Horizon Q-Learning . . . . . . . . . . p. 455

321


3/233

322 Approximate Dynamic Programming Chap. 6

6.6. Stochastic Shortest Path Problems . . . . . . . . . p. 4586.7. Average Cost Problems . . . . . . . . . . . . . . p. 462

6.7.1. Approximate Policy Evaluation . . . . . . . . p. 4626.7.2. Approximate Policy Iteration . . . . . . . . . p. 4716.7.3. Q-Learning for Average Cost Problems . . . . . p. 474

6.8. Simulation-Based Solution of Large Systems . . . . . p. 4776.8.1. Projected Equations - Simulation-Based Versions p. 4796.8.2. Matrix Inversion and Regression-Type Methods . p. 4846.8.3. Iterative/LSPE-Type Methods . . . . . . . . p. 4866.8.4. Multistep Methods . . . . . . . . . . . . . p. 4936.8.5. Extension of Q-Learning for Optimal Stopping . p. 4966.8.6. Bellman Equation Error-Type Methods . . . . p. 4986.8.7. Oblique Pro jections . . . . . . . . . . . . . p. 5036.8.8. Generalized Aggregation by Simulation . . . . . p. 504

6.9. Approximation in Policy Space . . . . . . . . . . . p. 5096.9.1. The Gradient Formula . . . . . . . . . . . . p. 5106.9.2. Computing the Gradient by Simulation . . . . p. 5116.9.3. Essential Features of Critics . . . . . . . . . p. 5136.9.4. Approximations in Policy and Value Space . . . p. 515

6.10. Notes, Sources, and Exercises . . . . . . . . . . . p. 516References . . . . . . . . . . . . . . . . . . . . . . p. 539


4/233

Sec. 6.0 323

In this chapter we consider approximation methods for challenging, compu-tationally intensive DP problems. We discussed a number of such methodsin Chapter 6 of Vol. I and Chapter 1 of the present volume, such as for

example rollout and other one-step lookahead approaches. Here our focuswill be on algorithms that are mostly patterned after two principal methodsof infinite horizon DP: policy and value iteration. These algorithms formthe core of a methodology known by various names, such as approximatedynamic programming, or neuro-dynamic programming, or reinforcementlearning.

A principal aim of the methods of this chapter is to address problemswith very large number of states n. In such problems, ordinary linearalgebra operations such asn-dimensional inner products, are prohibitivelytime-consuming, and indeed it may be impossible to even store an n-vectorin a computer memory. Our methods will involve linear algebra operationsof dimension much smaller than n, and require only that the components

ofn-vectors are just generated when needed rather than stored.Another aim of the methods of this chapter is to address model-freesituations, i.e., problems where a mathematical model is unavailable orhard to construct. Instead, the system and cost structure may be sim-ulated (think, for example, of a queueing network with complicated butwell-defined service disciplines at the queues). The assumption here is thatthere is a computer program that simulates, for a given control u, the prob-abilistic transitions from any given state i to a successor state j accordingto the transition probabilities pij(u), and also generates a correspondingtransition costg (i,u,j).

Given a simulator, it may be possible to use repeated simulation tocalculate (at least approximately) the transition probabilities of the systemand the expected stage costs by averaging, and then to apply the methodsdiscussed in earlier chapters. The methods of this chapter, however, aregeared towards an alternative possibility, which is much more attractivewhen one is faced with a large and complex system, and one contemplatesapproximations. Rather than estimate explicitly the transition probabil-ities and costs, we will aim to approximate the cost function of a givenpolicy or even the optimal cost-to-go function by generating one or moresimulated system trajectories and associated costs, and by using some formof least squares fit.

Implicit in the rationale of methods based on cost function approxi-mation is of course the hypothesis that a more accurate cost-to-go approx-imation will yield a better one-step or multistep lookahead policy. Thisis a reasonable but by no means self-evident conjecture, and may in fact

not even be true in a given problem. In another type of method, whichwe will discuss somewhat briefly, we use simulation in conjunction with agradient or other method to approximate directly an optimal policy witha policy of a given parametric form. This type of method does not aim atgood cost function approximation through which a well-performing policy


5/233


may be obtained. Rather it aims directly at finding a policy with goodperformance.

Let us also mention, two other approximate DP methods, which we

have discussed at various points in other parts of the book, but we will notconsider further: rollout algorithms (Sections 6.4, 6.5 of Vol. I, and Section1.3.5 of Vol. II), and approximate linear programming (Section 1.3.4).

Our main focus will be on two types of methods: policy evaluation al-gorithms, which deal with approximation of the cost of a single policy (andcan also be embedded within a policy iteration scheme), and Q-learningalgorithms, which deal with approximation of the optimal cost. Let ussummarize each type of method, focusing for concreteness on the finite-state discounted case.

Policy Evaluation Algorithms

With this class of methods, we aim to approximate the cost function J(i)of a policy with a parametric architecture of the form J(i, r), wherer is a parameter vector (cf. Section 6.3.5 of Vol. I). This approximationmay be carried out repeatedly, for a sequence of policies, in the contextof a policy iteration scheme. Alternatively, it may be used to constructan approximate cost-to-go function of a single suboptimal/heuristic policy,which can be used in an on-line rollout scheme, with one-step or multisteplookahead. We focus primarily on two types of methods.

In the first class of methods, calleddirect, we use simulation to collectsamples of costs for various initial states, and fit the architecture J tothe samples through some least squares problem. This problem may besolved by several possible algorithms, including linear least squares methods

based on simple matrix inversion. Gradient methods have also been usedextensively, and will be described in Section 6.2.

The second and currently more popular class of methods is calledindirect. Here, we obtainr by solving an approximate version of Bellmansequation. We will focus exclusively on the case of a linear architecture,where Jis of the form r, and is a matrix whose columns can be viewedas basis functions (cf. Section 6.3.5 of Vol. I). In an important method of

In another type of policy evaluation method, often called theBellman equa-tion errorapproach, which we will discuss briefly in Section 6.8.4, the parametervectorr is determined by minimizing a measure of error in satisfying Bellmansequation; for example, by minimizing over r

J TJ,

where is some norm. If is a Euclidean norm, and J(i, r) is linear in r ,this minimization is a linear least squares problem.


6/233

Sec. 6.0 325

this type, we obtain the parameter vector r by solving the equation

r= T(r), (6.1)

where denotes projection with respect to a suitable norm on the subspaceof vectors of the form r, and T is either the mapping T or a relatedmapping, which also hasJ as its unique fixed point [here T(r) denotesthe projection of the vector T(r) on the subspace].

We can view Eq. (6.1) as a form ofprojected Bellman equation. Wewill show that for a special choice of the norm of the projection, T isa contraction mapping, so the projected Bellman equation has a uniquesolution r. We will discuss several iterative methods for finding r inSection 6.3. All these methods use simulation and can be shown to convergeunder reasonable assumptions tor, so they produce the same approximatecost function. However, they differ in their speed of convergence and in

their suitability for various problem contexts. Here are the methods that wewill focus on in Section 6.3 for discounted problems, and also in Sections 6.6-6.8 for other types of problems. They all depend on a parameter [0, 1],whose role will be discussed later.

(1) TD()ortemporal differences method. This algorithm may be viewedas a stochastic iterative method for solving a version of the projectedequation (6.1) that depends on. The algorithm embodies importantideas and has played an important role in the development of thesubject, but in practical terms, it is usually inferior to the next twomethods, so it will be discussed in less detail.

(2) LSTD() or least squares temporal differences method. This algo-

rithm computes and solves a progressively more refined simulation-based approximation to the projected Bellman equation (6.1).

(3) LSPE() or least squares policy evaluation method. This algorithmis based on the idea of executing value iteration within the lowerdimensional space spanned by the basis functions. Conceptually, ithas the form

rk+1= T(rk) + simulation noise, (6.2)

Another method of this type is based on aggregation (cf. Section 6.3.4 ofVol. I) and is discussed in Section 6.4. This approach can also be viewed as a

problem approximation approach (cf. Section 6.3.3 of Vol. I): the original problem

is approximated with a related aggregate problem, which is then solved exactlyto yield a cost-to-go approximation for the original problem. The aggregation

counterpart of the equation r = T(r) has the form r = DT(r), where

and D are matrices whose rows are restricted to be probability distributions

(the aggregation and disaggregation probabilities, respectively).


7/233


i.e., the current value iterate T(rk) is projected onSand is suitablyapproximated by simulation. The simulation noise tends to 0 asymp-totically, so assuming that Tis a contraction, the method converges

to the solution of the projected Bellman equation (6.1). There arealso a number of variants of LSPE(). Both LSPE() and its vari-ants have the same convergence rate as LSTD(), because they sharea common bottleneck: the slow speed of simulation.

Q-Learning Algorithms

With this class of methods, we aim to compute, without any approximation,the optimal cost function (not just the cost function of a single policy). Q-learning maintains and updates for each state-control pair (i, u) an estimateof the expression that is minimized in the right-hand side of Bellmansequation. This is called the Q-factor of the pair (i, u), and is denoted

by Q(i, u). The Q-factors are updated with what may be viewed as asimulation-based form of value iteration, as will be explained in Section6.5. An important advantage of using Q-factors is that when they areavailable, they can be used to obtain an optimal control at any state isimply by minimizingQ(i, u) overuU(i), so the transition probabilitiesof the problem are not needed.

On the other hand, for problems with a large number of state-controlpairs, Q-learning is often impractical because there may be simply toomany Q-factors to update. As a result, the algorithm is primarily suitablefor systems with a small number of states (or for aggregated/few-stateversions of more complex systems). There are also algorithms that useparametric approximations for the Q-factors (see Section 6.5), although

their theoretical basis is generally less solid.

Chapter Organization

Throughout this chapter, we will focus almost exclusively on perfect stateinformation problems, involving a Markov chain with a finite number ofstatesi, transition probabilitiespij(u), and single stage costsg(i,u,j). Ex-tensions of many of the ideas to continuous state spaces are possible, butthey are beyond our scope. We will consider first, in Sections 6.1-6.5, thediscounted problem using the notation of Section 1.3. Section 6.1 pro-vides a broad overview of cost approximation architectures and their usesin approximate policy iteration. Section 6.2 focuses on direct methods forpolicy evaluation. Section 6.3 is a long section on a major class of indirect

methods for policy evaluation, which are based on the projected Bellmanequation. Section 6.4 discusses methods based on aggregation. Section 6.5discusses Q-learning and its variations, and extends the projected Bellmanequation approach to the case of multiple policies, and particularly to opti-mal stopping problems. Stochastic shortest path and average cost problems


8/233

Sec. 6.1 General Issues of Cost Approximation 327

are discussed in Sections 6.6 and 6.7, respectively. Section 6.8 extends andelaborates on the projected Bellman equation approach of Sections 6.3,6.6, and 6.7, discusses another approach based on the Bellman equation

error, and generalizes the aggregation methodology. Section 6.9 describesmethods based on parametric approximation of policies rather than costfunctions.

6.1 GENERAL ISSUES OF COST APPROXIMATION

Most of the methodology of this chapter deals with approximation of sometype of cost function (optimal cost, cost of a policy, Q-factors, etc). Thepurpose of this section is to highlight the main issues involved, withoutgetting too much into the mathematical details.

We start with general issues of parametric approximation architec-tures, which we have also discussed in Vol. I (Section 6.3.5). We thenconsider approximate policy iteration (Section 6.1.2), and the two generalapproaches for approximate cost evaluation (direct and indirect; Section6.1.3). In Section 6.1.4, we discuss various special structures that can beexploited to simplify approximate policy iteration. In Sections 6.1.5 and6.1.6 we provide orientation into the main mathematical issues underlyingthe methodology, and focus on two of its main components: contractionmappings and simulation.

6.1.1 Approximation Architectures

The major use of cost approximation is for obtaining a one-step lookaheadsuboptimal policy (cf. Section 6.3 of Vol. I). In particular, suppose thatwe use J(j, r) as an approximation to the optimal cost of the finite-statediscounted problem of Section 1.3. Here J is a function of some chosenform (the approximation architecture) and r is a parameter/weight vector.Once r is determined, it yields a suboptimal control at any state i via theone-step lookahead minimization

(i) = arg minuU(i)

nj=1

pij(u)

g(i,u,j) + J(j, r)

. (6.3)

The degree of suboptimality of , as measured byJ J, is boundedby a constant multiple of the approximation error according to

J J 21

J J,

We may also use a multiple-step lookahead minimization, with a cost-to-goapproximation at the end of the multiple-step horizon. Conceptually, single-step

and multiple-step lookahead approaches are similar, and the cost-to-go approxi-

mation algorithms of this chapter apply to both.


9/233


as shown in Prop. 1.3.7. This bound is qualitative in nature, as it tends tobe quite conservative in practice.

An alternative possibility is to obtain a parametric approximation

Q(i,u,r) of the Q-factor of the pair (i, u), defined in terms of the optimalcost function J as

Q(i, u) =nj=1

pij(u)

g(i,u,j) + J(j)

.

SinceQ(i, u) is the expression minimized in Bellmans equation, given theapproximation Q(i,u,r), we can generate a suboptimal control at any statei via

(i) = arg minuU(i)

Q(i,u,r).

The advantage of using Q-factors is that in contrast with the minimiza-

tion (6.3), the transition probabilities pij(u) are not needed in the aboveminimization. Thus Q-factors are better suited to the model-free context.

Note that we may similarly use approximations to the cost functionsJ and Q-factors Q(i, u) of specific policies . A major use of such ap-proximations is in the context of an approximate policy iteration scheme;see Section 6.1.2.

The choice of architecture is very significant for the success of theapproximation approach. One possibility is to use the linear form

J(i, r) =s

k=1

rkk(i), (6.4)

where r = (r1, . . . , rs) is the parameter vector, and k(i) are some knownscalars that depend on the state i. Thus, for each statei, the approximatecost J(i, r) is the inner product (i)r ofr and

(i) =

1(i)...

s(i)

.

We refer to(i) as thefeature vectorofi, and to its components asfeatures(see Fig. 6.1.1). Thus the cost function is approximated by a vector in thesubspace

S=

{r

|r

s

},

where

=

1(1) . . . s(1)...

......

1(n) . . . s(n)

=

(1)

...(n)

.


10/233


State i Feature ExtractionMapping Mapping

Feature Vector (i) Linear

Linear CostApproximator(i)r

Figure 6.1.1A linear feature-based architecture. It combines a mapping that

extracts the feature vector (i) =

1(i), . . . , s(i)

associated with state i, anda parameter vector r to form a linear cost approximator.

We can view the s columns of as basis functions, and r as a linearcombination of basis functions.

Features, when well-crafted, can capture the dominant nonlinearitiesof the cost function, and their linear combination may work very well as anapproximation architecture. For example, in computer chess (Section 6.3.5

of Vol. I) where the state is the current board position, appropriate fea-tures are material balance, piece mobility, king safety, and other positionalfactors.

Example 6.1.1 (Polynomial Approximation)

An important example of linear cost approximation is based on polynomialbasis functions. Suppose that the state consists of q integer componentsx1, . . . , xq, each taking values within some limited range of integers. Forexample, in a queueing system, xk may represent the number of customersin the kth queue, where k = 1, . . . , q . Suppose that we want to use anapproximating function that is quadratic in the components xk. Then wecan define a total of 1 +q+ q2 basis functions that depend on the statex= (x1, . . . , xq) via

0(x) = 1, k(x) = xk, km(x) = xkxm, k, m= 1, . . . , q .

A linear approximation architecture that uses these functions is given by

J(x, r) = r0+

qk=1

rkxk+

qk=1

qm=k

rkmxkxm,

where the parameter vector r has components r0, rk, and rkm, with k =1, . . . , q , m = k, . . . , q . In fact, any kind of approximating function that ispolynomial in the componentsx1, . . . , xq can be constructed similarly.

It is also possible to combine feature extraction with polynomial approx-

imations. For example, the feature vector (i) = 1(i), . . . , s(i) trans-formed by a quadratic polynomial mapping, leads to approximating functionsof the form

J(i, r) = r0+

sk=1

rkk(i) +

sk=1

s=1

rkk(i)(i),


11/233


where the parameter vector r has components r0, rk, and rk, with k, =1, . . . , s. This function can be viewed as a linear cost approximation thatuses the basis functions

w0(i) = 1, wk(i) = k(i), wk(i) =k(i)(i), k, = 1, . . . , s .

Example 6.1.2 (Interpolation)

A common type of approximation of a function Jis based on interpolation.Here, a set Iof special states is selected, and the parameter vectorr has onecomponentri per stateiI, which is the value ofJ at i:

ri = J(i), iI .

The value ofJat statesi /I is approximated by some form of interpolationusingr .

Interpolation may be based on geometric proximity. For a simple ex-ample that conveys the basic idea, let the system states be the integers withinsome interval, letIbe a subset of special states, and for each statei let i andi be the states inIthat are closest toi from below and from above. Then forany state i, J(i, r) is obtained by linear interpolation of the costs ri = J(i)and r i= J(i):

J(i, r) = i ii iri+

i ii ir i.

The scalars multiplying the components of r may be viewed as features, sothe feature vector of i above consists of two nonzero features (the ones cor-responding toi and i), with all other features being 0. Similar examples canbe constructed for the case where the state space is a subset of a multidimen-sional space (see Example 6.3.13 of Vol. I).

A generalization of the preceding example is approximation based onaggregation; see Section 6.3.4 of Vol. I and the subsequent Section 6.4 inthis chapter. There are also interesting nonlinear approximation architec-tures, including those defined by neural networks, perhaps in combinationwith feature extraction mappings (see Bertsekas and Tsitsiklis [BeT96], orSutton and Barto [SuB98] for further discussion). In this chapter, we willmostly focus on the case of linear architectures, because many of the policyevaluation algorithms of this chapter are valid only for that case.

We note that there has been considerable research on automatic ba-sis function generation approaches (see e.g., Keller, Mannor, and Precup[KMP06], and Jung and Polani [JuP07]). Moreover it is possible to usestandard basis functions which may be computed by simulation (perhapswith simulation error). The following example discusses this possibility.


12/233


Example 6.1.3 (Krylov Subspace Generating Functions)

We have assumed so far that the columns of , the basis functions, are known,

and the rows(i) of are explicitly available to use in the various simulation-based formulas. We will now discuss a class of basis functions that may notbe available, but may be approximated by simulation in the course of variousalgorithms. For concreteness, let us consider the evaluation of the cost vector

J= (I P)1g

of a policy in a discounted MDP. Then J has an expansion of the form

J =

t=0

tPtg.

Thusg, Pg, . . . , P sgyield an approximation based on the firsts +1 terms

of the expansion, and seem suitable choices as basis functions. Also a moregeneral expansion is

J = J+

t=0

tPtq,

whereJis any vector inn and qis the residual vector

q= TJ J= g+ PJ J;

this can be seen from the equationJ J=P(J J) + q. Thus the basisfunctionsJ, q,Pq , . . . , P

s1 qyield an approximation based on the first s + 1

terms of the preceding expansion.Generally, to implement various methods in subsequent sections with

basis functions of the formPm g, m

0, one would need to generate theithcomponents (Pm g)(i) for any given statei, but these may be hard to calcu-late. However, it turns out that one can use instead single sample approxi-mations of (Pm g)(i), and rely on the averaging mechanism of simulation toimprove the approximation process. The details of this are beyond our scopeand we refer to the original sources (Bertsekas and Yu [BeY07], [BeY09]) forfurther discussion and specific implementations.

We finally mention the possibility of optimal selection of basis func-tions within some restricted class. In particular, consider an approximationsubspace

S =

()r|r s,where thes columns of the n

smatrix are basis functions parametrized

by a vector . Assume that for a given, there is a corresponding vectorr(), obtained using some algorithm, so that ()r() is an approximationof a cost function J (various such algorithms will be presented later inthis chapter). Then we may wish to select so that some measure ofapproximation quality is optimized. For example, suppose that we can


13/233


compute the true cost values J(i) (or more generally, approximations tothese values) for a subset of selected states I. Then we may determineso that

iI

J(i) (i, )r()2

is minimized, where (i, ) is the ith row of (). Alternatively, we maydetermine so that the norm of the error in satisfying Bellmans equation,

()r() T()r()2,is minimized. Gradient and random search algorithms for carrying out suchminimizations have been proposed in the literature (see Menache, Mannor,and Shimkin [MMS06], and Yu and Bertsekas [YuB09]).

6.1.2 Approximate Policy Iteration

Let us consider a form of approximate policy iteration, where we com-pute simulation-based approximations J(, r) to the cost functions J ofstationary policies , and we use them to compute new policies based on(approximate) policy improvement. We impose no constraints on the ap-proximation architecture, so J(i, r) may be linear or nonlinear in r .

Suppose that the current policy is , and for a given r, J(i, r) is anapproximation of J(i). We generate an improved policy using theformula

(i) = arg minuU(i)

n

j=1pij(u)

g(i,u,j) + J(j, r)

, for alli. (6.5)

The method is illustrated in Fig. 6.1.2. Its theoretical basis was discussed inSection 1.3 (cf. Prop. 1.3.6), where it was shown that if the policy evaluationis accurate to within (in the sup-norm sense), then for an -discountedproblem, the method will yield in the limit (after infinitely many policyevaluations) a stationary policy that is optimal to within

2

(1 )2 ,

where is the discount factor. Experimental evidence indicates that thisbound is usually conservative. Furthermore, often just a few policy evalu-ations are needed before the bound is attained.

When the sequence of policies obtained actually converges to some ,then it can be proved that is optimal to within

2

1


14/233


Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Generate Improved Policy

Evaluate Approximate Cost r

Using Simulation

Figure 6.1.2 Block diagram of approximate policy iteration.

(see Section 6.3.8 and also Section 6.4.2, where it is shown that if policyevaluation is done using an aggregation approach, the generated sequenceof policies does converge).

A simulation-based implementation of the algorithm is illustrated inFig. 6.1.3. It consists of four parts:

(a) The simulator, which given a state-control pair (i, u), generates thenext state j according to the systems transition probabilities.

(b) The decision generator, which generates the control (i) of the im-proved policy at the current state i for use in the simulator.

(c) Thecost-to-go approximator, which is the function J(j, r) that is used

by the decision generator.(d) Thecost approximation algorithm, which accepts as input the output

of the simulator and obtains the approximation J(, r) of the cost of.

Note that there are two policies and , and parameter vectors rand r, which are simultaneously involved in this algorithm. In particular,rcorresponds to the current policy , and the approximation J(, r) is usedin the policy improvement Eq. (6.5) to generate the new policy . At thesame time, drives the simulation that generates samples to be used bythe algorithm that determines the parameter r corresponding to , whichwill be used in the next policy iteration.

The Issue of Exploration

Let us note an important generic difficulty with simulation-based policyiteration: to evaluate a policy , we need to generate cost samples usingthat policy, but this biases the simulation by underrepresenting states that


15/233


System Simulator

Decision GeneratorDecision (i)

Cost-to-Go Approximator

Supplies Values J(j, r)

Cost ApproximationAlgorithm

J(j, r)

State i

Samples

Figure 6.1.3 Simulation-based implementation approximate policy iteration al-gorithm. Given the approximation J(i, r), we generate cost samples of the im-proved policy by simulation (the decision generator module). We use thesesamples to generate the approximator J(i, r) of.

are unlikely to occur under . As a result, the cost-to-go estimates ofthese underrepresented states may be highly inaccurate, causing potentiallyserious errors in the calculation of the improved control policy via thepolicy improvement Eq. (6.5).

The difficulty just described is known asinadequate explorationof thesystems dynamics because of the use of a fixed policy. It is a particularlyacute difficulty when the system is deterministic, or when the randomnessembodied in the transition probabilities is relatively small. One possibil-ity for guaranteeing adequate exploration of the state space is to frequentlyrestart the simulation and to ensure that the initial states employed forma rich and representative subset. A related approach, called iterative re-sampling, is to enrich the sampled set of states in evaluating the currentpolicy as follows: derive an initial cost evaluation of, simulate the nextpolicy obtained on the basis of this initial evaluation to obtain a set ofrepresentative states Svisited by , and repeat the evaluation of usingadditional trajectories initiated fromS.

Still another frequently used approach is to artificially introduce someextra randomization in the simulation, by occasionally using a randomlygenerated transition rather than the one dictated by the policy (althoughthis may not necessarily work because all admissible controls at a given

state may produce similar successor states). This and other possibilitiesto improve exploration will be discussed further in Section 6.3.7.


16/233


Limited Sampling/Optimistic Policy Iteration

In the approximate policy iteration approach discussed so far, the policy

evaluation of the cost of the improved policymust be fully carried out. Analternative, known as optimistic policy iteration, is to replace the policy with the policy after only a few simulation samples have been processed,at the risk ofJ(, r) being an inaccurate approximation ofJ.

Optimistic policy iteration has been successfully used, among oth-ers, in an impressive backgammon application (Tesauro [Tes92]). However,the associated theoretical convergence properties are not fully understood.As will be illustrated by the discussion of Section 6.3.8 (see also Section6.4.2 of [BeT96]), optimistic policy iteration can exhibit fascinating andcounterintuitive behavior, including a natural tendency for a phenomenoncalled chattering, whereby the generated parameter sequence{rk} con-verges, while the generated policy sequence oscillates because the limit of

{rk} corresponds to multiple policies.We note that optimistic policy iteration tends to deal better withthe problem of exploration discussed earlier, because with rapid changesof policy, there is less tendency to bias the simulation towards particularstates that are favored by any single policy.

Approximate Policy Iteration Based on Q-Factors

The approximate policy iteration method discussed so far relies on the cal-culation of the approximation J(, r) to the cost function J of the currentpolicy, which is then used for policy improvement using the minimization

(i) = arg min

uU(i)

n

j=1

pij(u)g(i,u,j) + J(j, r).Carrying out this minimization requires knowledge of the transition proba-bilitiespij(u) and calculation of the associated expected values for all con-trols u U(i) (otherwise a time-consuming simulation of these expectedvalues is needed). A model-free alternative is to compute approximate Q-factors

Q(i,u,r)nj=1

pij(u)

g(i,u,j) + J(j)

, (6.6)

and use the minimization

(i) = arg minuU(i)

Q(i,u,r) (6.7)

for policy improvement. Here, r is an adjustable parameter vector andQ(i,u,r) is a parametric architecture, possibly of the linear form

Q(i,u,r) =s

k=1

rkk(i, u),


17/233


18/233


because there is no guarantee that the policies involved in the oscillation aregood policies, and there is often no way to verify how well they performrelative to the optimal.

We note that oscillations can be avoided and approximate policy it-eration can be shown to converge under special conditions that arise inparticular when aggregation is used for policy evaluation. These condi-tions involve certain monotonicity assumptions regarding the choice of thematrix , which are fulfilled in the case of aggregation (see Section 6.3.8,and also Section 6.4.2). However, when is chosen in an unrestricted man-ner, as often happens in practical applications of the projected equationmethods of Section 6.3, policy oscillations tend to occur generically, andoften for very simple problems (see Section 6.3.8 for an example).

6.1.3 Direct and Indirect Approximation

We will now preview two general algorithmic approaches for approximatingthe cost function of a fixed stationary policy within a subspace of theform S={r|r s}. (A third approach, based on aggregation, uses aspecial type of matrix and is discussed in Section 6.4.) The first and moststraightforward approach, referred to asdirect, is to find an approximationJ Sthat matches bestJ in some normed error sense, i.e.,

minJS

J J,

or equivalently,minrs

J r(see the left-hand side of Fig. 6.1.5). Here, is usually some (possiblyweighted) Euclidean norm, in which case the approximation problem is a

linear least squares problem, whose solution, denoted r, can in principle beobtained in closed form by solving the associated quadratic minimizationproblem. If the matrix has linearly independent columns, the solution isunique and can also be represented as

r = J,

where denotes projection with respect to on the subspace S. A majordifficulty is that specific cost function values J(i) can only be estimated

Note that direct approximation may be used in other approximate DPcontexts, such as finite horizon problems, where we use sequential single-stage

approximation of the cost-to-go functionsJk, going backwards (i.e., starting with

JN, we obtain a least squares approximation of JN1, which is used in turn to

obtain a least squares approximation ofJN2, etc). This approach is sometimescalledfitted value iteration.

In what follows in this chapter, we will not distinguish between the linearoperation of projection and the corresponding matrix representation, denoting

them both by . The meaning should be clear from the context.


19/233


20/233


T, the DP mapping corresponding to multiple/all policies, although thereare some interesting exceptions, one of which relates to optimal stoppingproblems and is discussed in Section 6.5.3.

6.1.4 Simplifications

We now consider various situations where the special structure of the prob-lem may be exploited to simplify policy iteration or other approximate DPalgorithms.

Problems with Uncontrollable State Components

In many problems of interest the state is a composite (i, y) of two compo-nentsi and y , and the evolution of the main component i can be directlyaffected by the controlu, but the evolution of the other componenty can-

not. Then as discussed in Section 1.4 of Vol. I, the value and the policyiteration algorithms can be carried out over a smaller state space, the spaceof the controllable component i. In particular, we assume that given thestate (i, y) and the control u, the next state (j, z) is determined as follows:j is generated according to transition probabilities pij(u, y), and z is gen-erated according to conditional probabilities p(z| j) that depend on themain component j of the new state (see Fig. 6.1.6). Let us assume fornotational convenience that the cost of a transition from state (i, y) is ofthe formg(i,y,u,j) and does not depend on the uncontrollable componentz of the next state (j, z). Ifg depends on z it can be replaced by

g(i,y,u,j) = z

p(z|j)g(i,y,u,j,z)

in what follows.

States

j

pij(u

Controllable State Components

(i, y) (j, z)

g(i, y, u, j)

Control u

No Controlp(z| j)

Figure 6.1.6States and transition probabilities for a problem with uncontrollablestate components.


21/233


22/233


Problems with Post-Decision States

In some stochastic problems, the transition probabilities and stage costs

have the special formpij(u) =q

j|f(i, u), (6.10)

wherefis some function andq |f(i, u) is a given probability distribution

for each value of f(i, u). In words, the dependence of the transitions on(i, u) comes through the functionf(i, u). We may exploit this structure byviewingf(i, u) as a form of state: a post-decision statethat determines theprobabilistic evolution to the next state. An example where the conditions(6.10) are satisfied are inventory control problems of the type considered inSection 4.2 of Vol. I. There the post-decision state at time k is xk + uk, i.e.,the post-purchase inventory, before any demand at time k has been filled.

Post-decision states can be exploited when the stage cost has no de-pendence onj ,

i.e., when we have (with some notation abuse)

g(i,u,j) = g(i, u).

Then the optimal cost-to-go within an -discounted context at state i isgiven by

J(i) = minuU(i)

g(i, u) + V

f(i, u)

,

while the optimal cost-to-go at post-decision state m (optimal sum of costsof future stages) is given by

V(m) =n

j=1 q(j|m)J(j).

In effect, we consider a modified problem where the state space is enlargedto include post-decision states, with transitions between ordinary statesand post-decision states specified by f and q

| f(i, u) (see Fig. 6.1.7).The preceding two equations represent Bellmans equation for this modifiedproblem.

Combining these equations, we have

V(m) =nj=1

q(j|m) minuU(j)

g(j, u) + V

f(j, u)

, m, (6.11)

which can be viewed as Bellmans equation over the space of post-decision

states m. This equation is similar to Q-factor equations, but is defined

If there is dependence onj , one may consider computing, possibly by simu-lation, (an approximation to)g(i, u) =

nj=1

pij(u)g(i,u,j), and using it in place

ofg (i,u,j).


23/233


State-Control Pairs

(i, u) (j, v)

g(i,u,m)

m

Post-Decision States

m= f(i, u)q(j | m)Control v

Control u

Figure 6.1.7 Modified problem where the post-decision states are viewed asadditional states.

over the space of post-decision states rather than the larger space of state-control pairs. The advantage of this equation is that once the function V

is calculated (or approximated), the optimal policy can be computed as

(i) = arg minuU(i)

g(i, u) + V

f(i, u)

,

which does not require the knowledge of transition probabilities and com-putation of an expected value. It involves a deterministic optimization,and it can be used in a model-free context (as long as the functions g andfare known). This is important if the calculation of the optimal policy isdone on-line.

It is straightforward to construct a policy iteration algorithm that is

defined over the space of post-decision states. The cost-to-go functionVof a stationary policyis the unique solution of the corresponding Bellmanequation

V(m) =nj=1

q(j|m)

gj, (j)

+ V

fj, (j)

, m.

Given V, the improved policy is obtained as

(i) = arg minuU(i)

g(i, u) + V

f(i, u)

, i= 1, . . . , n .

There are also corresponding approximate policy iteration methods with

cost function approximation.An advantage of this method when implemented by simulation is that

the computation of the improved policy does not require the calculationof expected values. Moreover, with a simulator, the policy evaluation ofV can be done in model-free fashion, without explicit knowledge of the


24/233


probabilities q(j| m). These advantages are shared with policy iterationalgorithms based on Q-factors. However, when function approximation isused in policy iteration, the methods using post-decision states may have a

significant advantage over Q-factor-based methods: they use cost functionapproximation in the space of post-decision states, rather than the largerspace of state-control pairs, and they are less susceptible to difficulties dueto inadequate exploration.

We note that there is a similar simplification with post-decision stateswheng is of the form

g(i,u,j) =h

f(i, u), j

,

for some function h. Then we have

J(i) = minuU(i)

V

f(i, u)

,

whereV is the unique solution of the equation

V(m) =nj=1

q(j|m)

h(m, j) + minuU(j)

V

f(j, u)

, m.

Here V(m) should be interpreted as the optimal cost-to-go from post-decision statem, including the costh(m, j) incurred within the stage whenm was generated. When h does not depend on j, the algorithm takes thesimpler form

V(m) =h(m) + n

j=1

q(j|m) minuU(j)

Vf(j, u), m. (6.12)

Example 6.1.4 (Tetris)

Let us revisit the game of tetris, which was discussed in Example 1.4.1 of Vol.I in the context of problems with an uncontrollable state component. Wewill show that it also admits a post-decision state. Assuming that the gameterminates with probability 1 for every policy (a proof of this has been givenby Burgiel [Bur97]), we can model the problem of finding an optimal tetrisplaying strategy as a stochastic shortest path problem.

The state consists of two components:

(1) The board position, i.e., a binary description of the full/empty status

of each square, denoted by x.(2) The shape of the current falling block, denoted byy (this is the uncon-

trollable component).

The control, denoted byu, is the horizontal positioning and rotation appliedto the falling block.


25/233


Bellmans equation over the space of the controllable state componenttakes the form

J(x) =y

p(y)maxu

g(x ,y,u) + J

f(x ,y,u)

, for allx,

whereg(x ,y,u) andf(x ,y,u) are the number of points scored (rows removed),and the board position when the state is (x, y) and control u is applied,respectively [cf. Eq. (6.9)].

This problem also admits a post-decision state. Once u is applied atstate (x, y), a new board positionm is obtained, and the new state componentx is obtained fromm after removing a number of rows. Thus we have

m= f(x ,y,u)

for some function f, and m also determines the reward of the stage, which

has the form h(m) for some m [h(m) is the number of complete rows thatcan be removed from m]. Thus, m may serve as a post-decision state, andthe corresponding Bellmans equation takes the form (6.12), i.e.,

V(m) =h(m) +

n(x,y)

q(m,x,y) maxuU(j)

V

f(x ,y,u)

, m,

where (x, y) is the state that followsm, andq(m,x,y) are the correspondingtransition probabilities. Note that both of the simplified Bellmans equationsshare the same characteristic: they involve a deterministic optimization.

Trading off Complexity of Control Space with Complexity of

State Space

Suboptimal control using cost function approximation deals fairly well withlarge state spaces, but still encounters serious difficulties when the numberof controls available at each state is large. In particular, the minimization

minuU(i)

nj=1

pij(u)

g(i,u,j) + J(j, r)

using an approximate cost-go function J(j, r) may be very time-consuming.For multistep lookahead schemes, the difficulty is exacerbated, since therequired computation grows exponentially with the size of the lookahead

horizon. It is thus useful to know that by reformulating the problem, itmay be possible to reduce the complexity of the control space by increasingthe complexity of the state space. The potential advantage is that theextra state space complexity may still be dealt with by using functionapproximation and/or rollout.


26/233


In particular, suppose that the control u consists ofm components,

u= (u1, . . . , um).

Then, at a given state i, we can break down u into the sequence of them controls u1, u2, . . . , um, and introduce artificial intermediate states(i, u1), (i, u1, u2), . . . , (i, u1, . . . , um1), and corresponding transitions to mo-del the effect of these controls. The choice of the last control componentum at state (i, u1, . . . , um1) marks the transition to state j accordingto the given transition probabilities pij(u). In this way the control space issimplified at the expense of introducing m 1 additional layers of states,and m 1 additional cost-to-go functions

J1(i, u1), J2(i, u1, u2), . . . , J m1(i, u1, . . . , um1).

To deal with the increase in size of the state space we may use rollout, i.e.,when at state (i, u1, . . . , uk), assume that future controls uk+1, . . . , umwill be chosen by a base heuristic. Alternatively, we may use functionapproximation, that is, introduce cost-to-go approximations

J1(i, u1, r1), J2(i, u1, u2, r2), . . . , Jm1(i, u1, . . . , um1, rm1),

in addition to J(i, r). We refer to [BeT96], Section 6.1.4, for further dis-cussion.

A potential complication in the preceding schemes arises when thecontrolsu1, . . . , um are coupled through a constraint of the form

u= (u1, . . . , um)U(i). (6.13)Then, when choosing a control uk, care must be exercised to ensure thatthe future controls u

k+1, . . . , um can be chosen together with the already

chosen controls u1, . . . , uk to satisfy the feasibility constraint (6.13). Thisrequires a variant of the rollout algorithm that works with constrained DPproblems; see Exercise 6.19 of Vol. I, and also references [Ber05a], [Ber05b].

6.1.5 Monte Carlo Simulation

In this subsection and the next, we will try to provide some orientationinto the mathematical content of this chapter. The reader may wish toskip these subsections at first, but return to them later for a higher levelview of some of the subsequent technical material.

The methods of this chapter rely to a large extent on simulation inconjunction with cost function approximation in order to deal with large

state spaces. The advantage that simulation holds in this regard can betraced to its ability to compute (approximately) sums with a very largenumber of terms. These sums arise in a number of contexts: inner productand matrix-vector product calculations, the solution of linear systems ofequations and policy evaluation, linear least squares problems, etc.


27/233


Example 6.1.5 (Approximate Policy Evaluation)

Consider the approximate solution of the Bellman equation that corresponds

to a given policy of an n-state discounted problem:

J= g + P J;

where P is the transition probability matrix and is the discount factor.Let us adopt a hard aggregation approach (cf. Section 6.3.4 of Vol. I; seealso Section 6.4 later in this chapter), whereby we divide the n states in twodisjoint subsetsI1 andI2 withI1 I2={1, . . . , n}, and we use the piecewiseconstant approximation

J(i) =

r1 ifiI1,r2 ifiI2.

This corresponds to the linear feature-based architecture J

r, where

is the n 2 matrix with column components equal to 1 or 0, depending onwhether the component corresponds toI1 or I2.

We obtain the approximate equations

J(i)g(i) +

jI1

pij

r1+

jI2

pij

r2, i= 1, . . . , n ,

which we can reduce to just two equations by forming two weighted sums(with equal weights) of the equations corresponding to the states inI1 andI2, respectively:

r1 1n1

iI1

J(i), r2 1n2

iI2

J(i),

where n1 and n2 are numbers of states in I1 and I2, respectively. We thusobtain the aggregate system of the following two equations inr1 andr2:

r1= 1

n1

iI1

g(i) +

n1

iI1

jI1

pij

r1+

n1

iI1

jI2

pij

r2,

r2= 1

n2

iI2

g(i) +

n2

iI2

jI1

pij

r1+

n2

iI2

jI2

pij

r2.

Here the challenge, when the number of states n is very large, is the calcu-lation of the large sums in the right-hand side, which can be of order O(n2).Simulation allows the approximate calculation of these sums with complexitythat is independent ofn. This is similar to the advantage that Monte-Carlointegration holds over numerical integration, as discussed in standard textson Monte-Carlo methods.


28/233


To see how simulation can be used with advantage, let us considerthe problem of estimating a scalar sum of the form

z=

v(),

where is a finite set and v : is a function of. We introduce adistributionthat assigns positive probability() to every element(but is otherwise arbitrary), and we generate a sequence

{1, . . . , T}

of samples from , with each sample t taking values from according to. We then estimate z with

zT =

1

T

Tt=1

v(t)

(t) . (6.14)

Clearly z is unbiased:

E[zT] = 1

T

Tt=1

E

v(t)

(t)

=

1

T

Tt=1

()v()

() =

v() =z.

Suppose now that the samples are generated in a way that the long-term frequency of each is equal to (), i.e.,

limT

T

t=1

(t= )

T

=(),

, (6.15)

where () denotes the indicator function [(E) = 1 if the event E hasoccurred and(E) = 0 otherwise]. Then from Eq. (6.14), we have

zT =

Tt=1

(t = )

T v()

(),

and by taking limit as T and using Eq. (6.15),

limT

zT =

limT

T

t=1(t= )

T v()

()=

v() =z.

Thus in the limit, as the number of samples increases, we obtain the desiredsum z. An important case, of particular relevance to the methods of thischapter, is when is the set of states of an irreducible Markov chain. Then,if we generate an infinitely long trajectory{1, 2, . . .} starting from any


29/233


initial state1, then the condition (6.15) will hold with probability 1, with() being the steady-state probability of state .

The samplest need not be independent for the preceding properties

to hold, but if they are, then the variance of zTis the sum of the variancesof the independent components in the sum of Eq. (6.14), and is given by

var(zT) = 1

T2

Tt=1

()

v()

() z

2=

1

T

()

v()

() z

2.

(6.16)An important observation from this formula is that the accuracy of theapproximation does not depend on the number of terms in the sum z (thenumber of elements in ), but rather depends on the variance of the randomvariable that takes valuesv()/(),, with probabilities (). Thus,it is possible to execute approximately linear algebra operations of verylarge size through Monte Carlo sampling (with whatever distributions may

be convenient in a given context), and this a principal idea underlying themethods of this chapter.

In the case where the samples are dependent, the variance formula(6.16) does not hold, but similar qualitative conclusions can be drawn undervarious assumptions, which ensure that the dependencies between samplesbecome sufficiently weak over time (see the specialized literature).

Monte Carlo simulation is also important in the context of this chap-ter for an additional reason. In addition to its ability to compute efficientlysums of very large numbers of terms, it can often do so in model-free fash-ion (i.e., by using a simulator, rather than an explicit model of the termsin the sum).

6.1.6 Contraction Mappings and Simulation

Most of the chapter (Sections 6.3-6.8) deals with the approximate com-putation of a fixed point of a (linear or nonlinear) mapping T within a

The selection of the distribution () | can be optimized (at leastapproximately), and methods for doing this are the subject of the technique ofimportance sampling. In particular, assuming that samples are independent andthatv() 0 for all , we have from Eq. (6.16) that the optimal distributionis = v/z and the corresponding minimum variance value is 0. However,

cannot be computed without knowledge ofz. Instead, is usually chosen to bean approximation to v , normalized so that its components add to 1. Note thatwe may assume that v() 0 for all without loss of generality: when vtakes negative values, we may decompose v as

v= v+ v,

so that both v+ and v are positive functions, and then estimate separately

z+ =

v+() andz =

v().


30/233


subspaceS={r|r s}.

We will discuss a variety of approaches with distinct characteristics, but atan abstract mathematical level, these approaches fall into two categories:

(a) A projected equation approach, based on the equation

r= T(r), (6.17)

where is a projection operation with respect to a Euclidean norm(see Section 6.3 for discounted problems, and Sections 7.1-7.3 for othertypes of problems).

(b) An aggregation approach, based on an equation of the form

r= DT(r), (6.18)

whereD is an s n matrix whose rows are probability distributionsand are matrices that satisfy certain restrictions.

When iterative methods are used for solution of Eqs. (6.17) and (6.18),it is important that T and DT be contractions over the subspace S.Note here that even if T is a contraction mapping (as is ordinarily thecase in DP), it does not follow that T and DT are contractions. Inour analysis, this is resolved by requiring that T be a contraction withrespect to a norm such that or D, respectively, is a nonexpansivemapping. As a result, we need various assumptions onT, , and D, whichguide the algorithmic development. We postpone further discussion of theseissues, but for the moment we note that the projection approach revolvesmostly around Euclidean norm contractions and cases where T is linear,while the aggregation/Q-learning approach revolves mostly around sup-norm contractions.

If T is linear, both equations (6.17) and (6.18) may be written assquare systems of linear equations of the form C r= d, whose solution canbe approximated by simulation. The approach here is very simple: weapproximateCand d with simulation-generated approximations C and d,and we solve the resulting (approximate) linear system Cr = d by matrix

inversion, thereby obtaining the solution estimate r = C1d. A primaryexample is the LSTD methods of Section 6.3.4. We may also try to solvethe linear system Cr = d iteratively, which leads to the LSPE type ofmethods, some of which produce estimates of r simultaneously with thegeneration of the simulation samples ofw (see Section 6.3.4).

Stochastic Approximation Methods

Let us also mention some stochastic iterative algorithms that are basedon a somewhat different simulation idea, and fall within the framework of


31/233


stochastic approximation methods. The TD() and Q-learning algorithmsfall in this category. For an informal orientation, let us consider the com-putation of the fixed point of a general mapping F :

n

n that is a

contraction mapping with respect to some norm, and involves an expectedvalue: it has the form

F(x) =E

f(x, w)

, (6.19)

wherex n is a generic argument ofF,w is a random variable andf(, w)is a given function. Assume for simplicity that w takes values in a finiteset W with probabilities p(w), so that the fixed point equation x = F(x)has the form

x=wW

p(w)f(x, w).

We generate a sequence of samples{w1, w2, . . .} such that the empiricalfrequency of each value wWis equal to its probability p(w), i.e.,

limk

nk(w)

k =p(w), wW,

where nk(w) denotes the number of times that w appears in the first ksamples w1, . . . , wk. This is a reasonable assumption that may be verifiedby application of various laws of large numbers to the sampling method athand.

Given the samples, we may consider approximating the fixed point ofFby the (approximate) fixed point iteration

xk+1 = wWnk(w)

k f(xk, w), (6.20)

which can also be equivalently written as

xk+1 = 1

k

ki=1

f(xk, wi). (6.21)

We may view Eq. (6.20) as a simulation-based version of the convergentfixed point iteration

xk+1=F(xk) =wW

p(w)f(xk , w),

where the probabilities p(w) have been replaced by the empirical frequen-cies nk(w)k . Thus we expect that the simulation-based iteration (6.21) con-verges to the fixed point ofF.

On the other hand the iteration (6.21) has a major flaw: it requires,for each k, the computation of f(xk, wi) for all sample values wi, i =


32/233

Sec. 6.2 Direct Policy Evaluation - Gradient Methods 351

1, . . . , k. An algorithm that requires much less computation than iteration(6.21) is

xk+1= 1k

ki=1

f(xi, wi), k= 1, 2, . . . , (6.22)

where only onevalue off per sample wi is computed. This iteration canalso be written in the simple recursive form

xk+1= (1 k)xk+ kf(xk, wk), k= 1, 2, . . . , (6.23)

with the stepsize k having the form k = 1/k. As an indication of itsvalidity, we note that if it converges to some limit then this limit must bethe fixed point ofF, since for large kthe iteration (6.22) becomes essentiallyidentical to the iteration xk+1=F(xk). Other stepsize rules, which satisfyk

0 and k=1

k

=

, may also be used. However, a rigorous analysisof the convergence of iteration (6.23) is nontrivial and is beyond our scope.The book by Bertsekas and Tsitsiklis [BeT96] contains a fairly detaileddevelopment, which is tailored to DP. Other more general references areBenveniste, Metivier, and Priouret [BMP90], Borkar [Bor08], Kushner andYin [KuY03], and Meyn [Mey07].

6.2 DIRECT POLICY EVALUATION - GRADIENT METHODS

We will now consider the direct approach for policy evaluation. In par-ticular, suppose that the current policy is , and for a given r, J(i, r) is

an approximation ofJ(i). We generate an improved policy using theformula

(i) = arg minuU(i)

nj=1

pij(u)

g(i,u,j) + J(j, r)

, for alli. (6.24)

To evaluate approximatelyJ, we select a subset of representative states

S(perhaps obtained by some form of simulation), and for each i S, weobtainM(i) samples of the cost J(i). Themth such sample is denoted by

Direct policy evaluation methods have been historically important, andprovide an interesting contrast with indirect methods. However, they are cur-

rently less popular than the projected equation methods to be considered in thenext section, despite some generic advantages (the option to use nonlinear ap-

proximation architectures, and the capability of more accurate approximation).

The material of this section will not be substantially used later, so the reader

may read lightly this section without loss of continuity.


33/233


c(i, m), and mathematically, it can be viewed as being J(i) plus some sim-ulation error/noise. Then we obtain the corresponding parameter vectorr by solving the following least squares problem

minr

iS

M(i)m=1

J(i, r) c(i, m)2, (6.25)

and we repeat the process with and r replacing and r, respectively (seeFig. 6.1.1).

The least squares problem (6.25) can be solved exactly if a linearapproximation architecture is used, i.e., if

J(i, r) =(i)r,

where(i)

is a row vector of features corresponding to state i. In this caser is obtained by solving the linear system of equations

iS

M(i)m=1

(i)

(i)r c(i, m) = 0,which is obtained by setting to 0 the gradient with respect to r of thequadratic cost in the minimization (6.25). When a nonlinear architectureis used, we may use gradient-like methods for solving the least squaresproblem (6.25), as we will now discuss.

Batch Gradient Methods for Policy Evaluation

Let us focus on an N-transition portion (i0, . . . , iN) of a simulated trajec-tory, also called a batch. We view the numbers

N1t=k

tkg

it, (it), it+1

, k= 0, . . . , N 1,

The manner in which the samples c(i, m) are collected is immaterial forthe purposes of the subsequent discussion. Thus one may generate these samples

through a single very long trajectory of the Markov chain corresponding to, or

one may use multiple trajectories, with different starting points, to ensure that

enough cost samples are generated for a representative subset of states. In

either case, the samplesc(i, m) corresponding to any one statei will generally becorrelated as well as noisy. Still the average 1

M(i)

M(i)m=1

c(i, m) will ordinarily

converge toJ(i) as M(i) by a law of large numbers argument [see Exercise6.2 and the discussion in [BeT96], Sections 5.1, 5.2, regarding the behavior of the

average when M(i) is finite and random].


34/233


as cost samples, one per initial state i0, . . . , iN1, which can be used forleast squares approximation of the parametric architecture J(i, r) [cf. Eq.(6.25)]:

minr

N1k=0

1

2

J(ik, r)

N1t=k

tkg

it, (it), it+12

. (6.26)

One way to solve this least squares problem is to use a gradient method,whereby the parameterr associated with is updated at timeN by

r:= r N1k=0

J(ik, r)

J(ik, r) N1t=k

tkg

it, (it), it+1

. (6.27)

Here,Jdenotes gradient with respect to r and is a positive stepsize,which is usually diminishing over time (we leave its precise choice open for

the moment). Each of the N terms in the summation in the right-handside above is the gradient of a corresponding term in the least squaressummation of problem (6.26). Note that the update of r is done afterprocessing the entire batch, and that the gradientsJ(ik, r) are evaluatedat the preexisting value ofr , i.e., the one before the update.

In a traditional gradient method, the gradient iteration (6.27) isrepeated, until convergence to the solution of the least squares problem(6.26), i.e., a single N-transition batch is used. However, there is an im-portant tradeoff relating to the size N of the batch: in order to reducesimulation error and generate multiple cost samples for a representativelylarge subset of states, it is necessary to use a large N, yet to keep the workper gradient iteration small it is necessary to use a small N.

To address the issue of size ofN, an expanded view of the gradientmethod is preferable in practice, whereby batches may be changed after oneor more iterations. Thus, in this more general method, the N-transitionbatch used in a given gradient iteration comes from a potentially longersimulated trajectory, or from one of many simulated trajectories. A se-quence of gradient iterations is performed, with each iteration using costsamples formed from batches collected in a variety of different ways andwhose lengthNmay vary. Batches may also overlap to a substantial degree.

We leave the method for generating simulated trajectories and form-ing batches open for the moment, but we note that it influences stronglythe result of the corresponding least squares optimization (6.25), provid-ing better approximations for the states that arise most frequently in thebatches used. This is related to the issue of ensuring that the state space is

adequately explored, with an adequately broad selection of states beingrepresented in the least squares optimization, cf. our earlier discussion onthe exploration issue.

The gradient method (6.27) is simple, widely known, and easily un-derstood. There are extensive convergence analyses of this method and


35/233


its variations, for which we refer to the literature cited at the end of thechapter. These analyses often involve considerable mathematical sophis-tication, particularly when multiple batches are involved, because of the

stochastic nature of the simulation and the complex correlations betweenthe cost samples. However, qualitatively, the conclusions of these analysesare consistent among themselves as well as with practical experience, andindicate that:

(1) Under some reasonable technical assumptions, convergence to a lim-iting value ofr that is a local minimum of the associated optimizationproblem is expected.

(2) For convergence, it is essential to gradually reduce the stepsize to 0,the most popular choice being to use a stepsize proportional to 1/m,while processing the mth batch. In practice, considerable trial anderror may be needed to settle on an effective stepsize choice method.

Sometimes it is possible to improve performance by using a differentstepsize (or scaling factor) for each component of the gradient.

(3) The rate of convergence is often very slow, and depends among otherthings on the initial choice ofr, the number of states and the dynamicsof the associated Markov chain, the level of simulation error, andthe method for stepsize choice. In fact, the rate of convergence issometimes so slow, that practical convergence is infeasible, even iftheoretical convergence is guaranteed.

Incremental Gradient Methods for Policy Evaluation

We will now consider a variant of the gradient method called incremental.This method can also be described through the use ofN-transition batches,but we will see that (contrary to the batch version discussed earlier) themethod is suitable for use with very long batches, including the possibilityof a single very long simulated trajectory, viewed as a single batch.

For a given N-transition batch (i0, . . . , iN), the batch gradient methodprocesses theNtransitions all at once, and updates r using Eq. (6.27). Theincremental method updatesr a total ofN times, once after each transi-tion. Each time it adds tor the corresponding portion of the gradient inthe right-hand side of Eq. (6.27) that can be calculated using the newly

available simulation data. Thus, after each transition (ik, ik+1):

(1) We evaluate the gradientJ(ik, r) at the current value ofr.(2) We sum all the terms in the right-hand side of Eq. (6.27) that involve

the transition (ik, ik+1), and we update r by making a correction


36/233


along their sum:

r:= r J(ik, r)J(ik, r) kt=0

ktJ(it, r) gik, (ik), ik+1 .(6.28)

By adding the parenthesized incremental correction terms in the aboveiteration, we see that after N transitions, all the terms of the batch iter-ation (6.27) will have been accumulated, but there is a difference: in theincremental version, r is changed during the processing of the batch, andthe gradientJ(it, r) is evaluated at the most recent value ofr [after thetransition (it, it+1)]. By contrast, in the batch version these gradients areevaluated at the value ofr prevailing at the beginning of the batch. Notethat the gradient sum in the right-hand side of Eq. (6.28) can be conve-niently updated following each transition, thereby resulting in an efficient

implementation.It can now be seen that because r is updated at intermediate transi-tions within a batch (rather than at the end of the batch), the location ofthe end of the batch becomes less relevant. It is thus possible to have verylong batches, and indeed the algorithm can be operated with a single verylong simulated trajectory and a single batch. In this case, for each statei, we will have one cost sample for every time when state i is encounteredin the simulation. Accordingly state i will be weighted in the least squaresoptimization in proportion to the frequency of its occurrence within thesimulated trajectory.

Generally, within the least squares/policy evaluation context of thissection, the incremental versions of the gradient methods can be imple-mented more flexibly and tend to converge faster than their batch counter-parts, so they will be adopted as the default in our discussion. The bookby Bertsekas and Tsitsiklis [BeT96] contains an extensive analysis of thetheoretical convergence properties of incremental gradient methods (theyare fairly similar to those of batch methods), and provides some insight intothe reasons for their superior performance relative to the batch versions; seealso the authors nonlinear programming book [Ber99] (Section 1.5.2), thepaper by Bertsekas and Tsitsiklis [BeT00], and the authors recent survey[Ber10d]. Still, however, the rate of convergence can be very slow.

Implementation Using Temporal Differences TD(1)

We now introduce an alternative, mathematically equivalent, implemen-

tation of the batch and incremental gradient iterations (6.27) and (6.28),which is described with cleaner formulas. It uses the notion oftemporaldifference(TD for short) given by

qk = J(ik, r)J(ik+1, r)g

ik, (ik), ik+1

, k= 0, . . . , N 2, (6.29)


37/233


qN1= J(iN1, r) g

iN1, (iN1), iN

. (6.30)

In particular, by noting that the parenthesized term multiplying

J(ik, r)in Eq. (6.27) is equal to

qk+ qk+1+ + N1kqN1,

we can verify by adding the equations below that iteration (6.27) can alsobe implemented as follows:

After the state transition (i0, i1), set

r := r q0J(i0, r).

After the state transition (i1, i2), set

r:= r q1J(i0, r) + J(i1, r).Proceeding similarly, after the state transition (iN1, t), set

r:= r qN1

N1J(i0, r) + N2J(i1, r) + + J(iN1, r)

.

The batch version (6.27) is obtained if the gradients J(ik, r) areall evaluated at the value ofr that prevails at the beginning of the batch.The incremental version (6.28) is obtained if each gradient J(ik, r) isevaluated at the value ofr that prevails when the transition (ik, ik+1) isprocessed.

In particular, for the incremental version, we start with some vectorr0, and following the transition (ik, ik+1), k= 0, . . . , N 1, we set

rk+1=rk kqkkt=0

ktJ(it, rt), (6.31)

where the stepsize k may very from one transition to the next. In theimportant case of a linear approximation architecture of the form

J(i, r) =(i)r, i= 1, . . . , n ,

where(i) s are some fixed vectors, it takes the form

rk+1= rk kqkkt=0

kt(it). (6.32)

This algorithm is known as TD(1), and we will see in Section 6.3.6 that itis a limiting version (as 1) of the TD() method discussed there.


38/233

Sec. 6.3 Projected Equation Methods 357

6.3 PROJECTED EQUATION METHODS

In this section, we consider the indirect approach, whereby the policy eval-

uation is based on solving a projected form of Bellmans equation (cf. theright-hand side of Fig. 6.1.5). We will be dealing with a single station-ary policy , so we generally suppress in our notation the dependence oncontrol of the transition probabilities and the cost per stage. We thus con-sider a stationary finite-state Markov chain, and we denote the states byi= 1, . . . , n, the transition probabilities bypij ,i, j = 1, . . . , n, and the stagecosts byg (i, j). We want to evaluate the expected cost of correspondingto each initial state i, given by

J(i) = limN

E

N1k=0

kg(ik, ik+1) i0= i

, i= 1, . . . , n ,

whereik denotes the state at time k, and

(0, 1) is the discount factor.We approximateJ(i) with a linear architecture of the form

J(i, r) =(i)r, i= 1, . . . , n , (6.33)

where r is a parameter vector and (i) is an s-dimensional feature vectorassociated with the state i. (Throughout this section, vectors are viewedas column vectors, and a prime denotes transposition.) As earlier, we alsowrite the vector

J(1, r), . . . , J(n, r)

in the compact form r,where is the n s matrix that has as rows thefeature vectors(i),i = 1, . . . , n. Thus, we want to approximateJ within

S=

{r

|r

s

},

the subspace spanned by s basis functions, the columns of . Our as-sumptions in this section are the following (we will later discuss how ourmethodology may be modified in the absence of these assumptions).

Assumption 6.3.1: The Markov chain has steady-state probabilities1, . . . , n, which are positive, i.e., for all i = 1, . . . , n,

limN

1

N

Nk=1

P(ik =j|i0= i) =j >0, j = 1, . . . , n .

Assumption 6.3.2: The matrix has rank s.


39/233


Assumption 6.3.1 is equivalent to assuming that the Markov chain isirreducible, i.e., has a single recurrent class and no transient states. As-sumption 6.3.2 is equivalent to the basis functions (the columns of ) being

linearly independent, and is analytically convenient because it implies thateach vectorJin the subspaceSis represented in the form rwith a uniquevectorr.

6.3.1 The Projected Bellman Equation

We will now introduce the projected form of Bellmans equation. We usea weighted Euclidean norm onn of the form

Jv = n

i=1

vi

J(i)2

,

wherev is a vector of positive weights v1, . . . , vn. Let denote the projec-tion operation onto Swith respect to this norm. Thus for any J n, Jis the unique vector in S that minimizesJ J2v over all J S. It canalso be written as

J= rJ,

whererJ= arg min

rsJ r2v, J n. (6.34)

This is because has rank s by Assumption 6.3.2, so a vector in S isuniquely written in the form r.

Note that and rJcan be written explicitly in closed form. This canbe done by setting to 0 the gradient of the quadratic function

J r2v = (J r)V(J r),where V is the diagonal matrix with vi, i = 1, . . . , n, along the diagonal[cf. Eq. (6.34)]. We thus obtain the necessary and sufficient optimalitycondition

V(J rJ) = 0, (6.35)from which

rJ = (V)1V J,

and using the formula rJ = J,

= (V)1V.

[The inverse (V)1 exists because is assumed to have rank s; cf.Assumption 6.3.2.] The optimality condition (6.35), through left multipli-cation with r, can also be equivalently expressed as

(r)V(J rJ) = 0, rS. (6.36)


40/233


The interpretation is that the difference/approximation error JrJ isorthogonal to the subspaceS in the scaled geometry of the norm v (twovectorsx, y

n are called orthogonal ifxV y = ni=1vixiyi = 0).

Consider now the mapping T given by

(T J)(i) =ni=1

pij

g(i, j) + J(j)

, i= 1, . . . , n ,

the mapping T(the composition of with T), and the equation

r= T(r). (6.37)

We view this as a projected/approximate form of Bellmans equation, andwe view a solution r of this equation as an approximation to J. Notethat r depends only on the projection norm and the subspace S, andnot on the matrix , which provides just an algebraic representation ofS,

i.e., all matrices whose range space is Sresult in identical vectors r

).We know from Section 1.4 that T is a contraction with respect tothe sup-norm, but unfortunately this does not necessarily imply that Tis a contraction with respect to the norm v. We will next show animportant fact: ifv is chosen to be the steady-state probability vector ,thenTis a contraction with respect to v, with modulus . The criticalpart of the proof is addressed in the following lemma.

Lemma 6.3.1: For anyn nstochastic matrixPthat has a steady-state probability vector = (1, . . . , n) with positive components, wehave

P z

z

, z

n.

Proof: Let pij be the components ofP. For all z n, we have

P z2 =ni=1

i

nj=1

pijzj

2

ni=1

i

nj=1

pijz2j

=n

j=1

n

i=1

ipijz2j

=

nj=1

jz2j

=z2,


41/233


42/233


Subspace S= {r| r s}

J

TJ

TJ

J

TJ

TJ0

Figure 6.3.1 Illustration of the contraction property of T due to the nonex-pansiveness of . IfTis a contraction with respect to v, the Euclidean normused in the projection, then T is also a contraction with respect to that norm,since is nonexpansive and we have

T J TJv T J TJv J Jv ,

whereis the modulus of contraction ofTwith respect to v.

The next proposition gives an estimate of the error in estimating Jwith the fixed point of T.

Proposition 6.3.2: Let r be the fixed point of T. We have

J r 11 2 J J.

Proof: We have

J r2 =J J2+J r2

=J J2+T J T(r)2

J J2+ 2J r2,

where the first equality uses the Pythagorean Theorem [cf. Eq. (6.38) withX=J r], the second equality holds because J is the fixed point ofTand r is the fixed point of T, and the inequality uses the contraction

property of T. From this relation, the result follows. Q.E.D.

Note the critical fact in the preceding analysis: P (and hence T)is a contraction with respect to the projection norm (cf. Lemma6.3.1). Indeed, Props. 6.3.1 and 6.3.2 hold ifTis any (possibly nonlinear)


43/233


contraction with respect to the Euclidean norm of the projection (cf. Fig.6.3.1).

The Matrix Form of the Projected Bellman Equation

Let us now write the projected Bellman equation r= T(r) in explicitform. We note that this is a linear equation, since the projection is linearand alsoT is linear of the form

T J=g + PJ,

where g is the vector with componentsn

j=1pijg(i, j), i = 1, . . . , n, andP is the matrix with components pij . The solution of the projected Bell-man equation is the vector J = r, where r satisfies the orthogonalitycondition

r (g+ Pr)

= 0, (6.39)

with being the diagonal matrix with the steady-state probabilities 1, . . . , nalong the diagonal [cf. Eq. (6.36)].

Thus the projected equation is written as

Cr =d, (6.40)

whereC= (I P), d= g, (6.41)

and can be solved by matrix inversion:

r =C1d,

just like the Bellman equation, which can also be solved by matrix inversion,

J= (I

P)1g.

An important difference is that the projected equation has smaller dimen-sion (srather thann). Still, however, computingCandd using Eq. (6.41),requires computation of inner products of size n, so for problems where nis very large, the explicit computation ofC and d is impractical. We willdiscuss shortly efficient methods to compute inner products of large size byusing simulation and low dimensional calculations. The idea is that an in-ner product, appropriately normalized, can be viewed as an expected value(the weighted sum of a large number of terms), which can be computed bysampling its components with an appropriate probability distribution andaveraging the samples, as discussed in Section 6.1.5.

Here r is the projection ofg +Pr, so r(g +Pr) is orthogonalto the columns of . Alternatively,r

solves the problem

minrs

r (g+ Pr)2

.

Setting to 0 the gradient with respect tor of the above quadratic expression, we

obtain Eq. (6.39).


44/233


Subspace S= {r| r s}

0

Value Iterate

T(rk) = g+ Prk

Projection on S

rk+1

rk

Figure 6.3.2 Illustration of the projected value iteration (PVI) method

rk+1 = T(rk).

At the typical iteration k , the current iterate rk is operated on with T, and thegenerated vector T(rk) is projected onto S, to yield the new iterate rk+1.

6.3.2 Projected Value Iteration - Other Iterative Methods

We noted in Chapter 1 that for problems where n is very large, an iterativemethod such as value iteration may be appropriate for solving the Bellmanequation J = T J. Similarly, one may consider an iterative method forsolving the projected Bellman equation r = T(r) or its equivalentversionC r= d [cf. Eqs. (6.40)-(6.41)].

Since Tis a contraction (cf. Prop. 6.3.1), the first iterative methodthat comes to mind is the analog of value iteration: successively apply T,starting with an arbitrary initial vector r0:

rk+1= T(rk), k= 0, 1, . . . . (6.42)

Thus at iteration k, the current iterate rk is operated on with T, andthe generated value iterate T(rk) (which does not necessarily lie in S)is projected onto S, to yield the new iterate rk+1 (see Fig. 6.3.2). Werefer to this as projected value iteration (PVI for short). Since T is acontraction, it follows that the sequence {rk} generated by PVI convergesto the unique fixed point r of T.

It is possible to write PVI explicitly by noting that

rk+1= arg minrs

r (g+ Prk)2 .By setting to 0 the gradient with respect to r of the above quadratic ex-pression, we obtain the orthogonality condition

rk+1 (g+ Prk)

= 0,


45/233


[cf. Eq. (6.39)], which yields

rk+1= rk

()1(Crk

d), (6.43)

whereC and dare given by Eq. (6.41).From the point of view of DP, the PVI method makes intuitive sense,

and connects well with established DP theory. However, the methodologyof iterative methods for solving linear equations suggests a much broaderset of algorithmic possibilities. In particular, in a generic lass of methods,the current iteraterk is corrected by the residualCrkd(which tends to0), after scaling with some ssscaling matrixG, leading to the iteration

rk+1= rk G(Crk d), (6.44)

whereis a positive stepsize, and G is some s

s scaling matrix.

When

G= ()1 and = 1, we obtain the PVI method, but there are otherinteresting possibilities. For example whenG is the identity or a diagonalapproximation to ()1, the iteration (6.44) is simpler than PVI inthat it does not require a matrix inversion (it does require, however, thechoice of a stepsize ).

The iteration (6.44) converges to the solution of the projected equa-tion if and only if the matrix I GChas eigenvalues strictly within theunit circle. The following proposition shows that this is true when G ispositive definite symmetric, as long as the stepsize is small enough tocompensate for large components in the matrix G. This hinges on an im-portant property of the matrix C, which we now define. Let us say that a(possibly nonsymmetric)s s matrix M is positive definiteif

rM r >0, r= 0.

We say thatM is positive semidefiniteif

rM r0, r s.

The following proposition shows that C is positive definite, and if G ispositive definite and symmetric, the iteration (6.44) is convergent for

Dp Chapter

Documents

Transcript of Dp Chapter