Final exam solutionsee266.stanford.edu/hw/spr16/f2015.pdf · 1. Markov representations of systems....

EE 266 / MS&E251 Stochastic ControlProf. S. LallJune 5–6, 6–7, 7–8, 8–9, 9–10, or 10–11 2015

Final exam solutions

This is a 24 hour take-home final.

You may use any books, notes, or computer programs (e.g.MATLAB).You may not discuss the exam with anyone until June 12th, after everyone has taken theexam. The only exception is that you can ask us for clarification by emailing the teachingstaff. Our contact information is on the class website. We will keep an updated file witheventual clarifications at the following url:

http://www.stanford.edu/class/ee266/f2015/typos.txt

Please read this document carefully before asking a question. We have tried to make theexam unambiguous and clear, so we are unlikely to say much more.

Please make a copy of your exam before handing it in.

Please attach the cover page to the front of your exam. Assemble your solutions inorder (problem 1, problem 2, problem 3, . . . ), starting each problem on a new page. Puteverything associated with each problem (e.g., text, code, plots) together; do not attach codeor plots at the end of the final.

We will deduct points from long, needlessly complex solutions, even if they arecorrect. Our solutions are not long, so if you find that your solution to a problem goes onand on for many pages, you should try to figure out a simpler one. We expect neat, legibleexams from everyone, including those enrolled Cr/N.

When a problem involves computation you must give all of the following: a clear discussionand justification of exactly what you did, the source code that produces the result, and thefinal numerical results or plots. If you do not provide source code and your solutions aredifferent, you will get no credit for the question. To download files containing problem data,you will have to type the whole URL given in the problem into your browser; there are nolinks on the course web page pointing to these files. To get a file called filename.m, forexample, you would retrieve:

http://www.stanford.edu/class/ee266/f2015/filename.m

with your browser.

All problems have equal weight.

Be sure to check your email often during the exam, just in case we need to send out animportant announcement.

1

1. Markov representations of systems. A system is called a Markov decision process if ithas the form

xk+1 = f(xk, uk, wk)

where xk is the state, uk is the control input, and wk is the disturbance. In addition,we require that x0, w0, w1, . . . are independent.

In each of the following parts, we specify dynamics which are not in the form above,but which can be represented in that form by appropriate choice of state, action, anddecision variables. For each, specify how these variables should be chosen, and writedown the new dynamics function f . You need not specify initial conditions. Assumingthat xk, uk, wk, ηk ∈ R, give the dimension of the state space in each case.

(a) Here, the new state depends on previous states.

xk+1 = f(xk, xk−1, uk, wk)

(b) Here we have the same state dependence as in part (a), but in addition the controltakes two time steps to take effect.

xk+1 = f(xk, xk−1, uk−2, wk)

(c) Consider the same dynamics as in part (b), but now the disturbances are no longerindependent, and instead satisfy

wk = −wk−1 + ηk

where η0, η1, . . . are independent.

Solution

(a) Define yk = xk−1 and define new state as x̃k = (xk, yk). The system equations are:

xk+1 = f(xk, yk, uk, wk)

yk+1 = xk

The new state space is 2-dimensional.

(b) Define yk = xk−1, sk = uk−1, vk = uk−2, and define new state as x̃k = (xk, yk, sk, vk).The system equations are:

xk+1 = f(xk, yk, vk, wk)

yk+1 = xk

sk+1 = uk

vk+1 = sk

The new state space is 4-dimensional. Notice that including only vk but not skdoes not work.

2

(c) Define yk = xk−1, sk = uk−1, vk = uk−2, zk = wk−1, and define new state asx̃k = (xk, yk, sk, vk, zk). The system equations are:

xk+1 = f(xk, yk, vk,−zk + ηk)

yk+1 = xk

sk+1 = uk

vk+1 = sk

zk+1 = wk.

The new state space is 5-dimensional.

3

2. First hitting time. Consider a Markov chain with states X = {a, b, c, d}. The transitionmatrix is given by

P =

0 0 1/2 1/2

1/3 0 1/3 1/30 1/2 0 1/20 1/2 1/2 0

In each of the parts below we define a type of hitting time, τ . For each, plot thedistribution of τ conditioned on x0 = a. Also report Prob(τ = 10 | x0 = a).

You should report the approach you use for each part. In particular, if you constructa new Markov chain as part of your method, you should give its transition matrix.

(a) The hitting time τ is the smallest time t such that xt = d.

(b) The hitting time τ is the smallest time t such that xt = d and xs = b for somes < t. That is, τ is the first time state d is visited after visiting state b.

(c) The hitting time τ is the smallest t such that xt = d and xs = b, xu = c for somes < t and u < t. That is, τ is the first time state d is visited after visiting in anyorder states b and c.

Solution

(a) Transition matrix is

P =

0 0 1/2 1/2

1/3 0 1/3 1/30 1/2 0 1/20 0 0 1

Using the same method than in Homework 3 question 1 part(a) :

P (t = 10) = 0.0021

.

(b) Consider two copies of the states: a0, b0, c0, d0, a1, b1, c1, d1. 1 means that youhave visited or are currently visiting state b; 0 means that you have not visitedstate b and are not currently visiting b. Notice that b0 is an unreachable state.The transition matrix is given below:

P =

0 0 1/2 1/2 0 0 0 01/3 0 1/3 1/3 0 0 0 00 0 0 1/2 0 1/2 0 00 0 1/2 0 0 1/2 0 00 0 0 0 0 0 1/2 1/20 0 0 0 1/3 0 1/3 1/30 0 0 0 0 1/2 0 1/20 0 0 0 0 0 0 1

P (t = 10) = 0.0231

4

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Figure 1: part(a)

(c) Since there are two waypoints, we need four numbers to encode whether or notwe have visited b and/or c. Consider four copies of the states : a0, b0, c0, d0, a1,b1, c1, d1, a2, b2, c2, d2, a3, b3, c3, d3. 0 means having visited neither b nor c, 1means having visited b but not c, 2 means having visited c but not b, 3 meanshaving visited both b and c. Notice that b0, c0, c1, b2 are unreachable states. Thetransition matrix is given below:

P =

0 0 0 1/2 0 0 0 0 0 0 1/2 0 0 0 0 01/3 0 0 1/3 0 0 0 0 0 0 1/3 0 0 0 0 00 0 0 1/2 0 1/2 0 0 0 0 0 0 0 0 0 00 0 0 0 0 1/2 0 0 0 0 1/2 0 0 0 0 00 0 0 0 0 0 0 1/2 0 0 0 0 0 0 1/2 00 0 0 0 1/3 0 0 1/3 0 0 0 0 0 0 1/3 00 0 0 0 0 1/2 0 1/2 0 0 0 0 0 0 0 00 0 0 0 0 1/2 0 0 0 0 0 0 0 0 1/2 00 0 0 0 0 0 0 0 0 0 1/2 1/2 0 0 0 00 0 0 0 0 0 0 0 1/3 0 1/3 1/3 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1/2 0 1/2 0 00 0 0 0 0 0 0 0 0 0 1/2 0 0 1/2 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1/2 1/20 0 0 0 0 0 0 0 0 0 0 0 1/3 0 1/3 1/30 0 0 0 0 0 0 0 0 0 0 0 0 1/2 0 1/20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

P (t = 10) = 0.0322

5

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

Figure 2: part(b)

3. Stock selling with market impact. You have X > 0 shares of stock that you are tryingto sell over T days. At the beginning of each day t, you observe last night’s closingprice pt−1, and decide to sell ut shares on day t, for t = 0, . . . , T . Since you are a biginvestor, your trading activities will modify the market. On day t, the price when themarket closes is pt, given by

pt = pt−1 − utβ + wt.

Here β > 0 is a known constant, and w0, w1, . . . are IID Gaussian with mean 0 andvariance 1. Your revenue is utpt (not utpt−1) after selling the stocks on day t.

You make T trades given by u0, . . . , uT , but the amount you trade at time T is de-termined by the constraint that all shares must be sold after this final trade. You areallowed to trade fractional shares.

(a) You would like to maximize the expected total revenue. Formulate this problem asa minimum cost problem for a Markov decision process. In particular, define thestate space, action space, the dynamics function, the stage cost, and the terminalcost.

(b) This problem has a very special structure which allows it to be solved analytically.Do so, and give expressions for the optimal policy, the total expected revenue, andthe value function at each time t. Simplify your solution as much as possible. Notethat the only parameters in the problem are X, β and T , and you should expressyour answers in terms of these.

Solution

(a) State space at the beginning of day t consists of the previous closing price pt−1

6

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

Figure 3: part(c)

and number of shares left xt. Action space is given by ut. gt and vt are given by

gt = −utptvt(xt, pt−1) = min

ut

E[gt + vt+1(xt − ut, pt)]

(b) Since all shares need to be sold by end of day T , uT = xT . Therefore

vT (xT , pT−1) = E[gT ] = E[−xTpT ] = −pT−1xT + βx2T

(c) For (T − 1):

vT−1(xT−1, pT−2)

= minuT−1

E[gT−1 + vT (xT , pT−1)]

= minuT−1

E[−uT−1pT−1 − pT−1(xT−1 − uT−1) + β(xT−1 − uT−1)2]

= minuT−1

(−xT−1E[pT−1] + β(xT−1 − uT−1)2

)= min

uT−1

(βu2T−1 − βuT−1xT−1 + βx2T−1 − xT−1pT−2

)= −xT−1pT−2 +

3

4βx2T−1

And u∗T−1 = 12xT−1.

7

For (T − 2):

vT−2(xT−2, pT−3)

= minuT−2

E[gT−2 + vT−1(xT−1, pT−2)]

= minuT−2

E[−uT−2pT−2 − pT−2(xT−2 − uT−2) +3

4β(xT−2 − uT−2)2]

= minuT−2

(−xT−2E[pT−2] +

3

4β(xT−2 − uT−2)2

)= min

uT−2

(3

4βu2T−2 −

1

2βuT−2xT−2 +

3

4βx2T−2 − xT−2pT−3

)= −xT−2pT−3 +

2

3βx2T−2

And u∗T−2 = 13xT−2.

(d) Use induction to show that

vT−k(xT−k, pT−(k+1)) = −xT−kpT−(k+1) +k + 2

2k + 2βx2T−k

u∗T−k =1

k + 1xT−k

The result is true for k = 1, 2. Suppose the result is true for k, then for k + 1:

vT−(k+1)(xT−(k+1), pT−(k+2))

= minuT−(k+1)

E[gT−(k+1) + vT−k(xT−k, pT−(k+1))]

= minuT−(k+1)

E[−uT−(k+1)pT−(k+1) − pT−(k+1)(xT−(k+1) − uT−(k+1)) +k + 2

2k + 2β(xT−(k+1) − uT−(k+1))

2]

= minuT−(k+1)

(−xT−(k+1)E[pT−(k+1)] +

k + 2

2k + 2β(xT−(k+1) − uT−(k+1))

2

)= min

uT−(k+1)

(k + 2

2k + 2βu2T−(k+1) −

1

k + 1βuT−(k+1)xT−(k+1) +

k + 2

2k + 2βx2T−(k+1) − xT−(k+1)pT−(k+2)

)= −xT−(k+1)pT−(k+2) +

k + 3

2k + 4βx2T−(k+1).

And u∗T−(k+1) = 1k+2

xT−(k+1) The optimal policy is to sell an equal number ofshares everyday. Expected revenue under the optimal policy is

−v1(x1, p0) = x1p0 −T + 1

2Tβx21.

8

4. Valet parking. You are driving to San Francisco for a well-deserved night out after yourlast exam and are looking for a parking spot on your way to the bar.

The parking places are arranged in a line along the street along which you are driving.You incur a cost k for parking k places away from the bar (to take into account thepain of walking). If you have still not parked when reaching the bar, you will give uptrying, and pay γ > 0 for the valet parking option. Each place, at distance k, is freewith probability pk independent of others. Finally, everyone seems to have gone outwith their SUV tonight, and you are driving a mini, so you cannot observe if a parkingspot is free until you reach it.

You start at a distance T = 15 from the bar, and every time step you move forwardone place. The state space is X = {P, P̄}, where xt = P if you are parked at time t.The action spaces is U = {S, S̄}, where taking action ut = S means that you stop andpark at that time.

(a) What are the dynamics function, stage cost, and terminal cost?

(b) Suppose γ = 50 andpk = min(1, 0.5 + 0.01k)

Find the optimal policy. State the optimal policy in words.

(c) Plot the two components of the optimal value function vt as a function of time.

(d) Explain the shape of the graph of the value function vt(P ).

(e) Explain the shape of the graph of the value function vt(P̄ ).

(f) Suppose now that pk = p, independent of the location k. Let q = 1 − p. Showthat there is some k? such that, for k < k?

vk(P̄ ) = k + γqk − q(1− qk)

p

Solution

(a) Let the state xk ∈ {P, P̄} where P represents the driver having parked beforereaching the kth spot.Let the control at each parking spot be uk ∈ {S, S̄} where S represents the choiceto stop (ie. park) in the kth spot.Let the disturbance be:

wk =

{1 if kth spot is free

0 otherwise.

and the cost be:

gk =

k x = P̄ , u = S,w = 1

γ x0 = P̄

∞ uk = S, xk = P or wk = 0

0 otherwise.

9

The system evolves according to:

xk+1 =

{P xk = P or uk = S

P̄ otherwise.

(b) Once the driver has parked, his remaining cost is zero, so Vk(P̄ ) is the expectedremaining cost, given that the driver has not parked before the kth spot. We solveby k-value iteration (where T − k = t):

V0 = γ

Vk+1(P̄ ) = minuE(gk(P̄ , u, w) + Vk(xk)

)= min

u

(pkk + (1− pk)Vk(P̄ ), Vk(P̄ )

)Vk+1(P̄ ) = pk min(k, Vk) + (1− pk)Vk

Based on the equation above, you should park when Vk(P̄ ) ≤ k. There exists somek∗ where it is optimal to continue if k ≥ k∗ and it is optimal to park if k < k∗,which is the smallest integer satisfying k∗ ≥ Vk∗−1(P̄ ).(Vk(P̄ )) is a monotonously decreasing sequence (with k) of positive integers andγ > 0 so such a k∗ exists.We find that the optimal policy is to try to park for k < k∗ = 5 places away fromthe bar.

(c) We have the following plot (with k∗ in dotted):

0 5 10 150

5

10

15

20

25

30

35

40

45

50

v(P̄ )

v(P)

10

(d) When the driver has parked, the subsequent costs are all zero.

(e) As we get closer to the bar, the alternative of paying the valet parking is morelikelu, even more so because the probability of finding a free spot is smaller.

(f) By induction, we prove that for k < k∗, using the notation q = 1− p :

Vk(P̄ ) = k + qkγ − q

p(1− qk)

Using that k∗ ≥ Vk∗−1(P̄ ) and rearranging the inequality:

(q + pγ)−1 ≥ qk∗−1

k∗ =

⌈ln (q + γp)−1

ln q+ 1

⌉The policy is therefore to stop at the first available position k < k∗.

11

5. ID, please. You finally reached the bar after an endless walk (despite parking opti-mally).

q(1)

q(2)

They have a system with two queues: a VIP one, whose length at time t is q(1)t , and a

regular one with length q(2)t . Queue i has capacity Q(i). At each time period there are

d(i)t arrivals in each queue, with d

(i)t being zero or one. Suppose furthermore d0, . . . , dT−1

are IID, but arrivals d(1)t and d

(2)t are not independent. The joint distribution of dt is

given in the data file.

There is only one bouncer (server), that can process one customer from either queuein each time period, for a reward vi > 0 for each queue. After service, before the nexttime period we count new customers and reject them for a cost ri > 0 if the queue isfull. The number of customers rejected by queue i is

(q(i)t + d

(i)t − u

(i)t −Q(i))+

There is also a quadratic cost associated with the queue length, given by

g(i)t (q

(i)t , u

(i)t , d

(i)t ) = a(i)(q

(i)t )2 + b(i)q

(i)t

The terminal cost is zero, and initially both queues are empty. The time horizon isT = 101.

(a) Using the provided bar_queue_data.m file, plot the optimal policy and valuefunction for t = 0, 20, 40, 60, 80, 100. Comment on your findings. What is themean total cost?

(b) Plot the expected value of the components g(1)t , g

(2)t , grejectt , grewardt of the total ex-

pected cost as a function of t under the optimal policy.

(c) Suppose now the policy is to always serve the VIP queue if it is non-empty, other-wise to serve the second queue. Plot the policy and value function as in part (a),and the components of the total cost as in part(b). Give the mean total cost, andcompare with the optimal policy.

(d) Plot two sample trajectories showing qt, one using the optimal policy, the otherusing the heuristic policy.

Solution

12

(a) The total cost is around 1723. The optimal policy plots are given below. At t = T ,we minimise cost by processing a VIP (bigger reward). We also notice that theoptimal policy converges (before the effects due to our stopping time T).

0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10

Time = 0

0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10

Time = 20

0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10

Time = 40

0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10

Time = 60

0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10

Time = 80

0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10

Time = 100

Value function plots are below. The value function converges in shape, and is thenshifted upwards by the average cost (cf. lecture on infinite horizon).

02

46

810

0

1

2

3

4

5

2000

3000

4000

5000

6000

7000

queue 2

Time = 0

queue 1 02

46

810

0

1

2

3

4

5

2000

3000

4000

5000

6000

7000

queue 2

Time = 20

queue 1

13

02

46

810

0

1

2

3

4

5

1000

2000

3000

4000

5000

6000

queue 2

Time = 40

queue 1 02

46

810

0

1

2

3

4

5

1000

2000

3000

4000

5000

6000

queue 2

Time = 60

queue 1

02

46

810

0

1

2

3

4

5

0

1000

2000

3000

4000

5000

6000

queue 2

Time = 80

queue 1 02

46

810

0

1

2

3

4

50

1000

2000

3000

4000

5000

6000

queue 2

Time = 100

queue 1

(b) We represent the reward positively on the plot to compare it with the other com-ponents. There are very few rejections. The cost incurred by the VIP queue ishigher. The cost structure is such that at full capacity both queues have a samebuffer cost, but the VIP reaches it more quickly.

14

20 40 60 80 1000

10

20

30

40

50

Cost structure for optimal policy

queue 1queue 2rejectionreward

(c) With the heuristic policy, the total cost starting with initial queues is 1870 (biggerthan under the optimal policy). We obviously serve queue 1 as soon as it is nonempty. Plots are given below.

0 1 2 3 4 5

0

1

2

3

4

5

6

7

8

9

10

Time = 0

The cost of size queue 2 is the driving factor, because we are focusing on the VIPqueue.

15

20 40 60 80 1000

10

20

30

40

50

Cost structure for heuristic policy

queue 1queue 2rejectionreward

(d) As expected, under the heuristic policy, the VIP queue has maximum one personin it. The plot illustrates why the cost of queue 2 is so high with the heuristicpolicy - because we mainly serve queue 1, queue 2 has more people than in theoptimal scenario, and thus incurs a higher cost. We also notice the interactionarrival effect: rejecting a customer in our model is rather expensive. Since a fifthof the time we have simultaneous arrivals, the policy tries to avoid having two fullqueues (ie. serving the regular queue despite the VIP queue being full).

0 20 40 60 80 1000

2

4

6

8

10Sample path for optimal policy

queue 1queue 2

0 20 40 60 80 1000

2

4

6

8

10Sample path for heuristic policy

queue 1queue 2

Note: We have one server, and on average one arrival per time step. If we did nothave a maximum capacity, the queue length would diverge to ∞, which explainswhy we have rejections.

16

6. Uh oh . . .You thought your night out was going well until your friend Tex got into afight. You bring him to the ER, and a surgeon is about to operate. You believe youare sober enough to help him maximise the success rate of the operation.

The surgeon is using a steerable needle to inject a drug into a precise spot. The needlehas an angled tip, and so when pushed it tends to move in a curve. By rotating theshaft of the needle the surgeon can steer the needle through the body, avoiding criticalorgans, and target the injection to a particular problem area.

We use a very simplified two-dimensional grid model here. The needle can eithermove North or East. However, there is uncertainty due to unpredictable tissue/needleinteraction. If the surgeon pushes in the same direction as in the previous time-step,there is a higher probability of moving in the desired direction than if the surgeon triesto change direction. (as illustrated below). The first move is treated as though therewas a move in the same direction previously.

x

y

Previousdirection

p = 0.2

p = 0.8

p = 0.6

p = 0.4

Figure 4: Distribution probability of states whether: the surgeon pushes in the same direction thanpreviously (left), or changes direction before pushing further (right). The main direction is plottedin full, the random direction in dotted.

The operating workspace is a 41x41 grid. The starting point is in the bottom leftcorner and the target is in the top right corner. We assume at each time step, thesurgeon observes the current location via ultrasound and pushes the needle in a chosendirection, North or East. Reaching a critical organ, or a boundary of the workspaceresults in an operation failure.

Using the surgeon_data.m file, answer the following questions:

(a) Give the initial direction that maximises the success rate of the operation, andgive the corresponding maximum rate.

(b) Plot the optimal policy and value function. Comment on your results.

(c) Simulate the system to estimate the success rate and plot a sample trajectory ofa successful operation.

Solution

17

0 10 20 30 400

10

20

30

40

Figure 5: Grid representation of the workspace

(a) The state space is the position of the needle and the previous direction pushedbefore reaching that position:

x ∈ {0, .., l}2 × {0, 1}

The input space is simply {0, 1} (ie. keep same direction or change).

We solve the stochastic shortest path problem that satisfies the following equation,where N(x) is the set of reachable points from x in one time step (independent ofu in our scenario):

J∗(x) = maxu

( ∑v∈N(x)

Pxv(u)J∗(v))

J∗ is the probability of success : J∗(t ∈ T ) = 1, J∗(r ∈ R) = 0 (T being the targetregion, R rejection region).

We find a success rate of 0.19 for starting off by cutting in the North direction,and 0.26 by cutting first in East direction. Therefore, the surgeon should insertthe needle in the East direction. (We also notice by inspection there is a criticalregion right above the starting point...). Our friend’s chances remain slim though.

(b) Plots are given below.

We notice how the critical regions affect the input of the surgeon. We also noticethat the upper left and lower region have zero probability of success because thesurgeon will exit the workspace region. The policies only have minor differences.

(c) Using MC simulation we find a success rate of 0.26.

18

Figure 6: Success probability for reaching states with a North cut (top) and East cut (bottom)

19

0 10 20 30 400

10

20

30

40

0 10 20 30 400

10

20

30

40

Figure 7: Optimal policy when reaching states with North cut (left) and East cut (right).Red corresponds to a North cut, green to East

20

0 10 20 30 400

10

20

30

40

Figure 8: Successful operation

21

Final exam solutionsee266.stanford.edu/hw/spr16/f2015.pdf · 1. Markov representations of systems....

Documents

Transcript of Final exam solutionsee266.stanford.edu/hw/spr16/f2015.pdf · 1. Markov representations of systems....