Dynamic Optimization and Learning for Renewal Systems

24
Dynamic Optimization and Learning for Renewal Systems Michael J. Neely, University of Southern California Asilomar Conference on Signals, Systems, and Computers, Nov. 2010 PDF of paper at: http://ee.usc.edu/stochastic-nets/ docs/ in part by the NSF Career CCF-0747525, ARL Network Science Collaborative Te t T/R T/R T/R T/R T/R Network Coordina tor Task 1 Task 2 Task 3 T[0] T[1] T[2]

description

T/R. T/R. T/R. T/R. T/R. Dynamic Optimization and Learning for Renewal Systems. Task 3. Task 2. Task 1. t. T[0]. T[1]. T[2]. Network Coordinator. Michael J. Neely, University of Southern California Asilomar Conference on Signals, Systems, and Computers, Nov. 2010 - PowerPoint PPT Presentation

Transcript of Dynamic Optimization and Learning for Renewal Systems

Page 1: Dynamic Optimization and Learning for Renewal Systems

Dynamic Optimization and Learning for Renewal Systems

Michael J. Neely, University of Southern CaliforniaAsilomar Conference on Signals, Systems, and Computers, Nov. 2010

PDF of paper at: http://ee.usc.edu/stochastic-nets/docs/renewal-systems-asilomar2010.pdfSponsored in part by the NSF Career CCF-0747525, ARL Network Science Collaborative Tech. Alliance

tT/R

T/R

T/R

T/R

T/RNetwork

Coordinator

Task 1

Task 2

Task 3

T[0] T[1] T[2]

Page 2: Dynamic Optimization and Learning for Renewal Systems

A General Renewal System

tT[0] T[1] T[2]

y[2]y[1]y[0]

•Renewal Frames r in {0, 1, 2, …}.•π[r] = Policy chosen on frame r.•P = Abstract policy space (π[r] in P for all r).•Policy π[r] affects frame size and penalty vector on frame r.

π[r]•y[r] = [y0(π[r]), y1(π[r]), …, yL(π[r])]

•T[r] = T(π[r]) = Frame Duration

Page 3: Dynamic Optimization and Learning for Renewal Systems

A General Renewal System

tT[0] T[1] T[2]

y[2]y[1]y[0]

•Renewal Frames r in {0, 1, 2, …}.•π[r] = Policy chosen on frame r.•P = Abstract policy space (π[r] in P for all r).•Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]):

π[r]•y[r] = [y0(π[r]), y1(π[r]), …, yL(π[r])]

•T[r] = T(π[r]) = Frame Duration

Page 4: Dynamic Optimization and Learning for Renewal Systems

A General Renewal System

tT[0] T[1] T[2]

y[2]y[1]y[0]

•Renewal Frames r in {0, 1, 2, …}.•π[r] = Policy chosen on frame r.•P = Abstract policy space (π[r] in P for all r).•Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]):

π[r]•y[r] = [1.2, 1.8, …, 0.4]

•T[r] = 8.1 = Frame Duration

Page 5: Dynamic Optimization and Learning for Renewal Systems

A General Renewal System

tT[0] T[1] T[2]

y[2]y[1]y[0]

•Renewal Frames r in {0, 1, 2, …}.•π[r] = Policy chosen on frame r.•P = Abstract policy space (π[r] in P for all r).•Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]):

π[r]•y[r] = [0.0, 3.8, …, -2.0]

•T[r] = 12.3 = Frame Duration

Page 6: Dynamic Optimization and Learning for Renewal Systems

A General Renewal System

tT[0] T[1] T[2]

y[2]y[1]y[0]

•Renewal Frames r in {0, 1, 2, …}.•π[r] = Policy chosen on frame r.•P = Abstract policy space (π[r] in P for all r).•Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]):

π[r]•y[r] = [1.7, 2.2, …, 0.9]

•T[r] = 5.6 = Frame Duration

Page 7: Dynamic Optimization and Learning for Renewal Systems

Example 1: Opportunistic Scheduling

S[r] = (S1[r], S2[r], S3[r])

•All Frames = 1 Slot•S[r] = (S1[r], S2[r], S3[r]) = Channel States for Slot r•Policy p[r]: On frame r: First observe S[r], then choose a channel to serve (i.,e, {1, 2, 3}).•Example Objectives: thruput, energy, fairness, etc.

Page 8: Dynamic Optimization and Learning for Renewal Systems

Example 2: Markov Decision Problems

•M(t) = Recurrent Markov Chain (continuous or discrete)•Renewals are defined as recurrences to state 1.•T[r] = random inter-renewal frame size (frame r).•y[r] = penalties incurred over frame r.•π[r] = policy that affects transition probs over frame r.

•Objective: Minimize time average of one penalty subj. to time average constraints on others.

1

2

3

4

Page 9: Dynamic Optimization and Learning for Renewal Systems

Example 3: Task Processing over Networks

T/R

T/R

T/R

T/R

T/R

Network Coordinator

•Infinite Sequence of Tasks.•E.g.: Query sensors and/or perform computations.•Renewal Frame r = Processing Time for Frame r.•Policy Types:• Low Level: {Specify Transmission Decisions over Net}• High Level: {Backpressure1, Backpressure2, Shortest Path}

•Example Objective: Maximize quality of information per unit time subject to per-node power constraints.

Task 1Task 2Task 3T/R

Page 10: Dynamic Optimization and Learning for Renewal Systems

Quick Review of Renewal-Reward Theory(Pop Quiz Next Slide!)

Define the frame-average for y0[r]:

The time-average for y0[r] is then:

*If i.i.d. over frames, by LLN this is the same as E{y0}/E{T}.

Page 11: Dynamic Optimization and Learning for Renewal Systems

Pop Quiz: (10 points)

•Let y0[r] = Energy Expended on frame r.•Time avg. power = (Total Energy Use)/(Total Time)•Suppose (for simplicity) behavior is i.i.d. over frames.

To minimize time average power, which one should we minimize?

(a) (b)

Page 12: Dynamic Optimization and Learning for Renewal Systems

Pop Quiz: (10 points)

•Let y0[r] = Energy Expended on frame r.•Time avg. power = (Total Energy Use)/(Total Time)•Suppose (for simplicity) behavior is i.i.d. over frames.

To minimize time average power, which one should we minimize?

(a) (b)

Page 13: Dynamic Optimization and Learning for Renewal Systems

Two General Problem Types:

1) Minimize time average subject to time average constraints:

2) Maximize concave function φ(x1, …, xL) of time average:

Page 14: Dynamic Optimization and Learning for Renewal Systems

Solving the Problem (Type 1):

Define a “Virtual Queue” for each inequality constraint:

Zl[r] clT[r]yl[r]

Zl[r+1] = max[Zl[r] – clT[r] + yl[r], 0]

Page 15: Dynamic Optimization and Learning for Renewal Systems

Lyapunov Function and “Drift-Plus-Penalty Ratio”:

Z2(t)

Z1(t)

L[r] = Z1[r]2 + Z2[r]2 + … + ZL[r]2

Δ(Z[r]) = E{L[r+1] – L[r] | Z[r]} = “Frame-Based Lyap. Drift”

•Scalar measure of queue sizes:

•Algorithm Technique: Every frame r, observe Z1[r], …, ZL[r]. Then choose a policy π[r] in P to minimize:

Δ(Z[r]) + VE{y0[r]|Z[r]}

E{T|Z[r]}“Drift-Plus-Penalty Ratio” =

Page 16: Dynamic Optimization and Learning for Renewal Systems

The Algorithm Becomes:

•Observe Z[r] = (Z1[r], …, ZL[r]). Choose π[r] in P to solve:

•Then update virtual queues:

Δ(Z[r]) + VE{y0[r]|Z[r]}

E{T|Z[r]}

Zl[r+1] = max[Zl[r] – clT[r] + yl[r], 0]

Page 17: Dynamic Optimization and Learning for Renewal Systems

Theorem: Assume the constraints are feasible. Then under this algorithm, we achieve:

Δ(Z[r]) + VE{y0[r]|Z[r]}

E{T|Z[r]}DPP Ratio:

(a)

(b)

For all frames r in {1, 2, 3, …}

Page 18: Dynamic Optimization and Learning for Renewal Systems

Solving the Problem (Type 2):

We reduce it to a problem with the structure of Type 1 via:• Auxiliary Variables γ[r] = (γ1[r], …, γL[r]).• The following variation on Jensen’s Inequality:

For any concave function φ(x1, .., xL) and any (arbitrarily correlated) vector of random variables (x1, x2, …, xL, T), where T>0, we have:

E{Tφ(X1, …, XL)}

E{T}E{T(X1, …, XL)}

E{T}φ( )≤

Page 19: Dynamic Optimization and Learning for Renewal Systems

The Algorithm (type 2) Becomes:

•On frame r, observe Z[r] = (Z1[r], …, ZL[r]).•(Auxiliary Variables) Choose γ1[r], …, γL[r] to max the below deterministic problem:

•(Policy Selection) Choose π[r] in P to minimize:

•Then update virtual queues:Zl[r+1] = max[Zl[r] – clT[r] + yl[r], 0], Gl[r+1] = max[Gl[r] + γl[r]T[r] - yl[r], 0]

Page 20: Dynamic Optimization and Learning for Renewal Systems

Example Problem – Task Processing:

T/R T/R

T/R

T/R

T/R

Network Coordinator

Task 1Task 2Task 3

•Every Task reveals random task parameters η[r]: η[r] = [(qual1[r], T1[r]), (qual2[r], T2[r]), …, (qual5[r], T5[r])]•Choose π[r] = [which node to transmit, how much idle] in {1,2,3,4,5} X [0, Imax] •Transmissions incur power•We use a quality distribution that tends to be better for higher-numbered nodes.•Maximize quality/time subject to pav≤ 0.25 for all nodes.

Setup Transmit Idle I[r]Frame r

Page 21: Dynamic Optimization and Learning for Renewal Systems

Minimizing the Drift-Plus-Penalty Ratio:

•Minimizing a pure expectation, rather than a ratio, is typically easier (see Bertsekas, Tsitsiklis Neuro-DP).

•Define:

•“Bisection Lemma”:

Page 22: Dynamic Optimization and Learning for Renewal Systems

Learning via Sampling from the past:

•Suppose randomness characterized by: {η1, η2, ..., ηW} (past random samples)

•Want to compute (over unknown random distribution of η):

•Approximate this via W samples from the past:

Page 23: Dynamic Optimization and Learning for Renewal Systems

Simulation:

Sample Size W

Qua

lity

of In

form

ation

/ U

nit T

ime

Drift-Plus-Penalty Ratio Alg. With Bisection

Alternative Alg. With Time Averaging

Page 24: Dynamic Optimization and Learning for Renewal Systems

Concluding Sims (values for W=10):

Quick Advertisement: New Book: M. J. Neely, Stochastic Network Optimization with Application to Communication and Queueing Systems. Morgan & Claypool, 2010.

http://www.morganclaypool.com/doi/abs/10.2200/S00271ED1V01Y201006CNT007

• PDF also available from “Synthesis Lecture Series” (on digital library)• Lyapunov Optimization theory (including these renewal system problems)• Detailed Examples and Problem Set Questions.