Online Learning for Energy-Efficient Multimedia...

Multimedia Communications and Systems Laboratory 1

Online Learning for Energy-EfficientMultimedia Systems

Nick Mastronarde

[email protected]

PhD Defense

May 6, 2011

2

• Old: Higher multimedia quality is better

– Optimize rate-distortion performance

• H.264/AVC

– Minimize delay

– Minimize distortion

– …

• New: Quality costs power

SurveillanceVideo conferencing Sensor networks Data centersIn home

Resource intensive multimedia applications are booming over a variety of resource constrained networks and systems

Delay,Distortion

Energy

My Focus!Energy-efficient resource management

Multimedia Communications and Systems Laboratory

Performance Metrics and High-level System Model

• Performance metric depends on the system and application

– Minimize energy subject to QoS constraint

– Optimize QoS subject to energy budget

– …

• For example:

– E[Cost] = E[Energy] + µE[Delay]

3

QoSDelay,

Distortion

Buffer

ServerSource

Service AdaptationSource Adaptation

I P

B

P

B

Multimedia Data

Scheduling


Two types of optimization objectives

• Myopic:

– Minimize expected immediate cost

• Foresighted:

– Minimize expected immediate cost + expected future cost

– Why?

• Power & Delay: Time to transmit current packet impacts time available (and

power required) to transmit future packets before their deadlines

• Multimedia Utility: Scheduling decisions at the current time impact future

scheduling decisions due to source-coding dependencies

4

E[Cost] = E[Energy] + µE[Delay]

Suboptimal!

My Focus!


Foresighted Optimization

• How does foresighted optimization work?

– In time slot n, take transmission action to minimize:

( )( )( , ) , ,wc s a V f s a w+ E

Dynamics:wAction: a

Current cost Expected future cost

State: s State: ( ), ,s f s a w′ =

Time n Time n+1

ChannelBuffer backlogMM Data state

SchedulingAMC

ChannelData arrivals

Tx errors

Myopic solutions are suboptimal because they

ignore the expected future utility

5


Challenges

• Challenge 1: Unknown dynamic environments

– Dynamic traffic and channel conditions

– Lack of statistical knowledge of dynamics

– Fast learning algorithms

• Challenge 2: Heterogeneous multimedia data

– Different deadlines, priorities, dependencies

• Challenge 3: Multi-user

– Coupling due to shared resources

– Curse of dimensionality


Existing Solutions (1/2)

• Cross-layer optimization in multimedia communications and systems

– Myopic: Ignore the impact of current decisions on the future performance. [Nahrstedt 2006, 2007, He 2005, Sachs 2003, Mohapatra 2005, van der Schaar 2003, 2007]

• Single-layer optimizations

– Hardware layer (dynamic power management): [Benini 1999, Chung 2002, Marculescu 2005]

• Learning solutions require too much memory or are too complex

– Physical layer (transmission power-control)• Optimal solutions require statistical knowledge of dynamics [Berry 2002]

• Learning solutions are slow to converge [Borkar 2008]

– Application layer (multimedia rate-control) [Ortega 1994]

• Rate-distortion characteristics are assumed to be known


Existing solutions (2/2)

• Multi-user network optimization

– Network utility maximization [Chiang 2007]

• Static utility function

• Ignores network dynamics

• Ignores packet deadlines, priorities, and dependencies

• No learning for unknown environments

– Stability-constrained optimization [Neely 2006]

• Guarantees queue stability, but achieves suboptimal power consumption in

low delay region

• Ignores packet deadlines, priorities, and dependencies


Improvement over state-of-the-art

Problem setting Previous state-of-the-art Achieved improvement

Point-to-point energy-efficient wireless

communication

[Mastronarde 2011b]

Heuristic policy

[Nahrstedt 2007]

Reinforcement learning

[Borkar, 2008]

Reduce power by up to 33% for same delay

(in non-stationary environment)

Reduce delay and power by up to 50%and 23%, respectively, after 3000 learning steps

Cooperative multi-user video transmission

[Mastronarde 2011a]

Non-cooperative multi-user video transmission

[Fu, van der Schaar, 2010]

Improve 5 – 10 dB PSNR for nodes with feeble direct signals

Cross-layer multimedia system optimization*

[Mastronarde 2010, 2009b]

Cross-layer adaptation

[Nahrstedt 2005]

Improve up to 7 dB PSNR and reduce power by 21%

The proposed framework achieves...

*Prior work presented during Qualifying Exam


Overview

• Part I: Fast reinforcement learning for energy-efficient wireless communication [Mastronarde, 2011b]

– Post-decision state learning

– Virtual experience learning

• Part II: A distributed cross-layer approach to cooperative video transmission [Mastronarde, 2011a]

– Multi-user Markov decision process formulation

– Mitigating the curse of dimensionality


Overview








The Solved Energy-efficient Wireless Communication Problem (1/2)

• Point-to-point time-slotted wireless communication system

• Minimize power consumption subject to buffer delay constraint

– Little’s law: Average buffer delay is proportional to average buffer occupancy

nb

nh

nBEP

nxny nz

nl nf


The Solved Energy-efficient Wireless Communications Problem (2/2)

• System variables

– Buffer occupancy state:

– Channel state: -- Finite state Markov chain (e.g. Rayleigh fading)

– Power management state:

– Data arrivals: -- i.i.d.

• Decision variables (actions)

– Packet throughput:

– Bit-error probability:

– Power management action:

{ }0, ,nb B∈ …

{ }on,offnx ∈

nh

nl

, 0n n nz z b≤ ≤nBEP

{ }s_on, s_offny ∈

Goodput , 0n n nf f z≤ ≤

nb

nh

nBEP

nxny nz

nl nf


Buffer Model

• Buffer state: ,

– Buffer recursion

– Controlled Markov chain with transition probabilities:

{ }: 0,1,nb n∈ =B … { }0,1, ,B=B …

[ ]( )| , , , , ,bp b b h x BEP y z′

( )( )

0init

1 min , , ,n n n n n n

b b

b b f BEP z l B+

=

= − +

( ) ( )( ) ( )

0

0

| , , if

| , , if

z l f

f

z l f

f l B b f

p b b f p f BEP z b Bp b b h x BEP y z

p l p f BEP z b B

=∞

= = − −

′ ′− − < = ′ =

∑∑ ∑


Power Management Model

• Power management state:

– Controlled Markov chain with transition probabilities [Benini 1999]

{ }: 0,1,nx n∈ =X …

Switch “on”

Switch “off”

( ) ( )[ ],

| ,x x

x xy p x x y ′

′=P

• Switching wireless card “on” or “off”

– Incurs transition power penalty (watts):

– Incurs expected transition delay:

trP

t∆

( )

( )

on off

on 1 0s_on

off 1 0

on off

on 0 1s_off

off 0 1

x

x

=

=

P

P


Costs

We want to achieve the optimal power subject to a buffer constraint

• Power cost:

• Buffer cost:

[ ] [ ] [ ]( ),

holding cost overflow cost

( , , , , ) max ,0f lg b x BEP y z b f b f l Bη = − + − + −

E ��

Proportional to the delay

(by Little’s law)

Provides incentive to tx packetsinstead of dropping them

[ ]( )

( )[ ]on tx

off

tr

, , , if on, s_on

, , , , , if off, s_off

, otherwise,

P P h BEP z x y

h x BEP y z P x y

P

ρ

+ = == = =

tr on off 0PP P≥ > ≥


Formulation as Markov Decision Process (MDP)

• State:

• Action:

• Policy:

• Cost:

• Transition probability:

( ), ,s b h x�

( ), ,a BEP y z�

( ) [ ]( ) [ ]( )Buffer costPower cost

, , , , , , , , ,c s a h x BEP y z g b x BEP y zρ µ= + ��

( ) [ ]( ) ( ) ( )Buffer state Power stateChannel state

| , , , , , , | | ,b h xp s s a p b h x BEP y z p h h p x x y′ ′ ′= ��

: s aπ →


Value Functions

• State-value function:

• Optimal state-value function:

• Optimal policy:

( ) argmin ( , ), a

s Q s a sπ∗ ∗

∈= ∀ ∈

A

S

( ) ( )( ) ( )( ) ( )Current cost Expected future cost

, | ,s

V s c s s p s s s V sπ ππ γ π′∈′ ′= + ∑ S��

( ) ( ) ( ) ( ){ }( ),

min , | ,sa

Q s a

V s c s a p s s a V sγ

∗

∗ ∗′∈

′ ′= + ∑ S��

If and are known, this is a simple numerical problem…( ),c s a ( )| ,p s s a′

18


Conventional Reinforcement Learning Algorithm:Q-Learning

19

( )1, , ,n n n n na cs sσ += ( )1

( , )

| ,

n n n

n n n

E c c s a

s p s s a+

= ′∼

( ) ( ) ( )( )

1

1

, 1 ,

min ,

n n n n n n n

n n n n

a

Q s s

c s

a Q a

Q a

α

α γ

+

+

′∈

← −

′+ + A

( )max

1

onto 0,

: average delay constraint

0,1 : learning

: projec

rat

ts

e

n

n n n ng

µ µ

δ

µ µ

β

β δ+ = Λ + −

Λ

∈

Initialization at time n=0

Take Action

(Exploration vs. Exploitation)

Observe Experience

Update the Action-Value Function

Update the Lagrange Multiplier

n=n+1

( )0 ,, , as sQ a∀ ∈ ∀ ∈S A

( ) ( )( )

argmin , , with probability 1

rand , with probability

nan

n n n

n

Q s aa

−= A

ε

ε

Problem

Problem Problem


Post-Decision State Definition

Definition: An intermediate state after the known dynamics take place, but before the unknown dynamics take place.

State(time n)

State(time n+1)

Post-decision state(time n)

( ), ,n n n ns b h x=

( ), ,n n n na BEP y z= ( )lp l ( )|hp h h′

( )[ ]( )1, ,

, ,

n n n n

n n n n

s b h x

b f h x +

=

= −

� ��

Known Unknown

Deterministic •PM state transition

•Power cost

•N/A

Stochastic •Goodput distribution

•Holding cost

•Traffic arrival distribution

•Channel state distribution

•Overflow cost

( )[ ]( )

1 1 1 1

1 1

, ,

, ,

n n n n

n n n n n

s b h x

b f l h x

+ + + +

+ +

=

= − +


Post-Decision State Generalization

• Transition probability function

• Cost function

( )k k u( , ) ( , ) | , ( , )s

c s a c s a p s s a c s a= + ∑�

� �

s s ′→� s s→ �

s s ′→� s s→ �

KnownUnknown

Known Unknown

( ) ( ) ( )u k| , | , | ,s

p s s a p s s a p s s a′ ′= ∑ ��

In a large class of wireless systems

( )u u| , |p s s a p s s′ ′=� �

u u( , ) ( )c s a c s=� �


Post-Decision State Value Function

State(time n)

State(time n+1)

Post-decision state(time n)

n n n ns b h x

n n n na BEP y z

n n n ns b h x� � 1ns +

( )k | , | , | ,p s s a p x x y p b b BEP z I h h� � ( )u |p s s′ �

( )uc s� k( , ) | , , , , ,c s a p f BEP z b f h x BEP y z

( )V s∗ ′( )V s∗� �( )V s∗

( ) ( ) ( ) ( )u u |s

V s c s p s s V sγ∗ ∗

′′ ′= + ∑� � � �

( ) ( ) ( )k kmin ( , ) | ,

as

V s c s a p s s a V s∗ ∗∈

= + ∑

A�

��

(a)(b)

(a)

(b)

known unknown

The PDS value function must

be learned


Post-Decision State Learning

23

( ) ( )0 0, V s V s s∀ ∈ S�

( ) ( )k kargmin ( , ) | ,n n n n

a s

a c s a p s s a V s∈

= + ∑

A �

��

( ) ( ) ( )1 1 1k kmin ( , ) | ,n n n n n

as

V s c s a p s s a V s+ + +

∈

= + ∑

A�

��

( ) ( ) ( ) ( )1 1u1n n n n n n n n nV s V s c V sα α γ+ + ← − + +

� ��

( )max

1

onto 0,

: average delay constraint

0,1 : learning

: projec

rat

ts

e

n

n n n ng

µ µ

δ

µ µ

β

β δ+ = Λ + −

Λ

∈

( )1u, , , ,n n n n n ns a s c sσ +=� �( )1

( )

|

n nu u

n nu

E c c s

s p s s+

= ′

�

�∼

No Exploration!

Integrates known information!

Problem


Virtual Experience Learning

• Problem: PDS learning only updates one PDS in each time slot

• Observation: unknown dynamics and are independent of the buffer and power management states

– Learn about all buffer and power management states in each time slot!

– Improve adaptation speed at the expense of increased complexity.

( )lp l ( )|hp h h′

( )1u, , , ,n n n n n ns a s c sσ += ��

( ) ( )( ) ( ){ }1u, , , , , ; , , , | ,n n n n n n ns a b h x c b l b l h x b xσ + ∑ = + ∀ ∈ × B X� � � � ��

• Actual PDS experience tuple:

• Set of virtual experience tuples:

Current VE state Next VE state

( ) ( )u ; max , 0c b l b l Bη= + −� �VE cost


...

...Throughput

...

...

...

...Throughput

...

...Throughput

...

...

Comparison of Learning Algorithms

Action Selection Complexity Learning Update Complexity

Q-learning ( )O A ( )O A

PDS learning ( )O S A� ( )O S A�

Virtual experience learning ( )O S A� ( )O ∑ S A�

=S B�

∑ = ×B X


Simulation Setup

• PHY layer: QAM square constellations + Gray code

• Unknown channel transition and packet arrival distributions

Simulation Parameters

Parameter Value Parameter Value

Arrival rate λ 200 packets/second Packet loss rates PLR {1, 2, 4, 8, 16} %

Buffer size B 25 packets Power management actions y ∈ Y { }s_on, s_off

Channel states h ∈ H {-18.82, -13.79, -11.23, -9.37,

-7.80, -6.30, -4.68, -2.08} dB Power management states x ∈ X { }on, off

Holding cost constraint 4 packets Time slot duration t∆ 10 ms

“Off” power offP 0 watts Transmission actions*

z ∈ Z

{0, 1, 2, … , 10}

packets/time slot

“On” power onP 80 mW, 160mW, or 320 mW Discount factor γ 0.98

Transition power trP Set equal to onP Noise power spectral density

10

112 0N −×=

watts/Hz

*Symbol rate 31 / 50 100sT = × symbols/s

Packet size 5000= bits

Bits per symbol { }1,2, ,10β ∈ …


Learning Algorithm Performance Comparison (1/2)

27

0 2 4 6

x 104

0

5

10

15

20

25

Time slot (n)

Ho

ldin

g C

ost

0 2 4 6

x 104

200

250

300

Time slot (n)

Po

we

r (m

W)

0 2 4 6

x 104

0

0.1

0.2

0.3

0.4

Time slot (n)

θo

ff PDS Learning PDS Learning (No DPM) Q-learning

PDS + Virtual Experience (update period = 1) PDS + Virtual Experience (update period = 10)

*

****[Borkar, 2008]


Learning Algorithm Performance Comparison (2/2)

28

0 2 4 6

x 104

0

5

10

15

20

25

Time slot (n)

Ho

ldin

g C

ost

0 2 4 6

x 104

200

250

300

Time slot (n)

Po

we

r (m

W)

0 2 4 6

x 104

0

0.1

0.2

0.3

0.4

Time slot (n)

θo

ff

PDS + Virtual Experience (update period = 1) PDS + Virtual Experience (update period = 10) PDS + Virtual Experience (update period = 25) PDS + Virtual Experience (update period = 125)PDS Learning


Comparison to State-of-the-Art

• Threshold-k [Nahrstedt, ’07]

– If backlog exceeds k, then turn on wireless card and transmit all packets

– After transmitting, turn card off

– Ignore channel conditions

• Non-stationary dynamics

– Markov modulated arrival process using unobservable 5-state Markov chain

– Time-varying channel transition probabilities

1 2 3 4 5 6 7 8

50

100

150

200

250

Holding cost (packets)P

ow

er

(mW

)

ProposedThreshold-k

11% – 33% improvement for same holding cost

*Update period for proposed: T = 50 time slots


Summary:Fast learning for energy-efficient wireless communications

• Proposed first unified power management framework for delay-sensitive wireless communication

– Integrate system-level and physical-layer centric power management

• Exploited structure of the problem to improve learning performance

– Post-decision state

• Separation of known and unknown dynamics

• Eliminate need for exploration


• Independence of unknown dynamics and components of state


Overview








Multi-user Wireless Video Network With Cooperation

Cooperative phase II uses randomized

space-time block coding rule

• Direct mode: Transmit at data rate• Cooperative mode:

– Phase I: transmit at data rate

– Phase II: transmit at data rate

– Cooperative: data rate

0 (bits/s/Hz) for (s)it t

iRxβ

,1 (bits/s/Hz) for (s)i it t t

iR xβ ρ

( ),2 (bits/s/Hz) for 1 (s)i it t

itR xβ ρ−

( ),coop ,1 ,2= 1 (bits/s/Hz) i i i i it t t t tβ ρ β ρ β+ −

12th

Cooperative - Phase I

Cooperative - Phase II

13th

30th

20th

10th

( )

( )

1

1,2 20 30

2,3

,

t

t tt h h

=

=h

C

1 10t, tRx β

1,11 1, t t tR xρ β ( ) 1,21 11 , t t tR xρ β−

: Time slot duration (seconds)

: Transmission time fraction in [0,1]

: Phase I time fraction in [0,1]

R

itx

itρ


Prior Work

• Throughput-maximizing (opportunistic) multiple access policies [Knopp 1995], [Viswanath 2002], [Tse 2005]

– Schedule nodes with good fades

– Ignore delay deadlines, priorities, and dependencies

• Cross-layer solutions [Katsaggelos 2007, 2008], [Su 2007], [van der Schaar 2010], [Melodia 2010]

– Balance between scheduling easy nodes and most important nodes

– Underlying inefficiency in network resource usage• Users with high priority data, but worse fades, get access to the channel

Cooperation reduces inefficiency!

Enables users with feeble direct signals, but high priority data, to exploit channel diversity


A Sophisticated Traffic Model for Video

• Traffic state:

– Schedulable frame set:

– Buffer state:

Simple IBPB IBPB... GOP structure Illustrative Traffic State

( ),i i it t t= bT F

( )|i itt jib j ∈=b F

{ }( )| , 1, ,i it jd t t t Wj ∈ + …= +F


Traffic State Transition Illustration

tttt

t t t t +1+1+1+1

t t t t +2+2+2+2

Ft = (1,2,3)

bbbbt ====(4,3,2)

Ft+1 = (2,3,1,4)

bbbbt+1 = = = = (3,2,6,1)

Ft+2 = (1,4,2,3)

bbbbt+2 = = = = (4,-1,4,1)

yyyyt = = = = (4444,0,0)

yyyyt+1 = = = = (0,0,2222,0)

yyyyt+2 = = = = (4444,0,1111,0)

Traffic StateScheduling

Action


Multi-User Markov Decision Process Formulation

• States

– Channel state (i.i.d.):

– Traffic state:

• Actions

– Scheduling action:

– Cooperation decision:

• Utility and Transition Probability

{ } { }, for 0,1,2 , it ti

h i M≠ ∈ …=H

( ),i i it t t= bT F

( ), |i it t j

ity j= ∈y F

{ }0 direct, 1 cooperativeitz ∈ − −

( ) ,, it

i i t it t j jj

it tu s q y

∈= ∑y

F( ) ( ) ( )1 1

11| , | ,M

i i it t tt t t

itp p p+ ++

=

= ∏s s y H yT T

Distortion reduction for packets belonging to frame j


Feasible Scheduling Actions

• Constraint set:

– Buffer constraint:

– Packet constraint

– Dependency constraint:

37

( ), ,i ii it t tz∈y HP T

, ,0 i it j t jy b≤ ≤

( )1

i it ti

t

s

zR

PT

β≤y

( ), , ,if , then 0i i it k t k t jk j b y y− =≺


Optimization Objective

• Decision variables:


– Cooperation decision:

• Dynamic programming equation:

• Subject to

• Challenges:

– Complexity is quadratic in , which scales exponentially in and

– Traffic state information is local to users

, where

S M 2M

( ) ( ) ( ) ( ) ( )1

,1

max , | , ,M

i i

i

Mi i i i

i

U u p p Uα′ =∈

∗ ∗

=

′ ′= + ′ ∀∑ ∑ ∏

ys

zHs y y s s

S

T T T

( )21 ,, ,t t t t

TM= …y y y y

( )21 ,, ,t t t t

TM= …z z z z

( )1

, , and 1i i iM

i i

i

z x=

∈ ≤∑y HP T1( )

i

i i

i sPTx

zRβ= y


Mitigating the Curse of Dimensionality

• Problem 1: Complexity scales exponentially in

– Theorem: Cooperation decision that maximizes immediate throughput is long-term optimal [Mastronarde, 2011a]

– Implications of theorem: • Instead of tracking track maximum transmission rates

• Use an opportunistic cooperation scheme for cooperation decision

• Problem 2: Complexity scales exponentially in

– Solution [Fu, van der Schaar, 2010]: Lagrangian relaxation with a resource price

• The resulting MU-MDP can be decomposed into one local MDP per user

• Optimal resource price can be determined using subgradient method

M

2M

λ

tH

( ){ }max ,i

i i i

z

zβ β∗ = H

39

1

ii tt

s

R

PT

β ∗

≤y⇒1

i s it ti

tR

PTx

β ∗= y⇒


Simulation Setup

• Scenarios:

– Homogeneous

• Foreman (CIF, 30 Hz, 1.5 Mb/s)

– Heterogeneous 1

• Coastguard (CIF, 30 Hz, 1.5 Mb/s)

• Mobile (CIF, 30 Hz, 2.0 Mb/s)


– Heterogeneous 2

• Coastguard (CIF, 30 Hz, 1.5 Mb/s)


• Mobile (CIF, 30 Hz, 2.0 Mb/s)

Parameter Description Value

L Length of the STBC 2

cR Rate of orthogonal STBC rule 1

ξ Self-selection parameter 0.20

P Packet size 8000 bits

BEP Bit error probability target 310−

δ Path loss exponent 3

cellR WLAN coverage radius

(5 dB SNR at boundary) 100 m

M Number of nodes

(excluding the AP) 50

α Discount factor 0.80

1/ sT Symbol rate

(symbols per second)

625000 or

1250000

0 1if will self-select itself as a coop , then nod re laye

it c

tict

R

R

βξ

β

+≤

****

****


Network Topology

-100 -50 0 50 100-100

-50

0

50

100

APVideo SourcesPotential Relays

-100 -50 0 50 100-100

-50

0

50

100

-100 -50 0 50 100-100

-50

0

50

100

41

Source Distance to AP Angle

1 20 m 25º

2 45 m -30º

3 80 m 0º


Transmission Rates

• A: Feeble direct

• B: Strong direct

• C: Cooperative gains

• D: Homogeneous allocation

• E: Heterogeneous allocation

Cooperative (Low Congestion)Direct (Low Congestion)Cooperative (High Congestion)Direct (High Congestion)

1 2 30

200

400

600

800

1000

1200

1400

1600

1800Homogeneous (Foreman)

1 2 30

200

400

600

800

1000

1200

1400

1600

1800Heterogeneous 1 (Coastguard, Mobile, Foreman)

1 2 30

200

400

600

800

1000

1200

1400

1600

1800Heterogeneous 2 (Coastguard, Foreman, Mobile)

Avg. Transmission Rate (Kbps)




Video Quality Comparison

• A: Feeble direct � video undecodable at receiver

• B: Cooperation achieves 5-10 dB PSNR improvement for nodes with feeble direct signals

• C: Cooperation minimally impacts nodes with strong direct signals

Streaming

Scenario

Transmission

Mode

Video User 1 @ 20 m

(Low / High)

Video User 2 @ 45 m

(Low / High)

Video User 3 @ 80 m

(Low / High)

Homogeneous

Foreman Foreman Foreman

Direct 36.82 dB / 36.51 dB 35.85 dB / 30.20 dB 29.89 dB / --- dB

Cooperative 36.69 dB / 35.82 dB 36.58 dB / 34.83 dB 36.04 dB / 27.12 dB

Change -0.13 dB / -0.69 dB 0.73 dB / 4.63 dB 6.15 dB / --- dB

Heterogeneous

1

Coastguard Mobile Foreman



Change -0.36 dB / -0.20 dB 0.4 dB / 1.27 dB 9.75 dB / --- dB

Heterogeneous

2

Coastguard Foreman Mobile



Change 0.35 dB / -0.75 dB 0.56 dB / -0.36 dB 4.68 dB / --- dB


Video Quality Example

• Video user 3 @ 80 m

• Low congestion

44

Original

Direct Transmission26.9 dB PSNR

Cooperative Transmission34.7 dB PSNR


Optimal Resource Price

Streaming

Scenario

Transmission

Mode

Resource Price

(Low / High)

Homogeneous

Direct 45.79 / 42.97

Cooperative 38.72 / 52.56

Change -6.93 / 9.59

Heterogeneous

1

Direct 51.01 / 53.17


Change -2.99 / 18.77

Heterogeneous

2

Direct 68.24 / 41.48


Change -5.63 / 31.38


Summary:Multi-user cooperative video transmission

• Multi-user MDP based approach

– Enables high priority nodes to exploit diversity of channel fading states in the network

– Improves video quality of feeble (distant nodes) by 5-10 dB PSNR• Reduces quality of nodes with strong direct signals by < 1 dB

– Resource price for managing congestion• Increases in congested networks

• Decreases in uncongested networks

• Mitigate complexity

– Opportunistic cooperation is long-term optimal

– Decompose problem into local MDPs for each user


Impact in Industry

Company Impact

Sanyo (i) Energy-efficient point-to-point wireless communication(ii) Cooperative video transmission

Intel Optimal video encoder mode decisions

IBM Learning for data exploration

Skype Rigorous modeling and optimization using MDP and reinforcement learning

47


Thank you!

http://www.ee.ucla.edu/~nhmastro/

http://medianetlab.ee.ucla.edu/


My Journal Papers

1. [Mastronarde, 2011b] N. Mastronarde and M. van der Schaar, “Fast reinforcement learning for energy efficient wireless communications,” in review.

2. [Mastronarde, 2011a] N. Mastronarde, F. Verde, D. Darsena, A. Scaglione, and M. van der Schaar, “Transmitting important bits and sailing high radio waves: a decentralized cross-layer approach to cooperative video transmission,” in review.

3. [Mastronarde, 2010] N. Mastronarde and M. van der Schaar, “Online reinforcement learning for dynamic multimedia systems,” IEEE Trans. on Image Processing, vol. 19, no. 2, pp. 290-305, Feb. 2010.

4. [Mastronarde, 2009c] N. Mastronarde and M. van der Schaar, “Designing autonomous layered video coders,”Elsevier Journal Signal Processing: Image Communication – Special Issue on Scalable Coded Media Beyond Compression, vol. 24, no. 6, pp. 417-436, July 2009.

5. [Mastronarde, 2009b] N. Mastronarde and M. van der Schaar, “Towards a General Framework for Cross-Layer Decision Making in Multimedia Systems,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 19, no. 5, pp. 719-732, May 2009.

6. [Mastronarde, 2009a] N. Mastronarde and M. van der Schaar, “Automated bidding for media services at the edge of a content delivery network,” IEEE Trans. on Multimedia, vol. 11, no. 3, pp. 543-555, Apr. 2009.

7. [Mastronarde, 2008] N. Mastronarde and M. van der Schaar, “A bargaining theoretic approach to quality-fair system resource allocation for multiple decoding tasks,” IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 3, Mar. 2008.

8. [Mastronarde, 2007b] N. Mastronarde and M. van der Schaar, "A queuing-theoretic approach to task scheduling and processor selection for video decoding applications," IEEE Trans. Multimedia, vol. 8, no. 7, pp. 1493-1507, Nov. 2007.

9. [Mastronarde, 2007a] N. Mastronarde, D. S. Turaga, and M. van der Schaar. “Collaborative resource exchanges for peer-to-peer video streaming over wireless mesh networks,” IEEE J. on Select. Areas in Communications Peer-to-peer Communications and Applications, vol. 25, no. 1, pp. 108-118, Jan. 2007.

10. [Mastronarde, 2006] Y. Andreopoulos, N. Mastronarde, and M. van der Schaar, “Cross-layer optimized video streaming over wireless multi-hop mesh networks,” IEEE J. on Select. Areas in Communications Multi-Hop Wireless Mesh Networks, vol. 24, no. 11, pp. 2104-2115, Nov. 2006.


References

• Myopic Cross-Layer (Multimedia systems and communications)

– [He, 2005] Z. He, Y. Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion analysis for wireless video communication under energy constraints,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 15, no. 5, pp. 645-658, May 2005.

– [Sachs, 2003] D. G. Sachs, S. Adve, D. L. Jones, “Cross-layer adaptive video coding to reduce energy on general-purpose processors,” in Proc. International Conference on Image Processing, vol. 3, pp. III-109-112 vol. 2, Sept. 2003.

– [Nahrstedt, 2006] W. Yuan, K. Nahrstedt, S. V. Adve, D. L. Jones, R. H. Kravets, “GRACE-1: cross-layer adaptation for multimedia quality and battery energy,” IEEE Trans. on Mobile Computing, vol. 5, no. 7, pp. 799-815, July 2006.

– [Nahrstedt, 2007] K. Nahstedt, W. Yuan, S. Shah, Y. Xue, and K. Chen, “QoS support in multimedia wireless environments,” in Multimedia Over IP and Wireless Networks, ed. M. van der Schaar and P. Chou, Academic Press, 2007.

– [Mohapatra, 2005] S. Mohapatra, R. Cornea, H. Oh, K. Lee, M. Kim, N. Dutt, R. Gupta, A. Nicolau, S. Shukla, N. Venkatasubramanian, “A cross-layer approach for power-performance optimization in distributed mobile systems,” 19th IEEE International Parallel and Distributed Processing Symposium, 2005.

– [Pillai, 2003] P. Pillai, H. Huang, and K.G. Shin, “Energy-Aware Quality of Service Adaptation,” Technical Report CSE-TR-479-03, Univ. of Michigan, 2003.

– [van der Schaar 2003] M. van der Schaar, S. Krishnamachari, S. Choi, and X. Xu, “Adaptive cross-layer protection strategies for robust scalable video transmission over 802.11 WLANs,” IEEE JSAC, vol. 21, no. 10, pp. 1752-1763.

– [van der Schaar 2007] M. van der Schaar, Y. Andreopoulos, and Z. Hu, “Optimized scalable video streaming over 802.11 a/e HCCA wireless networks under delay constraints,” IEEE Trans. on Mobile Computing, vol. 5, no. 6, pp. 755-768, June 2006.

• Foresighted Single-Layer (no learning, or heuristic)

– [Benini, 1999] L. Benini, A. Bogliolo, G. A. Paleologo, G. D. Micheli, “Policy optimization for dynamic power management,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 18, no. 6, pp. 813-833, June 1999.

– [Ortega, 1994] A. Ortega, K. Ramchandran, M. Vetterli, “Optimal trellis-based buffered compression and fast approximations,”IEEE Trans. on Image Processing, vol. 3, no. 1, pp. 26-40, Jan. 1994.

– [Berry, 2002] R. Berry and R. G. Gallager, “Communications over fading channels with delay constraints,” IEEE Trans. Info. Theory, vol. 48, no. 5, pp. 1135-1149, May 2002.


References

• Foresighted Single Layer (with learning)

– [Chung, 2002] E.-Y. Chung, L. Benini, A. Bogliolo, Y.-H. Lu, and G. De Micheli, “Dynamic power management for nonstationary service requests,” IEEE Trans. on Computers, vol. 51, no. 11, Nov. 2002.

– [Marculescu, 2005] Z. Ren, B. H. Krogh, R. Marculescu, “Hierarchical adaptive dynamic power management,” IEEE Trans. on Computers, vol. 54, no. 4, Apr. 2005.

– [Borkar, 2008] N. Salodkar, A. Bhorkar, A. Karandikar, V. S. Borkar, “An on-line learning algorithm for energy efficient delay constrained scheduling over a fading channel,” IEEE JSAC, vol. 26, no. 4, pp. 732-742, Apr. 2008.

– [Krishnamurthy] M. H. Ngo and V. Krishnamurthy, “Monotonicity of constrained optimal transmission policies in correlated fading channels with ARQ,” IEEE Trans. on Signal Processing, vol. 58, no. 1, pp. 438-451, Jan. 2010.

• Multiuser network optimization

– [Neely, 2010] L. Huang, S. Moeller, M. J. Neely and B. Krishnamachari, “LIFO-Backpressure Achieves Near Optimal Utility-Delay Tradeoff,” Aug. 2010, ArXiv Technical Report, arXiv:1008.4895v1.

– [Neely, 2009] M. J. Neely and R. Urgaonkar, "Optimal Backpressure Routing in Wireless Networks with Multi-Receiver Diversity,"Ad Hoc Networks (Elsevier), vol. 7, no. 5, pp. 862-881, July 2009.

– M. J. Neely, "Energy Optimal Control for Time Varying Wireless Networks", IEEE Trans. On Information Theory, vol. 52, no. 7, pp. 2915-2934, July 2006.

– [Fu, van der Schaar, 2010] F. Fu and M. van der Schaar, “A systematic framework for dynamically optimizing multi-user video transmission,” IEEE JSAC, vol. 28, pp. 308-320, Apr. 2010.

– [Chiang 2007] M. Chiang, S. H. Low, A. R. Caldbank, and J.C. Doyle, “Layering as optimization decomposition: A mathematical theory of network architectures,” Proc. of IEEE, vol. 95, no. 1, 2007.


References

• Other

– [Katsaggelos 2008] J. Huang, Z. Li, M. Chiang, and A.K. Katsaggelos, “Joint Source Adaptation and Resource Allocation for Multi-User Wireless Video Streaming,” IEEE Trans. Circuits and Systems for Video Technology, vol. 18, issue 5, 582-595, May 2008.

– [Katsaggelos 2007] E. Maani, P. Pahalawatta, R. Berry, T.N. Pappas, and A.K. Katsaggelos, “Resource Allocation for Downlink Multiuser Video Transmission over Wireless Lossy Networks,” IEEE Transactions on Image Processing, vol. 17, issue 9, 1663-1671, September 2008.

– [Su 2007] G.-M. Su, Z. Han, M. Wu, and K.J.R. Liu, “Joint Uplink and Downlink Optimization for Real-Time Multiuser Video Streaming Over WLANs,” IEEE Journal of Selected Topics in Signal Processing, vol. 1, no. 2, pp. 280-294, August 2007.

– [Knopp 1995] R. Knopp and P. A. Humblet, “Information capacity and power control in single-cell multiuser communications,” Proc. IEEE ICC, 1995.

– [Viswanath 2002] P. Viswanath, D. N. C. Tse, R. Laroia, “Opportunistic beamforming using dumb antennas,” IEEE Trans. on Information Theory, vol. 48, no. 6, June 2002.

– [Tse 2005] D. N. C. Tse and P. Viswanath, Fundamentals of wireless communication. Cambridge, U.K.: Cambridge Univ. Press, 2005.

– [Alay 2009] O. Alay, P. Liu, Z. Guo, L. Wang, Y. Wang, E. Erkip, and S. Panwar, “Cooperative layered video multicast using

randomized distributed space time codes”, IEEE INFOCOM Workshops 2009, Rio de Janeiro, Brazil, Oct. 2009, pp. 1–6.

– [Laneman 2003] J.N. Laneman and G.W. Wornell, “Distributed space-time block coded protocols for exploiting cooperative

diversity in wireless networks,” IEEE Trans. Inf. Theory, vol. 49, pp. 2415–2425, Oct. 2003.

– [Sendonaris 2003] A. Sendonaris, E. Erkip, and B. Aazhang, “User cooperation diversity – Part I & II,” IEEE Trans. Commun., vol.

51, pp. 1927–1948, Nov. 2003.

– [Melodia, 2010] T. Melodia and W. Heinzelmann, “Cross-layer optimization in video sensor networks,” IEEE COMSOC MMTC E-

Letter, vol. 5, no. 3, May 2010.


Supplementary Slides


Multimedia Application Characteristics

• Characteristics

– Stringent delay constraints

– Sophisticated source-coding

dependency structures

– Mixed priorities

– Intense resource requirements

(a) Sequential Dependencies (a) Sequential Dependencies

(b) Typical Hybrid Coder Dependencies (MPEG-2, H.264/AVC)

[Chou, 2006]

(c) Scalable Coding Dependencies

0 1 2 3 4 5 6 7 8 90

1000

2000

3000

4000

5000

6000

7000

8000Complexity profile over time for decoding four layers -- Silent.CIF at 1.5 Mb/s

Time (sec)

Norm

aliz

ed P

rocessor

Tic

ks

0 1 2 3 4 5 6 7 8 90

1000

2000

3000

4000

5000

6000

7000

8000Complexity profile over time for decoding four layers -- Silent.CIF at 1.5 Mb/s

Time (sec)

Norm

aliz

ed P

rocessor

Tic

ks

(c)

Decoding complexity (Silent sequence)

Time (seconds)

N

orm

aliz

ed C

om

ple

xit

y


Reinforcement Learning Architecture

55

Policy

Traffic and Channel

Dynamics

cost

actionstate

Error

Value

Function


Conventional Reinforcement Learning Algorithm

• Q-learning

– Experience tuple:

– Q-learning update:

• Exploration vs. Exploitation

– Given state , how do we choose the action ?

– Exploitation: Take greedy action w.r.t. action-value function estimate• Prevents the discovery of potentially better actions

– Exploration: Take suboptimal action w.r.t. action-value function estimate

• Sacrifice immediate performance for possibly improved future performance

( )1, , ,n n n n ns a c sσ +=

ns na

( )1 1( , ) 1 ( , ) min ( , )n n n n n n n n n n n

aQ s a Q s a c Q s aα α γ+ +

′∈

′← − + + A

starting estimate new samplerevised estimate

( )1( , ), | ,n n n n n nE c c s a s p s s a+ ′= ∼

Problem

Problem

Problem


Partially Known Dynamics

• Known dynamics

– Goodput:

– PM state transition distribution

– Power cost

– Holding cost

• Unknown dynamics

– Packet arrival distribution:

– Channel state transition:

– Overflow cost

• Post-decision state

– An intermediate state• After known dynamics take place

• Before unknown dynamics take place

( ) ( )| , bin ,1f n n n n np f BEP z z PLR= −

( )lp l

( )|hp h h′


Decomposition into Known and Unknown Components

( ) ( ) ( ) ( )k | , | , | ,x fp s s a p x x y p b b BEP z I h h= − =� ��

( ) ( ) ( ) ( )u | | ,h lp s s p h h p b b I x x′ ′ ′ ′= − =� ��

( )

( )( ) ( )u

0overflow cost

max ,0 .1

l

l

c s p l b l Bγ

γ

∞

== + −

− ∑ ��

• Known and unknown transition probabilities

• Known and unknown costs

known

unknown

known

unknown

( ) [ ] [ ]( )k

0 power costholding cost

( , ) | , , , , ,z

f

f

c s a p f BEP z b f h x BEP y zµ ρ=

= − +∑�

��

The unknown components do not depend on the action


Post-Decision State Learning

• Post-decision state learning (online):

– PDS experience tuple: ( )1u, , , ,n n n n n ns a s c sσ += ��

( ) ( ) ( ) ( )u u |s

V s c s p s s V sγ∗ ∗

′′ ′= + ∑� � � �

( ) ( ) ( )k kmin ( , ) | ,

as

V s c s a p s s a V s∗ ∗∈

= + ∑

A �

��

(a)

(b)

The PDS value function must

be learned

( ) ( ) ( ) ( )[ ]1 1u1n n n n n n n n nV s V s c V sα α γ+ +← − + +� ��

starting estimate new samplerevised estimate

[ ] ( )1( ), |n n n nu u uE c c s s p s s+ ′= � �∼


Comparison to Prior Work using Post-Decision States [N. Salodkar, 2010]*

Salodkar Proposed

DPM No Yes

AMC Yes Yes

Power-control Yes Yes

Packet losses No Yes

Post-decision state Deterministic Stochastic

Costs Known only Known and unknown

State transitions Known and unknown Known and unknown

Optimization Criteria Undiscounted Discounted

Virtual Experience No Yes

* Differences in the proposed work are highlighted in red


Learning Algorithm Performance Comparison

0 2 4 6

x 104

0

5

10

15

20

25

Time slot (n)

Ho

ldin

g C

ost

(a)

x 10

PDS + Virtual Experience (update period = 1) PDS + Virtual Experience (update period = 10) PDS + Virtual Experience (update period = 25) PDS + Virtual Experience (update period = 125)PDS Learning PDS Learning (No DPM) Q-learning

0 2 4 6

x 104

0

0.1

0.2

0.3

0.4

Time slot (n)

θo

ff

(e)

0 2 4 6

x 104

200

250

300

Time slot (n)

Po

we

r (m

W)

(b)

*[Borkar, 2008]

*


Comparison to Optimal Policy With Imperfect Statistics

100

101

102

103

104

105

0

2

4

6

8

10

12

Time slot (n)

Hold

ing

Co

st

100

101

102

103

104

105

0

50

100

150

200

250

300

350

Time slot (n)

Pow

er

(mW

)

PDS + Virtual Experience (update period = 1)Optimal policy (imperfect statistics)


Non-Stationary Arrivals

0 2 4 6

x 104

0

100

200

300

400

Time slot (n)Exp

ect

ed a

rriv

al r

ate

(pa

cke

ts/s

)

• Unobservable 5-state Markov modulated process

– States

• Expected arrival rate for a Poisson arrival process

• (0, 100, 200, 300, 400) packets/s

– Stationary distribution

• (0.0188, 0.3755, 0.0973, 0.4842, 0.0242).


Non-Stationary Channel Transitions

• Channel state transition probabilities vary over time as an AR(1) process.

Self-transition probabilities

(White indicates a relatively high self-transition probability)


Information-Theoretic Power Cost

( )

2

2( , ) 2 1zc h zh

σ= −


Physical Layer: Adaptive Modulation and Power Control

• Transmission rate:

– bits per symbol:

– packet length (bits):

– packet rate (packets/s):

• Bit-error probability (BEP):

• SNR:

/n sTβ

L ( )/n n

sr LTβ=

4 3

2 1n

nn n

nBEP Q hβ

γ

β

≤ − ( ) ( )

2 /21/ 2 u

xQ x e duπ

∞ −∫�

tx

0

s nTP

Nγ =

1nβ ≥


Physical Layer: Adaptive Modulation and Power Control

• Variables of interest

– Packet throughput (packets/time slot):

– Bits per symbol:

– Transmission power (watts):

– Bit-error probability:

β

txPBEP

zpacket throughput

/n nsz LT tβ = ∆

bits per symbol

( ) 20 1

tx

2 1

3 4

nn

n nns

NP Q BEP

h T

ββ−− ≥

transmission power

n nBEP Q h

bit-error probability

Decision variable 1

Decision variable 2

Adaptive Modulation

Power Control

z


Initializing the PDS Value Function

• PDS Value iteration:

• Initialization:

– Define reasonable estimates: and

– Perform PDS value iteration with estimates

( ) ( ) ( ) ( )u u |k ksV s c s p s s V sγ ′

′ ′= + ∑� � � �

( ) ( ) ( ){ }1 k kmin ( , ) | ,k a ksV s c s a p s s a V s+ ∈= +∑A �

��

� ( )uc s� � ( )u |p s s′ �


Impact of Initial Conditions

• Initial arrival rate is assumed deterministic or uniform

• Channel state is assumed constant

3.9 3.95 4 4.05 4.1 4.15 4.2180

190

200

210

220

230

240

Holding cost (packets)

Pow

er

(mW

)

Init. Arr. Rate = 100 packets/sInit. Arr. Rate = 200 packets/sInit. Arr. Rate = 300 packets/sInit. Arr. Rate = 400 packets/sInit. Arr. Rate = 500 packets/sInit. Arr. Rate = 600 packets/sInitialized Arrival Rate = Uniform


Traffic State Transition Illustration

tttt

t t t t +1+1+1+1

t t t t +2+2+2+2

Ft = (1,2,3)

bbbbt ====(4,3,2)

Ft+1 = (2,3,1,4)

bbbbt+1 = = = = (3,2,6,1)

Ft+2 = (1,4,2,3)

bbbbt+2 = = = = (4,-1,4,1)

yyyyt = = = = (4444,0,0)

yyyyt+1 = = = = (0,0,2222,0)

yyyyt+2 = = = = (4444,0,1111,0)

Traffic StateScheduling

Action


Phase I Time Fraction

• K and Q symbols have to be transmitted in phase I and phase II, respectively

71

( ),1 ,2 ,1 ,2 0

11 1

1 /

i c ct i i i i i

t t c t t tR

R Rρ

β β β β β

+= ⇒ + <

+

( ),1 ,21i i i it t t tQ Kρ β ρ β−=

1 is the rate of the orthogonal STB/ C rulecR K Q= ≤


Reformulated Multi-user Markov Decision Process

• Global state:

• Decision variables:


• Dynamic programming equation:

• Constraints:

• Challenges:

– Complexity is proportional to , which scales exponentially in

– Traffic state information is local to users

, where

S M

, where


Optimization Decomposition

• The multi-user optimization can be decomposed into local MDPs satisfying:

• Requires message exchanges between users and the AP

– Users �� AP: Discounted infinite horizon resource consumption

– AP �� Users: Uniform resource price to manage congestion


Proposed Opportunistic Cooperation Protocol (1/2)

• When is cooperative transmission better than direct transmission?

• Candidate cooperative nodes can self-select themselves:

• AP verifies fulfillment of following condition:

• If satisfied, then cooperation is better; otherwise, choose direct

74

,coop 0

,1 ,2 0

11i i c ct t i i i

t t t

R Rβ β

β β β

+> ⇒ + <

0

, where 1

:1

0i

i t c ct t ti

c ct

R R

R R

βξ ξ

β

= < ≤

+

+≤C

( ),2 0

111 c

ti it t

Rξ

β β

+< −


Proposed Opportunistic Cooperation Protocol (2/2)

• RTS

– Request to send

• CRS

– Cooperative recruitment signal

• HTS

– Help to send

• CTS

– Clear to send

75

{ }, , 0 ,it Mh ∀ ∈ …

*Dual channels, i.e. i it th h=

0 to candidatesitβ

,2itRh

,2 and and i it tz β λ


Computation of Transmission Rates

• Direct rate

• Phase I rate

• Phase II rate

76


Cooperation Statistics

Online Learning for Energy-Efficient Multimedia...

Documents

Transcript of Online Learning for Energy-Efficient Multimedia...