Universal approximators for Direct Policy Search in multi-purpose water reservoir management
-
Upload
andrea-castelletti -
Category
Engineering
-
view
366 -
download
2
description
Transcript of Universal approximators for Direct Policy Search in multi-purpose water reservoir management
Matteo Giuliani, Emanuele Mason, Andrea Castelletti, Francesca Pianosi, Rodolfo Soncini-Sessa Dipartimento di Elettronica, Informazione, e Bioingegneria, Politecnico di Milano, Milano, Italy Hydroinformatics Lab, Como Campus, Politecnico di Milano, Italy Department of Civil and Environmental Engineering, University of Bristol, Bristol, UK
Universal approximators for Direct Policy Search in multi-purpose water reservoir management: A comparative analysis
IFAC 2014 CAPE TOWN -‐ZA
Modelling and Control of Water Systems
Controlling hydro-environmental systems
The long-term optimal operation of hydro-environmental systems can be formulated as a q-objective stochastic optimal control problem
xt+1 = ft(xt,ut, �t+1)
J i = limh⇤⌅
E⇥h1⇥�
�h�1⇤
t=0
�tgit(xt,ut, ⇥t+1)
⇥i = 1, . . . , q
i-th immediate cost discount factor i-th objective
state control disturbance subject to
!
"t+1 ⇠ �(·)
minµt(·)
J = |J1 J2...Jq|
ut = µt(xt)
xt 2 Rnx
ut 2 Rnu
"t 2 Rn"
SDP and the 3 curses
Stochastic Dynamic Programming is - in principle - the best approach to solve the problem - in practice - it suffers from 3 major shortcomings
1) Curse of dimensionality: computational cost grows exponentially with state, control and disturbance dimension [Bellman, 1967];
u�t
Qt
ut
xt
Look-up table Q-function
unknown Q-function
computations are numerically performed on a discretized variable domain
SDP and the 3 curses
Stochastic Dynamic Programming is - in principle - the best approach to solve the problem - in practice - it suffers from 3 major shortcomings
1) Curse of dimensionality: computational cost grows exponentially with state, control and disturbance dimension [Bellman, 1967];
u�t
Qt
ut
xt
Look-up table Q-function
unknown Q-function
computations are numerically performed on a discretized variable domain
2) Curse of modelling: any variable considered among the operating rule’s arguments has to be modelled [Bertsekas and Tsitsiklis, 1996];
time t t+1
xt
ut, "t+1
models are use in a multiple one-step-ahead-simulation mode
SDP and the 3 curses
Stochastic Dynamic Programming is - in principle - the best approach to solve the problem - in practice - it suffers from 3 major shortcomings
3) Curse of multiple objectives: computational cost grows exponentially with the number of objectives considered [Powell, 2011].
PARETO frontier
multi-objective problems are solved by reiteratively solving single objective problems
J1
J2
J3
Beyond SDP: ADP and RL
Approximate Dynamic Programming and Reinforcement Learning provide a framework to overcome some or all the SDP’s curses.
[Powell, 2007; Busoniu et al. 2011
VALUE FUNCTION-BASED APPROCHES:
• Approximate value iteration
• Approximate policy iteration
• Approximate policy evaluation
Model-free or model-based // parametric or non-parametric
POLICY SEARCH-BASED APPROACHES:
• Direct policy search
Simulation-based optimization // parametric
Beyond SDP: ADP and RL
Approximate Dynamic Programming and Reinforcement Learning provide a framework to overcome some or all the SDP’s curses.
[Powell, 2007; Busoniu et al. 2011
VALUE FUNCTION-BASED APPROCHES:
• Approximate value iteration
• Approximate policy iteration
• Approximate policy evaluation
Model-free or model-based // parametric or non-parametric
POLICY SEARCH-BASED APPROACHES:
• Direct policy search
Simulation-based optimization // parametricv
Multi-objective Direct Policy Search (MODPS)
Assuming the operating rule belong to a given family of functions and search the optimal solution in the policy’s parameter space
xt+1 = ft(xt,ut, �t+1)
"t+1 ⇠ �(·)
minµt(·)
J = |J1 J2...Jq|
ut = µt(xt)
xt 2 Rnx
ut 2 Rnu
"t 2 Rn"
subject to
!
ut = µt(xt, )✓t
xt+1 = ft(xt,ut, �t+1)
"t+1 ⇠ �(·)
minµt(·)
J = |J1 J2...Jq|
xt 2 Rnx
ut 2 Rnu
"t 2 Rn"
subject to
!
ORIGINAL PROBLEM POLICY SEARCH PROBLEM
ut = µt(xt, )✓t
✓t
✓t 2 ⇥t2 Rn✓
WHEN 1. The system is already operated
2. the system is simple (i.e. one reservoir) AND/OR the systems has one single objective (e.g. water supply)
Selecting the policy approximation: Ad hoc/Empirism
500!
450!
400!
350!
300!
250!
200!
150!
100!
50!
0!
0 25 50 75 100 125 150 175 200 225 250!
rele
ase
[m3 /s
]"
storage [Mm3]"
�1
�3
�5
�2 �4
• NEW York City rule [Clark, 1950]
• Space rule [Clark, 1956]
• Standard Operating Policy [Draper, 2004]
• …..
Identify existing regularities in a sample of the operator behaviour
Empirical rules identified in the past
Selecting the policy approximation: Universal Approx.
Provided that some conditions are met, an Universal Approximator is approximate arbitrarily closely every continuous function.
ARTIFICIAL NEURAL NETWORKS [Cybenko 1989, Funahashi 1989, Hornik et al. 1989]
GAUSSIAN RADIAL BASIS FUNCTIONS [Busoniu et al. 2011]
n✓
= N(2nx
+ nu
)
n✓
= nu
(N(nx
+ 2) + 1)
Parameter dimension
Parameter dimension
Number of NEURONS
Number of BASES
x1
x2
x3
u1
x1
x2
x3
u1
Selecting the optimization algorithm
Key problem features • High dimensional search spaces (rich parameterizations) • Complex search spaces (many local minima) • Sensitivity to parameter initialization (no-preconditioning) • Multiple objectives • Non differentiable objective functions • Sensitivity to noise
Selecting the optimization algorithm
BORG is self-adaptive and employs • multiple search operators adaptively selected during the optimization • e-dominance archiving with internal operators to detect search stagnation • randomized restarts to escape local optima
BORG [Hadka and Reed 2012; Reed et al. 2013] a MULTI-OBJECTIVE EVOLUTIONARY ALGORITHM
Key problem features • High dimensional search spaces (rich parameterizations) • Complex search spaces (many local minima) • Sensitivity to parameter initialization (no-preconditioning) • Multiple objectives • Non differentiable objective functions • Sensitivity to noise
CASE STUDY
Hanoi
HoaBinh
TaBu
LaiChau
TamDuong
NamGiang
MuongTe
VuQuangYenBai
BaoLacHaGiang
BacMe
VIETNAM
CHINA
LAOS
CAMBODIA
THAILAND
Da
Thao Lo
Red-Thai Binh River System - Vietnam
Integrated Management of Red-Thai Binh Rivers System (IMRR) funded by the Italian Ministry of Foreign Affairs http://www.imrr.info/
Hoa Binh reservoir - Vietnam
Main characteristics
• Catchment area 52,000 km2
• Active capacity 6 x 109 m3
• 8 penstocks 2,360 m3/s (240 MW)
• 12 bottom gates 22,000 m3/s
• 6 spillways 14,000 m3/s
• 15% national energy (7,800 GWh)
source: IWRP2008
Operating objectives • Hydropower production
• Flood control (Hanoi)
RESERVOIR
CATCHMENT
POWER PLANT
DIVERSION DAM
COMSUMPTIVE USE THAO
LO
DA
HOABINH
Experimental Setting: ANN vs RBF
STATE VECTOR (n_x=5)
• 2 time indexes (sin, cosin) • Storage • Previous day inflow to reservoir • Previous day lateral inflow
CONTROL VECTOR (n_u=1) • release from the reservoir
RESERVOIR
CATCHMENT
POWER PLANT
DIVERSION DAM
COMSUMPTIVE USE THAO
LO
DA
HOABINH
ALGORITHM SETTING and RUNNING
• Default Borg MOEA parameterization [Hadka and Reed 2013]
• NFE = 500,000 per replication
• 20 replications to avoid dependence on randomness
• Historical horizon 1962-1969, which comprises normal, wet and dry years
Policy perfomance – operating objectives
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8(b) Generational distance (c) Additive ε-indicator (d) Hypervolume
4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16# of neurons/basis # of neurons/basis # of neurons/basis
0 100 200 300 400 500 600 7001.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6 x 107
Jflo
Jhyd
46810
121416
Legend: ANN - RBFnumber of neurons/basis
(a) Policy Performance with different ANN and RBF architectures
FIG. 2. Policy performance obtained with di↵erent ANN and RBF structures (a), andevaluation of the associated Pareto front in terms of generational distance (b), additive"-indicator (c), and hypervolume (d). Solid bars represent the best performance acrossthe multiple runs, while transparent ones the average performance for each policyarchitecture.
30
ANN RBF
Hydrop
ower -‐ kWh/d
Floods – cm2/d
Policy perfomance – front approximation quality
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8(b) Generational distance (c) Additive ε-indicator (d) Hypervolume
4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16 4 6 8 10 12 14 16# of neurons/basis # of neurons/basis # of neurons/basis
0 100 200 300 400 500 600 7001.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6 x 107
Jflo
Jhyd
46810
121416
Legend: ANN - RBFnumber of neurons/basis
(a) Policy Performance with different ANN and RBF architectures
FIG. 2. Policy performance obtained with di↵erent ANN and RBF structures (a), andevaluation of the associated Pareto front in terms of generational distance (b), additive"-indicator (c), and hypervolume (d). Solid bars represent the best performance acrossthe multiple runs, while transparent ones the average performance for each policyarchitecture.
30
ANN RBF
Hydrop
ower -‐ kWh/d
Floods – cm2/d
CONVERGENCE CONSISTENCY DIVERSITY
Policy reliability
4 6 8 10 12 14 16 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1(a) 75% best metric value (b) 95% best metric value
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 4 6 8 10 12 14 16
gene
ratio
nal d
istan
cead
ditive
ε-in
dicat
orhy
perv
olum
e
# of neurons/basis # of neurons/basis4 6 8 10 12 14 16 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
FIG. 3. Probability of attainment with a threshold equal to 75% (a) and to 95% (b) ofthe best metric values for di↵erent ANN (blue bars) and RBF (red bars) architecturesin terms of number of neurons/basis.
31
ANN RBF
CONVERGENCE
CONSISTENCY
DIVERSITY
0 0.5 1 1.5 20
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5 20
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9(a) Generational distance (b) Additive ε−indicator (c) Hypervolume
FIG. 4. Analysis of runtime search dynamics for ANN (red lines) and RBF (blue lines)operating policy optimization in terms of generational distance (a), additive "-indicator(b), and hypervolume (c).
32
ANN ( 6 neurons ) RBF (6 bases)
Run time search dynamics (NFA = 2M)
CONVERGENCE CONSISTENCY DIVERSITY
NFA (x106) NFA (x106) NFA (x106)
Policy validation
0 100 200 300 400 500 600 700 800 9001.8
2
2.2
2.4
2.6 x 107
Jflo
Jhyd
0 100 200 300 400 500 600 700 800 9001.8
2
2.2
2.4
2.6 x 107
Jflo
Jhyd
ANNRBF
ANNRBF
(a) Results over the optimization horizon (1962-1969)
(b) Results over the validation horizon (1995-2004)
FIG. 5. Comparison of ANN and RBF policy performance over the optimization (a)and the validation (b) horizons.
33
ANN RBF
Hydrop
ower -‐ kWh/d
Floods – cm2/d
Hydrop
ower -‐ kWh/d
Floods – cm2/d
Conclusions
§ MODPS is an interesting alternative to SDP familiy methods for a number of
good reasons
1. No discretization required: NO curse of dimensionality;
2. Does not require separability in time of constraints and objective functions (e.g. duration curves): NO curse of dimensionality;
3. Can easily include any model-free information as long as this is control-indipendent: NO curse of modelling;
4. Can be combined with any simulation model (also high fidelity ones): NO curse of modelling;
5. Can be easily combined with truly multi-objective optimization algorithms: NO curse of the multiple objectives.
Conclusions
§ RBFs and ANNs seem to perform comparatively well when evaluated
in terms of policy performance
§ RBFs outperform ANNs in terms of quality of the Pareto front
approximation, reliability and run time search dynamics
§ Future works will focus on exploring multiple output policies (e.g.
network of reservoirs)
THANKS