Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf ·...

13
Delft University of Technology Delft Center for Systems and Control Technical report 15-043 Residential demand response of thermostatically controlled loads using batch reinforcement learning * F. Ruelens, B.J. Claessens, S. Vandael, B. De Schutter, R. Babuˇ ska, and R. Belmans If you want to cite this report, please use the following reference instead: F. Ruelens, B.J. Claessens, S. Vandael, B. De Schutter, R. Babuˇ ska, and R. Belmans, “Residential demand response of thermostatically controlled loads using batch rein- forcement learning,” IEEE Transactions on Smart Grid, vol. 8, no. 5, pp. 2149–2159, Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg 2, 2628 CD Delft The Netherlands phone: +31-15-278.24.73 (secretary) URL: https://www.dcsc.tudelft.nl * This report can also be downloaded via https://pub.deschutter.info/abs/15_043.html

Transcript of Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf ·...

Page 1: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

Delft University of Technology

Delft Center for Systems and Control

Technical report 15-043

Residential demand response of

thermostatically controlled loads using

batch reinforcement learning∗

F. Ruelens, B.J. Claessens, S. Vandael, B. De Schutter, R. Babuska,

and R. Belmans

If you want to cite this report, please use the following reference instead:

F. Ruelens, B.J. Claessens, S. Vandael, B. De Schutter, R. Babuska, and R. Belmans,

“Residential demand response of thermostatically controlled loads using batch rein-

forcement learning,” IEEE Transactions on Smart Grid, vol. 8, no. 5, pp. 2149–2159,

Sept. 2017.

Delft Center for Systems and Control

Delft University of Technology

Mekelweg 2, 2628 CD Delft

The Netherlands

phone: +31-15-278.24.73 (secretary)

URL: https://www.dcsc.tudelft.nl

∗This report can also be downloaded via https://pub.deschutter.info/abs/15_043.html

Page 2: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

1

Residential Demand Response ofThermostatically Controlled Loads

Using Batch Reinforcement LearningFrederik Ruelens, Bert J. Claessens, Stijn Vandael, Bart De Schutter, Robert Babuska, and Ronnie Belmans

Abstract—Driven by recent advances in batch ReinforcementLearning (RL), this paper contributes to the application of batchRL to demand response. In contrast to conventional model-based approaches, batch RL techniques do not require a systemidentification step, making them more suitable for a large-scaleimplementation. This paper extends fitted Q-iteration, a standardbatch RL technique, to the situation when a forecast of theexogenous data is provided. In general, batch RL techniques donot rely on expert knowledge about the system dynamics or thesolution. However, if some expert knowledge is provided, it can beincorporated by using the proposed policy adjustment method.Finally, we tackle the challenge of finding an open-loop schedulerequired to participate in the day-ahead market. We proposea model-free Monte-Carlo method that uses a metric based onthe state-action value function or Q-function and we illustratethis method by finding the day-ahead schedule of a heat-pumpthermostat. Our experiments show that batch RL techniquesprovide a valuable alternative to model-based controllers andthat they can be used to construct both closed-loop and open-loop policies.

Index Terms—Batch reinforcement learning, Demand re-sponse, Electric water heater, Fitted Q-iteration, Heat pump.

I. INTRODUCTION

THE increasing share of renewable energy sources intro-duces the need for flexibility on the demand side of the

electricity system [1]. A prominent example of loads thatoffer flexibility at the residential level are thermostaticallycontrolled loads, such as heat pumps, air conditioning units,and electric water heaters. These loads represent about 20%of the total electricity consumption at the residential levelin the United States [2]. The market share of these loadsis expected to increase as a result of the electrification ofheating and cooling [2], making them an interesting domainfor demand response [1], [3]–[5]. Demand response programsoffer demand flexibility by motivating end users to adapt theirconsumption profile in response to changing electricity pricesor other grid signals. The forecast uncertainty of renewableenergy sources [6], combined with their limited controllability,have made demand response the topic of an extensive numberof research projects [1], [7], [8] and scientific papers [3], [5],[9]–[11]. The traditional control paradigm defines the demandresponse problem as a model-based control problem [3], [7],[9], requiring a model of the demand response application, an

F. Ruelens, S. Vandael and R. Belmans are with the department of ElectricalEngineering, KU Leuven/EnergyVille, Leuven, Belgium.

B.J. Claessens is with the energy department of VITO, Belgium.B. De Schutter and R. Babuska are with the Delft Center for Systems and

Control, Delft University of Technology, The Netherlands.

optimizer, and a forecasting technique. A critical step in settingup a model-based controller includes selecting accurate modelsand estimating the model parameters. This step becomes morechallenging considering the heterogeneity of the end users andtheir different patterns of behavior. As a result, different endusers are expected to have different model parameters andeven different models. As such, a large-scale implementa-tion of model-based controllers requires a stable and robustapproach that is able to identify the appropriate model andthe corresponding model parameters. A detailed report of theimplementation issues of a model predictive control strategyapplied to the heating system of a building can be foundin [12]. Moreover, the authors of [3] and [13] demonstratea successful implementation of a model predictive controlapproach at an aggregated level to control a heterogeneouscluster of thermostatically controlled loads.

In contrast, Reinforcement Learning (RL) [14], [15] is amodel-free technique that requires no system identificationstep and no a priori knowledge. Recent developments in thefield of reinforcement learning show that RL techniques caneither replace or supplement model-based techniques [16]. Anumber of recent papers provide examples of how a pop-ular RL method, Q-learning [14], can be used for demandresponse [4], [10], [17], [18]. For example in [10], O’Neill etal. propose an automated energy management system basedon Q-learning that learns how to make optimal decisions forthe consumers. In [17], Henze et al. investigate the potentialof Q-learning for the operation of commercial cold storesand in [4], Kara et al. use Q-learning to control a cluster ofthermostatically controlled loads. Similarly, in [19] Liang etal. propose a Q-learning approach to minimize the electricitycost of the flexible demand and the disutility of the user.Furthermore, inspired by [20], Lee et al. propose a bias-corrected form of Q-learning to operate battery charging inthe presence of volatile prices [18].

While being a popular method, one of the fundamentaldrawbacks of Q-learning is its inefficient use of data, giventhat Q-learning discards the current data sample after everyupdate. As a result, more observations are needed to propagatealready known information through the state space. In orderto overcome this drawback, batch RL techniques [21]–[24]can be used. In batch RL, a controller estimates a controlpolicy based on a batch of experiences. These experiences canbe a fixed set [23] or can be gathered online by interactingwith the environment [25]. Given that batch RL algorithmscan reuse past experiences, they converge faster compared to

Page 3: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

2

OFFLINE ONLINE

Forecasts/reward model Expert knowledge User/safety settings Building

Policy

Control action

Physical control action

Observed state

Feature vector/feature mapping

Batch RLExpert policy

adjustmentExploration-exploitation

Backupcontroller TCL

Featureextraction

Fig. 1. Building blocks of a model-free Reinforcement Learning (RL) agent (gray) applied to a Thermostatically Controlled Load (TCL).

standard temporal difference methods like Q-learning [26] orSARSA [27]. This makes batch RL techniques suitable forpractical implementations, such as demand response. For ex-ample, the authors of [28] combine Q-learning with eligibilitytraces in order to learn the consumer and time preferences ofdemand response applications. In [5], the authors use a batchRL technique to schedule a cluster of electric water heatersand in [29], Vandael et al. use a batch RL technique to find aday-ahead consumption plan of a cluster of electric vehicles.An excellent overview of batch RL methods can be foundin [25] and [30].

Inspired by the recent developments in batch RL, in par-ticular fitted Q-iteration by Ernst et al. [16], this paper buildsupon the existing batch RL literature and contributes to theapplication of batch RL techniques to residential demandresponse. The contributions of our paper can be summarizedas follows: (1) we demonstrate how fitted Q-iteration can beextended to the situation when a forecast of the exogenousdata is provided; (2) we propose a policy adjustment methodthat exploits general expert knowledge about monotonicityconditions of the control policy; (3) we introduce a model-free Monte Carlo method to find a day-ahead consumptionplan by making use of a novel metric based on Q-values.

This paper is structured as follows: Section II defines thebuilding blocks of our batch RL approach applied to demandresponse. Section III formulates the problem as a Markovdecision process. Section IV describes our model-free batchRL techniques for demand response. Section V demonstratesthe presented techniques in a realistic demand response setting.To conclude, Section VI summarizes the results and discussesfurther research.

II. BUILDING BLOCKS: MODEL-FREE APPROACH

Fig. 1 presents an overview of our model-free learning agentapplied to a Thermostatically Controlled Load (TCL), wherethe gray building blocks correspond to the learning agent.

At the start of each day the learning agent uses a batchRL method to construct a control policy for the next day,given a batch of past interactions with its environment. Thelearning agent needs no a priori information on the modeldynamics and considers its environment as a black box.Nevertheless, if a model of the exogenous variables, e.g.a forecast of the outside temperature, or reward model isprovided, the batch RL method can use this information to

enrich its batch. Once a policy is found, an expert policyadjustment method can be used to shape the policy obtainedwith the batch RL method. During the day, the learning agentuses an exploration-exploitation strategy to interact with itsenvironment and to collect new transitions that are addedsystematically to the given batch.

In this paper, the proposed learning agent is applied to twotypes of TCLs. The first type is a residential electric waterheater with a stochastic hot-water demand. The dynamic be-havior of the electric water heater is modeled using a nonlinearstratified thermal tank model as described in [31]. Our secondTCL is a heat-pump thermostat for a residential building. Thetemperature dynamics of the building are modeled using asecond-order equivalent thermal parameter model [32]. Thismodel describes the temperature dynamics of the indoor airand of the building envelope. However, to develop a realisticimplementation, this paper assumes that the learning agentcannot measure the temperature of the building envelope andconsiders it as a hidden state variable.

In addition, we assume that both TCLs are equipped witha backup controller that guarantees the comfort and safetysettings of its users. The backup controller is a built-in overrulemechanism that turns the TCL 'on' or 'off' depending on thecurrent state and a predefined switching logic. The operationand settings of the backup controller are assumed to beunknown to the learning agent. However, the learning agentcan measure the overrule action of the backup controller (seethe dashed arrow in Fig. 1).

The observable state information contains sensory input dataof the state of the process and its environment. Before thisinformation is sent to the batch RL algorithm, the learningagent can apply feature extraction [33]. This feature extractionstep can have two functions namely to extract hidden stateinformation or to reduce the dimensionality of the state vector.For example, in the case of a heat-pump thermostat, thisstep can be used to extract a feature that represents thehidden state information, e.g. the temperature of the buildingenvelope. Alternatively, a feature extraction mapping can beused to find a low-dimensional representation of the sensoryinput data. For example, in the case of an electrical waterheater, the observed state vector consists of the temperaturesensors that are installed along the hull of the buffer tank.When the number of temperature sensors is large, it can beinteresting to map this high-dimensional state vector to a low-dimensional feature vector. This mapping can be the result

Page 4: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

3

of an auto-encoder network or principal component analysis.For example in [34], Curren et al. indicate that when only alimited number of observations are available, a mapping to alow-dimensional state space can improve the convergence ofthe learning algorithm.

In this paper, the learning agent is applied to two relevantdemand response business models: dynamic pricing and day-ahead scheduling [1], [35]. In dynamic pricing, the learningagent learns a control policy that minimizes its electricity costby adapting its consumption profile in response to an externalprice signal. The solution of this optimal control problem isa closed-loop control policy that is a function of the currentmeasurement of the state. The second business model relates tothe participation in the day-ahead market. The learning agentconstructs a day-ahead consumption plan and then tries tofollow it during the day. The objective of the learning agent isto minimize its cost in the day-ahead market and to minimizeany deviation between the day-ahead consumption plan and theactual consumption. In contrast to the solution of the dynamicpricing scheme, the day-ahead consumption plan is a feed-forward plan for the next day, i.e. an open-loop policy, whichdoes not depend on future measurements of the state.

III. MARKOV DECISION PROCESS FORMULATION

This section formulates the decision-making problem of thelearning agent as a Markov decision process. The Markovdecision process is defined by its d-dimensional state spaceX ⊂ Rd, its action space U ⊂ R, its stochastic discrete-timetransition function f , and its cost function ρ. The optimizationhorizon is considered finite, comprising T ∈ N \ {0} steps,where at each discrete time step k, the state evolves as follows:

xk+1 = f(xk, uk, wk) ∀k ∈ {1, . . . , T − 1}, (1)

with wk a realization of a random disturbance drawn froma conditional probability distribution pW(·|xk), uk ∈ U thecontrol action, and xk ∈ X the state. Associated with eachstate transition, a cost ck is given by:

ck = ρ(xk, uk, wk) ∀k ∈ {1, . . . , T}. (2)

The goal is to find an optimal control policy h∗ : X → U thatminimizes the expected T -stage return for any state in thestate space. The expected T -stage return starting from x1 andfollowing a policy h is defined as follows:

JhT (x1) = Ewk∼pW(·|xk)

[T∑k=1

ρ(xk, h(xk), wk)

]. (3)

A convenient way to characterize the policy h is by using astate-action value function or Q-function:

Qh(x, u) = Ew∼pW(·|x)

[ρ(x, u, w) + γJhT (f(x, u, w))

], (4)

where γ ∈ (0, 1) is the discount factor. The Q-function is thecumulative return starting from state x, taking action u, andfollowing h thereafter.

The optimal Q-function corresponds the best Q-function thatcan be obtained by any policy:

Q∗(x, u) = minhQh(x, u). (5)

Starting from an optimal Q-function for every state-action pair,the optimal policy is calculated as follows:

h∗(x) ∈ arg minu∈U

Q∗(x, u), (6)

where Q∗ satisfies the Bellman optimality equation [36]:

Q∗(x, u) = Ew∼pW(·|x)

[ρ(x, u, w) + γmin

u′∈UQ∗(f(x, u, w), u′)

].

(7)

The next three paragraphs give a formal description of the statespace, the backup controller, and the cost function tailored todemand response.

A. State description

Following the notational style of [37], the state space Xis spanned by a time-dependent component Xt, a controllablecomponent Xph, and an uncontrollable exogenous componentXex:

X = Xt ×Xph ×Xex. (8)

1) Timing: The time-dependent component Xt describesthe part of the state space related to timing, i.e. it carries timinginformation that is relevant for the dynamics of the system:

Xt = Xqt ×Xd

t with Xqt = {1, ..., 96} , Xd

t = {1, ..., 7} , (9)

where xqt ∈ Xqt denotes the quarter in the day, and xdt ∈

Xdt denotes the day in the week. The rationale is that most

consumer behavior tends to be repetitive and tends to followsa diurnal pattern.

2) Physical representation: The controllable componentXph represents the physical state information related to thequantities that are measured locally and that are influenced bythe control actions, e.g. the indoor air temperature or the stateof charge of an electric water heater:

xph ∈ Xph with xph < xph < xph, (10)

where xph and xph denote the lower and upper bounds, set toguarantee the comfort and safety of the end user.

3) Exogenous Information: The state description of theuncontrollable exogenous state is split into two components:

Xex = Xphex ×Xc

ex. (11)

When the random disturbance wk+1 is independent of wkthere is no need to include exogenous variables in the statespace. However, most physical processes, such as the outsidetemperature and solar radiation, exhibit a certain degree ofautocorrelation, where the outcome of the next state dependson the current state. For this reason we include an exogenouscomponent xphex ∈ Xph

ex in our state space description. Thisexogenous component is related to the observable exogenousinformation that has an impact on the physical dynamics andthat cannot be influenced by the control actions, e.g. theoutside temperature.

The second exogenous component xcex ∈ Xcex has no

direct influence on the dynamics, but contains information tocalculate the cost ck. This work assumes that a deterministicforecast of the exogenous state information related to the

Page 5: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

4

cost xcex and related to the physical dynamics, i.e. outsidetemperature and solar radiation, xphex is provided for the timespan covering the optimization problem.

B. Backup controllerThis paper assumes that each TCL is equipped with an over-

rule mechanism that guarantees comfort and safety constraints.The backup function B : X × U −→ Uph maps the requestedcontrol action uk ∈ U taken in state xk to a physical controlaction uphk ∈ Uph:

uphk = B(xk, uk). (12)

The settings of the backup function B are unknown to thelearning agent, but the resulting action uphk can be measuredby the learning agent (see the dashed arrow in Fig. 1).

C. Cost functionIn general, RL techniques do not require a description of the

cost function. However, for most demand response businessmodels a cost function is available. This paper considers twotypical cost functions related to demand response.

1) Dynamic pricing: In the dynamic pricing scenario anexternal price profile is known deterministically at the start ofthe horizon. The cost function is described as:

ck = uphk xck,ex∆t, (13)

where xck,ex is the electricity price at time step k and ∆t isthe length of a control period.

2) Day-ahead scheduling: The objective of the secondbusiness case is to determine a day-ahead consumption planand to follow this plan during operation. The day-ahead con-sumption plan should be minimized based on day-ahead prices.In addition, any deviation between the planned consumptionand actual consumption should be avoided. As such, the costfunction can be written as:

ck = ukxck,ex∆t+ α

∣∣∣uk∆t− uphk ∆t∣∣∣ , (14)

where uk is the planned consumption, uphk is the actualconsumption, xck,ex is the forecasted day-ahead price andα > 0 is a penalty. The first part of (14) is the cost forbuying energy at the day-ahead market, whereas the secondpart penalizes any deviation between the planned consumptionand the actual consumption.

D. Reinforcement learning for demand responseWhen the description of the transition function and cost

function is available, techniques that make use of the Markovdecision process framework, such as approximate dynamicprogramming [38] or direct policy search [30], can be used tofind near-optimal policies. However, in our implementation weassume that the transition function f , the backup controller B,and the underlying probability of the exogenous information ware unknown. In addition, we assume that they are challengingto obtain in a residential setting. For these reasons, we presenta model-free batch RL approach that builds on previous theo-retical work on RL, in particular fitted Q-iteration [23], expertknowledge [39], and the synthesis of artificial trajectories [21].

Algorithm 1 Fitted Q-iteration using a forecast of the exoge-nous data (extended FQI)

Input: F = {(xl, ul, x′l, uphl )}#Fl=1, {(x

phl,ex, x

cl,ex)}#Fl=1, T

1: let Q0 be zero everywhere on X × U2: for l = 1, . . . ,#F do3: x′l ← (xq ′l,t , x

d ′l,t , x

′l,ph, x

ph ′l,ex ) B replace the observed

exogenous part of the next state xph ′l,ex by its forecastxph ′l,ex

4: end for5: for N = 1, . . . , T do6: for l = 1, . . . ,#F do7: cl ← ρ(xcl,ex, u

phl )

8: QN,l ← cl + minu∈U

QN−1(x′l, u)

9: end for10: use regression to obtain QN from

Treg = {((xl, ul), QN,l) , l = 1, . . . ,#F}11: end for

Output: Q∗ = QT

IV. ALGORITHMS

Typically, batch RL techniques construct policies based ona batch of tuples of the form: F = {(xl, ul, x′l, cl)}

#Fl=1 , where

xl = (xql,t, xdl,t, xl,ph, x

phl,ex) denotes the state at time step l

and x′l denotes the state at time step l+ 1. However, for mostdemand response applications, the cost function ρ is givena priori, and of the form ρ(xcl,ex, u

phl ). As such, this paper

considers tuples of the form (xl, ul, x′l, u

phl ).

A. Fitted Q-iteration using a forecast of the exogenous data

Here we demonstrate how fitted Q-iteration [23] can beextended to the situation when a forecast of the exogenouscomponent is provided (Algorithm 1). The algorithm itera-tively builds a training set Treg with all state-action pairs(x, u) in F as the input. The target values consist of thecorresponding cost values ρ(xcex, u

ph) and the optimal Q-values, based on the approximation of the Q-function of theprevious iteration, for the next states min

u∈UQN−1(x′l, u).

Since we consider a finite horizon problem with T controlperiods, Algorithm 1 needs T iterations until the Q-functioncontains all information about the future rewards. Noticethat for N = 1 (first iteration), the Q-values in the trainingset correspond to their immediate cost QN,l ← cl (line 7in Algorithm 1). In the subsequent iterations, Q-values areupdated using the value iteration based on the Q-function ofthe previous iteration.

It is important to note that x′l denotes the successor statein F , where the observed exogenous information xph ′l,ex isreplaced by its forecast xph ′l,ex (line 3 in Algorithm 1). Notethat in our algorithm the next state contains information onthe forecasted exogenous data, whereas for standard fittedQ-iteration the next state contains past observations of theexogenous data. By replacing the observed exogenous partsof the next state by their forecasts, the Q-function of the nextstate assumes that the exogenous information will follow its

Page 6: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

5

forecast. In other words, the Q-value of the next state becomesbiased towards the provided forecast of the uncontrollableexogenous data. Also notice that we recalculate the cost foreach tuple in F by using the price for the next day xcex (line 7in Algorithm 1). The proposed algorithm is relevant for de-mand response applications that are influenced by exogenousweather data. Examples of these applications are heat-pumpthermostats and air conditioning units.

In principle, any function approximator, such as a neuralnetwork [40], can be applied in combination with fitted Q-iteration. However, because of its robustness and fast calcula-tion time, an ensemble of extremely randomized trees [23] isused to approximate the Q-function.

A discussion on the convergence properties of the proposedmethod and a comparison with Q-learning can be found in theappendix section.

B. Expert policy adjustment

Given the Q-function from Algorithm 1, a near-optimalpolicy h∗ can be constructed by solving (6) for every state inthe state space. However, in some cases, e.g. when F containsa limited number of observations, the resulting policy can beimproved by using general prior knowledge about its shape.

In this section, we show how expert knowledge about themonotonicity of the policy can be exploited to regularizethe policy. The method enforces monotonicity conditions byusing convex optimization to approximate the policy, whereexpert knowledge is included in the form of extra constraints.These constraints can result directly from an expert or from amodel-based solution. In order to define a convex optimizationproblem we use a fuzzy model with triangular membershipfunctions [30] to approximate the policy. The centers of thetriangular membership functions are located on an equidistantgrid with N membership functions along each dimension ofthe state space. This partitioning leads to Nd state-dependentmembership functions for each action. The parameter vector θ∗

that approximates the original policy can be found by solvingthe following least-squares problem:

θ∗ ∈ arg minθ

#F∑l=1

([F (θ)](xl)− h∗(xl)

)2,

s.t. expert knowledge

(15)

where F (θ) denotes an approximation mapping of a weightedlinear combination of a set of triangular membership functionsφ and [F (θ)](x) denotes the policy F (θ) evaluated at statex. The triangular Membership Functions (MFs) for each statevariable xd, with d ∈ {1, . . . ,dim(X)} are defined as follows:

φd,1(xd) = max

(0,

cd,2 − xdcd,2 − cd,1

)φd,i(xd) = max

[0,min

(xd − cd,i−1cd,i − cd,i−1

,cd,i+1 − xdcd,i+1 − cd,i

)],

φd,Nd(xd) = max

(0,

xd − cd,Nd−1

cd,Nd− cd,Nd−1

)for i = 2, . . . , Nd − 1. The centers of the MFs alongdimension d are denoted by cd,1, . . . , cd,Nd

which must satisfy:

cd,1 < · · · < cd,Nd. These centers are chosen equidistant along

the state space in such a way that xd ∈ [cd,1, cd,Nd] for

d = 1, . . . ,dim(X).The fuzzy approximation of the policy allows us to add

expert knowledge to the policy in the form of convex con-straints of the least-squares problem (15), which can be solvedusing a convex optimization solver. Using the same notationas in [39], we can enforce monotonicity conditions along thedth dimension of the state space as follows:

δd[F (θ)](xd) ≤ δd[F (θ)](x′d) (16)

for all state components xd ≤ x′d along the dimension d. If δdis -1 then [F (θ)] will be decreasing along the dth dimensionof X , whereas if δd is 1 then [F (θ)] will be increasing alongthe dth dimension of X . Once h∗ is found, the adjustedpolicy h∗exp, given this expert knowledge, can be calculatedas h∗exp(x) = [F (θ∗)](x).

C. Day-ahead consumption plan

This section explains how to construct a day-ahead schedulestarting from the Q-function obtained by Algorithm 1 andusing cost function (14). Finding a day-ahead schedule hasa direct relation to two situations:• a day-ahead market, where participants have to submit

a day-ahead schedule one day in advance of the actualconsumption [35];

• a distributed optimization process, where two or moreparticipants are coupled by a common constraint, e.g.congestion management [9].

Algorithm 2 describes a model-free Monte Carlo method tofind a day-ahead schedule that makes use of a metric basedon Q-values. The method estimates the average return of apolicy by synthesizing p sequences of transitions of lengthT from F . These p sequences can be seen as a proxy ofthe actual trajectories that could be obtained by simulatingthe policy on the given control problem. Note that since weconsider a stochastic setting, p needs to be greater than 1.A sequence is grown in length by selecting a new transitionamong the samples of not-yet-used one-step transitions. Eachnew transition is selected by minimizing a distance metric withthe previously selected transition.

In [21], Fonteneau et al. propose the following distancemetric in X×U : ∆ ((x, x′) , (u, u′)) = ‖x− x′‖+ ‖u− u′‖,where ‖ ·‖ denotes the Euclidean norm. It is important to notethat this metric weighs each dimension of the state and actionspace equally. In order to overcome specifying weights foreach dimension, we propose the following distance metric:

|Q∗(xik, uik)− Q∗(xl, ul)|+ ξ‖xik − xl‖. (17)

Here Q∗ is obtained by applying Algorithm 1 with costfunction (14). The state xik and action uik denote the state andaction corresponding to the ith trajectory at time step k. Thecontrol action uik is computed by minimizing the Q-functionQ∗ in xik (see line 6). The next state x′li is found by takingthe next state of the tuple that minimizes this distance metric(see line 8). The regularization parameter ξ is a scalar thatis included to penalize states that have similar Q-values, but

Page 7: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

6

Algorithm 2 Model-free Monte Carlo method using a Q-function to find a feed-forward plan

Input: F = {(xl, ul, x′l, uphl )}#Fl=1 , {(xphl,ex, xcl,ex)}#Fl=1 , T , x1,

p, ξ1: G ← F2: Apply Algorithm 1 with cost function (14), to obtain Q∗

3: for i = 1, . . . , p do4: xi1 ← x15: for k = 1, . . . , T do6: uik ← arg min

u∈UQ∗(xik, u)

7: Gs ← {(xl, ul, x′l, uphl ) ∈ G|ul = uik}

8: H ← arg min(xl,ul,x′

l,uphl )∈Gs

|Q∗(xik, uik) − Q∗(xl, ul)| +

ξ‖xik − xl‖9: li ← lowest index in G of the transitions in H

10: P ik ← uik11: xik+1 ← x′li

12: G ← G\{

(xli , uli , x′li , u

phli )}Bdo not reuse tuple

13: end for14: end for

Output: p artificial trajectories: [P 1k ]Tk=1, . . . , [P

pk ]Tk=1

have a large Euclidean norm in the state space. When the Q-function is strictly increasing or decreasing ξ can be set to 0.The motivation behind using Q-values instead of the Euclideandistance in X × U is that Q-values capture the dynamics ofthe system and, therefore, there is no need to select individualweights.

Notice that once a tuple with the lowest distance metricis selected, this tuple is removed from the given batch (seeline 12). As a result, this ensures that the p artificial trajectoriesare distinct and thus can be seen as p stochastic realizationsthat could be obtained by simulating the policy on the givencontrol policy.

V. SIMULATIONS

This section presents the simulation results of three ex-periments and evaluates the performance of the proposedalgorithms. We focus on two examples of flexible loads,i.e. an electric water heater and a heat-pump thermostat.The first experiment evaluates the performance of extendedFQI (Algorithm 1) for a heat-pump thermostat. The rationalebehind using extended FQI for a heat-pump thermostat, isthat the temperature dynamics of a building is influencedby exogenous weather data, which is less the case for anelectric water heater. In the second experiment, we apply thepolicy adjustment method to an electric water heater. The finalexperiment uses the model-free Monte Carlo method to finda day-ahead consumption plan for a heat-pump thermostat. Itshould be noted that the policy adjustment method and model-free Monte Carlo method can also be applied to both heat-pump thermostat and electric water heater.

A. Thermostatically controlled loadsHere we describe the state definition and the settings of the

backup controller of the electric water heater and the heat-

pump thermostat.1) Electric water heater: We consider that the storage tank

of the electric water heater is equipped with ns temperaturesensors. The full state description of the electric water heateris defined as follows:

xk = (xqk,t, T1k , . . . , T

ik, . . . , T

ns

k ), (18)

where xqk,t denotes the current quarter in the day and T ikdenotes the temperature measurement of the ith sensor. Thiswork uses feature extraction to reduce the dimensionality ofthe controllable state space component by replacing it withwith the average sensor measurement. As such the reducedstate is defined as follows:

xk = (xqk,t,

∑ns

i=1 Tik

ns). (19)

More generic dimension reduction techniques, such as an auto-encoder network and a principal components analysis [41],[42] will be explored in future research.

The logic of the backup controller of the electric waterheater is defined as:

B(xk, uk) =

uphk = umax if xk,soc ≤30%

uphk = uk if 30%< xk,soc <100%.

uphk = 0 if xk,soc ≥100%

(20)

The electric heating element of the electric water heater canbe controlled with a binary control action uk ∈ {0, umax},where umax = 2.3 kW is the maximum power. A detaileddescription of the nonlinear tank model of the electric waterheater and the calculation of its state of charge xsoc can befound in [31]. We use a set of hot-water demand profiles witha daily mean load of 100 liter obtained from [43] to simulatea realistic tap demand.

2) Heat-pump thermostat: Our second application con-siders a heat-pump thermostat that can measure the indoorair temperature, the outdoor air temperature, and the solarradiation. The full state of the heat-pump thermostat is definedas follows:

xk = (xqk,t, Tk,in, Tk,out, Sk), (21)

where the physical state space component consists of theindoor air temperature Tk,in and the exogenous state spacecomponent contains the outside air temperature Tk,out andthe solar radiation Sk. The internal heat gains, caused byuser behavior and electric appliances, cannot be measured andare excluded from the state. We included a measurement ofthe solar radiance in the state since solar energy transmittedthrough windows can significantly impact the indoor temper-ature dynamics [32]. In order to have a practical implementa-tion, we consider that we cannot measure the temperature ofthe building envelope. We used a feature extraction techniquebased on past state observations by including a virtual buildingenvelope temperature, which is a running average of the pastnr air temperatures:

xk = (xqk,t, Tk,in,

∑k−1i=k−nr

Ti,in

nr, Tk,out, Sk). (22)

The backup controller of the heat-pump thermostat is de-

Page 8: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

7

fined as follows:

B(xk, uk) =

uphk = umax if Tk,in ≤ T kuphk = uk if T k < Tk,in < T k,

uphk = 0 if Tk,in ≥ T k(23)

where T k and T k are the minimum and maximum temperaturesettings defined by the end user. The action space of theheat pump is discretized in 10 steps, uk ∈ {0, ..., umax},where umax = 3kW is the maximum power. In the simulationsection we define a minimum and maximum comfort settingof 19◦C and 23◦C. A detailed description of the temperaturedynamics of the indoor air and building envelope can be foundin [32]. The exogenous information consists of the outsidetemperature, the solar radiation, and the internal heat gainsfrom a location in Belgium [1].

In the following experiments we define a control periodof one quarter and an optimization horizon T of 96 controlperiods.

B. Experiment 1

The goal of the first experiment is to compare the per-formance of Fitted Q-Iteration (standard FQI) [23] to theperformance of our extension of FQI (extended FQI), given byAlgorithm 1. The objective of the considered heating system isto minimize the electricity cost of the heat pump by respondingto an external price signal. The electricity prices are takenfrom the Belgian wholesale market [35]. We assume that aforecast of the outside temperature and of the solar radianceis available. Since the goal of the experiment is to assess theimpact on the performance of FQI when a forecast is included,we assume perfect forecasts of the outside temperature andsolar radiance.

1) FQI controllers: Both FQI controllers start with anempty batch F . At the end of each day, they add the tuples ofthe given day to their current batch and they compute a T -stagecontrol policy for the next day. Both FQI controllers calculatethe cost value of each tuple in F (line 7 in Algorithm 1). Assuch, FQI can reuse (or replay) previous observations evenwhen the external prices change daily. The extended FQIcontroller uses the forecasted values of the outside temperatureand solar radiance to construct the next states in the batchx′l ← (xq ′l,t , T

′l,in, T

′l,out, S

′l ), where (.) denotes a forecast.

In contrast, standard FQI uses observed values of the outsidetemperature and solar radiance to construct the next states inthe batch.

The observed state information of both FQI controllersis defined by (22), where a handcrafted feature is used torepresent the temperature of the building envelope (nr setto 3). During the day, both FQI controllers use an ε-greedyexploration strategy. This exploration strategy selects a randomcontrol action with probability εd and follows the policy withprobability 1 − εd. Since more interactions result in a bettercoverage of the state-action space, the exploration probabilityεd is decreased on a daily basis, according to the harmonicsequence 1/dn, where n is set to 0.7 and d denotes thecurrent day. As a result the agent becomes greedier, i.e. itselects the best action more often, as the number of tuples

10 20 30 40 50 60 70 80−0.5

0

0.5

1

M

Optimal Extended FQI FQI Default

10 20 30 40 50 60 70 800

10

20

30

Cos

t [€]

Time [days]

Optimal Extended FQI FQI Default

Fig. 2. Simulation results for a heat-pump thermostat and a dynamicpricing scheme using an optimal controller (Optimal), Fitted Q-Iteration (FQI),extended FQI, and a default controller (Default). The top plot depicts theperformance metric M and the bottom plot depicts the cumulative electricitycost.

in the batch increases. A possible route of future work couldbe to include a Boltzmann exploration strategy that exploresinteresting state-action pairs based on the current estimate ofthe Q-function [38].

The exploration parameters and exploration probabilities ofboth FQI and extended FQI are identical. Moreover, we usedidentical disturbances, prices and weather conditions in bothexperiments.

2) Results: In order to compare the performance of the FQIcontrollers we define the following metric:

M =cfqi − cdco − cd

, (24)

where cfqi denotes the daily cost of the FQI controller, cd de-notes the daily cost of the default controller and co denotes thedaily cost of the optimal controller. The metric M correspondsto 0 if the FQI controller obtains the same performance as thedefault controller and corresponds to 1 if the FQI controllerobtains the same performance as the optimal controller. Thedefault controller is a hysteresis controller that switches onwhen the indoor air temperature is lower than 19◦C and stopsheating when the indoor air temperature reaches 20◦C. Theoptimal controller is a model-based controller that has fullinformation on the model parameters and has perfect fore-casts of all exogenous variables, i.e the outside temperature,solar radiation and internal heat gains. The optimal controllerformalizes the problem as a mixed integer problem and uses acommercial solver [44]. The simulation results of the heatingsystem for a simulation horizon of 80 days are depicted inFig. 2. The top plot depicts the daily metric M , where theperformance of the default controller corresponds to zero andthe performance of the optimal controller corresponds to one.The bottom plot depicts the cumulative electricity cost of bothFQI controllers, and of the default and optimal controller.The average metric M over the simulation horizon is 0.56

Page 9: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

8

0 48 9645

50

55

60

65

Time [quarter hour]

Tem

p. [

°C]

0 48 9645

50

55

60

65

Time [quarter hour]

Tem

p. [

°C]

0 48 9645

50

55

60

65

Time [quarter hour]

Tem

p. [

°C]

0 48 9645

50

55

60

65

Time [quarter hour]T

emp.

[°C

]

0 48 9645

50

55

60

65

Time [quarter hour]

Tem

p. [

°C]

0 48 9645

50

55

60

65

Time [quarter hour]

Tem

p. [

°C]

Fig. 3. Simulation results for an electric water heater and a dynamic pricingscheme. The left column depicts the original policies for day 7, 14, and 21where the color corresponds to the control action, namely white (off) and black(on). The corresponding repaired policies are depicted in the right column.The price profile corresponding to each policy is depicted in the background.

for standard FQI and 0.71 for extended FQI, which is animprovement of 27%. The performance gap of 0.29 betweenextended FQI and the optimal controller is a reasonable resultgiven that the model dynamics and disturbances are unknown,and that exploration is included.

This experiment demonstrated that fitted Q-iteration can besuccessfully extended to incorporate forecasted data. ExtendedFQI was able to decrease the total electricity cost with 19%compared to the default controller over the total simulationhorizon, whereas standard FQI decreased the total electricitycost with 14%. It is important to note that the reduction of19% is not the result of a lower energy consumption, sincethe energy consumption increased with 4% compared to thedefault controller when extended FQI was used.

C. Experiment 2

The following experiment demonstrates the policy adjust-ment method for an electric water heater. As an illustrativeexample we used a sinusoidal price profile. As stated in (19),the state space of an electric water heater consists of twodimensions, i.e. the time component and the average temper-ature.

1) Policy adjustment method: As described in Section IV-B,the policy obtained with Algorithm 1 is represented usinga grid of fuzzy membership functions. Our representationconsists of 5× 5 membership functions, located equidistantlyalong the state space. We enforce a monotonicity constraintalong the second dimension, which contains the average sensortemperature, as follows:

[F (θ)] (xqt , Ta) ≤ [F (θ)] (xqt , T′a) , (25)

for all states xqt and Ta, and where Ta < T ′a. These monotonic-ity constraints are added to the least-squares problem (15).

2) Results: The original policies obtained with fitted Q-iteration after 7, 14 and 21 days are presented in the left col-

16 32 48 64 800

1

2

3

Po

wer

[k

W]

Time [quarter hour]

16 32 48 64 80

25

50

Pri

ce [

€/M

Wh

]MFMC day−ahead price

16 32 48 64 8018

20

22

24

Tem

per

atu

re [

°C]

Time [quarter hour]

16 32 48 64 8020

45

Pri

ce [

€/M

Wh

]

MFMC day−ahead price

20 40 60 80 1000

0.5

1

1.5

Time [days]

M

optimal MFMC

20 40 60 80 1000

2

4

6

Dev

.[k

Wh

]

Time [days]

Fig. 4. The top plots depict the metric M (left) and the daily deviationbetween the day-ahead consumption plan and actual consumption (right). Theday-ahead consumption plan and the indoor temperature obtained with theModel-Free Monte Carlo (MFMC) method are depicted in the middle andbottom plot.

umn of Fig. 3. It can be seen that the original policies, obtainedby fitted Q-iteration, violate the monotonicity constraints alongthe second dimension in several states. The adjusted policiesobtained by the policy adjustment method are depicted inthe right column. The policy adjustment method was able toreduce the total electricity cost by 11% over 60 days comparedto standard FQI. These simulation results indicate that whenthe number of tuples in the batch increases, the original and theadjusted policies converge. Furthermore, the results indicatethat when the number of tuples in F is small, the expert policyadjustment method can be used to improve the performanceof standard fitted Q-iteration.

D. Experiment 3

The final experiment demonstrates the Model-Free MonteCarlo (MFMC) method (Algorithm 2) to find the day-aheadconsumption plan of a heat-pump thermostat.

1) MFMC method: First, the MFMC method uses Algo-rithm 1 with cost function (14) to calculate the Q-function.The parameter α was set to 103 to penalize possible devia-tions between the planned consumption profile and the actualconsumption. The resulting Q-function is then used as a metricto build p distinct artificial trajectories (line 8 of Algorithm 2).The regularization parameter ε was set to 0.1 to penalizestates with identical Q-values, but with a large Euclideannorm in the state space. The parameter p, which indicates thenumber of artificial trajectories, was set to 4. Increasing thenumber of artificial trajectories beyond 4 did not significantlyimproved the performance of the MFMC method. A day-aheadconsumption plan was obtained by taking the average of these4 artificial trajectories.

2) Results: In order to assess the performance of ourMFMC method, we introduce an optimal model-based con-troller. Similar as in experiment 1, the model-based controllerhas exact knowledge about the future exogenous variables and

Page 10: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

9

model equations. The solution of the model-based controllerwas found using a commercial solver [44]. Moreover, wedefine a performance metric M = cMFMC/co, where cMFMC

is the daily cost of the MFMC method and co is the dailycost of the optimal model-based controller. If M equals to1, the performance of the MFMC controller is equal to theperformance of the optimal model-based controller.

The results of an experiment, spanning 100 days, aredepicted in Fig. 4. The experiment starts with an empty batchand the tuples of the current day are added to the batch atthe end of each day. The top left plot depicts the daily metricM of our MFMC method, where the metric of the optimalmodel-based controller corresponds to 1.

The right top plot indicates the absolute value of the dailyimbalance between the day-ahead consumption plan and theactual followed consumption. This plot demonstrates that thedaily imbalance decreases as number of observations (days)in F increases. The mean metric M of the MFMC methodover the whole simulation period, including the explorationphase is 0.81. Furthermore, the mean deviation relative tothe consumed energy of the MFMC method over the wholesimulation period corresponded to 4, 6%. Since α was setto 103, the solution of the optimal model-based controllerresulted in a deviation of 0%. This implies that for the optimalmodel-based controller, the forward consumption plan andactual consumption are identical.

A representative result of a single day, obtained with amature MFMC controller, is depicted in the middle and bottomplot of Fig. 4. As can be seen in the figure, the MFMC methodminimizes its day-ahead cost and follows its day-head scheduleduring the day. These results demonstrate that the model-freeMonte Carlo method can be successfully used to construct aforward consumption plan for the next day.

VI. CONCLUSION

Driven by the challenges presented by the system identifica-tion step of model-based controllers, this paper has contributedto the application of model-free batch Reinforcement Learning(RL) to demand response. Motivated by the fact that somedemand response applications, e.g a heat-pump thermostat,are influenced by exogenous weather data, we have adapted awell-known batch RL technique, fitted Q-iteration, to incorpo-rate a forecast of the exogenous data. The presented empiricalresults have indicated that the proposed extension of fittedQ-iteration was able to improve the performance of standardfitted Q-iteration by 27%. In general, batch RL techniques donot require any prior knowledge about the system behavior orthe solution. However, for some demand response applications,expert knowledge about the monotonicity of the solution, i.e.the policy, can be available. Therefore, we have presented anexpert policy adjustment method that can exploit this expertknowledge. The results of an experiment with an electricwater heater have indicated that the policy adjustment methodwas able to reduce the cost objective by 11% compared tofitted Q-iteration without expert knowledge. A final challengefor model-free batch RL techniques is that of finding aconsumption plan for the next day, i.e. an open-loop solution.

Iterations (days)0 50 100 150 200 250 300

Obj

ectiv

e [a

.u.]

0

2

4

6

8

10

12

Q-learning (, = 0.6)Q-learning (, = 0.95)Fitted Q-iterationOptimal

Fig. 5. Mean daily cost and standard deviation of 100 simulation runs withQ-learning for different learning rates α, fitted Q-iteration and model-basedsolution (Optimal) during 300 iterations.

In order to solve this problem, we have presented a model-free Monte Carlo method that uses a distance metric basedon the Q-function. In a final experiment, spanning 100 days,we have successfully tested this method to find the day-aheadconsumption plan of a residential heat pump.

Our future research in this area will focus on employingthe presented algorithms in a realistic lab environment. Weare currently testing the expert policy adjustment method ona converted electric water heater and an air conditioning unitwith promising results. The preliminary findings of the labexperiments indicate that the expert policy adjustment andextended fitted Q-iteration can be successfully used in a real-world demand response setting.

APPENDIX: CONVERGENCE PROPERTIES AND COMPARISONWITH Q-LEARNING

This section discusses the convergence properties of thefitted Q-iteration algorithm and the model-free Monte Carlomethod. In addition, it provides an empirical comparisonbetween a well-known reinforcement learning method, i.e. Q-learning, and fitted Q-iteration.

Fitted Q-iteration: In this paper, we build upon the the-oretical work of [16], [23] and [24], and we demonstratehow fitted Q-iteration can be adapted to work in a demandresponse setting. In [16], Ernst et al., starting from [24], showthat when a kernel-based regressor is used fitted Q-iteration,converges to an optimal policy. For the finite horizon case, theyshow that the solution of the T -stage optimal control problemyields a policy that approximates the true optimal policy. Asthe ensemble of extremely randomized trees [23], used in thecurrent work, readjusts its approximator architecture to theoutput variables, it is not possible to guarantee convergence.However, empirically we observed that they perform betterthan a frozen set of trees.

An empirical evaluation of fitted Q-iteration with an en-semble of extremely randomized trees is presented in Fig. 5.In this experiment, we implemented Q-learning [26] and weapplied it to a heat-pump thermostat with a time-varying priceprofile as described in Section V. The intent of the experimentis to compare fitted Q-iteration with two other approaches

Page 11: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

10

namely Q-learning and the optimal solution. Since Q-learningdiscards the given tuple after each observation and assumes nomodel of the reward and the exogenous data is provided, werepeated the same day, i.e. identical price profile, disturbancesand outside temperature, during 300 iterations. Specifically,Fig. 5 shows the mean daily cost and standard deviation of100 simulation runs with Q-learning for different learningrates. Note that each simulation run uses a different randomseed during the exploration step. As seen in Fig. 5, fittedQ-iteration converges to a near-optimal solution within 25days. The solution of the model-based controller was foundby solving the corresponding mixed integer problem with acommercial solver [44]. The model-based controller knowsthe exact future disturbance at the start of the optimizationhorizon. These empirical results indicate that the fitted Q-iteration algorithm clearly outperforms the standard Q-learningalgorithm for the given demand response problem.

Model-free Monte-Carlo: This paper uses a model-freeMonte-Carlo (MC) method to construct a day-ahead consump-tion schedule of a flexible load based on the Q-function ob-tained from fitted Q-iteration. Specifically in [45], Fonteneauet al. infer bounds on the returns of the model-free MC methodgiven a batch of tuples. They demonstrate that the lower andupper bounds converge at least linearly towards the true returnof the policy, when the size of the batch grows, i.e. new tuplesare added to the batch.

ACKNOWLEDGMENTS

The authors would like to thank Karel Macek of Honeywelland the reinforcement learning team from the University ofLiege for their comments and suggestions. This work has beendone under a grant from the Flemish Institute for the Pro-motion of Scientific and Technological Research in Industry(IWT).

REFERENCES

[1] B. Dupont, P. Vingerhoets, P. Tant, K. Vanthournout, W. Cardinaels,T. De Rybel, E. Peeters, and R. Belmans, “LINEAR breakthroughproject: Large-scale implementation of smart grid technologies in distri-bution grids,” in Proc. 3rd IEEE PES Innov. Smart Grid Technol. Conf.(ISGT Europe), Berlin, Germany, Oct. 2012, pp. 1–8.

[2] U.S. Energy Information Administration (EIA), “An-nual energy outlook 2014 with projections to 2040,”http://www.eia.gov/forecasts/aeo/pdf/0383(2014).pdf, U.S. Departmentof Energy, Washington, DC, Tech. Rep., April 2015, [Online: accessedMarch 21, 2015].

[3] S. Koch, J. L. Mathieu, and D. S. Callaway, “Modeling and control ofaggregated heterogeneous thermostatically controlled loads for ancillaryservices,” in Proc. 17th IEEE Power Sys. Comput. Conf. (PSCC),Stockholm, Sweden, Aug. 2011, pp. 1–7.

[4] E. C. Kara, M. Berges, B. Krogh, and S. Kar, “Using smart devices forsystem-level management and control in the smart grid: A reinforcementlearning framework,” in Proc. 3rd IEEE Int. Conf. on Smart GridCommun. (SmartGridComm), Tainan, Taiwan, Nov. 2012, pp. 85–90.

[5] F. Ruelens, B. Claessens, S. Vandael, S. Iacovella, P. Vingerhoets, andR. Belmans, “Demand response of a heterogeneous cluster of electricwater heaters using batch reinforcement learning,” in Proc. 18th IEEEPower Sys. Comput. Conf. (PSCC), Wrocław, Poland, Aug. 2014, pp.1–8.

[6] J. Tastu, P. Pinson, P.-J. Trombe, and H. Madsen, “Probabilistic forecastsof wind power generation accounting for geographically dispersedinformation,” IEEE Trans. on Smart Grid, vol. 5, no. 1, pp. 480–489,Jan. 2014.

[7] E. Peeters, D. Six, M. Hommelberg, R. Belhomme, and F. Bouffard,“The ADDRESS project: An architecture and markets to enable activedemand,” in Proc. 6th IEEE Int. Conf. on the European Energy Market(EEM), Leuven, Belgium, May 2009, pp. 1–5.

[8] F. Bliek, A. van den Noort, B. Roossien, R. Kamphuis, J. de Wit,J. van der Velde, and M. Eijgelaar, “Powermatching city, a living labsmart grid demonstration,” in Proc. 1st IEEE PES Innov. Smart GridTechnol. Conf. (ISGT Europe), Gothenburg, Sweden, Oct. 2010, pp. 1–8.

[9] A. Molderink, V. Bakker, M. Bosman, J. Hurink, and G. Smit, “Man-agement and control of domestic smart grid technology,” IEEE Trans.on Smart Grid, vol. 1, no. 2, pp. 109–119, Sept. 2010.

[10] D. O’Neill, M. Levorato, A. Goldsmith, and U. Mitra, “Residentialdemand response using reinforcement learning,” in Proc. 1st IEEEInt. Conf. on Smart Grid Commun. (SmartGridComm), Gaithersburg,Maryland, US, Oct. 2010, pp. 409–414.

[11] S. Vandael, B. Claessens, M. Hommelberg, T. Holvoet, and G. Decon-inck, “A scalable three-step approach for demand side management ofplug-in hybrid vehicles,” IEEE Trans. on Smart Grid, vol. 4, no. 2, pp.720–728, Nov. 2012.

[12] J. Cigler, D. Gyalistras, J. Siroky, V. Tiet, and L. Ferkl, “Beyond theory:the challenge of implementing model predictive control in buildings,” inProc. 11th REHVA World Congress (CLIMA), Czech Republic, Prague,2013.

[13] J. Mathieu and D. Callaway, “State estimation and control of hetero-geneous thermostatically controlled loads for load following,” in Proc.45th Hawaii Int. Conf. on System Science (HICSS), Maui, HI, Jan. 2012,pp. 2002–2011.

[14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.Cambridge, MA: MIT Press, 1998.

[15] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Nashua,NH: Athena Scientific, 1996.

[16] D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel, “Reinforcementlearning versus model predictive control: a comparison on a powersystem problem,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 39, no. 2,pp. 517–529, 2009.

[17] G. P. Henze and J. Schoenmann, “Evaluation of reinforcement learningcontrol for thermal energy storage systems,” HVAC&R Research, vol. 9,no. 3, pp. 259–275, 2003.

[18] D. Lee and W. B. Powell, “An intelligent battery controller using bias-corrected Q-learning,” in Association for the Advancement of ArtificialIntelligence, J. Hoffmann and B. Selman, Eds. Palo Alto, CA: AAAIPress, 2012.

[19] Y. Liang, L. He, X. Cao, and Z.-J. Shen, “Stochastic control for smartgrid users with flexible demand,” IEEE Trans. on Smart Grid,, vol. 4,no. 4, pp. 2296–2308, Dec 2013.

[20] H. V. Hasselt, “Double Q-learning,” in Proc. 24th Advances in NeuralInformation Processing Systems (NIPS), Vancouver, Canada, 2010,pp. 2613–2621. [Online]. Available: http://papers.nips.cc/paper/3964-double-q-learning.pdf

[21] R. Fonteneau, S. A. Murphy, L. Wehenkel, and D. Ernst, “Batch modereinforcement learning based on the synthesis of artificial trajectories,”Annals of Operations Research, vol. 208, no. 1, pp. 383–416, 2013.

[22] S. Adam, L. Busoniu, and R. Babuska, “Experience replay for real-time reinforcement learning control,” IEEE Trans. on Syst., Man, andCybern., Part C: Applications and Reviews, vol. 42, no. 2, pp. 201–212,2012.

[23] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforce-ment learning,” Journal of Machine Learning Research, pp. 503–556,2005.

[24] D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” Ma-chine learning, vol. 49, no. 2-3, pp. 161–178, 2002.

[25] S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,”in Reinforcement Learning: State-of-the-Art, M. Wiering and M. vanOtterlo, Eds. New York, NYC: Springer, 2012, pp. 45–73.

[26] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,no. 3-4, pp. 279–292, 1992.

[27] G. Rummery and M. Niranjan, On-line Q-learning using connectionistssystems. Cambridge, UK: Cambridge University, 1994.

[28] Z. Wen, D. O Neill, and H. Maei, “Optimal demand re-sponse using device-based reinforcement learning,” [Online]. Avail-able: http://web.stanford.edu/class/ee292k/reports/ZhengWen.pdf, Stan-ford University, Yahoo Labs, Stanford, CA, Tech. Rep., 2015.

[29] S. Vandael, B. Claessens, D. Ernst, T. Holvoet, and G. Deconinck,“Reinforcement learning of heuristic EV fleet charging in a day-aheadelectricity market,” IEEE Trans. on Smart Grid, vol. 6, no. 4, pp. 1795–1805, July 2015.

Page 12: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

11

[30] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, ReinforcementLearning and Dynamic Programming Using Function Approximators.Boca Raton, FL: CRC Press, 2010.

[31] K. Vanthournout, R. D’hulst, D. Geysen, and G. Jacobs, “A smartdomestic hot water buffer,” IEEE Trans. on Smart Grid, vol. 3, no. 4,pp. 2121–2127, Dec. 2012.

[32] D. Chassin, K. Schneider, and C. Gerkensmeyer, “GridLAB-D: An open-source power systems modeling and simulation environment,” in Proc.IEEE Transmission and Distribution Conf. and Expos., Chicago, IL, US,April 2008, pp. 1–5.

[33] J. N. Tsitsiklis and B. Van Roy, “Feature-based methods for large scaledynamic programming,” Machine Learning, vol. 22, no. 1-3, pp. 59–94,1996.

[34] W. Curran, T. Brys, M. Taylor, and W. Smart, “Using PCA to efficientlyrepresent state spaces,” in The 12th European Workshop on Reinforce-ment Learning (EWRL 2015), Lille, France, 2015.

[35] “Belpex - Belgian power exchange,” http://www.belpex.be/, [Online:accessed March 21, 2015].

[36] R. Bellman, Dynamic Programming. New York, NY: Dover Publica-tions, 1957.

[37] D. P. Bertsekas, “Approximate dynamic programming,” in DynamicProgramming and Optimal Control, 3rd ed. Belmont, MA: AthenaScientific, 2011, pp. 324–552.

[38] W. Powell, Approximate Dynamic Programming: Solving the Curses ofDimensionality, 2nd ed. Hoboken, NJ: Wiley-Blackwell, 2011.

[39] L. Busoniu, D. Ernst, R. Babuska, and B. De Schutter, “Exploiting policyknowledge in online least-squares policy iteration: An empirical study,”Automation, Computers, Applied Mathematics, vol. 19, no. 4, 2010.

[40] M. Riedmiller, “Neural fitted Q-iteration–first experiences with a dataefficient neural reinforcement learning method,” in Proc. 16th EuropeanConference on Machine Learning (ECML), vol. 3720. Porto, Portugal:Springer, Oct. 2005, p. 317.

[41] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,2006.

[42] S. Lange and M. Riedmiller, “Deep auto-encoder neural networks inreinforcement learning,” in Proc. IEEE 2010 Int. Joint Conf. on NeuralNetworks (IJCNN), Barcelona, Spain, July 2010, pp. 1–8.

[43] U. Jordan and K. Vajen, “Realistic domestic hot-water profiles indifferent time scales: Report for the international energy agency, solarheating and cooling task (IEA-SHC),” Universitat Marburg, Marburg,Germany, Tech. Rep., 2001.

[44] Gurobi Optimization, “Gurobi optimizer reference manual,” http://www.gurobi. com/, [Online: accessed March 21, 2015].

[45] R. Fonteneau, S. Murphy, L. Wehenkel, and D. Ernst, “Inferring boundson the performance of a control policy from a sample of trajectories,”in Proc. IEEE Symposium on Adaptive Dynamic Programming andReinforcement Learning (ADPRL), Nashville, TN, March 2009, pp. 117–123.

Frederik Ruelens received his M.Sc. degree (cumlaude) in electrical engineering from the KU Leuven,Belgium, in 2010. Since 2011, his Ph.D. researchon demand response using reinforcement learningis supported by an IWT Flanders scholarship. Hisresearch interests include machine learning, datascience, reinforcement learning and power systems.

Bert Claessens was born in Neeroeteren, Belgium.He received his M.Sc. and Ph.D. degrees in appliedphysics from the University of Technology of Eind-hoven, The Netherlands, in 2002 and 2006, respec-tively. In 2006, he started working at ASML Veld-hoven, The Netherlands, as a design engineer. SinceJune 2010 he has been working as a Researcher atthe Vlaamse Instelling voor Technologisch Onder-zoek (VITO), Mol, Belgium. His research interestsinclude algorithm development and data analysis.

Stijn Vandael was born in Hasselt, Belgium. Hereceived the M.S. degree in industrial engineeringfrom the Katholieke Hogeschool Limburg (KHLim),Belgium, and the M.Sc. degree from the KatholiekeUniversiteit Leuven (KU Leuven), Belgium. Cur-rently, he is working towards a Ph.D. degree at thedepartment of Computer Science (KU Leuven) inclose cooperation with the department of ElectricalEngineering (KU Leuven). His research interestsinclude coordination in multi-agent systems, plug-in hybrid electric vehicles, and smart grids.

172 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008

[107] A. Schaerf, Y. Shoham, and M. Tennenholtz, “Adaptive load balancing:A study in multi-agent learning,” J. Artif. Intell. Res., vol. 2, pp. 475–500,1995.

[108] J. A. Boyan and M. L. Littman, “Packet routing in dynamically changingnetworks: A reinforcement learning approach,” in Proc. Adv. Neural Inf.Process. Syst. (NIPS-93), Denver, CO, Nov. 29–Dec. 2, vol. 6, pp. 671–678.

[109] S. P. M. Choi and D.-Y. Yeung, “Predictive Q-routing: A memory-basedreinforcement learning approach to adaptive traffic control,” in Proc.Adv. Neural Inf. Process. Syst. (NIPS-95), Denver, CO, Nov. 27–30, vol.8, pp. 945–951.

[110] P. Tillotson, Q. Wu, and P. Hughes, “Multi-agent learning for routingcontrol within an Internet environment,” Eng. Appl. Artif. Intell., vol. 17,no. 2, pp. 179–185, 2004.

[111] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.Belmont, MA: Athena Scientific, 1996.

[112] G. Gordon, “Stable function approximation in dynamic programming,”in Proc. 12th Int. Conf. Mach. Learn. (ICML-95), Tahoe City, CA, Jul.9–12, pp. 261–268.

[113] J. N. Tsitsiklis and B. Van Roy, “Feature-based methods for large scaledynamic programming,” Mach. Learn., vol. 22, no. 1–3, pp. 59–94,1996.

[114] R. Munos and A. Moore, “Variable-resolution discretization in optimalcontrol,” Mach. Learn., vol. 49, no. 2–3, pp. 291–323, 2002.

[115] R. Munos, “Performance bounds in Lp -norm for approximate valueiteration,” SIAM J. Control Optim., vol. 46, no. 2. pp. 546–561, 2007.

[116] C. Szepesvari and R. Munos, “Finite time bounds for sampling basedfitted value iteration,” in Proc. 22nd Int. Conf. Mach. Learn. (ICML-05),Bonn, Germany, Aug. 7–11, pp. 880–887.

[117] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal differencelearning with function approximation,” IEEE Trans. Autom. Control,vol. 42, no. 5, pp. 674–690, May 1997.

[118] D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” Mach.Learn., vol. 49, no. 2–3, pp. 161–178, 2002.

[119] C. Szepesvari and W. D. Smart, “Interpolation-based Q-learning,” inProc. 21st Int. Conf. Mach. Learn. (ICML-04), Banff, AB, Canada, Jul.4–8.

[120] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforce-ment learning,” J. Mach. Learn. Res., vol. 6, pp. 503–556, 2005.

[121] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach.Learn. Res., vol. 4, pp. 1107–1149, 2003.

[122] S. Dzeroski, L. D. Raedt, and K. Driessens, “Relational reinforcementlearning,” Mach. Learn., vol. 43, no. 1–2, pp. 7–52, 2001.

[123] O. Abul, F. Polat, and R. Alhajj, “Multiagent reinforcement learningusing function approximation,” IEEE Trans. Syst., Man, Cybern. C,Appl. Rev., vol. 4, no. 4, pp. 485–497, Nov. 2000.

[124] L. Busoniu, B. De Schutter, and R. Babuska, “Decentralized reinforce-ment learning control of a robotic manipulator,” in Proc. 9th Int. Conf.Control Autom. Robot. Vis. (ICARCV-06), Singapore, Dec. 5–8, pp. 1347–1352.

[125] H. Tamakoshi and S. Ishii, “Multiagent reinforcement learning appliedto a chase problem in a continuous world,” Artif. Life Robot., vol. 5,no. 4, pp. 202–206, 2001.

[126] B. Price and C. Boutilier, “Implicit imitation in multiagent reinforce-ment learning,” in Proc. 16th Int. Conf. Mach. Learn. (ICML-99), Bled,Slovenia, Jun. 27–30, pp. 325–334.

[127] O. Buffet, A. Dutech, and F. Charpillet, “Shaping multi-agent systemswith gradient reinforcement learning,” Auton. Agents Multi-Agent Syst.,vol. 15, no. 2, pp. 197–220, 2007.

[128] M. Ghavamzadeh, S. Mahadevan, and R. Makar, “Hierarchical multi-agent reinforcement learning,” Auton. Agents Multi-Agent Syst., vol. 13,no. 2, pp. 197–229, 2006.

[129] W. S. Lovejoy, “Computationally feasible bounds for partially observedMarkov decision processes,” Oper. Res., vol. 39, no. 1, pp. 162–175,1991.

[130] S. Ishii, H. Fujita, M. Mitsutake, T. Yamazaki, J. Matsuda, and Y.Matsuno, “A reinforcement learning scheme for a partially-observablemulti-agent game,” Mach. Learn., vol. 59, no. 1–2, pp. 31–54, 2005.

[131] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic program-ming for partially observable stochastic games,” in Proc. 19th Natl. Conf.Artif. Intell. (AAAI-04), San Jose, CA, Jul. 25–29, pp. 709–715.

[132] J. M. Vidal, “Learning in multiagent systems: An introduction froma game-theoretic perspective,” in Adaptive Agents. Lecture Notes inArtificial Intelligence, vol. 2636, E. Alonso, Ed. New York: Springer-Verlag, 2003, pp. 202–215.

Lucian Busoniu received the M.Sc. degree and thePostgraduate Diploma from the Technical Univer-sity of Cluj Napoca, Cluj Napoca, Romania, in 2003and 2004, respectively, both in control engineering.Currently, he is working toward the Ph.D. degree atthe Delft Center for Systems and Control, Faculty ofMechanical Engineering, Delft University of Tech-nology, Delft, The Netherlands.

His current research interests include reinforce-ment learning in multiagent systems, approximatereinforcement learning, and adaptive and learningcontrol.

Robert Babuska received the M.Sc. degree in con-trol engineering from the Czech Technical Universityin Prague, Prague, Czech Republic, in 1990, and thePh.D. degree from the Delft University of Technol-ogy, Delft, The Netherlands, in 1997.

He was with the Department of Technical Cy-bernetics, Czech Technical University in Prague andwith the Faculty of Electrical Engineering, Delft Uni-versity of Technology. Currently, he is a Professorat the Delft Center for Systems and Control, Fac-ulty of Mechanical Engineering, Delft University of

Technology. He is involved in several projects in mechatronics, robotics, andaerospace. He is the author or coauthor of more than 190 publications, includ-ing one research monograph (Kluwer Academic), two edited books, 24 invitedchapters in books, 48 journal papers, and more than 120 conference papers. Hiscurrent research interests include neural and fuzzy systems for modeling andidentification, fault-tolerant control, learning, and adaptive control, and dynamicmultiagent systems.

Prof. Babuska is the Chairman of the IFAC Technical Committee on Cogni-tion and Control. He has served as an Associate Editor of the IEEE TRANSAC-TIONS ON FUZZY SYSTEMS, Engineering Applications of Artificial Intelligence,and as an Area Editor of Fuzzy Sets and Systems.

Bart De Schutter received the M.Sc. degreein electrotechnical-mechanical engineering and thePh.D. degree in applied sciences (summa cumlaude with congratulations of the examination jury)from Katholieke Universiteit Leuven (K.U. Leuven),Leuven, Belgium, in 1991 and 1996, respectively.

He was a Postdoctoral Researcher at the ESAT-SISTA Research Group of K.U. Leuven. In 1998,he was with the Control Systems EngineeringGroup, Faculty of Information Technology and Sys-tems, Delft University of Technology, Delft, The

Netherlands. Currently, he is a Full Professor at the Delft Center for Systemsand Control, Faculty of Mechanical Engineering, Delft University of Technol-ogy, where he is also associated with the Department of Marine and TransportTechnology. His current research interests include multiagent systems, intelli-gent transportation systems, control of transportation networks, hybrid systemscontrol, discrete-event systems, and optimization.

Prof. De Schutter was the recipient of the 1998 SIAM Richard C. DiPrimaPrize and the 1999 K.U. Leuven Robert Stock Prize for his Ph.D. thesis.

Bart De Schutter (M’08SM’10) received the M.Sc.degree in electrotechnicalmechanical engineeringand the Ph.D. degree (summa cum laude withcongratulations of the examination jury) in appliedsciences from Katholieke Universiteit Leuven, Leu-ven, Belgium, in 1991 and 1996, respectively. Heis currently a Full Professor at Delft Center forSystems and Control, Delft University of Technol-ogy, Delft, The Netherlands. His current researchinterests include control of intelligent transportationsystems, hybrid systems control, discrete-event sys-

tems, and multiagent multilevel control. Dr. De Schutter is an AssociateEditor of Automatica and IEEE TRANSACTIONS ON INTELLIGENTTRANSPORTATION SYSTEMS.

172 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008

[107] A. Schaerf, Y. Shoham, and M. Tennenholtz, “Adaptive load balancing:A study in multi-agent learning,” J. Artif. Intell. Res., vol. 2, pp. 475–500,1995.

[108] J. A. Boyan and M. L. Littman, “Packet routing in dynamically changingnetworks: A reinforcement learning approach,” in Proc. Adv. Neural Inf.Process. Syst. (NIPS-93), Denver, CO, Nov. 29–Dec. 2, vol. 6, pp. 671–678.

[109] S. P. M. Choi and D.-Y. Yeung, “Predictive Q-routing: A memory-basedreinforcement learning approach to adaptive traffic control,” in Proc.Adv. Neural Inf. Process. Syst. (NIPS-95), Denver, CO, Nov. 27–30, vol.8, pp. 945–951.

[110] P. Tillotson, Q. Wu, and P. Hughes, “Multi-agent learning for routingcontrol within an Internet environment,” Eng. Appl. Artif. Intell., vol. 17,no. 2, pp. 179–185, 2004.

[111] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.Belmont, MA: Athena Scientific, 1996.

[112] G. Gordon, “Stable function approximation in dynamic programming,”in Proc. 12th Int. Conf. Mach. Learn. (ICML-95), Tahoe City, CA, Jul.9–12, pp. 261–268.

[113] J. N. Tsitsiklis and B. Van Roy, “Feature-based methods for large scaledynamic programming,” Mach. Learn., vol. 22, no. 1–3, pp. 59–94,1996.

[114] R. Munos and A. Moore, “Variable-resolution discretization in optimalcontrol,” Mach. Learn., vol. 49, no. 2–3, pp. 291–323, 2002.

[115] R. Munos, “Performance bounds in Lp -norm for approximate valueiteration,” SIAM J. Control Optim., vol. 46, no. 2. pp. 546–561, 2007.

[116] C. Szepesvari and R. Munos, “Finite time bounds for sampling basedfitted value iteration,” in Proc. 22nd Int. Conf. Mach. Learn. (ICML-05),Bonn, Germany, Aug. 7–11, pp. 880–887.

[117] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal differencelearning with function approximation,” IEEE Trans. Autom. Control,vol. 42, no. 5, pp. 674–690, May 1997.

[118] D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” Mach.Learn., vol. 49, no. 2–3, pp. 161–178, 2002.

[119] C. Szepesvari and W. D. Smart, “Interpolation-based Q-learning,” inProc. 21st Int. Conf. Mach. Learn. (ICML-04), Banff, AB, Canada, Jul.4–8.

[120] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforce-ment learning,” J. Mach. Learn. Res., vol. 6, pp. 503–556, 2005.

[121] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach.Learn. Res., vol. 4, pp. 1107–1149, 2003.

[122] S. Dzeroski, L. D. Raedt, and K. Driessens, “Relational reinforcementlearning,” Mach. Learn., vol. 43, no. 1–2, pp. 7–52, 2001.

[123] O. Abul, F. Polat, and R. Alhajj, “Multiagent reinforcement learningusing function approximation,” IEEE Trans. Syst., Man, Cybern. C,Appl. Rev., vol. 4, no. 4, pp. 485–497, Nov. 2000.

[124] L. Busoniu, B. De Schutter, and R. Babuska, “Decentralized reinforce-ment learning control of a robotic manipulator,” in Proc. 9th Int. Conf.Control Autom. Robot. Vis. (ICARCV-06), Singapore, Dec. 5–8, pp. 1347–1352.

[125] H. Tamakoshi and S. Ishii, “Multiagent reinforcement learning appliedto a chase problem in a continuous world,” Artif. Life Robot., vol. 5,no. 4, pp. 202–206, 2001.

[126] B. Price and C. Boutilier, “Implicit imitation in multiagent reinforce-ment learning,” in Proc. 16th Int. Conf. Mach. Learn. (ICML-99), Bled,Slovenia, Jun. 27–30, pp. 325–334.

[127] O. Buffet, A. Dutech, and F. Charpillet, “Shaping multi-agent systemswith gradient reinforcement learning,” Auton. Agents Multi-Agent Syst.,vol. 15, no. 2, pp. 197–220, 2007.

[128] M. Ghavamzadeh, S. Mahadevan, and R. Makar, “Hierarchical multi-agent reinforcement learning,” Auton. Agents Multi-Agent Syst., vol. 13,no. 2, pp. 197–229, 2006.

[129] W. S. Lovejoy, “Computationally feasible bounds for partially observedMarkov decision processes,” Oper. Res., vol. 39, no. 1, pp. 162–175,1991.

[130] S. Ishii, H. Fujita, M. Mitsutake, T. Yamazaki, J. Matsuda, and Y.Matsuno, “A reinforcement learning scheme for a partially-observablemulti-agent game,” Mach. Learn., vol. 59, no. 1–2, pp. 31–54, 2005.

[131] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic program-ming for partially observable stochastic games,” in Proc. 19th Natl. Conf.Artif. Intell. (AAAI-04), San Jose, CA, Jul. 25–29, pp. 709–715.

[132] J. M. Vidal, “Learning in multiagent systems: An introduction froma game-theoretic perspective,” in Adaptive Agents. Lecture Notes inArtificial Intelligence, vol. 2636, E. Alonso, Ed. New York: Springer-Verlag, 2003, pp. 202–215.

Lucian Busoniu received the M.Sc. degree and thePostgraduate Diploma from the Technical Univer-sity of Cluj Napoca, Cluj Napoca, Romania, in 2003and 2004, respectively, both in control engineering.Currently, he is working toward the Ph.D. degree atthe Delft Center for Systems and Control, Faculty ofMechanical Engineering, Delft University of Tech-nology, Delft, The Netherlands.

His current research interests include reinforce-ment learning in multiagent systems, approximatereinforcement learning, and adaptive and learningcontrol.

Robert Babuska received the M.Sc. degree in con-trol engineering from the Czech Technical Universityin Prague, Prague, Czech Republic, in 1990, and thePh.D. degree from the Delft University of Technol-ogy, Delft, The Netherlands, in 1997.

He was with the Department of Technical Cy-bernetics, Czech Technical University in Prague andwith the Faculty of Electrical Engineering, Delft Uni-versity of Technology. Currently, he is a Professorat the Delft Center for Systems and Control, Fac-ulty of Mechanical Engineering, Delft University of

Technology. He is involved in several projects in mechatronics, robotics, andaerospace. He is the author or coauthor of more than 190 publications, includ-ing one research monograph (Kluwer Academic), two edited books, 24 invitedchapters in books, 48 journal papers, and more than 120 conference papers. Hiscurrent research interests include neural and fuzzy systems for modeling andidentification, fault-tolerant control, learning, and adaptive control, and dynamicmultiagent systems.

Prof. Babuska is the Chairman of the IFAC Technical Committee on Cogni-tion and Control. He has served as an Associate Editor of the IEEE TRANSAC-TIONS ON FUZZY SYSTEMS, Engineering Applications of Artificial Intelligence,and as an Area Editor of Fuzzy Sets and Systems.

Bart De Schutter received the M.Sc. degreein electrotechnical-mechanical engineering and thePh.D. degree in applied sciences (summa cumlaude with congratulations of the examination jury)from Katholieke Universiteit Leuven (K.U. Leuven),Leuven, Belgium, in 1991 and 1996, respectively.

He was a Postdoctoral Researcher at the ESAT-SISTA Research Group of K.U. Leuven. In 1998,he was with the Control Systems EngineeringGroup, Faculty of Information Technology and Sys-tems, Delft University of Technology, Delft, The

Netherlands. Currently, he is a Full Professor at the Delft Center for Systemsand Control, Faculty of Mechanical Engineering, Delft University of Technol-ogy, where he is also associated with the Department of Marine and TransportTechnology. His current research interests include multiagent systems, intelli-gent transportation systems, control of transportation networks, hybrid systemscontrol, discrete-event systems, and optimization.

Prof. De Schutter was the recipient of the 1998 SIAM Richard C. DiPrimaPrize and the 1999 K.U. Leuven Robert Stock Prize for his Ph.D. thesis.

Robert Babuska received the M.Sc. degree (cumlaude) in control engineering from the Czech Tech-nical University in Prague, in 1990, and the Ph.D.degree (cum laude) from the Delft University ofTechnology, the Netherlands, in 1997. He has hadfaculty appointments at the Technical Cybernet-ics Department of the Czech Technical UniversityPrague and at the Electrical Engineering Faculty ofthe Delft University of Technology. Currently, heis a Professor of Intelligent Control and Roboticsat the Delft Center for Systems and Control. His

research interests include reinforcement learning, neural and fuzzy systems,identification and state-estimation, nonlinear model-based and adaptive controland dynamic multi-agent systems. He has been working on applications ofthese techniques in the fields of robotics, mechatronics, and aerospace.

Page 13: Residential demand response of thermostatically controlled ...bdeschutter/pub/rep/15_043.pdf · Sept. 2017. Delft Center for Systems and Control Delft University of Technology Mekelweg

12

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

GUHA THAKURTA et al.: RISK-BASED MANAGEMENT OF OVERLOADS 11

[18] F. Bouffard and F. D. Galiana, “Stochastic security for operationsplanning with significant wind power generation,” IEEE Trans. PowerSyst., vol. 23, no. 2, pp. 306–316, May 2008.

[19] F. Bouffard and F. D. Galiana, “An electricity market with a proba-bilistic spinning reserve criterion,” IEEE Trans. Power Syst., vol. 19,no. 1, pp. 300–307, Feb. 2004.

[20] F. Bouffard, F. D. Galiana, and A. J. Conejo, “Market-clearing withstochastic security—Part I: Formulation,” IEEE Trans. Power Syst.,vol. 20, no. 4, pp. 1818–1826, Nov. 2005.

[21] F. Bouffard, F. D. Galiana, and A. J. Conejo, “Market-clearing withstochastic security—Part II: Case studies,” IEEE Trans. Power Syst.,vol. 20, no. 4, pp. 1827–1835, Nov. 2005.

[22] Y. Wang, Q. Xia, and C. Kang, “A novel security stochastic unit com-mitment for wind-thermal system operation,” in Proc. 4th Int. Conf.Electric Utility Deregulation and Restructuring and Power Technolo-gies (DRPT), Jul. 6–9, 2011, pp. 386–393.

[23] Y. Li and J. D. McCalley, “Risk-based optimal power flow and systemoperation state,” in Proc. IEEE Power & Energy Soc. General Meeting,Jul. 26–30, 2009, pp. 1–6, 2009.

[24] H. Yu andW. Rosenhart, “An optimal power flow algorithm to achieverobust operation considering load and renewable generation uncertain-ties,” IEEE Trans. Power Syst., vol. 27, no. 4, pp. 1808–1817, Nov.2012.

[25] D. Van Hertem, J. Rimez, and R. Belmans, “Power flow controllingdevices as a smart and independent grid investment for flexible gridoperations: Belgian case study,” IEEE Trans. Smart Grid, vol. 4, no. 3,pp. 1656–1664, Sep. 2013.

[26] M. Vrakopoulou, S. Chatzivasileiadis, and G. Andersson, “Prob-abilistic security-constrained optimal power flow including thecontrollability of HVDC lines,” in Proc. 4th IEEE/PES InnovativeSmart Grid Technologies Europe (ISGT Europe), Oct. 6–9, 2013, pp.1–5, 2013.

[27] M. Vrakopoulou, S. Chatzivasileiadis, E. Iggland, M. Imhof, T.Krause, O. Makela, J. L. Mathieu, L. Roald, R. Wiget, and G.Andersson, “A unified analysis of security-constrained OPF formu-lations considering uncertainty, risk, and controllability in singleand multi-area systems,” in IProc. REP Symp.—Bulk Power SystemsDynamics and Control—IX, Rethymnon, Greece, Aug. 25–30, 2013.

[28] Q. Wang, J. McCalley, T. Zheng, and E. Litvinov, “A computationalstrategy to solve preventive risk-based security-constrained OPF,”IEEE Trans. Power Syst., vol. 28, no. 2, pp. 1666–1675, May 2013.

[29] D. L. Olson and S. R. Swenseth, “A linear approximation forchance-constrained programming,” J. Oper. Res. Soc., vol. 38, no. 3,pp. 261–267, Mar. 1987.

[30] Y. Seppälä, “A chance-constrained programming algorithm,” BITNumer. Math., vol. 12, no. 3, pp. 376–399, 1972.

[31] A. Charnes and W. W. Cooper, “Chance-constrained programming,”Manage. Sci., vol. 6, no. 1, pp. 73–79, Oct. 1959.

[32] M. Campi, S. Garatti, and M. Prandini, “The scenario approach forsystems and control design,” Annu. Rev. Control, vol. 33, no. 2, pp.149–157, 2009.

[33] U. Focken, M. Lange, K. Mnnich, H. P. Waldl, H. G. Beyer, and A.Luig, “Short-term prediction of the aggregated power output of windfarms—A statistical analysis of the reduction of the prediction error byspatial smoothing effects,” J. Wind Eng. Ind. Aerodynam., vol. 90, no.3, pp. 231–246, 2002.

[34] L. Roald, F. Oldewurtel, T. Krause, and G. Andersson, “Analytical re-formulation of security constrained optimal power flow with proba-bilistic constraints,” in Proc. IEEE Powertech Conf., Grenoble, France,Jun. 16–20, 2013, pp. 1–6.

[35] P. M. Anderson and A. A. Fouad, Power System Control and Sta-bility. Piscataway, NJ, USA: IEEE Press/Wiley, 1994.

[36] Reliability Test System Task Force of the Application of ProbabilityMethods Subcommittee, “The IEEE reliability test system—1996,”IEEE Trans. Power Syst., vol. 14, no. 3, pp. 1010–1020, Aug. 1999.

[37] R. D. Zimmerman, C. E. Murillo-Sanchez, and R. J. Thomas, “Mat-power: Steady-state operations, planning and analysis tools for powersystems research and education,” IEEE Trans. Power Syst., vol. 26, no.1, pp. 12–19, Feb. 2011.

[38] GAMS Development Corporation, GAMS: The Solver Man-uals. Washington, DC, USA, 2005.

[39] P. Guha Thakurta, J. Maeght, R. Belmans, and D. Van Hertem, “In-creasing transmission grid flexibility by TSO coordination to integratemore wind energy sources while maintaining system security,” IEEETrans. Sustain. Energy, to be published.

[40] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, 2004.

[41] A. Papoulis, Probability, Random Variables, and Stochastic Pro-cesses. Boston, MA, USA: McGraw-Hill, 1991.

[42] J. R. Birge and F. Louveaux, Introduction to Stochastic Program-ming. New York, NY, USA: Springer, 1997.

Priyanko Guha Thakurta (S'10–M’14) receivedthe M.Sc. degree in electrical engineering fromKTH, Stockholm, Sweden, in 2009. Since 2010, hehas been pursuing the Ph.D. degree at KU Leuven,Leuven, Belgium.His special fields of interest are effects of FACTS

devices and coordination of power flow controllers inpower systems.

Ronnie Belmans (S'77–M'84–SM'89–F'05) receivedthe M.Sc. degree in electrical engineering in 1979and the Ph.D. degree in 1984, both from KU Leuven,Leuven, Belgium. In 1989 he added a Special Doc-torate from the KULeuven and in 1993 a “Habili-tierung”, from the RWTH, Aachen, Germany.Currently, he is a full Professor at KU Leuven,

teaching techno-economical aspects of power sys-tems, electrical energy, and regulatory affairs, amongothers. His research interests include smart grids,security of energy supply, and the techno-economic

aspects of the liberalization of the electricity market. Within Belgium, he isvice president of the KU Leuven Energy Institute as well as cofounder andCEO of EnergyVille, a research collaboration in Genk specializing in energy insmart cities and buildings, in cooperation with VITO and IMEC. On a globalscale, he is executive director of the Global Smart Grids Federation (GSGF).He is also honorary chairman of the board of directors of ELIA, the Belgiantransmission system operator.

Dirk Van Hertem (S'02–M’08–SM'09) was born in1979 in Neerpelt, Belgium. He received the M.Eng.degree in 2001 from the KHK, Geel, Belgium, theM.Sc. degree in electrical engineering from the KULeuven, Leuven, Belgium, in 2003, and the Ph.D. de-gree in 2009 from the KU Leuven.In 2010, he was a member of the EPS group at the

Royal Institute of Technology, Stockholm, Sweden,where he was the program manager for controllablepower systems for the EKC competence center atKTH. Since spring 2011, he has been back at the Uni-

versity of Leuven, where he is an Assistant Professor in the ELECTA group. Hisspecial fields of interest are power system operation and control in systems withFACTS andHVDC and building the transmission system of the future, includingoffshore grids and the supergrid concept.Dr. Van Hertem is an active member of both IEEE (PES and IAS) and Cigré.

Ronnie Belmans (S’77-M’84-SM’89-F’05) re-ceived the M.S. degree in electrical engineering andthe Ph.D. degree, both from the University of Leuven(KU Leuven), Heverlee, Belgium, in 1979 and 1984,respectively, and the Special Doctorate Aachen, Ger-many, in 1989 and 1993, respectively. Currently, heis a Full Professor with the KU Leuven, teachingelectric power and energy systems. His researchinterests include techno-economic aspects of powersystems, power quality, and distributed generation.He is also Guest Professor at Imperial College of

Science, Medicine and Technology, London, U.K. Dr. Belmans is a Fellow ofthe Institute of Electrical Engineers (IET, U.K.). He is the Honorary Chairmanof the board of Elia, which is the Belgian transmission grid operator. He is alsothe President of the International Union for Electricity Applications (UIE).