Reinforcement learning (RL)

Reinforcement learning (RL)

Reinforcement learning (RL) is a machine learning algorithm in which it originally inspired by behaviorist psychology. This inspiration comes from the fact that the agent or machine tries to automatically make suitable decisions in response to specific stimuli based on the corresponding feedback as rewards or punishments. The feedback is called reinforcement signal where the agent discovers the specific link between a specific action and a reward (or punishment) and then select a preferred action within a specific context. The whole process is known as reinforcement learning. The main difference between RL and other machine learning algorithms is that RL is not exactly supervised learning since it does not directly depend on set of ”supervised” (or labeled) data (the training set). In fact, it relies on being able to observe the response of the taken actions and evaluate the term of ”reward”. However, it is also not an unsupervised learning either, because we know in advance when we model our ”learner” which is the expected reward.

In the field of operation research and control theory, RL is also known as approximate dynamic programming or neuro-dynamic programming that was introduced by the early work of Bertsekas & Tsitsiklis (1995). Though in most studies, the problem has been considered with the existence of optimal solutions and their characterization. In all these research areas, RL introduces a framework for computational learning in which the agent learns how to behave within a trial-and error mechanism in an environment so that it can maximize its rewards over time (Kaelbling et al., 1996). In the most challenging and interesting cases, actions may not only affect on the immediate award from the environment, but it can affect the later situation and, through that all future rewards (Sutton & Barto, 1998).

So far, RL has been applied in various disciplines namely: machine learning, psychology, operations research, control theory. Being quite a successful tool in solving different problems with different applications, it became more popular due to its simple algorithmic and mathematical foundations. RL shares some common with DP based algorithms, in particular, the use of Bellman equation to obtain the sequence of an optimal decision to optimize the reward function. However, it is able to avoid the limitations offered by DP algorithm in dealing with large-scale problems. To this end, RL provides the following two key benefits in tackling large-scale problems:

1) Curse of dimensionality: RL uses forward recursion instead of backward recursion used by DP and SDP to solve the Bellman equation. This will avoid the demand for solving the problem in all possible realization of system’s state variable, therefore, the computational time required for solving the problem is being reduced as a result of less number of stages. Furthermore, there could be a possibility in the use of function approximation methods in order to decrease more the curse of dimensionality.

2) Curse of modeling: RL does not require to know either the exact transition functions or the transition probability matrix of the system. This will result in the capability of solving the MDP without knowing the exact model of the system.

In RL, the environment is commonly explained with a concept of MDP, as this context is also utilized by DP’s. One of the specific distinction between the conventional techniques and RL algorithms is that there is no need to have prior information about the MDP and usually they target large MDPs where exact approaches become infeasible. Meanwhile, RL never requires input/output pairs, nor explicit sub-optimal actions. However, there is a special attention to its online performance that is summarized in obtaining a balance between exploration and exploitation. The concept of exploitation vs. exploration in RL has attracted much attention through the problems classified as finite MDPs in particular multi-armed bandit problems.

Recently, RL approaches have received a popularity in the application of water reservoir systems. Castelletti (2002) presented a new methodology known as ”Q-learning planning” or ”QLP”, and employed the newly designed algorithm to the water resource operation problem. The approach combined a Q-learning strategy with conventional SDP techniques to generate a hybrid system for reservoir planning problem. Lee & Labadie (2007) proposed a solution for multireservoir operation problem based on the Q-learning approach. In this study, they studied the impact of operating rules with/without conditioning on predicted hydrologic state variable. For the unconditional case, optimized operation policies obtained with different discount factors show similar patterns. The discount factor is the basic parameters in RL algorithm where the estimate of future value can possess less value. For the conditional cases, two approaches as K-means clustering and percentile value approaches are compared for determining hydrologic states of the model. Although both approaches were integrated within the Q-learning algorithm, K-means clustering approach shows a better performance compared with the other one. Furthermore, an interpolation method was adapted to approximate the optimal value of future reward on the Q-learning algorithm. It was figure out that nearest neighbor methods possess more appropriate and stable convergence of the algorithm. The performance of the Q-learning algorithm was compared with SDP, or SSDP approaches in these studies.

Castelletti et al. (2010) employed reinforcement learning in particular fitted Q-learning approach in determining the daily water release policy in a multireservoir system over an annual time period. In this study, they combined the concept of continuous estimation of the value functions by learning from experience in order to find the optimal daily water release where there is cyclo-stationary behavior in the operating policies (yearly operation). Furthermore, the continuous approximation technique used in Q-learning approach based on tree-based regression algorithm. One of the promising property offered by tree-based regression approach is the possibility of reducing the curse of dimensionality by applying a very rough discretization scale. The introduced fitted Q-iteration algorithm finds a stationary policy, e.g., just one operating rule where it can be represented in a general form of ut = m(xt) on the side of operating rule for a stationary system. While the main system is not stationary but periodic over a time period (year). A simple approach has been considered by an extension of basic framework to the non-stationary version, by including the time as a part of the state vector. Finally, the learning experience, in the form of a dataset generated from historical observations, provides the opportunity to overcome the curse of modeling.

In the recent application of reservoir optimization problem, RL techniques have integrated with another newly proposed approach with an attempt to generate a more efficient solution. Mariano-Romero et al. (2007) improved hydrologic optimization problem by the use of multiobjective distributed Q-learning, in which it basically takes an advantage of multi-agent approach in Q-learning. Abolpour et al. (2007) coupled RL with the fuzzy logic principle in order to improve river basin allocation problem. In another work, Tizhoosh & Ventresca (2008) expanded optimal mid-term policies for a hydropower plant by applying Q-learning, therefore, moved on combining Q-learning with opposition-based learning in the management of single reservoir system with monthly time step over a year .

So far, various applications of RL techniques to either a single reservoir or multireservoir operation problems have been studied with a purpose of hydro-power optimization problem where the operating rules are also verified and tested over whole possible scenarios to reveal their efficiency and reliability. This indicates that the agent has the adequate level of ability to estimate the value of its actions from simulated experience, thereby improving an optimal or near-optimal control policy. In the literature, no studies were found that apply the RL approach to improve short term planning of hydro-power generation problem of a multireservoir systems as well as dealing with cross-correlated objectives and constraints. strategy then heavily depends on the learning from the observed experience due to the neural structure of the brain that is used here .

Table des matières

INTRODUCTION
CHAPTER 1 LITERATURE REVIEW
1.1 Dynamic Programming (DP)
1.1.1 Background
1.1.2 Deterministic Dynamic Programming
1.1.3 Stochastic Dynamic Programming
1.1.4 Stochastic Dual Dynamic Programming
1.1.5 Sampling Stochastic Dynamic Programming
1.1.6 Dynamic programming with function approximation
1.2 Reinforcement learning (RL)
1.2.1 Artificial Neural Networks
1.3 Other Optimization Techniques
1.3.1 Linear Programming
1.3.2 Multiobjective Optimization
1.3.3 Nonlinear Programming
1.3.4 Genetic Algorithm
1.3.5 Fuzzy Programming
1.4 Hydrological simulation model
1.5 Conclusions
CHAPTER 2 THE Q-LEARNING (QL) APPROACH
2.1 Introduction
2.2 QL basic concept
2.3 QL algorithm: lookup table version
2.4 Monte Carlo property
2.5 Exploration/Exploitation
2.6 Return function
2.7 Learn/Update process
2.8 Overview of complete QL approach
2.9 Function approximation
2.10 Summary
CHAPTER 3 Q-LEARNING IN OPERATION OF MULTIRESERVOIR
SYSTEMS
3.1 Introduction
3.2 Multireservoir operation problem
3.3 QL in multireservoir operation problem
3.3.1 State and action definition
3.3.2 Initialization of QL algorithm
3.3.3 Calibration of QL parameters
3.3.4 Training dataset
3.3.5 Surrogate form of cost function
3.3.6 Approximated QL on multireservoir operation problem
3.4 Summary
CHAPTER 4 MODEL TESTING AND IMPLEMENTATION
4.1 Introduction
4.2 Description of Romaine complex
4.3 Model of environment
4.4 Experimental plan
4.5 Experimental setting
4.6 Evaluation of adaptive step-size and ε-greedy
4.7 Pre-processing the training dataset
4.8 Aggregated Q-learning
4.9 Q-learning with linear Q-factor approximation
4.10 Experimental setting (II)
4.11 Algorithm Extension
4.12 Sensitivity Analysis
4.13 Summary
CONCLUSION