[ieee 2009 international conference on intelligent human-machine systems and cybernetics - hangzhou,...

A Survey of Approximate Dynamic Programming

WANG Lin, PENG Hui, ZHU Hua-yong, SHEN Lin-cheng College of Mechatronics Engineering and Automation

National Univ. of Defense Technology Changsha, China

E-mail [email protected]

Abstract—Multi-stage decision problems under uncertainty are abundant in process industries. Markov decision process (MDP) is a general mathematical formulation of such problems. Whereas stochastic programming and dynamic programming are the standard methods to solve MDPs, their unwieldy computational requirements limit their usefulness in real applications. Approximate dynamic programming (ADP) combines simulation and function approximation to alleviate the "curse-of-dimensionality" associated with the traditional dynamic programming approach. In this paper, the method of ADP, which abates the curse-of-dimensionality by solving the DP within a carefully chosen, small subset of the state space, was introduced; a survey of recent research directions within the field of ADP had been made.

Keywords-Dynamic Programming; Approximate Dynamic Programming; Reinforcement Learning; Markov Decision Processes

I. INTRODUCTION

How can we develop better general-purpose tools for doing optimization over time, by using learning and approximation to allow us to handle larger-scale, more difficult problems? Even today, there is only one exact method for solving problems of optimization over time, in the general case of nonlinearity with random disturbance: dynamic programming (DP). But exact dynamic programming is used only in niche applications today, because the “curse of dimensionality” limits the size of problems which can be handled. Thus in many engineering applications, basic stability is now assured via conservative design, but overall performance is far from optimal.

There has been enormous progress over the past ten years in increasing the scale of what we can handle with approximate DP and learning; In many practical applications today, we can use ADP as a kind of offline numerical method, which tries to “learn” or converge to an optimal adaptive control policy for a particular type of plant (like a car engine). But even in offline learning, convergence speed is a major issue, and similar mathematical challenges arise. However, the flourish of the theories and applications of ADP make us believe that maybe it is a way to solve those questions at first.

In this paper, a survey of ADP and its’ application has been made. In section two, an introduction of the ADP is proposed, including the basic principles and key issues, algorithms. In section three, some new improvements of ADP are presented, including new algorithms, policies and methods. In section four, kinds of applications are introduced.

II. APPROXIMATE DYNAMIC PROGRAMMING

The term “ADP” can be interpreted either as “Adaptive Dynamic Programming” or “Approximate Dynamic Programming” (the later is more usually). It is also being called “adaptive critic designs” (ACD). Various strands of the field have sometimes been called “reinforcement learning” or “adaptive critics” or “neuro dynamic programming”, but the term “reinforcement learning” has had many different meanings to many different people. Introduced by Werbos in 1977 [4], ADP have received increasing attention recently [1], [2], [3], [4].

Regards to equation 1, in dynamic programming, the user supplies both a utility function, the function to be maximized, and a stochastic model of the external plant or environment. One then solves for the unknown function J( x( t )) .

J( x( t ))

Max U( x( t ),u( t )) ( J( x( t+1)) /(1 r )) u( t ) (1)

Approximate dynamic programming, as defined in Werbos [1], which abates the curse-of-dimensionality by solving the DP within a carefully chosen, small subset of the state space. More exactly, value approximation and alternate starting points.

A. Value approximation

Instead of solving for J( x ) exactly, we can use a universal approximation function J( x,W ) , containing a set of parameters W , and try to estimate parameters W , which make J( x,W ) a good approximation to the true function J[4], [6].

In [4], Werbos proposed J( x,W ) to try to match or predict the targets:

J ( t ) U(t)+J(t+1)/(1+r) (2) This training could be done by minimizing least square

error and backpropagation, or we could use any other method used to train any kind of function approximator to match examples of input vectors and desired output vectors. It was named as “Heuristic Dynamic Programming” (HDP) by werbos [4].

In 1983, Barto, Sutton and Anderson [7] generalized Widrow’s method [6] to define the class of methods TD( ) ,such that 0 and 1 would yield a choice between Werbos’s and Wideow’s. Both TD( ) and TD-learning have been researched widely. We can see more of the details

2009 International Conference on Intelligent Human-Machine Systems and Cybernetics

978-0-7695-3752-8/09 $25.00 © 2009 IEEE

DOI 10.1109/IHMSC.2009.222

396

in [9], [10], [11]. There are also a wide variety of other methods available, some more truly learning-based and some based on linear programming [2].

Besides value updates, Policy updates also be a method of value approximation, which are much more straightforward in theory. Policy iteration is a two-step approach composed of policy evaluation and policy improvement. Rather than solving for a cost-to-go function by successive substitution into the Bellman equation, the policy iteration method starts with a specific policy and the policy evaluation step computes the cost-to-go values under that policy. Then, the policy improvement step tries to build an improved policy based on the cost-to-go function of theprevious policy. The policy evaluation and improvement steps are repeated until the policy no longer changes significantly. Hence, this method iterates in the policy space rather than the cost-to-go function space [35].

More recently, some researchers have used the term “Policy Gradient Control Synthesis” (PGCS) for this approach to updating or training an Action Net, this may require the efficient real-time calculation of selected second derivatives.

B. Alternate starting points

The previous section discussed value approximation for ADP designs based directly on the Bellman equation, including HDP, TD and variations of them. The Bellman equation itself may be viewed as a kind of recursion equation for the function J( x ) .

Instead of always starting from the Bellman equation directly, we can start from related recurrence equations which sometimes improve the approximation.

There are two alternative starting points which are important to applications of ADP today. One of them is the recursion equation which underlies methods called Action-Dependent HDP (ADHDP), Q-learning, and related methods. For Action-Dependent HDP [5], Werbos proposed that Q( x( t ),u( t )) and u(x) could be approximated by universal methods, like neural networks. We can train a network Q( x,u ,W ) to match the target: Q , using exactly the same procedure as in HDP. Similarly, we can train

uu( x,W ) to maximize Q , by tuning the weights of uW in response to the derivatives of Q , backpropagated from Q

to u to the weights. The other is the recursion relation underlying Dual

Heuristic Programming (DHP), in which the equations are essentially a stochastic generalization of the Pontryagin equation, which is as important and well known as the Hamilton-Jacobin equation in some fields of science. DHP is not the only method in its class. There have also been a few simulations of Action-Dependent DHP (ADDHP), Globalized DHP (GDHP), ADGDHP and related Error Critic designs.

III. IMPROVEMENTS OF UPDATE POLICIES AND APPROACHES

In this section, we will make a survey of Improvements of update policies and approaches. The improvements are divided into some kinds: improvements of basic methods, combination and other works. The word “Combination” means a combination between the basic methods of ADP and approaches in other field of optimization, control and learning.

A. Improvements of basic methods

To reduce the complexity, a method was introduced by Bo Lincoln and Anders Rantzer [22], In which, the distance from optimality was kept within prespecified bounds and the size of the bounds determines the computational complexity the method had been applied to a partially observable Markov decision problem (POMDP). In paper [12], a method for accelerating DHP critic training using templates and perceptual learning was proposed. Both faster and more stable learning were achieved by using the value template and utilizing its inherent constraints to regularize the perceptual learning task. Dimitri P. Bertsekas had introduced a method of implicit cost-to-go approximation [8]. In there, the values of k 1V at the states are computed on-line as needed, via some computation of future costs, starting from these states (optimal or suboptimal/heuristic, with or without a rolling horizon). Three methods are proposed, including rollout, Open-loop-feedback control, and Model predictive control.

Recently, a system theoretic framework for learning and optimization has been developed that shows how many approximate dynamic programming paradigms such as perturbation analysis, Markov decision processes, and reinforcement learning are very closely related. Using this system theoretic framework a new optimization technique called gradient-based policy iteration (GBPI) has been developed [14].

B. Combination

Many learning algorithms have been successfully applied in various fields. However, each algorithm has its advantages and disadvantages. With the increasing complexity of environments and tasks, it is difficult for a single learning algorithm to cope with complicated learning problems with high performance. This motivated researchers to combine some learning algorithms to improve the learning quality.

In [19], Roman Ilin etc. had applied the Extended Kalman Filter algorithm to train the cellular SRN to solve a difficult function approximation problem. The speed of convergence increased several orders of magnitude compared to the previous results. EKF training is much less sensitive to the random initialization of the network’s weights. The trained networks demonstrated good generalization capabilities.

In [31], Maryam Shokri etc. proposed an extension of eligibility traces to solve one of the challenging problems in reinforcement learning, which combined the idea of

397

opposition and eligibility traces to construct the oppositionbased Q( ) . The results were compared with the conventional Watkins’ Q( ) and reflected a remarkable performance increase.

A new graph-based evolutionary algorithm named “Genetic Network Programming”(GNP), had been already proposed in [30]. GNP represents its solutions as graph structures, which can improve the expression ability and performance. In addition, GNP with Reinforcement Learning (GNP-RL) was proposed a few years ago [29], in [28], GNP with Actor-Critic (GNP-AC) which is a new type of GNP-RL was proposed. Originally, GNP deals with discrete information, but GNP-AC aims to deal with continuous information. The proposed method had been applied in the Khepera simulator and its performance was evaluated. In [9], RLS methods were used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms were called RLS-TD( ) and Fast Adaptive Heuristic Critic (Fast-AHC), respectively.

Ju Jiang etc. proposed a new multiple learning architecture, “Aggregated Multiple Reinforcement Learning System (AMRLS)”. AMRLS adopts three different learning algorithms(Q( ) [25] , SARSA( ) [26], and AC( ) [27]). To learn individually and then combines their results with aggregation methods. AMRLS had been tested on two different environments: a Cart-pole System and a Maze environment. The presented simulation results reveal that aggregation not only provides robustness and fault tolerance ability, but also produces more smooth learning curves and needs fewer learning steps than individual learning algorithms [23], [24].

C. Other Works

In [32], [33], [34], Tang Hao etc. had introduced how to use the neuro-dynamic programming (NDP) and neuro-policy iteration (NPI) algorithm to Markov decision process, include CTMDP, POMDP and SMDP.

IV. APPLICATIONS

The researches of ADP and its’ application has been supporting by the NSF of American. The three activities focus on: (1) new cross disciplinary partnerships addressing electric power networks (EPNES); (2) space solar power (JIETSSP); and (3) sustainable technology (TSE). To learn about these activities in detail, search on these acronyms at http://www.nsf.gov.

J.H. Lee and J.M. Lee had pointed to the potentials of ADP as a general methodology for solving scheduling and control problems of stochastic complexity previously unreachable. More details can see in [35].

Recently, ADP has been used in many industry processes. In here, we would like to introduce some examples as fellows: In paper [13], the action dependent heuristic dynamic programming(ADHDP), a member of the adaptive Critic designs family, was used for the design of the static compensator(STATCOM) neuro-fuzzy controller.

In [15], An ADP Approach to a Communication Constrained Sensor Management Problem was proposed. In [16], [17], [18], Hossein Javaherian and Derong Liu had summarized their effort in neural network modeling for a V8 engine and self-learning controller development based on adaptive critic designs. The methods that they had been used are ADHDP and HDP. In [20], Hariharan Lakshmanan proposed a general approach for decentralized control based on approximate dynamic programming. They considered approximations to the Q-function via local approximation architectures, which lead to decentralization of the task of choosing control actions and can be computed and stored efficiently. They proposed an approach for fitting the Q-function based on linear programming. They showed that error bounds previously developed for cost-to-go function approximation via linear programming can be extended to the case of Q-function approximation. In [21], Anders Rantzer proposed an application of ADP in Switching Systems, in which a computational example had been given for switching control on a graph with 60 nodes, 120 edges and 30 continuous states.

V. CONCLUSION

Linking together with Reinforcement Learning closely, we have made a survey of Approximate Dynamic Programming. An introduction of the ADP was proposed, including the history, basic principles and key issues, algorithms. And then, some new improvements of ADP were present, including improvement of basic methods, combination, and others. Lastly, some kinds of applications were introduced.

REFERENCES

[1] P. Werbos, “ADP: Goals, Opportunities and Principles,” HANDBOOK of LEARNING and APPROXIMATE DYNAMIC PROGRAMMING, IEEE Press John Wiley & sons, Inc. 2004, pp.1-42.

[2] W. B. Powell and B. Van Roy, “Approximate Dynamic Programming for High-Dimensional Resource Allocation Problems,” HANDBOOK of LEARNING and APPROXIMATE DYNAMIC PROGRAMMING IEEE Press John Wiley & sons, Inc. 2004, pp.261-284

[3] P. Werbos, “Stable Adaptive Control Using New Critic Designs,” 1998 , (ArXiv.org: adaporg/9810001).

[4] P. Werbos, “Advanced forecasting for global crisis warning and models of intelligence,” General Systems Yearbook, 1977.

[5] P. Werbos, “Neural networks for control and system identification,” IEEE Proceeding of Conference on Decision and Control, 1989.

[6] B. Widrow, N, Gupta and S. Maitra, “Punish/reward: learning with a Critic in adaptive threshold systems,” IEEE Trans. SMC, vol. 5, 1973, pp.455-465.

[7] A. Barto, R. Sutton and C. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. SMC, vol. 13, 1983, pp.834-846.

[8] D. Dimitri, P. Bertsekas, “Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC,” 2005, Report LIDS 2632.

[9] Xin Xu, Han-gen He and Dewen Hu, “Efficient Reinforcement Learning Using Recursive Least-Squares Methods,” Journal of Artificial Intelligence Research , Vol.16 , 2002, pp.259-292.

398

[10] P. D. Dayan, “The Convergence of TD( ) for General ,” Machine Learning, vol. 8, 1992, pp.341-362.

[11] John N. Tsitsiklis and Benjamin Van Roy, “An Analysis of Temporal-Difference Learning with Function Approximation” Van Roy’s homepages , 1997.

[12] T. T. Shannon, R. A. Santiago and G. Lendaris, “Accelerating Critic Learning in Approximate Dynamic Programming Via Value Templates and Perceptual Learning,” IEEEE 0-7803-7898-9/03, 2003, pp.2922-2927

[13] S. Mohagheghi and Ganesh K. Venayagamoorthy, “Adaptive Critic Design Based Neuro-Fuzzy Controller for a Static Compensator in a Multimachine Power System,” IEEE Transactions on Power Syatems, vol. 21, NO. 4, 2006 pp.1744-1755.

[14] X.-R. Cao, “Learning and optimization - from a system theoretic perspective,” Hanudbook of Learning and Approximnate Dynamic Programmning: Scaling up to the Real World, Eds. Wiley-IEEE Press, 2004.

[15] J. L. Williams, J. W. Fisher and A. S. Willsky, “An Approximate Dynamic Programming Approach to a Communication Constrained Sensor Management Problem,” 2005 7th International Conference on Information Fusion, 2005, pp.582-560.

[16] H. Javaherian, D. Liu, and Olesia Kovalenko, “Automotive Engine Torque and Air-Fuel Ratio Control Using Dual Heuristic Dynamic Programming,” 2006 International Joint Conference on Neural Networks, 2006, pp.518-526.

[17] H. Javaherian, D. Liu, Y. Zhang and O. Kovalenko, “Adaptive critic learning techniques for automotive engine control,” Proceedings of the American Control Conference, ,2004, pp.4066–4071.

[18] O. Kovalenko, D. Liu, and H. Javaherian, “Neural network modeling and adaptive critic control of automotive fuel-injection systems,” Proceedings of the IEEE International Symposium on Intelligent Control, 2004, pp.386–373.

[19] R. Ilin, R. Kozma and P. Werbos, “Cellular SRN Trained by Extended Kalman Filter Shows Promise for ADP,” 2006 International Joint Conference on Neural Networks, 2006, pp.506-511.

[20] H. Lakshmanan and D. P. Farias, “Decentralized Approximate Dynamic Programming for Dynamic Networks of Agents,” Proceedings of the 2006 American Control Conference, 2006, pp.1648-1654.

[21] A. Rantzer, “On Approximate Dynamic Programming in Switching Systems,” 44th IEEE Conference on Decision and Control, and the European Control Conference, 2006, pp.1391-1397.

[22] B. Lincoln and A. Rantzer, “Relaxing Dynamic Programming,” IEEE Transactions on Automatic Control, vol.51, No. 8, 2006, pp.249-1261.

[23] Ju Jiang and M. S. Kamel, “Aggregation of Reinforcement Learning Algorithms,” 2006 International Joint Conference on Neural Networks, 2006.

[24] Ju Jiang, M. Kamel and Lei Chen, “Reinforcement Learning and Aggregation,” Proceedings of IEEE International Conference on Systems, Man, and Cybernetics 04, 2004, pp.1303-1308.

[25] C. J, C. H. Watkins and P. Dayan, “ Q-learning ”, Machine Learning, vol. 8, no. 3, 1992, pp.279-292.

[26] R. S. Sutton and A. G. Barto, “Reinforcement Learning, An Introduction.” A Bradford Book, The MIT Press, Cambridge, Massachusetts, London, England, ISBN 0-262-19398-1, 1998.

[27] L. P. Kaelbling, M. L. Littman and A. W. Moore, “Reinforcement Learning: A Survey ”, Journal of Artificial Intelligence Research, no.4, 1996, pp.237-255.

[28] H. Hatakeyama and S. Mabu, “An Extension of Genetic Network Programming with Reinforcement Learning Using Actor-Critic,” 2006 IEEE Congress on Evolutionary Computation, 2006.

[29] S. Mabu, K. Hirasawa and J. Hu, “Genetic Network Programming with Reinforcement Learning and its Performance Evaluation”, 2004 Gnenetic and Evolutionary Computation Conference, part II, 2004, pp.710-711.

[30] K. Hirasawa, M. Okubo, H. Katagiri, J. Hu, and J. Murata, “Comparison between genetic network programming (GNP) and genetic programming (GP),” Proc. of 2001 Congress on Evolutionary Computation, 2001, pp.1276–1282.

[31] M. Shokri, H. R. Tizhoosh and M. Kamel, “Opposition-Based Q( )Algorithm,” 2006 International Joint Conference on Neural Networks, 2006.

[32] TANG Hao, ZHOU Lei and YUAN Ji-bin, “Unified NDP method based on TD(0) learning for both average and discounted Markov decision processes,” Control Theory& Application, vo1.23, no.2, 2006, pp.292-297.

[33] TANG Hao, YUAN Ji-Bin, LU Yang, and CHENG Wen-Juan, “Performance Potential-based Neuro-dynamic Programming for SMDPs,” ACTA AUTOMATICA SINICA, vol. 31, no. 4, 2005, pp.642-646.

[34] TANG Hao, XI Hong-Sheng and YIN Bo-Qun, “A Simulation Optimization Algorithm for CTMDPs Based on Randomized Stationary Policies,” ACTA AUTOMATICA SINICA, vol. 30, No.2, 2004, pp.229-235.

[35] J. H. Lee and J. M. Lee, “Approximate dynamic programming based approach to process control and scheduling,” Computers and Chemical Engineering, no. 30, 2006, pp.1603–1618.

[36] J. M. Lee and J. H. Lee, “Approximate dynamic programming based approaches for input-output data-driven control of nonlinear processes,” Automatica, vol. 41, no. 7, 2005, pp.281–1288.

[37] J. M. Lee, N. S. Kaisare and J. H. Lee, “Choice of approximator and design of penalty function for an approximate dynamic programming based control approach,” Journal of Process Control, vol.16, no. 2, 2006, pp.135–156.

399

[ieee 2009 international conference on intelligent human-machine systems and cybernetics - hangzhou,...

Documents