Trajectory-Based Modified Policy Iteration

R. Sharma; M. Gopal

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Trajectory-Based Modified Policy Iteration

Authors: R. Sharma, M. Gopal

Abstract:

This paper presents a new problem solving approach that is able to generate optimal policy solution for finite-state stochastic sequential decision-making problems with high data efficiency. The proposed algorithm iteratively builds and improves an approximate Markov Decision Process (MDP) model along with cost-to-go value approximates by generating finite length trajectories through the state-space. The approach creates a synergy between an approximate evolving model and approximate cost-to-go values to produce a sequence of improving policies finally converging to the optimal policy through an intelligent and structured search of the policy space. The approach modifies the policy update step of the policy iteration so as to result in a speedy and stable convergence to the optimal policy. We apply the algorithm to a non-holonomic mobile robot control problem and compare its performance with other Reinforcement Learning (RL) approaches, e.g., a) Q-learning, b) Watkins Q(λ), c) SARSA(λ).

Keywords: Markov Decision Process (MDP), Mobile robot, Policy iteration, Simulation.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1083823

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1447

References:

[1] D.P. Bertsekas and J.N. Tsitsiklis, Neurodynamic-Programming, Athena Scientific, Belmont MA, 1996.
[2] C.W. Zobel and W.T. Scherer, "Simulation-Based Policy Generation Using Large Scale Markov Decision Processes", IEEE Transactions on Systems, Man and Cybernetics-Part A: Systems & Humans, Vol. 31, No-6, November 2001.
[3] Stephan Ten Hagen, "Continuous state-space and Q learning for control of Non-linear systems", Ph. D. Thesis, Amsterdam University, 2001.
[4] C.G. Atkenson and J.C. Santamaria, "A Comparison of Direct and Model-Based Reinforcement Learning", Tech. Rep. GA 30332-0280, College of Computing, Georgia Institute of Technology, Atlanta.
[5] J.N Tsitsiklis, "On the Convergence of Optimistic Policy Iteration", Journal of Machine Learning Research 3 , pp. 59-72, 2002.
[6] Leonid Kuvayev, "Model-Based Reinforcement Learning with an Approximate Learned Model", Master-s Thesis, Department of Computer Science, University of Massachusetts, 1997.
[7] L.P. Kaelbling, M.L. Littman and A.W. Moore, "Reinforcement Learning: A Survey," Journal of Artificial Intelligence Research 4, pp. 237-285,1996.
[8] R. Fierro and F.L.Lewis, "Control of a Non-holonomic Mobile Robot using Neural Networks", IEEE Transactions on Neural Networks, Vol. 9, No.4, July 1998.
[9] A.G. Barto, S.J. Bradtke and S.P. Singh, "Learning to Act Using Real- Time Dynamic Programming", Artificial Intelligence, 72(1), pp. 81-138, 1995.
[10] S.P. Singh & R.S. Sutton, " Reinforcement Learning with Replacing Eligibility Traces", Machine Learning, 22, pp.23-158, 1996.
[11] J.N. Tsiksiklis and B.V. Roy, " An Analysis of Temporal Difference Learning with Function Approximation", IEEE Transactions on Automatic Control 42(5), pp. 674- 690,1997.
[12] C.J.C.H. Watkins and P.Dayan, " Technical Note: Q-Learning", Machine Learning, 8(3/4), pp. 279-292, 1992.
[13] R.S.Sutton, " Learning to predict by the Methods of Temporal Differences", Machine Learning, 3, pp. 9-44,1988.
[14] C.J.C.H.Watkins, " Learning from Delayed Rewards", Ph.D. Thesis, Cambridge University, Cambridge, England, 1989.
[15] R.S.Sutton & A.G.Barto, " Reinforcement Learning: An introduction", MIT Press, Cambridge, Massachusetts, 1998.
[16] R.E.Bellman, "Dynamic Programming", Princeton University Press, Princeton NJ, 1957.
[17] P.Dayan, "The convergence of TD(λ) for general λ", Machine Learning, 8, pp. 341- 362,1992.