On Q-learning Convergence for Non-Markov Decision Processes

On Q-learning Convergence for Non-Markov Decision Processes

Sultan Javed Majeed, Marcus Hutter

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 2546-2552. https://doi.org/10.24963/ijcai.2018/353

Temporal-difference (TD) learning is an attractive, computationally efficient framework for model- free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.
Keywords:
Machine Learning: Online Learning
Machine Learning: Reinforcement Learning
Planning and Scheduling: Markov Decisions Processes