
In this thesis, we study the application of a class of stochastic primal-dual algorithms to reinforcement learning. The method takes advantage of the min-max formulation of the Bellman equation and jointly learns the optimal value function and the optimal policy through a series of primal-dual updates. We study both the online reinforcement learning problem and the simulation problem, and present several variants of the method and their theoretical guarantees.In Chapter 2, we investigate the Infinite-Horizon Average-Reward Markov Decision Process (AMDP) and propose an online primal-dual algorithm to learn the optimal policy by actively making decisions along a single state trajectory. Our results appear to be the first duality-based value-policy gradient method and regret analysis for infinite-horizon RL. We prove an (√T) regret upper bound that also depends on an ergodicity parameter of the AMDP, which can be significantly smaller than existing diameter-dependent bounds in some cases.In Chapter 3, we propose a class of utility maximization problems as a generalization of AMDPs that incorporates nonlinearity into the objective. We develop a primal-dual algorithm to solve the proposed problems and show that it achieves superior sample complexity compared to previous duality-based algorithms when the problem reduces to an AMDP.Chapter 4 presents a general primal-dual framework for solving the utility maximization problem with function approximation. We also study a specific class of linear/bilinear models and prove that the adjusted algorithm achieves a comparable convergence rate.
Page Count:
126
Publication Date:
2021-01-01
No comments yet. Be the first to share your thoughts!