116:20 — Policy Mirror Descent Inherently Explores Action Space

  • Yan Li, Georgia Institute of Technology

Explicit exploration in the action space was assumed to be indispensable for online policy gradient
methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning
problems over finite state and action spaces. We show that this is not needed by introducing new policy evaluation operators used in the policy mirror descent update. An $O(1/\epsilon^2)$ sample complexity is established. Notably, the evaluation operator uses no explicit exploration in estimating the state-action value function.

216:50 — Nonconvex Landscape of Policy Optimization for A Class of Finite Horizon Markov Decision Processes and Applications in Operations Models

  • Xin Chen, Georgia Institute of Technology

We explore policy gradient methods for finite horizon Markov Decision Processes (MDPs) with continuous state and action spaces. Policy gradient methods for MDPs do not converge to global optimal solutions in general due to the non-convexity of the objective functions. We identify several easily verifiable conditions to guarantee the global Kurdyka-Łojasiewicz (KŁ) condition for the objectives of policy gradient optimization problems for a class of MDPs. This allows to establish that the policy gradient optimization problems can be solved by first-order methods with a sample complexity in the order of $1/\epsilon$ and polynomial at the planning horizon length to attain $\epsilon$-global optimal solutions. Our results find applications in a host of operations models including the stochastic cash balance problem and multi-period inventory system with Markov-modulated demand, giving the first sample complexity results in the literature.

317:20 — Adversarially Robust Optimal Control with Causal Transport Distance

We investigate stochastic optimal control problems in the presence of ambiguous disturbance distributions. To address this dynamic uncertainty, we propose a minimax distributionally robust formulation based on the $\infty$-causal transport distance. We develop a nested dynamic programming reformulation for this problem. To tackle the dynamic programming problem involving continuous states and controls, we devise a stochastic dual dynamic programming algorithm. Numerical experiments on energy planning demonstrate the effectiveness of our approach.