114:00 — Bayesian Risk-Averse Reinforcement Learning

Reinforcement Learning (RL) has become a powerful tool in various fields such as autonomous driving and robotics in recent years. Within RL, an agent continuously interacts with its environment by choosing actions, observing rewards, and transitioning to new states. However, interacting with the actual physical environment can be costly or even unfeasible, which has led to the development of offline RL. In offline RL, the agent operates within a simulated training environment that is constructed based on previously observed data to approximate the real environment where the policy will ultimately be applied.

Directly implementing the policy learned from the training environment in the real environment can be risky, because of model misspecification, which is caused by the lack of historical data. To account for this model misspecification, we adopt a formulation of Bayesian risk MDP (BRMDP) with infinite horizon, which uses Bayesian posterior to estimate the transition model and impose a risk measure to account for the model uncertainty. We derive the asymptotic normality that characterizes the difference between the Bayesian risk value function and the original value function under true unknown distribution. The result indicates the Bayesian risk-averse approach tends to pessimistically underestimate the original value function and such a gap increases with stronger risk-aversion but decreases with more available data.

214:30 — Robust Regret Markov Decision Processes

Markov decision processes (MDPs) are one of the standard frameworks for reinforcement learning and sequential decision-making problems under uncertainty. However, a practical limitation of standard MDPs is their assumption of having precise knowledge of model parameters, such as transition probabilities and rewards. This assumption becomes problematic in real-world scenarios where the effect of these estimation errors in these parameters can accumulate over time, leading to output policies that are sensitive to errors and may fail catastrophically when deployed. To address this challenge, robust MDPs have emerged as a popular alternative. Robust MDPs aim to optimize worst-case performance by considering the ambiguity in the model parameters. While robust MDPs offer reliable policies even with limited data, they are often overly conservative. In this presentation, we propose to adopt the regret decision criterion to tackle this issue. We introduce a novel framework called robust regret MDPs, which specifically optimize stepwise regret under model ambiguity. By incorporating this criterion, we aim to strike a balance between policy reliability and the overly conservative nature of worst-case performances. The proposed robust regret MDPs can be solved by the value iteration method. Moreover, we present novel algorithms for efficiently computing the associated Bellman updates for large instances.

315:00 — Distributionally Robust Path Integral Control

We consider a continuous-time continuous-space stochastic optimal control problem, where the controller lacks exact knowledge of the underlying diffusion process, relying instead on a finite set of historical disturbance trajectories. In situations where data collection is limited, the controller synthesized from empirical data may exhibit poor performance. To address this issue, we introduce a novel approach named Distributionally Robust Path Integral (DRPI). The proposed method employs distributionally robust optimization (DRO) to robustify the resulting policy against the unknown diffusion process. Notably, the DRPI scheme shows similarities with risk-sensitive control, which enables us to utilize the path integral control (PIC) framework as an efficient solution scheme. We derive theoretical performance guarantees for the DRPI scheme, which closely aligns with selecting a risk parameter in risk-sensitive control. We validate the efficacy of our scheme and showcase its superiority when compared to risk-neutral PIC policies in the absence of the true diffusion process.

415:30 — Towards Optimal Offline Reinforcement Learning

We study offline reinforcement learning problems with a long-run average reward objective. The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain, and the empirical state-action-next-state distribution satisfies a large deviations principle. We use the rate function of this large deviations principle to construct an uncertainty set for the unknown true state-action-next-state distribution. We also construct a distribution shift transformation that maps any distribution in this uncertainty set to a state-action-next-state distribution of the Markov chain generated by a fixed evaluation policy, which may differ from the behavioral policy. We prove that the worst-case average reward of the evaluation policy with respect to all distributions in the shifted uncertainty set provides, in a rigorous statistical sense, the least conservative estimator for the average reward under the unknown true distribution. This guarantee is available even if one has only access to one single trajectory of serially correlated state-action pairs. The emerging robust optimization problem can be viewed as a robust Markov decision process with a non-rectangular uncertainty set. We develop an efficient first-order algorithm for solving such problems. Numerical experiments show that our methods compare favorably against state-of-the-art methods.