108:30 — On Foundations of Distributionally Robust Reinforcement Learning

Motivated by the need for a robust policy in the face of environment shifts between training and the deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around distributionally robust Markov decision processes (DRMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct DRMDPs that embraces various modeling attributes for both the decision maker and the adversary. These attributes include adaptability granularity, exploring history-dependent, Markov, and Markov time-homogeneous decision maker and adversary dynamics. Additionally, we delve into the flexibility of shifts induced by the adversary, examining SA and S-rectangularity. Within this DRMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficiency RL algorithms are reliant on the DPP. To study its existence, we comprehensively examine combinations of controller and adversary attributes, providing streamlined proofs grounded in a unified methodology. We also offer counterexamples for settings in which a DPP with full generality is absent.

209:00 — Regularization for Adversarial Robust Learning

  • Jie Wang, Georgia Institute of Technology
  • Rui Gao, University of Texas at Austin
  • Xie Yao, Georgia Institute of Technology

Despite the growing prevalence of artificial neural networks in real-world applications, their vulnerability to adversarial attacks remains to be a significant concern, which motivates us to investigate the robustness of machine learning models. While various heuristics aim to optimize the distributionally robust risk using the $\infty$-Wasserstein metric, such a notion of robustness frequently encounters computation intractability. To tackle the computational challenge, we develop a novel approach to adversarial training that integrates f-divergence regularization into the distributionally robust risk function. This regularization brings a notable improvement in computation compared with the original formulation. We develop stochastic gradient methods with biased gradient oracles to solve this problem efficiently. Theoretical convergence analysis reveals that this algorithm has near-optimal sample complexity. Moreover, we establish the regularization effects and demonstrate this formulation is asymptotic equivalence to a regularized empirical risk minimization (ERM) framework, by considering various scaling regimes of the regularization parameter $\eta$ and robustness level $\rho$. These regimes yield gradient norm regularization, variance regularization, or a smoothed gradient norm regularization that interpolates between these extremes. We numerically validate our proposed method in supervised learning and reinforcement learning applications and showcase its state-of-the-art performance against various adversarial attacks.

309:30 — Distributionally Robust Affine Markov Decision Processes

Markov Decision Processes (MDPs) serve as a powerful model for sequential decision-making under uncertainty. Despite their prominence, a key challenge in employing MDPs for large-scale planning and control applications is the significant estimation uncertainty present in the model parameters when they are estimated from limited data. To mitigate its impact, we consider a robust approach set up to find policies yielding the highest average reward achievable uniformly over any state-transition model from an uncertainty set. Our primary contribution is in exhibiting such distributionally robust MDP models for which (i) the robust value function is simply an affine function of the present state, and as a result, (ii) optimal policies can be found by solving lower-dimensional recursions without suffering the curse of dimensionality. We achieve this for state-transition and reward models with an affine dependence on the present state-action pair, and action spaces that are linearly constrained based on the present state in a naturally decomposable manner. As a consequence, these robust affine MDP models become a novel and expressive addition to the limited collection of MDP instances known to be simultaneously immune to both estimation uncertainty and dimension-related challenges.

410:00 — An imitation-based learning approach using DAgger for the Casual Employee Call Timing Problem

Predictive models are increasingly important in enhancing decision-making processes. This study proposes an innovative approach utilizing DAgger, an imitation learning algorithm, to iteratively train a policy for addressing stochastic sequential decision problems. These problems can be challenging, especially when expert input is costly or unavailable. Our focus lies in crafting an effective expert within the DAgger framework, drawing from deterministic solutions derived from contextual scenarios generated at each decision point. Subsequently, a predictive model is developed to mimic the expert's behavior, aiding real-time decision-making. To illustrate the applicability of this methodology, we address a dynamic employee call-timing issue concerning the scheduling of casual personnel for on-call work shifts. The key decision involves determining the optimal time to contact the next employee in seniority order, allowing them to select a preferred shift. Uncertainty arises from the varying response times of employees. The goal is to strike a balance between minimizing schedule changes induced by early notifications or calls and avoiding unassigned shifts due to late notifications. Unlike traditional predict-and-optimize approaches, our method utilizes optimization to train learning models that establish connections between the system's current state and the expert's wait time. We apply our algorithm using data provided by our industrial partner to derive an operational policy. Results demonstrate the superiority of this policy over the current heuristic method in use.