114:00 — Relative Monte Carlo for Reinforcement Learning

We introduce relative Monte Carlo (rMC), a new general purpose policy gradient algorithm for reinforcement learning with discrete action space. The policy is improved in real time using relative returns between a root sample path and counterfactual simulated paths instantiated by taking a different action from the root. The method is compatible with any differentiable policy, including the leading choice of neural network parametrization. It is guaranteed to converge for episodic and average reward tasks. Unlike traditional Monte Carlo, rMC policy gradient steps are performed throughout the rollout using a memory-efficient update decomposition, inspired by eligibility traces. Strong couplings between root and counterfactual paths further contribute to low data generation and memory requirements. We test rMC with a policy network in a two-tiered inventory fulfillment problem. Numerical results show it performs well compared to related algorithms.

214:30 — Structured Inverse-free Matrix Adaptive Methods for Large Neural Networks

Optimization is an essential ingredient of deep learning. Many optimization problems can be reformulated from a probabilistic perspective to exploit the Fisher-Rao geometric structure of a probability family. In this talk, we show how to design new quasi-Newton methods for large-scale neural network (NN) training by leveraging the geometric structure. We first establish a second-order view of adaptive methods such as RMSProp and full-matrix AdaGrad when the square root is removed from their preconditioned gradient step. We then introduce and exploit preconditioner invariance to make such square-root-free (non-diagonal) adaptive methods inverse-free while maintaining their preconditioner structures for modern low-precision training with mini-batches. Finally, we propose Kronecker-factored adaptive methods without the root to bridge the computation gap between non-diagonal and diagonal adaptive methods, and demonstrate advantages of our methods over square-root-based counterparts like Shampoo for training large NNs such as Transformers and Mambas in half-precision by removing numerically unstable matrix square-root decompositions and inversions.

315:00 — Target-based Surrogates for Stochastic Optimization

We consider minimizing functions for which it is expensive to compute the (possibly stochastic) gradient. Such functions are prevalent in reinforcement learning, imitation learning and adversarial training. Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a target space (e.g. the logits output by a linear model for classification) that can be minimized efficiently. This allows for multiple parameter updates to the model, amortizing the cost of gradient computation. In the full-batch setting, we prove that our surrogate is a global upper-bound on the loss, and can be (locally) minimized using a black-box optimization algorithm. We prove that the resulting majorization-minimization algorithm ensures convergence to a stationary point of the loss. Next, we instantiate our framework in the stochastic setting and propose the SSO algorithm, which can be viewed as projected stochastic gradient descent in the target space. This connection enables us to prove theoretical guarantees for SSO when minimizing convex functions. Our framework allows the use of standard stochastic optimization algorithms to construct surrogates which can be minimized by any deterministic optimization method. To evaluate our framework, we consider a suite of supervised learning and imitation learning problems. Our experiments indicate the benefits of target optimization and the effectiveness of SSO.

415:30 — Stochastic Extragradient Methods: New Convergence Guarantees for Monotone and Non-monotone Variational Inequalities

Stochastic extragradient methods (SEG) have garnered significant interest in recent years and rank among the most efficient algorithms for solving large-scale min-max optimization and variational inequalities problems (VIP) that occur in various machine learning tasks. Despite their undeniable popularity, current convergence analyses of SEG require strong assumptions like bounded variance or growth conditions. Moreover, several important questions regarding the convergence properties of these methods remain unanswered, including mini-batching, efficient step-size selection, and convergence guarantees under various sampling strategies. In this talk, we address these questions and provide novel convergence guarantees for several variants of SEG, including Same-sample SEG (S-SEG), Stochastic Past Extragradient (SPEG), and SEG with Random Reshuffling (SEG-RR). We complement our convergence guarantees with experiments that empirically validate our theoretical findings.