1 — 14:00 — Why Line-Search When You Can Plane-Search?
The practical performance of an optimization method depends on details such as using good step sizes. Strategies for setting step sizes are generally limited to hyperparameter tuning (for a fixed step size), step size schedules and line searches. For many common machine learning problems, line optimization and subspace optimization find accurate step sizes for asymptotically the same cost as using a fixed step size. In some cases, line optimization may find step sizes that are ruled out by the standard Armijo condition. For optimization methods that use multiple search directions, such as gradient descent with momentum, using subspace optimization instead of fixed step size schedules allow for better adaptivity and potentially faster convergence. In the case of some neural networks, subspace optimization allows the use of different step sizes for different layers that could decrease the amount of training time needed, as well as reducing the dependence on hyperparameter tuning.
2 — 14:30 — Searching for optimal per-coordinate step-sizes with multidimensional backtracking
The backtracking line-search is an effective technique to automatically tune the step-size in smooth optimization. It guarantees similar performance to using the theoretically optimal step-size. Many approaches have been developed to instead tune per-coordinate step-sizes, also known as diagonal preconditioners, but none of the existing methods are provably competitive with the optimal per-coordinate step-sizes. We propose multidimensional backtracking, an extension of the backtracking line-search to find good diagonal preconditioners for smooth convex problems. Our key insight is that the gradient with respect to the step-sizes, also known as hyper-gradients, yields separating hyperplanes that let us search for good preconditioners using cutting-plane methods. As black-box cutting-plane approaches like the ellipsoid method are computationally prohibitive, we develop an efficient algorithm tailored to our setting. Multidimensional backtracking is provably competitive with the best diagonal preconditioner and requires no manual tuning.
3 — 15:00 — When Online Learning Meets Stochastic Calculus
Online learning is a theoretical framework for learning and optimization under adversarial circumstances. Optimization methods developed in this setting are usually robust and have formal notions of beyond worst-case (or adaptive) performance. Often, the development and analysis of such methods can often be challenging and unintuitive. A recent line of work has looked at online learning through the lens of differential equations and continuous-time analysis. This viewpoint has yielded optimal results for several problems in online learning. In this talk I will overview a few uses of stochastic calculus as a guide in the design and analysis of online learning methods, with a particular focus on the fundamental problem of prediction with experts' advice.
4 — 15:30 — Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models
Recent works have shown that line search methods can speed up SGD and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function. We explore nonmonotone line search methods to relax this condition and possibly accept larger step sizes. Despite the lack of a monotonic decrease, we prove the same fast rates of convergence as in the monotone case. These are the first rates proving the convergence of nonmonotone line search methods in the stochastic setting under interpolation. Our experiments show that nonmonotone methods improve the speed of convergence and generalization properties of SGD/Adam even beyond the previous monotone line searches. We propose a POlyak NOnmonotone Stochastic (PoNoS) method, obtained by combining a nonmonotone line search with a Polyak initial step size. Furthermore, we develop a new resetting technique that in the majority of the iterations reduces the amount of backtracks to zero while still maintaining a large initial step size. To the best of our knowledge, a first runtime comparison shows that the epoch-wise advantage of line-search-based methods gets reflected in the overall computational time.