108:30 — Global-Local Smoothness: Line Search can really help! Really!

Iteration complexities for first-order optimization algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and optimal results can typically be achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have thus been proposed, which lead to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the path taken by the algorithm, which makes it difficult to compare the iteration complexities of different methods.

We consider a simple characterization of global and local smoothness that only depends on properties of the function. This allows explicit lower and upper bounds on iteration complexities in terms of problem-dependent constants, which allows us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes, of line searches with unbounded step sizes over those that bound the maximum step size, and even that in some settings gradient descent with line search has a better iteration complexity than accelerated gradient methods with fixed step sizes.

209:00 — What is the right extension of Polyak's stepsize to mirror descent?

We investigate the convergence of stochastic mirror descent (SMD) under interpolation in relatively smooth and smooth convex optimization. In relatively smooth convex optimization we provide new convergence guarantees for SMD with a constant stepsize. For smooth convex optimization we propose a new adaptive stepsize scheme --- the mirror stochastic Polyak stepsize (mSPS) . We show convergence to a neighborhood that vanishes under interpolation. Consequently, these results correspond to the first convergence guarantees under interpolation for the exponentiated gradient algorithm for fixed or adaptive stepsizes. Although mSPS provides one generalization of the Polyak stepsize, it may not be best suited for relatively smooth problems, we discuss potential alternatives that be more compatible with relative smoothness assumptions.

309:30 — Scaling Law: Finding Compute Optimal Curves on a Simple Model with SGD

We describe a program of analysis of stochastic gradient descent on a high-dimensional least squares problem with power law random features. First, we give description of these loss curves that can be analyzed precisely. This is a simple two-hyperparameter family of optimization problems, which displays 5 distinct phases of loss curves; these phases are determined by the relative complexities of the target, data distribution, and whether these are ‘high-dimensional’ or not (which in context can be precisely defined). In each phase, we also give, for a given fixed compute budget, the optimal random-feature dimensionality. In this sense, we predict the best allocation of parameters for a given compute budget.

Joint work with Elliot Paquette (McGill), Jeffrey Pennington (Google Deepmind), and Lechao Xiao (Google Deepmind).

410:00 — ** CANCELLED ** Generalized Polyak Step Size for First Order Optimization with Momentum

In machine learning applications, it is well known that carefully designed learning rate (step size) schedules can significantly improve the convergence of commonly used first-order optimization algorithms. Therefore how to set step size adaptively becomes an important research question. A popular and effective method is the Polyak step size, which sets step size adaptively for gradient descent or stochastic gradient descent without the need to estimate the smoothness parameter of the objective function. However, there has not been a principled way to generalize the Polyak step size for algorithms with momentum accelerations. This paper presents a general framework to set the learning rate adaptively for first-order optimization methods with momentum, motivated by the derivation of Polyak step size. It is shown that the resulting techniques are much less sensitive to the choice of momentum parameter and may avoid the oscillation of the heavy-ball method on ill-conditioned problems. These adaptive step sizes are further extended to the stochastic settings, which are attractive choices for stochastic gradient descent with momentum. Our methods are demonstrated to be more effective for stochastic gradient methods than prior adaptive step size algorithms in large-scale machine learning tasks.