1 — 14:00 — Inexact restoration for minimization with inexact evaluation both of the objective function and the constraints
In a recent paper an Inexact Restoration method for solving continuous constrained optimization problems was analyzed from the point of view of worst-case functional complexity and convergence. On the other hand, the Inexact Restoration methodology was employed, in a different research, to handle minimization problems with inexact evaluation and simple constraints. These two methodologies are combined in the present report, for constrained minimization problems in which both the objective function and the constraints, as well as their derivatives, are subject to evaluation errors. Together with a complete description of the method, complexity and convergence results will be presented. Preliminary numerical results will also be discussed.
2 — 14:30 — A Generalized Version of Chung’s Lemma and its Applications
Chung's lemma is a well-known tool for establishing asymptotic convergence rates of (stochastic) optimization methods under strong convexity-type assumptions and appropriate polynomial diminishing step sizes. In this talk, we propose a generalized version of Chung's lemma, which provides a simple non-asymptotic convergence analysis framework for a more general setting of step size rules. We demonstrate the broad applicability of the generalized Chung's lemma by deriving tight non-asymptotic convergence rates for a variety of stochastic approaches. Specifically, we obtain partially new non-asymptotic complexity results for stochastic optimization methods (e.g., SGD and Random Reshuffling) under a general (theta, mu)-Polyak-Lojasiewicz (PL) condition and various step sizes rules, including polynomial, constant, exponential, and cosine step sizes rules. Notably, we observe that the exponential step size can adapt to the objective function's geometry, achieving the optimal convergence rate without requiring exact knowledge of the underlying landscape.
3 — 15:00 — On the Trajectories of SGD Without Replacement
This talk is about the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be O(1) and we do not specify any model architecture, learning task, or loss (objective) function.
Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the trajectory of SGD without replacement diverges from both noise-injected GD and SGD with replacement (in which batches are sampled i.i.d.). Indeed, the two SGDs travel flat regions of the loss landscape in distinct directions and at different speeds. In expectation, SGD without replacement travels flat areas and may escape saddles significantly faster.
Moreover, we find that SGD implicitly regularizes the trace of the noise covariance in the eigendirections of small and negative Hessian eigenvalues. This coincides with penalizing a weighted trace of the Fisher Matrix and the Hessian on several vision tasks, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).
4 — 15:30 — Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods
In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal