1 — 14:00 — Inexact restoration for minimization with inexact evaluation both of the objective function and the constraints
In a recent paper an Inexact Restoration method for solving continuous constrained optimization problems was analyzed from the point of view of worst-case functional complexity and convergence. On the other hand, the Inexact Restoration methodology was employed, in a different research, to handle minimization problems with inexact evaluation and simple constraints. These two methodologies are combined in the present report, for constrained minimization problems in which both the objective function and the constraints, as well as their derivatives, are subject to evaluation errors. Together with a complete description of the method, complexity and convergence results will be presented. Preliminary numerical results will also be discussed.
2 — 14:30 — A Generalized Version of Chung’s Lemma and its Applications
Chung's lemma is a well-known tool for establishing asymptotic convergence rates of (stochastic) optimization methods under strong convexity-type assumptions and appropriate polynomial diminishing step sizes. In this talk, we propose a generalized version of Chung's lemma, which provides a simple non-asymptotic convergence analysis framework for a more general setting of step size rules. We demonstrate the broad applicability of the generalized Chung's lemma by deriving tight non-asymptotic convergence rates for a variety of stochastic approaches. Specifically, we obtain partially new non-asymptotic complexity results for stochastic optimization methods (e.g., SGD and Random Reshuffling) under a general (theta, mu)-Polyak-Lojasiewicz (PL) condition and various step sizes rules, including polynomial, constant, exponential, and cosine step sizes rules. Notably, we observe that the exponential step size can adapt to the objective function's geometry, achieving the optimal convergence rate without requiring exact knowledge of the underlying landscape.
3 — 15:00 — On the Trajectories of SGD Without Replacement
This talk is about the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be O(1) and we do not specify any model architecture, learning task, or loss (objective) function.
Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the trajectory of SGD without replacement diverges from both noise-injected GD and SGD with replacement (in which batches are sampled i.i.d.). Indeed, the two SGDs travel flat regions of the loss landscape in distinct directions and at different speeds. In expectation, SGD without replacement travels flat areas and may escape saddles significantly faster.
Moreover, we find that SGD implicitly regularizes the trace of the noise covariance in the eigendirections of small and negative Hessian eigenvalues. This coincides with penalizing a weighted trace of the Fisher Matrix and the Hessian on several vision tasks, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).
4 — 15:30 — Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods
In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.