114:00 — Convergence of Momentum Stochastic Gradient Descent

In this talk, we discuss optimizing a non-convex objective function using Momentum Stochastic Gradient Descent (MSGD) in the context of supervised learning. Empirical observations motivate the long-standing conjecture that including a momentum term improves the performance of stochastic optimization algorithms. However, there are only few theoretical results proving the advantage of these methods. We present a framework that allows the theoretical analysis of MSGD. Using local Lojasiewicz-inequalities we show almost sure convergence of the MSGD scheme under classical assumptions on the noise as well as in the overparametrized regime. These results are made possible by a careful choice of a Lyapunov function and rely on the derivation of polynomial (for classical noise), respectively exponential (in the overparametrized setting), convergence rates.
In the overparametrized and small learning-rate regime, we prove explicit bounds on the exponential rate of convergence for the MSGD scheme. Moreover, we analyze the optimal friction parameter and prove that, for the optimal choice of parameters, MSGD indeed speeds up convergence in the case of flat minima. This result extends the classical work of Polyak (1964) which showed a beneficial effect of Momentum in the optimization of a strictly convex objective function using deterministic gradient descent.

214:30 — Convergence of a Normal Map-Based Prox-SGD Method for Stochastic Composite Optimization

  • Andre Milzarek, The Chinese University of Hong Kong, Shenzhen
  • Junwen Qiu, The Chinese University of Hong Kong, Shenzhen

In this talk, we present a novel stochastic normal map-based algorithm (nor-SGD) for nonsmooth nonconvex composite-type optimization problems and discuss its asymptotic convergence properties. Using the time window-based strategy and a suitable merit function, we first analyze the global convergence behavior of nor-SGD and show that every accumulation point of the generated sequence of iterates is a stationary point almost surely and in an expectation sense. The obtained results hold under standard assumptions and extend the more limited convergence guarantees of the basic proximal stochastic gradient method. In addition, based on the well-known Kurdyka-Lojasiewicz (KL) analysis framework, we provide novel point-wise convergence results for the iterates generated by nor-SGD. In the meanwhile, we have derived convergence rates that depend on the KL exponent and the step size dynamics. The obtained rates are faster than related and existing convergence rates for SGD in the nonconvex setting. It is worth noting that the obtained asymptotic rates align with those in the strongly convex scenario when the KL exponent is less than or equal to 1/2. The techniques studied in this work can be potentially applied to other families of stochastic and simulation-based algorithms.

315:00 — Stochastic composition optimization in the absence of Lipschitz continuous gradient

This talk considers the optimization of the nested composition of two functions where at least the inner function has the expectation form. In such nested structures, obtaining unbiased estimates of the gradient of the composition is challenging. Nested stochastic optimization has a range of recent applications in machine learning, particularly in reinforcement learning and meta-learning. We develop stochastic algorithms for composition optimization with convergence guarantees in the absence of the Lipschitz continuity of the gradient of the inner and/or outer functions. More specifically, The smoothness property is generalized by the notion of relative smoothness which provokes the Bregman gradient method. We propose three Stochastic Composition Bregman Gradient algorithms for the three possible relatively smooth compositional scenarios and provide their sample complexities to achieve an $\epsilon$-approximate stationary point. For the smooth of relatively smooth composition, the first algorithm requires $O(\epsilon^{-2})$ calls to the stochastic oracles of the inner function value and gradient as well as the outer function gradient. When both functions are relatively smooth, the second algorithm requires $O(\epsilon^{-3})$ calls to the inner function value stochastic oracle and $O(\epsilon^{-2})$ calls to the inner and outer functions gradients stochastic oracles. We further improve the second algorithm by variance reduction for the setting where just the inner function is smooth. The resulting algorithm requires $O(\epsilon^{-5/2})$ calls to the inner function value stochastic oracle, $O(\epsilon^{-3/2})$ calls to the inner function gradient, and $O(\epsilon^{-2})$ calls to the outer function gradient stochastic oracles. Some numerical studies on the performance of the proposed algorithms will be presented.

415:30 — Adam-family Methods with Decoupled Weight Decay in Deep Learning

In this paper, we investigate the convergence properties of a wide class of Adam-family methods
for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially
in the context of training nonsmooth neural networks with weight decay. Motivated by the AdamW
method, we propose a novel framework for Adam-family methods with decoupled weight decay.
In our framework, the estimators for the first-order and second-order moments of stochastic subgradients
are updated independent of the weight decay term. Under mild assumptions, we prove that
our framework converges with non-diminishing stepsizes to the variables. In addition, we show
that our proposed framework encompasses a wide variety of well-known Adam-family methods,
hence offering convergence guarantees for these methods in the training of nonsmooth neural networks.
More importantly, we prove that our proposed framework asymptotically approximates the
SGD method, thereby providing an explanation for empirical observations that decoupled weight
decay enhances generalization performance for Adam-family methods. As a practical application of
our proposed framework, we propose a novel Adam-family method named Adam with Decoupled
Weight Decay (AdamD), and establish its convergence properties under mild conditions. Extensive
numerical experiments exhibit that AdamD outperforms Adam and is comparable to AdamW, in
the aspects of both generalization performance and efficiency.