108:30 — CMA Light: a minibatch algorithm for large-scale non-convex finite sum optimization

We consider minimizing the sum of a large number of smooth and possibly non-convex functions, which is the typical problem encountered in the training of deep neural networks on large-size datasets. By drawing inspiration from the Controlled Minibatch Algorithm (CMA) scheme proposed by Liuzzi et al. (2022), we define CMA Light, an incremental gradient-like method, which requires seldom computation of the objective function, needed only when light watchdog rules are not satisfied. Thanks to an extrapolation derivative-free linesearch, convergence still holds without any assumptions on the norm of the gradient. We firstly carry out numerical test on standard feed-forward neural networks to highlight the advantage of CMA Light over CMA and other mini-batch algorithms on large-size datasets. Then, we present a thorough computational analysis of CMA Light performance on standard image classification tasks on CIFAR10 and CIFAR100 with deep convolutional neural networks and residual networks, showing that CMA Light can easily scale up to problem with 50-100 millions of variables and has an advantage over its state-of-the-art competitors.
Finally, we present CMA Light randomized counterpart, which converges with standard assumptions on the norm of the gradient, and show its scaling capabilities on classification tasks with vision transformers, where the number of variables grows up to the order of billions.

209:00 — On adaptive stochastic heavy ball momentum for solving linear systems

The stochastic heavy ball momentum (SHBM) method has gained considerable popularity as a scalable approach for solving large-scale optimization problems. However, one limitation of this method is its reliance on prior knowledge of certain problem parameters, such as singular values of a matrix. In this paper, we propose an adaptive variant of the SHBM method for solving stochastic problems that are reformulated from linear systems using user-defined distributions. Our adaptive SHBM (ASHBM) method utilizes iterative information to update the parameters, addressing an open problem in the literature regarding the adaptive learning of momentum parameters. We prove that our method converges linearly in expectation, with a better convergence bound compared to the basic method. Notably, we demonstrate that the deterministic version of our ASHBM algorithm can be reformulated as a variant of the conjugate gradient (CG) method, inheriting many of its appealing properties, such as finite-time convergence. Consequently, the ASHBM method can be further generalized to develop a brand-new framework of the stochastic CG (SCG) method for solving linear systems. Our theoretical results are supported by numerical experiments.

309:30 — Optimal gradient tracking for decentralized optimization

  • Ming Yan, The Chinese University of Hong Kong, Shenzhen

In this talk, I will focus on solving the decentralized optimization problem of minimizing the sum of objective functions over a multi-agent network. The agents are embedded in an undirected graph where they can only send/receive information directly to/from their immediate neighbors. Assuming smooth and strongly convex objective functions, we propose an Optimal Gradient Tracking (OGT) method that simultaneously achieves the optimal gradient computation complexity and communication complexity. Its development involves two building blocks that are also of independent interest. The first one is another new decentralized gradient tracking method, SSGT, which achieves optimal gradient computation. The second one is a technique termed LCA, which can be implemented "looplessly" but achieves a similar effect by adding multiple inner loops of Chebyshev acceleration in the algorithm. This LCA technique can accelerate many other gradient tracking-based methods concerning the graph condition number.

410:00 — Communication-efficient distributed optimization algorithms

In the big data era, the explosion in size and complexity of data arises in parallel to a shift towards distributed computations. In distributed optimization and machine learning, especially in federated learning, specific issues arise, such as decentralized data storage for privacy concerns. In this setting, a large number of machines perform computations in parallel and communicate back and forth with a distant server. Communication is typically costly and slow and forms the main bottleneck. To decrease it, two strategies are popular: 1) communicate less frequently; 2) compress the communicated vectors. I will present several randomized algorithms we developed recently in this area, with proved convergence guarantees and state-of-the-art communication complexity.