1 — 08:30 — Structure Utilization for Reducing Dimensional Dependency for Nonsmooth Derivative-Free Optimization
Deterministic methods for derivative-free optimization that use finite difference to approximately conduct first-order updates usually have an evaluation cost per iteration proportional to the problem dimension, making these methods prohibitively expensive for high-dimensional problem.
In this work, we consider a regularized setting of minimizing the sum of a smooth function and a nonsmooth regularization term, where we only have access to the function value of the smooth term but assume full knowledge of the regularization. By identifying the structure at the point of convergence induced by the regularization using an inexact proximal gradient method constructed through finite difference, we propose an acceleration method whose per-iteration evaluation cost locally depends on the rank of the structure instead of the original problem dimension, and thus making the algorithm much more efficient without harming the convergence guarantees. We discuss how the structure can be found under a general error bound condition, and cover related convergence properties.
We further consider a progressive identification strategy that gradually decreases the estimated rank of the structure through an approximate Riemannian proximal gradient method by confining to the manifold that represents the structure at the current point, and discuss how it could potentially improve the convergence speed.
This is joint work with Geovani Nunes Grapiglia.
2 — 09:00 — Adaptable proximal algorithms: improved stepsize bounds and weaker assumptions
The choice of stepsize parameters has a significant impact on the performance of gradient-based optimization methods. Linesearches are arguably the most successful established strategies, for they enact an online tuning of the stepsize that reflects the local landscape of the function, thus striking a significant advantage over constant regimes which are instead only concerned with worst-case scenarios. On the other hand, the extra numerical effort involved in the necessary trial-and-error inner loops can constitute a severe drawback, especially in those cases in which functions are expensive to evaluate.
An answer to this dilemma has been recently found in (self-)adaptive strategies, in which stepsize parameters are updated based on local information yet with no need of posterior validation. Building upon recent works on these linesearch-free adaptive first-order methods, this talk proposes a framework that both unifies and tightens existing results by providing larger stepsize selection policies and improved lower bounds. While retaining a robust and linesearch-free nature, the employment of time-varying algorithmic meta-parameters provides an extra layer of adaptivity that can further boost performance in practice. Primal, dual, and primal-dual algorithms are investigated, and their effectiveness showcased with numerical simulations. Finally, we comment on recent developments demonstrating their effectiveness even under mere local Hölder continuity of the gradient.
3 — 09:30 — MGProx: A nonsmooth multigrid proximal gradient method with adaptive restriction for strongly convex optimization
We study the combination of proximal gradient descent with multigrid for solving a class of possibly nonsmooth strongly convex optimization problems. We propose a multigrid proximal gradient method called MGProx, which accelerates the proximal gradient method by multigrid, based on utilizing hierarchical information of the optimization problem.
MGProx applies a newly introduced adaptive restriction operator to simplify the Minkowski sum of subdifferentials of the nondifferentiable objective function across different levels. We provide a theoretical characterization of MGProx. First we show that variables at all levels exhibit a fixed-point property at convergence. Next, we show that the coarse correction is a descent direction for the fine variable in the general nonsmooth case. Lastly, under some mild assumptions we provide the convergence rate for the algorithm, such as the classical sub-linear $O(1/k)$ rate and also the linear rate.
By treating the multigrid proximal gradient iteration as a black-box, we also proposed a fast MGProx with Nesterov's acceleration, together with the classical $O(1/k^2)$ rate.
In the numerical experiments, we show that MGProx has a significantly faster convergence speed than proximal gradient descent and proximal gradient descent with Nesterov's acceleration on nonsmooth convex optimization problems such as the Elastic Obstacle Problem, which the restriction operator is well known.
Lastly, if time permits, we also discuss the possible extension of MGProx to other first-order methods in primal-dual settings and splitting settings, and also the the possible extension to second-order method such as non-smooth trust region methods.
4 — 10:00 — Understanding Neural Network Binarization with Forward and Backward Proximal Quantizers
In neural network binarization, BinaryConnect (BC) and its variants are considered the standard. These methods apply the sign function in their forward pass and their respective gradients are backpropagated to update the weights. However, the derivative of the sign function is zero whenever defined, which consequently freezes training. Therefore, implementations of BC (e.g., BNN) usually replace the derivative of sign in the backward computation with identity or other approximate gradient alternatives. Although such practice works well empirically, it is largely a heuristic or “training trick.” We aim at shedding some light on these training tricks from the optimization perspective. Building from existing theory on ProxConnect (PC, a generalization of BC), we (1) equip PC with different forward-backward quantizers and obtain ProxConnect++ (PC++) that includes existing binarization techniques as special cases; (2) derive a principled way to synthesize forward-backward quantizers with automatic theoretical guarantees; (3) illustrate our theory by proposing an enhanced binarization algorithm BNN++; (4) conduct image classification experiments on CNNs and vision transformers, and empirically verify that BNN++ generally achieves competitive results on binarizing these models.