1 — 16:20 — Differentiating Nonsmooth Solutions to Parametric Monotone Inclusion Problems
Understanding the differentiability and regularity of the solution to a monotone inclusion problem as a function of the problem data is an important question with consequences for convex optimization, machine learning, signal processing, and beyond. Past attempts have been made either under very restrictive assumptions that ensure the solution is continuously differentiable or using mathematical tools that are incompatible with automatic differentiation. In this talk, we discuss how to leverage path differentiability and a recent result on nonsmooth implicit differentiation calculus to give sufficient conditions ensuring that the solution to a monotone inclusion problem will be path differentiable. We also provide formulas for computing its generalized gradient that can be used in practice, notably for applications in deep learning. Our approach is fully compatible with automatic differentiation and the backpropagation algorithm and comes with assumptions which are easy to check. In fact, we show that semialgebraicity (or even just definability in some o-minimal structure) and strong monotonicity of the monotone inclusion problem are sufficient to guarantee the path differentiability of the solution. We illustrate the scope of our results by considering three fundamental composite problem settings that are often encountered in practice: strongly convex problems, dual solutions to convex minimization problems and primal-dual solutions to min-max problems.
2 — 16:50 — The gradient’s limit of a definable family of functions is a conservative set-valued field
It is well-known that the convergence of a family of smooth functions does not imply the convergence of its gradients. In this talk, we will show that if the family is definable in an o-minimal structure (for instance semialgebraic, subanalytic, or any composition of the previous with exp, log), then the gradient's limit is actually a conservative set-valued field in the sense introduced by Bolte and Pauwels. This result will be presented in a more general setting, where the functions in the original family might be merely locally Lipschitz continuous, vector-valued and the gradients are replaced by the Clarke Jacobians (or an arbitrary conservative mapping).
In particular, while a conservative set-valued field of a function is not necessarily equal to its gradient, it is equal to its gradient almost everywhere, implying the convergence of the gradients on a set of full measure. Moreover, the general geometric structure of a conservative field and its links with the usual subgradients are well-understood and will be presented in the talk.
Finally, we will discuss consequences of this result for convergence guarantees of smoothing methods and potential generalizations to the case where the functions are not locally Lipschitz continuous.
3 — 17:20 — Riemannian Stochastic Approximation for Minimizing Tame Nonsmooth Objective Functions
In many learning applications, the parameters in a model are structurally constrained in a way that can be modeled as them lying on a Riemannian manifold. Riemannian optimization, wherein procedures to enforce an iterative minimizing sequence to be constrained to the manifold, is used to train such models. At the same time, tame geometry has become a significant topological description of nonsmooth functions that appear in the landscapes of training neural networks and other important models with structural compositions of continuous nonlinear functions with nonsmooth maps. In this paper, we study the properties of such stratifiable functions on a manifold and the behavior of retracted stochastic gradient descent, with diminishing stepsizes, for minimizing such functions.
Functions whose graph satisfies this structure, called tame, have been the
subject of significant interest in recent years. This is in light of the fact that
the optimization problem defining the training of the weights of a deep neural
network is expressed as the minimization of a tame function. The function is a
nested composition of component-wise maximums (rectified linear units) and
nonlinear activation functions (softmax, hyperbolic tangent, etc.), resulting in a
nonsmooth and nonconvex objective landscape, however one that is also locally
Lispchitz and with a well-defined set of points at which it is nondifferentiable
with a hierarchical structure.