1 — 08:30 — ** CANCELLED ** Deep Learning with Nontrivial Constraints
Despite the sweeping success and impact of deep learning in numerous domains, imposing explicit constraints is relatively new but increasingly pressing in deep learning (DL), stimulated by, e.g., trustworthy AI that performs robust optimization over complicated perturbation sets and scientific and engineering applications that need to respect physical laws and constraints. In this talk, we will (1) survey DL problems with nontrivial constraints across science, engineering, and medicine, (2) highlight the NCVX computing framework we have recently built, which provides deterministic solvers to solve constrained DL problems, and (3) invite the optimization community to solve the stochastic constrained DL problems.
2 — 09:00 — Adaptive Quasi-Newton and Anderson Acceleration with Global (Accelerated) Convergence Rates
Despite the impressive numerical performance of the quasi-Newton and Anderson/nonlinear acceleration methods, their global convergence rates have remained elusive for over 50 years. This study addresses this long-standing issue by introducing a framework that derives novel, adaptive quasi-Newton and nonlinear/Anderson acceleration schemes. Under mild assumptions, the proposed iterative methods exhibit explicit, non-asymptotic convergence rates that blend those of the gradient descent and Cubic Regularized Newton's methods. The proposed approach also includes an accelerated version for convex functions. Notably, these rates are achieved adaptively without prior knowledge of the function's parameters. The framework presented in this study is generic, and its special cases includes algorithms such as Newton's method with random subspaces, finite-differences, or lazy Hessian. Numerical experiments demonstrated the efficiency of the proposed framework, even compared to the l-BFGS algorithm with Wolfe line-search.
The presented method can be seen as a mix between 1) (orthogonal) subspace Newton, 2) finite-difference approximation of the Hessian with successive gradients, 3) lazy Newton's method and 4) cubic regularization of Newton's method. The combination of those strategies ensures a fast global convergence rate, but reduces the per-iteration cost to $O(N^2 d)$, where N is the number of gradient stored in memory.
3 — 09:30 — Non-Convexity Two Ways: new bounds for the Paulsen Problem via Geodesic PL-condition
The Paulsen problem is a basic question in frame theory: given $u_{1}, …, u_{n} \in \R^{d}$ that $\varepsilon$-nearly satisfy the Parseval and equal-norm condition, what is the nearest frame $v_{1}, …, v_{n} \in \R^{d}$ that exactly satisfies these two conditions? For this question, the distance is measured as $\text{dist}(U,V) := \sum_{j \in [n]} \|u_{j} - v_{j}\|_{2}^{2}$, and the fundamental conjecture was whether $\text{dist}(U,V) \leq \text{poly}(d) \varepsilon$, namely bounded as a function independent of $n$.
In a breakthrough result, [Kwok et al 17] were able to affirm this conjecture using techniques from the operator scaling framework of [Gurvits et al 15]. Subsequently, this was improved to the optimal bound $\text{dist}(U,V) \lesssim \varepsilon$ in [R 21] using a principled perspective of geodesic convex optimization over positive definite matrices.
In this talk, we revisit some partial results that were known before the above resolution of the conjecture: in [Bodmann, Casazza 10] and [Casazza et al 12], it was shown that if $n,d$ are relatively prime, then $\text{dist}(U,V) \leq \text{poly}(d,n) \varepsilon^{2}$. Notably, this is a beyond worst case bound as it contradicts the $\varepsilon$ lower bound.
In this work, we are able to re-derive these results via the perspective of geodesic optimization.
Specifically, we show that the procedures developed in [BC10], [CFM 12] can be understood as gradient flows over certain non-Euclidean manifolds. Unlike the results from the scaling framework, this is still a non-convex optimization problem.
Nevertheless, we are able to prove global convergence by appealing to a gedoesic version of the Polyak-Lojasiewicz condition.
We are also able to use these techniques to give another proof of a result in [Kwok, Lau, R 19] which shows distance bound $\varepsilon^{2}$ for the setting of random input.
4 — 10:00 — Strict saddle avoidance for Riemannian Gradient Descent
It is well-known that for a C2 function with L-Lipschitz gradients, Euclidean Gradient Descent with a step size less than 1/L does not converge to a strict saddle point for almost all initializations.
The equivalent scenario in the Riemannian setting has received limited attention, with existing results only valid in the context of the metric projection retraction and subject to overly restrictive conditions on the step size.
In this talk, I will present some first steps towards generalizing these results in the case of Riemannian Gradient Descent with the exponential map or the metric projection as retractions, emphasizing the role of the curvature and the regularity of the cost function.
Joint work with Nicolas Boumal.