1 — 14:00 — High order stochastic optimization without evaluating the function.
In this work, we consider the problem of finding approximate minimizers for unconstrained nonconvex optimization problems. Our proposed algorithm falls within the category of adaptive regularization methods where Taylor approximation of order $p$ are used for the step computation, known for their optimal worst-case complexity in a deterministic framework.
This work also contributes to the field of objective function-free optimization (OFFO), a concept motivated by the fact that many optimization methods demand finer accuracy in function evaluations than in gradient evaluations for effective step size control and global convergence.
We present here an extension of our algorithms to a stochastic variant where, despite not evaluating the objective function, only stochastic approximations of the derivative are available.
We prove that the excellent complexity bounds of regularization methods remain applicable in this stochastic context, even though significantly less information is used.
More precisely the algorithm will find an ε-approximate first-order
minimizer in at most $O(ε^{-(p+1)/p}) $ iterations and $O^{-(p+1)/(p−1)}$ for second order critical points using conditions in conditional expectation on the derivatives tensors errors.
Practically, our algorithm requires dynamic control over the accuracy of the derivative, aligning well with machine learning problems where batch size offers the necessary control over approximate derivatives.
2 — 14:30 — Randomized matrix decompositions for faster scientific computing
Traditional numerical methods based on expensive matrix factorizations struggle with the scale of modern scientific applications. For example, kernel-based algorithms take a data set of size $N$, form the kernel matrix of size $N x N$, and then perform an eigendecomposition or inversion at a cost of $O(N^3)$ operations. For data sets of size $N \geq 10^5$, kernel learning is too expensive, straining the limits of personal workstations and even dedicated computing clusters. Randomized iterative methods have emerged as a faster alternative to the classical approaches. These methods combine randomized exploration with information about which matrix structures are important, leading to significant speed gains.
In this talk, I will review recent developments concerning two randomized algorithms. The first is "randomized block Krylov iteration", which uses an array of random Gaussian test vectors to probe a large data matrix in order to provide a randomized principal component analysis. Remarkably, this approach works well even when the matrix of interest is not low-rank. The second algorithm is "randomly pivoted Cholesky decomposition", which iteratively samples columns from a positive semidefinite matrix using a novelty metric and reconstructs the matrix from the randomly selected columns. Ultimately, both algorithms furnish a randomized approximation of an $N x N$ matrix with a reduced $rank k << N$, which enables fast inversion or singular value decomposition at a cost of $O(N k^2)$ operations. The speed-up factor from $N^3$ to $N k^2$ operations can be 3 million. The newest algorithms achieve this speed-up factor while guaranteeing performance across a broad range of input matrices.
3 — 15:00 — Adaptive Stochastic Optimization with Constraints
We consider optimization problems with a stochastic objective and deterministic equality constraints, and design a trust-region sequential quadratic programming method with random models (TR-SQP-STORM) to find both first- and second-order stationary points. In our method, random models are constructed in each iteration, which require estimates of the objective value, gradient, and Hessian to satisfy adaptive accuracy conditions with a fixed probability. To reduce KKT residuals and increase the smallest eigenvalue of the reduced Lagrangian Hessian if it is negative, we design gradient steps and eigensteps, respectively. Gradient steps are computed using the adaptive relaxation technique proposed by Fang et al. (2022), while eigensteps are novel and computed in a parameter-free approach. To help the algorithm move away from saddle points, we perform an additional second-order correction step when the current iteration is deemed unsuccessful and the current iterate is close to feasible sets. Under reasonable assumptions, we establish both first- and second-order global convergence guarantees: the sequence of iterates converges to first-order stationary points almost surely, and a subsequence of iterates converges to second-order stationary points almost surely. We apply our method to a subset of problems in the CUTEst set and to constrained logistic regression problems, demonstrating its promising empirical performance.
4 — 15:30 — Random subspace second order methods for nonconvex optimization
To improve the scalability of nonconvex optimization algorithms, we devise and analyse a random subspace variant of cubic regularization that has optimal complexity with high probability and almost sure convergence, relying on key properties of sketching matrices, that allow efficient projections to low dimensions while preserving useful properties. We discuss the challenges of the analysis, in particular, the key difficulty of using sketching given the required lower bound on the step (needed for optimal complexity) that depends the random gradient at the next step. We obtain improved results for the case of low rank functions that only vary within a low dimensional subspace and that are a common occurrence in applications involving overparameterized models and that can serve as an insightful proxy for the training landscape in neural networks. Numerical experiments with different sketching matrices are presented both in general and in the special case of low rank functions, using different sketching matrices (such as Gaussian or hashing matrices), with encouraging results. We discuss the strategies needed for careful algorithm development that go beyond vanilla application of sketching, into careful design of the subspace. Time permitting, we will explore local rates of convergence and other random subspace variants and their complexity.