1 — 16:20 — Efficient Methods for Solving Bilevel-structure Optimization Problem
Bilevel optimization is an important class of hierarchical decision-making where one optimization problem is nested within another. This modeling framework finds extensive applications across diverse domains, including machine learning, energy systems, and economics. However, as the scale and complexity of these systems continue to increase, there is a critical need to develop scalable and reliable optimization algorithms for bilevel optimization. For example, model-agnostic meta-learning in machine learning can be cast as a bilevel optimization where the follower computes task-specific parameters using a set of meta-parameters (leader’s decision variable) and the goal is to learn meta-parameters that produce appropriate task-specific parameters after adaptation. Other examples in this area include few-shot learning, reinforcement learning, and hyperparameter learning.
In this talk, we discuss various bilevel-structure problems including constrained bilevel optimization and bilevel saddle point problems. For each problem structure, we introduce a novel efficient first-order algorithm with a convergence rate guarantee. Several methods have been developed for tackling unconstrained bilevel optimization problems, but there is limited work on methods for constrained and min-max settings. Our proposed methods have an improved per-iteration complexity, surpassing existing methods. We also present numerical experiments to showcase the superior performance of our methods compared with state-of-the-art methods.
2 — 16:50 — Non-asymptotic Global Convergence Rates of BFGS with Exact Line Search
In this talk, we explore the non-asymptotic global convergence rates of the Broyden-FletcherGoldfarb-Shanno (BFGS) method implemented with exact line search. Notably, due to Dixon’s equivalence result, our findings are also applicable to other quasi-Newton methods in the convex Broyden class employing exact line search, such as the Davidon-Fletcher-Powell (DFP) method. Specifically, we focus on problems where the objective function is strongly convex with Lipschitz continuous gradient and Hessian. Our results hold for any initial point and any symmetric positive definite initial Hessian approximation matrix. The analysis unveils a detailed three-phase convergence process, characterized by distinct linear and superlinear rates, contingent on the iteration progress. Additionally, our theoretical findings demonstrate the trade-offs between linear and superlinear convergence rates for BFGS when we modify the initial Hessian approximation matrix, a phenomenon further corroborated by our numerical experiments.
3 — 17:20 — Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients
We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-order algorithm, called Mini-Batch Stochastic Variance-Reduced Newton (Mb-SVRN), which combines variance-reduced gradient estimates with access to an approximate Hessian oracle. In particular, we show that when the data size n is sufficiently large, i.e., $n >> \alpha^2 \kappa$, where $\kappa$ is the condition number and alpha is the Hessian approximation factor, then Mb-SVRN achieves a fast linear convergence rate that is independent of the gradient mini-batch size b, as long b is in the range between 1 and $b_{\max} = O(n / (\alpha log n))$. Only after increasing the mini-batch size past this critical point $b_{\max}$, the method begins to transition into a standard Newton-type algorithm which is much more sensitive to the Hessian approximation quality. We demonstrate this phenomenon empirically on benchmark optimization tasks showing that, after tuning the step size, the convergence rate of Mb-SVRN remains fast for a wide range of mini-batch sizes, and the dependence of the phase transition point $b_{\max}$ on the Hessian approximation factor alpha agrees with our theory.