114:00 — Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization

Various optimal gradient-based algorithms have been developed for smooth nonconvex optimization. However, many nonconvex machine learning problems do not belong to the class of smooth functions and therefore the existing algorithms are sub-optimal. Instead, these problems have been shown to satisfy certain generalized-smooth conditions, which have not been well understood in the existing literature. In this paper, we propose a notion of $\alpha$-symmetric generalized-smoothness that extends the existing notions and covers many important functions such as high-order polynomials and exponential functions. We study the fundamental properties and establish descent lemmas for the functions in this class. Then, to solve such a large class of nonconvex problems, we design a special deterministic normalized gradient descent algorithm that achieves the optimal iteration complexity $\mathcal{O}(\epsilon^{-2})$, and also prove that the popular SPIDER variance reduction algorithm achieves the optimal sample complexity $\mathcal{O}(\epsilon^{-3})$ in the stochastic setting. Our results show that solving generalized-smooth nonconvex problems is as efficient as solving smooth nonconvex problems.

214:30 — Stochastic Optimization under Hidden Convexity

In this work, we consider constrained stochastic optimization problems under hidden convexity, i.e., those that admit a convex reformulation via non-linear (but invertible) map c(⋅). A number of non-convex problems ranging from optimal control, revenue and inventory management, to convex reinforcement learning all admit such a hidden convex structure. Unfortunately, in the majority of applications considered, the map c(⋅) is unavailable or implicit; therefore, directly solving the convex reformulation is not possible. On the other hand, the stochastic gradients with respect to the original variable are often easy to obtain. Motivated by these observations, we examine the basic projected stochastic (sub-) gradient methods for solving such problems under hidden convexity. We provide the first sample complexity guarantees for global convergence in smooth and non-smooth settings. Additionally, in the smooth setting, we improve our results to the last iterate convergence in terms of function value gap using the momentum variant of projected stochastic gradient descent.

315:00 — Convergence of Douglas--Rachford Splitting and Primal-Dual Hybrid Gradient for Non-Monotone, Non-Smooth Problems

Despite the widespread use of splitting methods, their convergence analysis has predominantly been limited to convex/monotone settings. This presentation delves into the convergence properties of two such methods, Douglas-Rachford splitting (DRS) and the primal-dual hybrid gradient (PDHG) method, in the absense of monotonicity and smoothness. Specifically, we introduce the notion of semimonotonicity and establish conditions under which DRS and PDHG exhibit global convergence, even when dealing with the sum of two semimonotone operators. Through illustrative examples, we showcase the versatility of our theoretical framework in addressing problems emerging in optimization and reinforcement learning.

415:30 — On the Lower Bound of Minimizing Polyak-Lojasiewicz functions

Polyak-Lojasiewicz (PL) (Polyak, 1963) condition is a weaker condition than the strong convexity but suffices to ensure a global convergence for the Gradient Descent algorithm. In this paper, we study the lower bound of algorithms using first-order oracles to find an approximate optimal solution. We show that any first-order algorithm requires at least ${\Omega}\left(\frac{L}{\mu}\log\frac{1}{\epsilon}\right)$ gradient costs to find an $\epsilon $-approximate optimal solution for a general $L$-smooth function that has an $\mu$-PL constant. This result demonstrates the optimality of the Gradient Descent algorithm to minimize smooth PL functions in the sense that there exists a ``hard'' PL function such that no first-order algorithm can be faster than Gradient Descent when ignoring a numerical constant. In contrast, it is well-known that the momentum technique, e.g. Nesterov (2003, chap. 2), can provably accelerate Gradient Descent to ${O}\left(\sqrt{\frac{L}{\hat{\mu}}}\log\frac{1}{\epsilon}\right)$ gradient costs for functions that are $L$-smooth and $\hat{\mu}$-strongly convex. Therefore, our result distinguishes the hardness of minimizing a smooth PL function and a smooth strongly convex function as the complexity of the former cannot be improved by any polynomial order in general.