114:00 — Exploiting Negative Curvature in Conjunction with Adaptive Sampling: Theoretical Results and a Practical Algorithm

With the emergence of deep learning in recent years, nonconvex optimization has gained more interest and focus, but its nonconvex properties bring us great challenges in algorithm design. Our goal is to develop algorithms that have second-order convergence and affordable complexity so that they can be used for large-scale problems. Therefore, we’ll exploit negative curvature with adaptive sampling strategy. First, we provide the convergence result for the deterministic setting where the gradients and Hessians are inexact. In the second part, I want to show our result in the stochastic setting where the gradients are constructed by random samples. Except for the theory part, I’ll also talk about our practical algorithm which is the Newton-CG method with negative curvature detection and adaptive sampling. Our experiment results on different nonconvex problems are shown after that.

214:30 — Solving large-scale nonlinear least-squares with random Gauss-Newton models

We address the solution of large-scale nonlinear least-squares problems by stochastic Gauss-Newton methods combined with a line-search strategy. The algorithms proposed have per-iteration computational complexity lower than classical deterministic methods, due to the employment of random models inspired by randomized linear algebra tools. Under suitable assumptions, the stochastic optimization procedures can achieve a desired level of accuracy in the first-order optimality condition. We discuss the construction of the random models and the iteration complexity results to drive the gradient below a prescribed accuracy, then we present results from our computational experience.

315:00 — A stochastic gradient method with variance control and variable learning rate for Deep Learning

In this talk we present a stochastic gradient algorithm which rules the increase of the mini-batch size in a predefined fashion and automatically adjusts the learning rate by means of a monotone or non-monotone line search procedure. The mini-batch size is incremented at a suitable a priori rate throughout the iterative process in order that the variance of the stochastic gradients is progressively reduced. The a priori rate is not subject to restrictive assumptions, allowing for the possibility of a slow increase in the mini-batch size. On the other hand, the learning rate can non-monotonically vary along the iterations, as long as it is appropriately bounded. Convergence results for the proposed method are provided for both convex and non convex objective functions. Moreover it can be proved that the algorithm enjoys a global linear rate of convergence on strongly convex functions. The low per-iteration cost, the limited memory requirements and the robustness against the hyperparameters setting make the suggested approach well-suited for implementation within the deep learning framework, also for GPGPU-equipped architectures. Numerical results on training deep neural networks for multi-class image classification show a promising behaviour of the proposed scheme with respect to similar state of the art competitors.

415:30 — Probabilistic Trust Region method for solving Multi-objective Problems

The problem considered is a multi-objective optimization problem, in which the goal is to find an optimal value of a vector function representing various criteria. An algorithm which utilizes trust region framework with probabilistic model functions able to cope with noisy problems and approximate functions and their derivatives is derived and analysed. We prove the almost sure convergence of the proposed algorithm to a Pareto critical point if the model functions are good approximations in probabilistic sense. Numerical results demonstrate effectiveness of the probabilistic trust region by comparing it to competitive stochastic multi-objective solvers. The application in supervised machine learning is showcased by training non discriminatory Logistic Regression models on different size data groups. Additionally, we use several test examples with irregularly shaped fronts to exibit the efficiency of the algorithm.