114:00 — Practical Performance Guarantees for Pipelined DNN Inference

We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner's dilemma of deciding when a solution is good enough. To this end, we design novel mixed-integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for {2, 4, 8, 16, 32, 64}, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across our production testbed with pipeline stages, our MIP formulations raised the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds closed the optimality gap by a factor of 9.855x.

214:30 — Contextual Nearest Neighbor Search in Sublinear Time

Over the past couple of decades, locality-sensitive hash functions (LSH) appeared as the main technique for finding near neighbors with a vast list of applications in machine learning, data mining, and database management, among others. In this work, we introduce the concept of "contextual near-neighbor queries", where each query consists of a point accompanied by a context that specifies the dimensions that matter and those that should be ignored.
We introduce an extension of the near-neighbor problem to accommodate such queries named the in-context near-neighbor problem.

315:00 — Multi-Vector Retrieval: The New Frontier of Similarity Search

Over the past decade, dense embedding models have become a fundamental component of modern information retrieval (IR) paradigms. These models produce a single d-dimensional embedding per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT model, it has been shown that multi-vector models, which produces a set of embedding for a single data point, can achieve significantly higher quality results for IR tasks than single embedding models. Unfortunately, retrieval (nearest neighbor search) for these models is computationally expensive, as multi-vector similarity is a significantly more complex metric requiring new techniques. This dilemma is the primary bottleneck preventing the widespread use of multi-vector models, posing multi-vector retrieval as a central modern challenge in similarity search.

In this talk, we discuss the first scalable and direct search algorithms for multi-vector similarities. Our approach is based on a randomized, data-dependent embedding from a set of vectors to a single vector, called "Fixed Dimensional Encodings" (FDE), which employ random space partitions and projections to approximate multi-vector similarity by single-vector dot products. This enables the usage of highly optimized MIPS solvers, backed by decades of prolific research, for multi-vector retrieval. The method is based on the theory of probabilistic tree embeddings underlying decades of theoretical nearest neighbor search literature. Empirically, it achieves latency on the same order of magnitude as single-vector retrieval, while nearly optimally capturing the more powerful expressivity of multi-vector similarity.

415:30 — Attention-based Model Structure Optimization

We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.