114:00 — Adaptive Experimentation at Scale: A Computational Framework

Experimentation is the basis of scientific decision-making. Adaptive experimentation (e.g., bandits) can provide substantial statistical power, but existing algorithms are designed by theoreticians to achieve analytic guarantees and notoriously difficult to implement and maintain in practice. We propose a new paradigm of adaptive experimentation centered around computational tools. Motivated by problem instances involving large batches, delayed feedback, and a small number of opportunities to reallocate sampling effort, we design near-optimal, easy-to-deploy adaptive experiments algorithms that can flexibly deal with any batch size. Our main observation is that normal approximations universal in statistics can guide the design of scalable adaptive experimentation methods. Our preliminary results show large performance gains against state-of-the-art methods (e.g., Thompson sampling).

214:30 — Exploration Incentives in Model-Based Reinforcement Learning

Reinforcement Learning (RL) is a form of stochastic adaptive control in which one seeks to estimate parameters of a controller only from data, and has gained popularity in recent years. However, technological applications of RL are often hindered astronomical sample complexity demanded by their training. Model-based reinforcement learning is known to provide a practically sample efficient approach; however, its performance certificates in terms of Bayesian regret often require restrictive Gaussian assumptions, and may fail to distinguish between vastly different performance in sparse or dense reward settings. Motivated by these gaps, we propose a way to make MBRL, namely, Posterior Sampling combined with Model-Predictive Control (MPC), computationally efficient for mixture distributions based a novel application of integral probability metrics and kernelized Stein discrepancy. Then, we build upon this insight to pose a new exploration incentive called Stein Information Gain, which permits us to come up with a variant of information-directed sampling (IDS) whose exploration incentive is evaluable in closed-form. Bayesian and information-theoretic regret bounds of the proposed algorithms are presented. Finally, experimental validation on some environments from OpenAI Gym and Deepmind Control Suite illuminates the merits of the proposed methodologies in the sparse-reward setting.

315:00 — RL and ADP for Rideshare Operations

Dispatch is at the center of the rideshare operations. In this talk, we will first revisit the core developments in reinforcement learning (RL) for rideshare dispatch and the extensions to online learning and joint optimization with repositioning. We will subsequently look at in parallel an approximate DP approach to the same problem and discuss their interesting connections, which offer an alternative lens to interpret the RL approach.

415:30 — Deep Reinforcement Learning for Recommender Systems and Beyond

Current recommender systems predominantly employ supervised learning algorithms, which often fail to optimize for long-term user engagement. This short-sighted approach highlights the significance of sequential recommender systems, designed to make decisions that extend beyond immediate user responses. To maximize cumulative positive feedback, these systems must balance exploration—probing users for insightful feedback to inform future recommendations—and the strategic selection of items that pave the way for more successful future interactions.

However, the dynamic nature of user behavior and evolving social trends present additional challenges, demanding that sequential recommender systems operate effectively in non-stationary environments. This necessitates a more discerning exploration strategy, focusing on gathering enduring insights rather than ephemeral information.

In this talk, we introduce several key advancements in sequential recommender systems and related production systems. These contributions are centered around sequential decision making and scalable and intelligent exploration techniques, accommodating both immediate and delayed user feedback, while adeptly adjusting to non-stationary contexts. We provide empirical evidence demonstrating that our proposed algorithms outperforms existing methods in various benchmarks and real-world systems.