108:30 — HyperAgent: Advancing Scalable Exploration through Fast Uncertainty Estimation in RL

  • Yingru Li, The Chinese University of Hong Kong, Shenzhen
  • Xu Jiawei, The Chinese University of Hong Kong, Shenzhen
  • Lei Han, Tencent
  • Zhi Quan Luo, The Chinese University of Hong Kong, Shenzhen

We introduce HyperAgent, a scalable reinforcement learning (RL) algorithm that advances data-efficient exploration through fast incremental uncertainty estimation. HyperAgent leverages the hypermodel framework to efficiently approximate the posterior distribution of action-value functions, providing a simple yet stronger alternative to the epsilon-greedy exploration strategy and integrating seamlessly with established deep RL frameworks. Additionally, we extend this approach with GPT-derived feature embeddings in GPT-HyperAgent, enhancing uncertainty-aware reward modeling in contextual bandit problems with natural language input.

Theoretical analysis confirms that, with logarithmic per-step computational complexity, HyperAgent's performance matches exact Thompson sampling (TS) in linear contextual bandits and Randomized Least-Square Value Iteration (RLSVI) in tabular RL environments. Fundamental to this analysis is the first probability tool for sequential random projection.

Empirical results support our theoretical findings, demonstrating HyperAgent's robust performance in large-scale deep RL benchmarks and real-world safety-critical applications, such as automated content moderation with human feedback.

209:00 — Variance-Dependent Regret Bounds for Non-stationary Linear Bandits

We investigate the non-stationary stochastic linear bandit problem where the reward distribution evolves each round. Existing algorithms characterize the non-stationarity by the total variation budget $B_K$, which is the summation of the change of the consecutive feature vectors of the linear bandits over $K$ rounds. However, such a quantity only measures the non-stationarity with respect to the expectation of the reward distribution, which makes existing algorithms sub-optimal under the general non-stationary distribution setting. In this work, we propose algorithms that utilize the
variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. Specifically, we introduce two novel algorithms that address cases where the variance information of the rewards is known and unknown, respectively. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary stochastic linear bandits under different settings. Experimental evaluations further validate the superior performance of our proposed algorithms over existing works.

309:30 — Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a sub-linear regret bound. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

410:00 — Rethinking of Deep RL: Learning the Target Network in Function Space

We focus on the task of learning the value function in the approximate reinforcement learning (RL) setting. Existing algorithms solve this task by updating a pair of online and target networks while ensuring that the parameters of these two networks are equivalent. We propose Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence. Instead, the algorithm is designed to maintain an equivalence between the two networks in the function space, which is obtained by employing a new target-network update. We show that LR leads to a convergent behavior in learning the value function. We also present empirical results demonstrating that LR-based updates significantly improve the performance of deep RL on the Atari benchmark.