116:20 — Hindsight Learning for MDPs with Exogenous Inputs

Many resource management problems require sequential decision-making under uncertainty, where the only uncertainty affecting the decision outcomes are exogenous variables outside the control of the decision-maker. We model these problems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and design a class of data-efficient algorithms for them termed Hindsight Learning (HL). Our HL algorithms achieve data efficiency by leveraging a key insight: having samples of the exogenous variables, past decisions can be revisited in hindsight to infer counterfactual consequences that can accelerate policy improvements. We compare HL against classic baselines in the multi-secretary, airline revenue management, and newsvendor problems. We also scale our algorithms to a business-critical cloud resource management problem - allocating Virtual Machines (VMs) to physical machines, and simulate their performance with real datasets from a large public cloud provider. We find that HL algorithms outperform domain-specific heuristics, as well as state-of-the-art reinforcement learning methods.

216:50 — VC Theory for Inventory Policies

Advances in computational power and AI have increased interest in reinforcement learning approaches to inventory management. This paper provides a theoretical foundation for these approaches and investigates the benefits of restricting to policy structures that are well-established by decades of inventory theory. In particular, we prove generalization guarantees for learning several well-known classes of inventory policies, including base-stock and (s, S) policies, by leveraging the celebrated Vapnik-Chervonenkis (VC) theory. We apply the concepts of the Pseudo-dimension and Fat-shattering dimension from VC theory to determine the generalizability of inventory policies, that is, the difference between an inventory policy’s performance on training data and its expected performance on unseen data. We focus on a classical setting without contexts, but allow for an arbitrary distribution over demand sequences and do not make any assumptions such as independence over time. We corroborate our supervised learning results using numerical simulations.
Managerially, our theory and simulations translate to the following insights. First, there is a principle of “learning less is more” in inventory management: depending on the amount of data available, it may be beneficial to restrict oneself to a simpler, albeit suboptimal, class of inventory policies to minimize overfitting errors. Second, the number of parameters in a policy class may not be the correct measure of overfitting error: in fact, the class of policies defined by T time-varying base-stock levels exhibits a generalization error comparable to that of the two-parameter (s, S) policy class. Finally, our research suggests situations in which it could be beneficial to incorporate the concepts of base-stock and inventory position into black-box learning machines, instead of having these machines directly learn the order quantity actions.

317:20 — Costly Data Collection in the Newsvendor Problem

In recent years, data-driven algorithms and policies have been developed and widely used in many applications. While some applications can utilize the preexisting data, there are many cases where collecting data is very expensive and time-consuming. Previous research has mainly focused on improving decision-making with data, leaving a gap in understanding the optimal quantity and quality of data needed. Our paper studies the data-driven variant of one of the most widely used operations models, the newsvendor model, in which the retailer is selling products with uncertain demand. Effective ordering decisions can only be made when the retailer collects insightful data to learn about the demand distribution. However, the accuracy of data can vary, and collecting this data is costly due to the sample size (quantity) and sample noise (quality). We aim to understand how the quantity and quality of data impact the retailer’s newsvendor profit and to develop policies for data collection to enhance profit margins. In this paper, we provide a novel denoised-SAA approach that minimizes the in-sample newsvendor cost given a noisy dataset, and prove several bounds of the cost margin on the quantity and quality. Based on our in-depth theoretical analysis, we develop a series of data collection policies with analytical performance guarantees. In an extensive set of computational experiments, we show that these policies perform well in realistic settings.

417:50 — Learning to Order for Inventory Systems with Lost Sales and Uncertain Supplies

We consider a stochastic lost-sales inventory control system with a lead time L over a planning horizon T. Supply is uncertain, and is a function of the order quantity (due to random yield/capacity, etc). We aim to minimize the T-period cost, a problem that is known to be computationally intractable even under known distributions of demand and supply. In this paper, we assume that both the demand and supply distributions are unknown and develop a computationally efficient online learning algorithm. We show that our algorithm achieves a regret (i.e. the performance gap between the cost of our algorithm and that of an optimal policy over T periods) of O(L+\sqrt{T}) when L≥log(T). We do so by 1) showing our algorithm cost is higher by at most O(L+\sqrt{T}) for any L≥0 compared to an optimal constant-order policy under complete information (a well-known and widely-used algorithm) and 2) leveraging its known performance guarantee from the existing literature. To the best of our knowledge, a finite-sample O(\sqrt{T}) (and polynomial in L) regret bound when benchmarked against an optimal policy is not known before in the online inventory control literature. A key challenge in this learning problem is that both demand and supply data can be censored; hence only truncated values are observable. We circumvent this challenge by showing that the data generated under an order quantity q_2 allows us to simulate the performance of not only q_2 but also q_1 for all q_1