108:30 — Network dual reoptimization policies and bounds for managing energy real options

Energy operations and investments are managed with real option models that embed timing and/or switching decisions, where the former and latter types of decisions change irreversibly and reversibly, respectively, the status of the real asset. Optimizing the flexibility in these models requires solving a finite-horizon Markov decision process with finite endogenous states but a high-dimensional and continuous exogenous state space, which is typically intractable and requires heuristic approaches. Least squares Monte Carlo (LSM) is a popular approximate dynamic programming method for this purpose while a known forecast-based reoptimization heuristic (FRH) is not well suited to handle timing decisions. We develop a network dual reoptimization framework that leverages network flow algorithms and employs a novel class of partial information relaxations to overcome this shortcoming. Our framework provides both a new policy and a dual bound and we establish that: (i) the decisions taken by our policy are equivalent to solving a scenario-based two-stage stochastic program but computationally more efficient since scenarios are solved independently; and (ii) our dual bound is tighter than the known penalized perfect information bound. Numerical experiments on two energy real options dealing with commodity production and vehicle fleet upgrade show that our policy outperforms FRH by 10-20\% and LSM by to 2-5\%, while our dual bound improves by less than 1\%.

209:00 — Reinforcement Learning for Energy-Efficient Job Shop Scheduling with Large Structured Action Spaces

Over the past years, reinforcement learning has gained increasing traction in the operations research community, with novel solution methods proposed in computer science finding their way to solve sequential planning and scheduling problems. Although many methods have been proposed to handle large state spaces, combinatorial optimization problems are also well-known for their extremely large decision spaces. Despite their size, such decision spaces often display spatiotemporal structures and desirable properties such as convexity, which are often exploited through mathematical programming. This research proposes a novel algorithmic pipeline, in which local decision neighborhoods are generated and formulated as mathematical programs. These neighborhoods are subsequently explored based on Q-values, utilizing both mathematical programming and simulated annealing to identify improving actions. The proposed pipeline is applied on a stylized variant of a job shop scheduling problem, in which energy consumption is required to be balanced. Performance is shown to match or outperform state-of-the-art embedding methods and scales beyond enumerable decision spaces.

309:30 — Stochastic Approximation for Upper Bound Optimization in Markov Decision Processes

Markov decision processes (MDPs) with small endogenous states are prevalent in options pricing, investment planning, and operations. Solving these problems is difficult because of the presence of a high-dimensional exogenous state driven by a stochastic process. Computing an upper bound of the optimal policy of such MDPs is important for assessing the quality of heuristic policies. Approximate linear programming (ALP) and pathwise optimization (PO) are two established approaches that minimize an upper bound on the optimal policy value. However, they often lead to challenging semi-infinite linear programs. PO is known to provide better upper bounds than ALP. We propose a new ALP relaxation formulation based on surrogate relaxation using exogenous state information. The upper bound computed by this model is better than that of ALP and converges to the MDP optimal policy value under appropriately sampled basis functions. In addition, we show a relationship between the upper bounds of PO and our ALP relaxation and also compare these two models more broadly. We develop first-order methods to solve our ALP relaxation with convergence guarantees and show that optimized upper bounds for MDPs with small endogenous states can be computed more efficiently than using ALP.

410:00 — Reinforcement Learning or Policy Sampling: An Online Coupling Policy with an Energy Storage Application

We propose a new approach for solving episodic sequential decision problems that occur under unknown uncertainty. The approach leverages concurrent real-time simulation to couple online reinforcement learning with a multi-armed bandit type of online learning approach that we introduce under the name policy sampling. We show the theoretical performance guarantees of the proposed coupling policy, and illustrate the new method by applying it to an energy storage management problem.