### 1 — 08:30 — A Recurrent Reinforcement Learning Strategy for Online Scheduling of a State Task Network Under Uncertainty

Deep Reinforcement Learning (DRL) has gained interest for optimal scheduling in industrial facilities. The aim is to train an agent to find the optimal sequence of decisions in the performance of a task, e.g., a short-term schedule for manufacturing production. Key features of DRL agents for scheduling applications under uncertainty are the unique method to explore solutions in complex systems, fast adaptation to unexpected deviations in the parameters, and their ability to provide immediate responses. DRL has just emerged in the field of short-term scheduling optimization, showing good capabilities in creating schedules through different techniques. In the literature, the usual approach for the decision-making models in DRL is the Markov Decision Process (MDP) which uses information from the present time to take the consecutive action in the schedule. Nevertheless, MDPs might not be attractive to model the decision process in scheduling because of the evident correlation between events in scheduling. In this work we proposed the use of a Partially Observable MDP (POMDP) that considers a set of consecutive events to make a decision.

In this work, a methodology to develop a RL agent that acts as an online scheduler on a State Task Network (STN) subject to uncertainty is presented. Proximal Policy Optimization (PPO) is used to train a hybrid agent capable to provide multiple actions [1]. To account for the partial observability of the environment, the agent receives multiple states and correlates them using a Recurrent Neural Network (RNN). The agent produces a hybrid action at every time interval: one that starts a task in the network and the other that specifies the amount of product to be initialized. The hybrid agent was trained to maximize production in a given horizon. To the authors’ knowledge, the use of a hybrid RNN agent has not been considered to address scheduling in STNs under uncertainty.

The hybrid agent designs a schedule considering the objective, uncertainty, and constraints of the system by using a reward shaping strategy. The DRL framework was tested on two case studies involving the online scheduling of two well-known STNs [2], [3]. Parametric uncertainty was incorporated in both cases to show the agent’s adaptability to different circumstances. Results showed that the agent extracted useful knowledge from the instances to create attractive scheduling decisions.

[1] Z. Fan, R. Su, W. Zhang, and Y. Yu, “Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space.” arXiv, May 30, 2019. Accessed: Jan. 18, 2024. [Online]. Available: http://arxiv.org/abs/1903.01344

[2] E. Kondili, C. C. Pantelides, and R. W. H. Sargent, “A general algorithm for short-term scheduling of batch operations—I. MILP formulation,” Comput. Chem. Eng., vol. 17, no. 2, Art. no. 2, Feb. 1993, doi: 10.1016/0098-1354(93)80015-F.

[3] L. G. Papageorgiou and C. C. Pantelides, “Optimal Campaign Planning/Scheduling of Multipurpose Batch/Semicontinuous Plants. 2. A Mathematical Decomposition Approach,” Ind. Eng. Chem. Res., vol. 35, no. 2, pp. 510–529, Jan. 1996, doi: 10.1021/ie950082d.

### 2 — 09:00 — A Unified Framework for Sequential Decision Problems: From Dynamic Programming/RL to Stochastic Programming

"Stochastic programming" is typically associated with two (or multi) stage math programs using scenario trees, or methods based on Benders decomposition. These methods are treated as competitive with approximate dynamic programming/reinforcement learning and model predictive control, and are distinct from recently popular methods from “AI” based on deep neural networks. I refer to these (and many other methods) as the “jungle of stochastic optimization.” I will pull these (and other) methods together in a single universal framework that can be used to model any sequential decision problem. Every method that has been proposed in the academic literature (or used in practice) can be organized into one of four (meta) classes of policies. I will highlight the power of parameterized policies, which are easily the most widely used methods in practice. I will then describe the concept of a parametric cost function approximation which is simply a parameterized deterministic approximation which has to be properly tuned in a simulator (but is sometimes tuned in the field). While the academic literature has focused on stochastic lookahead policies (using scenario trees) and methods based on Bellman’s equation (which includes Benders decomposition, approximate dynamic programming and reinforcement learning), industry almost exclusively uses different forms of parameterized policies, often applied in an ad hoc fashion. Parameterized policies are far more practical, but introduce the challenges of designing the parameterization, and tuning the parameters. I will close by arguing that if the stochastic programming community shifts its focus to these classes of policies, our work will be much more useful in practice.

### 3 — 09:30 — Inducing Group Fairness in Real-World Pipelines through Uncertainty Protections

When the data used for optimization and decision-making is uncertain or random, optimum solutions can have a disparate impact on different demographic groups. This impact arises from variations in the probability distributions of data corresponding to different groups and the interactions of our estimates about these with optimization. In this talk, we will present a framework to tackle such disparities using "uncertainty protections" on different groups of people. We will discuss provable strategies to mitigate the resultant disparities, depending on the properties of the notion of fairness used. We will augment the theory with promising computational results on (i) school allocation with differentially private data and (ii) inventory management with stochastic demands.

### 4 — 10:00 — A Functional Model Method for Nonconvex Nonsmooth Conditional Stochastic Optimization

We consider stochastic optimization problems involving an expected value of a nonlinear function of a base random vector and a conditional expectation of another function depending on the base random vector, a dependent random vector, and the decision variables. We call such problems conditional stochastic optimization problems. They arise in many applications, such as uplift modeling, reinforcement learning, and contextual optimization. We propose a specialized single time-scale stochastic method for nonconvex constrained conditional stochastic optimization problems with a Lipschitz smooth outer function and a generalized differentiable inner function. In the method, we approximate the inner conditional expectation with a rich parametric model whose mean squared error satisfies a stochastic version of a {\L}ojasiewicz condition. The model is used by an inner learning algorithm. The main feature of our approach is that unbiased stochastic estimates of the directions used by the method can be generated with one observation from the joint distribution per iteration, which makes it applicable to real-time learning. The directions, however, are not gradients or subgradients of any overall objective function. We prove the convergence of the method with probability one, using the method of differential inclusions and a specially designed Lyapunov function, involving a stochastic generalization of the Bregman distance. Finally, a numerical illustration demonstrates the viability of our approach.