key: cord-0157303-83zhxzzq authors: Suo, Daniel; Agarwal, Naman; Xia, Wenhan; Chen, Xinyi; Ghai, Udaya; Yu, Alexander; Gradu, Paula; Singh, Karan; Zhang, Cyril; Minasyan, Edgar; LaChance, Julienne; Zajdel, Tom; Schottdorf, Manuel; Cohen, Daniel; Hazan, Elad title: Machine Learning for Mechanical Ventilation Control date: 2021-02-12 journal: nan DOI: nan sha: bd000dcb7190d0ce720413839a73284431966435 doc_id: 157303 cord_uid: 83zhxzzq We consider the problem of controlling an invasive mechanical ventilator for pressure-controlled ventilation: a controller must let air in and out of a sedated patient's lungs according to a trajectory of airway pressures specified by a clinician. Hand-tuned PID controllers and similar variants have comprised the industry standard for decades, yet can behave poorly by over- or under-shooting their target or oscillating rapidly. We consider a data-driven machine learning approach: First, we train a simulator based on data we collect from an artificial lung. Then, we train deep neural network controllers on these simulators.We show that our controllers are able to track target pressure waveforms significantly better than PID controllers. We further show that a learned controller generalizes across lungs with varying characteristics much more readily than PID controllers do. Mechanical ventilation is a widely used treatment with applications spanning anaesthesia (Coppola et al., 2014) , neonatal intensive care (van Kaam et al., 2019) , and life support during the current COVID-19 pandemic (Meng et al., 2020; Wunsch, 2020; Möhlenkamp and Thiele, 2020) . This lifesustaining treatment has two common modes: invasive ventilation, where the patient is fully sedated, and assist-control ventilation, where the patient can initiate breaths (Oto et al., 2021) . Even though mechanical ventilation has been deployed in ICUs for decades, several challenges remain that can lead to ventilator-induced lung injury (VILI) for patients (Cruz et al., 2018) . In pressure-support * Google LLC † Princeton University ventilation, a form of assist-control ventilation, evidence suggests that a combination of high peak pressure and high tidal volume can lead to tissue injury in the lung (Jain et al., 2017) . Pressure-support ventilation also suffers from patient-ventilator asynchrony, where the patient's breathing pattern does not match the ventilator's, and can result in hypoxemia (low level of blood oxygen), cardiovascular compromise, and patient discomfort (Mellott et al.) . However, the risk of developing VILI depends not only on factors related to the ventilator, but also on intrinsic characteristics of the patient's lung (Cruz et al., 2018) . These characteristics usually cannot be directly observed, so trained clinicians must continuously monitor the patient. Given the highly manual process of mechanical ventilation, it is desirable to have control methods that can better track prescribed pressure targets and are robust to variations of the patient's lung. Motivated by this potential to improve patient health, we focus on pressure-controlled invasive ventilation (PCV) (Rittayamai et al., 2015) as a starting point. In this setting, an algorithm controls two valves that let air in and out of a patient's lung according to a target waveform of lung pressure (see Figure 2 ). We consider the control task only on ISOstandard (ISO 80601-2-80:2018) artificial lungs. State of the art. Despite its importance, ventilator control has remained largely unchanged for years, relying on PID (Bennett, 1993) controllers and similar variants to track patient state according to a prescribed target waveform. However, this approach is not optimal in terms of tracking-PID can overshoot, undershoot, and exhibit ringing behavior for certain lungs. It is also not sufficiently robust-ventilators are carefully tuned during design, manufacture, and maintenance (Ziegler et al., 1942; Chen et al., 2012) and any changes in ventilator dynamics (e.g., tubing, response delay), environment (e.g., atmospheric pressure), or patient must be accounted for and con-tinuously monitored by trained clinicians via various physical controls on the ventilator (Rees et al., 2006) . Challenges of ventilator control. A ventilator controller must adapt quickly and reliably across the spectrum of clinical conditions, which are only indirectly observable given a single measurement of pressure. A model that is highly expressive may learn the dynamics of the underlying systems more precisely and thus adapt faster to the patient's condition. However, such models usually require a large amount of data to train, which can take prohibitively long to collect by purely running the ventilator. We opt instead to learn a simulator to generate artificial data, though learning such a simulator for a partially observed non-linear system is itself a difficult problem. We present better-performing, more robust results and present resources for future researchers. Specifically, 1. We demonstrate that learning a controller as a neural network correction to PID outperforms its uncorrected counterpart (optimality). 2. We show that a single learned controller trained on several ISO lung settings outperforms the PID controller that performs best across the same settings (robustness). 3. We provide self-contained differentiable simulators for the ventilation problem. These simulators reduce the entrance cost for future researchers to contribute to invasive mechanical ventilation. 4. We conduct a methodological study of reinforcement learning techniques, both model-based and model-free, including policy gradient, Q-learning and other variants. We conclude that modelbased approaches are more sample and computationally efficient. Of note, we limit our investigation to open-source ventilators. Control methods used by proprietary ventilators cannot be modified or assessed independently from their hardware and such equipment are cost-prohibitive for academic research. We see this study as a preliminary investigation of machine learning for ventilator control. In future work, we hope to extend this methodology to non-invasive ventilation, pressure-support ventilation, and conduct clinical trials. The modern positive-pressure ICU mechanical ventilator dates back to the 1940s (Kacmarek, 2011) with many open-source ventilator designs (Bouteloup et al., 2020) published during the COVID-19 pandemic. Yet at their core, ventilators all rely on controlling air in and out of an elastic lung via a respiratory circuit, as described in many physics-based models (Marini and Crooke, 1993) . Such simple operation masks the complexity of treatment (Chatburn and Mireles-Cabodevila, 2011) and recent work on augmenting PID controllers with adaptive methods (Hazarika and Swarup, 2020; Shi et al., 2020) have sought to address more advanced clinical needs. To the best of our knowledge, our data-driven approach of learning both simulator and controller is novel in this field. Control and RL in virtual and physical systems. Much progress has been made on learning dynamics when the dynamics themselves exist in silico: MuJoCo physics (Hafner et al., 2019) , Atari games (Kaiser et al., 2019) , and board games (Schrittwieser et al., 2020) . Combining such data-driven models with either pixel-space or latent-space planning has been shown to be effective for policy learning. (Ha and Schmidhuber, 2018) is an example of this research program for the deep learning era. Progress on deploying end-to-end learned agents (i.e. controllers) in the physical world is more limited in comparison, due to difficulties in scaling parallel data collection and higher variability in real-world data. (Bellemare et al., 2020) present a case study on autonomous balloon navigation using a Q-learning approach, rather than a model-based one like ours. (Akkaya et al., 2019) use domain randomization with non-differentiable simulators for a difficult dexterous manipulation task. System identification and residual policy learning. System identification has been studied for decades in control and reinforcement learning, see e.g. (Schoukens and Ljung, 2019; Billings, 1980) for nonlinear system identification. Deep neural networks have been used to represent nonlinear dynamics, see e.g. Punjani and Abbeel (2015) . Residual policy learning (Silver et al., 2019) is a model-free analogue of our controller design: it learns a correction term on an initial, imperfect policy, and is shown to be more data-efficient than learning from scratch, especially for complex robotic tasks. More recently, concurrent work by (Hynes et al., 2020) uses residual policy learning to improve PID for the car suspension control problem. Multi-task reinforcement learning. Part of our methodology has close parallels in multi-task reinforcement learning (Taylor and Stone, 2009) , where the objective is to learn a policy that performs well across diverse environments. To make our controllers more robust, we optimize our policy simultaneously on an ensemble of learned models corresponding to different physical settings, similar to the work of (Rajeswaran et al., 2016; Chebotar et al., 2019) on robotic manipulation. Healthcare offers a multitude of opportunities and challenges for machine learning; for a survey, see (Ghassemi et al., 2020) . Specifically, reinforcement learning and control have found numerous applications (Yu et al., 2020a) , and recently for weaning patients off mechanical ventilators (Prasad et al., 2017; Yu et al., 2019 Yu et al., , 2020b . As far as we know, there is no prior work on improving the control of ventilators using machine learning. We begin with some formalisms of the control problem. A partially-observable discrete-time dynamical system is given by the following equation: where x t is the underlying state of the dynamical system, o t is the observation of the state available to the controller, u t is the control input and f, g are the transition function and observation functions respectively. Given a dynamical system, the control problem is to minimize the sum of cost functions over a long-term horizon: This problem is in general computationally intractable, and theoretical guarantees are available for special cases of dynamics (notably linear dynamics) and perturbations. For an in-depth exposition on the subject, see the textbooks by Bertsekas (2017) ; Zhou et al. (1996) ; Tedrake (2020). PID control. A ubiquitous technique for the control of dynamical systems is the use of linear errorfeedback controllers, i.e. policies that choose a control based on a linear function of the current and past errors vs. a target state. That is, is the deviation from the target state at time t, and k represents the history length of the controller. PID applies a linear control with proportional, integral, and differential coefficients, This special class of linear error-feedback controllers, motivated by physical laws, is a simple, efficient and widely used technique (Åström and Hägglund, 1995) . It is currently the industry standard for (open-source) ventilator control. In invasive ventilation, the ventilator is connected to a patient's main airway, and applies pressure in a cyclic manner to simulate healthy breathing. During the inspiratory phase, the target applied pressure increases to the peak inspiratory pressure (PIP). During the expiratory phase, the target decreases to the positive-end expiratory pressure (PEEP), maintained in order to prevent the lungs from collapsing. The PIP and PEEP values, along with the durations of these phases, define the time-varying target waveform, specified by the clinician. The goal of ventilator control is to regulate the pressure sensor measurements to follow the target waveform p t via controlling the air-flow into the system which forms the control input u t . As a dynamical system, we can denote the underlying state of the ventilator-patient system as x t evolving as for an unknown f and the pressure sensor measurement p t is the observation available to us. The cost function can be defined to be a measure of the deviation from the target; e.g. the absolute deviation c t (p t , u t ) = |p t − p t |. The objective is to design a controller that minimizes the total cost over T time steps. A ventilator needs to take into account the structure of the lung to determine the optimal pressure to induce. Such structural factors include compliance (C), or the change in lung volume per unit pressure, and resistance (R), or the change in pressure per unit flow. Physics-based model. A simplistic formalization of the ventilator-lung dynamical system can be derived from the physics of a connected two-balloon system, with a latent state v t representing the volume of air inside the lung. The dynamics equations can be written as where p t is the measured pressure, v t is volume, r t is radius of the lung, u t is the input air flow rate, and ∆ t is the time increment. u t originates from a pressure difference between lung-pressure p t and supplypressure p supply , regulated by a valve: u t = p supply −pt Rin . The resistance of the valve is R in ∝ 1/d 4 (Poiseuille's law) where d, the opening of the valve, is controlled by a motor. The constants p 0 , r 0 depend on both the lung and ventilator. In Nadeem (2021), several physics-based models are benchmarked, showing errors that are an order of magnitude larger than what can be achieved with a data driven approach. While the interpretability of such models is appealing, their low fidelity is prohibitive for offline reinforcement learning. The physics-based dynamics models described above are highly idealized, and are suitable only to provide coarse predictions for the behaviors of very simple controllers. We list some sources of error arising from using physics equations for model-based control: • Idealization of physics: oversimplifying fluid flow and turbulence via ideal incompressible gas assumptions; linearizing the dynamics of the lung and ventilator components. • Lagged and partial observations: assuming instantaneous changes to volume and pressure across the system. In reality, there are nonnegligible propagation times for pressure impulses, delayed pressure feedback arising from lung elasticity, and computational latency. • Underspecification of variability: different patients' clinical scenarios, captured by the latent constants p 0 , r 0 , may intrinsically vary in more complex (i.e. higher-dimensional) ways. Due to the reasons listed above, it is highly desirable to adopt a learned model-based approach in this setting because of its sample-efficiency and reusability. A reliable simulator enables much cheaper and faster data collection for training a controller, and allows us to incorporate multitask objectives and domain randomization (e.g. different waveforms, or even different patients). An additional goal is to make the simulator differentiable, enabling direct gradientbased policy optimization through the system's dynamics (rather than stochastic estimates thereof). We show that in this partially-observed (but singleinput single-output) system, we can query a reasonable amount of training data in real time from the test lung, and use it offline to learn a differentiable simulator of its dynamics ("real2sim"). Then, we complete the pipeline by leveraging interactive access to this simulator to train a controller ("sim2real"). We demonstrate that this pipeline is sufficiently robust that the learned controllers can outperform PID controllers tuned directly on the test lung. To develop simulators and control algorithms, we run mechanical ventilation tasks on a physical test lung (IngMar, 2020) using the open-source ventilator designed by Princeton University's People's Ventilator Project (PVP) (LaChance et al., 2020). For our experiments, we use the commerciallyavailable adult test lung, "QuickLung", manufactured by IngMar Medical. The lung has three lung compliance settings (C = {10, 20, 50} mL/cmH2O) and three airway resistance settings (R = {5, 20, 50} cmH2O/L/s) for a total of 9 settings, which are specified by the ISO standard for ventilatory support equipment (ISO 80601-2-80:2018 ). An operator can change the lung's compliance and resistance settings manually. We connect the test lung to the ventilator via a respiratory circuit (McIntyre, 1986; Parmley et al., 1972) as shown in Figure 1 . Figure 3 shows a snapshot of our hardware setup. There are many forms of ventilator treatment. In addition to various pressure target trajectories, clinicians may want to focus on other factors, such as volume and flow (Chatburn, 2007) . The PVP ventilator focuses on targeting pressure for a completely sedated patient (i.e., the patient does not initiate any breaths) and comprises two pathways (see Figure 1 ): (1) the inspiratory pathway that regulates airflow into the lung and (2) the expiratory pathway for airflow out of the lung. A software controller is able to adjust one valve for each pathway. The inspiratory valve is a proportional control flow valve that allows control in a continuous range from fully closed to fully open. The expiratory valve is a binary on-off valve that only permits zero or maximum airflow. To prevent damage to the ventilator and/or injury to the operator, we implement software overrides that abort a given run: 1) if pressure or volume in the lung exceeds certain thresholds, 2) if tubing disconnects, or 3) if there is significant software delay. The PVP pneumatic design also includes a safety valve in case software overrides fail. We treat the mechanical ventilation task as episodic by separating each inspiratory phase (e.g., light gray regions in Figure 4 ) from the breath timeseries and treating those as individual episodes. This approach reflects both physical and medical realities. Mechanically ventilated breaths are by their nature highly regular and feature long expiratory phases (dark gray regions in Figure 4 ) that end with the ventilator-lung system close to its initial state, thereby justifying the episodic nature. Further, the inspiratory phase is indeed the most relevant to clinical treatment and the harder regime to control with prevalent problems of under-or over-shooting the target pressure and ringing. Naturally thus, we attempt to learn a simulator for the ventilator-lung dynamics for the inspiratory phase. To this end repeated episodes of inspiratory phases are thus simplified, faithful units of training data. With the hardware setup outlined in Section 3, we have a physical system suitable for benchmarking, in place of a true patient's lung. In this section, we present our approach to learning a simulator for the inspiratory phase of this ventilator-lung system, subject to the practical constraints of real-time data col- lection. Two main considerations drive our simulator training and evaluation design: First, the evaluation for any simulator can only be performed using a black-box metric, since we do not have explicit access to the system dynamics, and existing physics models are poor approximations to the empirical behavior. Second, the dynamical system we simulate is very challenging for a comprehensive simulation covering all modalities and in particular exhibits chaotic behavior in boundary cases. Therefore, since the end goal for the simulator is better control, we only evaluate the simulator for "reasonable" scenarios that are relevant to the control task. The class of learned simulators we consider are deep neural networks and thus in addition to the lack of explicit access to system dynamics, the simulator dynamics are also complex non-linear operations. Thus we deviate from standard distance metrics (between the simulator and the true system) considered in the literature, such as Ferns et al. (2005), as they explicitly involve the value function over states, transition probabilities or other unknown quantities. Rather, we consider metrics that are based on the evolution of the dynamics, as studied in Vishwanathan et al. (2007) . However, unlike the latter work, we take into account the particular distribution over control sequences that we expect to search around during the controller training phase. We thus define the following distance between dynamical systems. Let f 1 , f 2 be two dynamical systems over the same state-action spaces. Let D be a distribution over sequences of controls denoted u = {u 1 , u 2 , ..., u T }. We define the open-loop distance w.r.t. horizon T and control In the former, we see low error as we increase the number of steps we project and in the latter, we see that our simulated trajectory tracks the true trajectory quite closely. sequence distribution D as We use the Euclidean norm over the states in the inner loop, although this can be generalized to any metric. Compared to metrics involving feedback from the simulator, the open-loop distance is a more reliable description of transfer, since it minimizes hardto-analyze interactions between the policy and the simulator. We evaluate our data-driven simulator using the open loop distance metric, and we illustrate a result in the top half of Figure 5 . In the bottom half, we show a sample trajectory of our simulator and the ground truth. See Section 4.3 for experimental details. Motivated by the black-box metric described above, we focus on collecting trajectories comprising of control sequences and the measured pressure sequences upon the execution of the control sequences to form a training dataset. Due to safety and complexity issues, we cannot hope to exhaustively explore the space of all trajectories. Instead keeping the eventual control task in mind, we choose to explore trajectories near the control sequence generated by a baseline PI controller. The goal is to have the simulator faithfully capture the true dynamics in a reasonably large vicinity of the optimal control trajectory on the true system. To this end, for each of the lung settings, we collect data by choosing a safe PID controller baseline and introducing random exploratory perturbations according to the following two policies: 1. Boundary exploration: To the very beginning of the inhalation, add an additional control sampled uniformly from (c a min , c a max ) and decrease this additive control linearly to zero over a time frame sampled randomly from (t a min , t a max ); 2. Triangular exploration: sample a maximal additional control from a range (c b min , c b max ) and an interval (t b min , t b max ), within the inhalation. Start from 0 additional control at time t b min , increase the additional control linearly until (t b min + t b max ))/2, and then decrease to 0 linearly until t b min . For each breath during data collection, we choose policy (a) with probability p a and policy (b) with probability (1 − p a ). The ranges in (a) and (b) are lung-specific. We give the exact values used in the Appendix. This protocol balances the need to explore a significant part of the state space with the need to ensure safety. The boundary exploration capitalizes on the fact that at the beginning of the breath, exploration is safer and also more valuable. The former, due to the lung being at steady state and the latter due to the fact that the typical target waveform for inhalation requires a rapid pressure increase with a quick switch to stabilization, leading to a need for better understanding of dynamics in the early phases of a breath. The structure for the triangular exploration is inspired by the need for a persistent exploration strategy (similar ideas exist in Dabney et al. (2020)) which can capture intrinsic delay in the system. We illustrate this approach in Figure 6 : control inputs used in our exploration policy are shown on the top, and the pressure measurements of the ventilator-lung system are shown on the bottom. Precise parameters for our exploration policy are listed in Table 1 in the Appendix. Now we describe the architectural details of our datadriven simulator. Due to the inherent differences across lungs, we opt to learn a different simulator for each of the tasks, which we can wrap into a single meta-simulator through code that selects the ap- Figure 6 : We overlay the controls and pressures from all inspiratory phases in the upper and lower plots, respectively. From this example of the simulator training data (lung setting R = 5, C = 50), we see that we explore a wide range of control inputs (upper plot), but a more limited "safe" range around the resulting pressures. propriate model based on a user's input of R and C parameters. Training Task(s). The simulator aims to learn the unknown dynamics of the inhalation phase. We approximate the state of the system (which is not observable to us) by the sequence of the past pressures and controls upto a history length of H c and H p respectively. The task of the simulator can now be distilled down to that of predicting the next pressure p t+1 , based on the past H c controls u t , . . . , u t−Hc and H p pressures p t , . . . , p t−Hp . We define the training task by constructing a regression data set whose inputs come from contiguous overlapping sections of H p , H c within the collected trajectories and the task is to predict the following pressure. Boundary Model Towards further improvement of simulator performance we found that, additional distinction needs to be provided for difference in the behavior of the dynamics during the "rise" and "stabilize" phases of an inhalation. Thus we learned a collection of individual models for the very beginning of the inhalation/episode and a general model for the rest of the inhalation, mirroring our choice of exploration policies. This proves to be very helpful as the dynamics at the very beginning of an inhalation are transient, and also extremely important to get right due to downstream effects. Concretely, our final model stitches together a list of N B boundary models and a general model, whose training tasks are as described earlier (details found in Appendix B, Table 3). In this section we describe the following two controller tasks: 1. Performance: improve performance for tracking desired waveform in ISO-specified benchmarks. Specifically, we minimize the combined L 1 deviation from the target inhalation behavior across all target pressures on the simulator corresponding to a single lung setting of interest. 2. Robustness: improve performance using a single trained controller. Specifically, we minimize the combined L 1 deviation from the target inhalation behavior across all target pressures and across the simulators corresponding to several lung settings of interest. Controller architecture. Our controller is comprised of a PID baseline upon which we learn a deep network correction controlled with a regularization parameter λ. This residual setup can be seen as a regularization against the gap between the simulator and the real dynamics. In particular this prevents the controller training from over-fitting on the simulator. We found this approach to be significantly better than the directly using the best (and perhaps over-fitted) controller on the simulator. We provide further details about the architecture and ablation studies in the Appendix. For our experiments, we use the physical test lung to run our proposed controllers (trained on the simulators) and compare it against the PID controller that perform best on the physical lung. To make comparisons, we compute a score for each controller on a given test lung setting (e.g., R = 5, C = 50) by averaging the L 1 deviation from a target pressure waveform for all inspiratory phases, and then averaging these average L 1 errors over six waveforms specified in ISO 80601-2-80:2018. We choose L 1 as an error metric so as not to over-penalize breaths that fall short of their target pressures and to avoid engineering a new metric. We determine the best performing PID controller for a given setting by running exhaustive grid searches over P, I, D coefficients for each lung setting (details for both our score and the grid searches can be found in the Appendix). Figure 7 : We show that for each lung setting, the controller we trained on the simulator for that setting outperforms the best-performing PID controller found on the physical test lung. : As an example, we compare our method (learned controller on learned simulator) to the best P-only, I-only, and PID controllers relative to a target waveform (dotted line). Whereas our controller rises quickly and stays very near the target waveform, the other controllers take significantly longer to rise, overshoot, and, in the case of P-only and PID, ring the entire inspiratory phase. As part of our investigation, we benchmarked and compared several Reinforcement Learning(RL) methods for policy optimization on the simulator before settling on the analytic policy gradient approach that leverages the ability to differentiate through the simulated dynamics outlined before. We consider popular RL algorithms, namely PPO (Schulman et al., 2017) and DQN (Mnih et al., 2013) and compare them to direct analytical policy gradient descent. These algorithms are representative of two mainstream RL paradigms, policy gradient and Q-learning, respectively. We performed experiments on simulators that represent lungs with different R,C parameters. The metric as earlier is the L1 distance between the target and achieved lung pressure per step. To ensure a fair model comparison, we used the same state featurization (as described in the previous section) for all algorithms and performed extensive hyperparameter search for our baselines during the training phase. Results are shown in Figure 10 . Our algorithm achieves comparable scores to the baselines across all simulators. Importantly, our analytical gradient based method achieves a comparable score relative to PPO/DQN in orders of magnitude less samples. This sampleefficiency property of our algorithm can be clearly observed from (Figure 11 ). Our method converges within 100 episodes of training, while the other methods require tens of thousands of episodes. Further, our algorithm has a stable training process, in contrast to the notable training instability for the baselines. Furthermore, our method is robust with respect to hyperparameter tuning, unlike the baselines, which require an extensive search over hyperparameters to achieve comparable performance. This extensive hyperparameter search required by the baselines is unfeasible for use in resource-constrained or online learning scenarios, which are typical use cases for these control systems. Specifically for the results provided here, we conducted 720 trials with different hyperparameter configuration for PPO and 180 trials for DQN. In contrast, using our method only involves a few trials of standard optimizer learning rate tuning, which is the minimum effort in deep learning practices. We have presented a machine learning approach to ventilator control, demonstrating the potential of end-to-end learned controllers by obtaining improvements over industry-standard baselines. Our main conclusions are 1. The nonlinear dynamical system of lungventilator can be modeled by a neural network more accurately than previously studied physicsbased models. 2. Controllers based on deep neural networks can outperform PID controllers across multiple clini- cal settings (waveforms), and can generalize better across patient lungs characteristics, despite having significantly more parameters. 3. Direct policy optimization for differentiable environments has potential to significantly outperform Q-learning or (standard) policy gradient methods in terms of sample and computational complexity. There remain a number of areas to explore, mostly motivated by medical need. The lung settings we examined are by no means representative of all lung characteristics (e.g., neonatal, child, non-sedated) and lung characteristics are not static over time; a patient may improve or worsen, or begin coughing. Ventilator costs also drive further research. As an example, inexpensive valves have less consistent behavior and longer reaction times, which exacerbate bad PID behavior (e.g., overshooting, ringing), yet are crucial to bringing down costs and expanding ac-cess. Learned controllers that adapt to these deficiencies may obviate the need for such trade-offs. Stephen A Billings. Identification of nonlinear systems-a survey. In IEE Proceedings D-Control Theory and Applications, volume 127, pages 272-285. IET, 1980. Julien Bouteloup, Emmanuel Vilsbol, Asem Alaa, and Francois Branciard. Covid-19-open- For each grid point, we target six different waveforms (with identical PEEP and breaths per minute, but varying PIP over [10, 15, 20, 25, 30, 35] cmH2O. This gives us 2,400 trajectories for each lung setting. We determine a score for the run by averaging the L1 loss between the actual and target pressures, ignoring the first breath. Each run lasts 300 time steps (approximately 9 seconds, or three breaths), which we have found to give sufficiently consistent results compared with a longer run. Of note, some of our coefficients reach our maximum grid setting (i.e., 10.0). We explored going beyond 10 but found that performance actually degrades quickly since a quickly rising pressure is offset by subsequent overshooting and/or ringing. Table 2 : P and I coefficients that give the best L1 controller performance relative to the target waveform averaged across the six waveforms associated with P IP = [10, 15, 20, 25, 30, 35] . Open-loop test. To validate the simulator's performance, we hold out 20% of the trajectory data we collected including residual exploration. We run the exact sequence of controls derived from the lung execution on the simulator. We define the point-wise error to be the absolute value of the distance between the pressure observed on the real lung and the corresponding output of the simulator, i.e. err t = |p sim t − p lung t |. We assess the MAE loss corresponding to the errors accumulated across all test trajectories. The following table contains the optimal objective values achieved via the above training and evaluation along with an architecture search over the parameters H p (pressure window), H c (control window), W (width) , d (depth), N B (number of boundary models), Trajectory comparison. In addition to the open-loop test, we compare the true trajectories to simulated ones as described in Section 4. R=20, C=20 R=20, C=50 Solving rubik's cube with a robot hand PID Controllers: Theory, Design, and Tuning. ISA -The Instrumentation, Systems and Automation Society Autonomous navigation of stratospheric balloons using reinforcement learning Development of the pid controller Dynamic Programming and Optimal Control, volume I Learning robust neural network policies using model ensembles Using physiological models and decision theory for selecting appropriate ventilator settings Pressure-controlled vs volumecontrolled ventilation in acute respiratory failure: a physiology-based narrative and systematic review Nonlinear system identification: A user-oriented roadmap Mastering atari, go, chess and shogi by planning with a learned model Proximal policy optimization algorithms Selfadjusting ventilator control strategy based on pid Transfer learning for reinforcement learning domains: A survey Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT 6.832) Modes and strategies for providing conventional mechanical ventilation in neonates Binet-cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes Mechanical ventilation in covid-19: interpreting the current epidemiology Inverse reinforcement learning for intelligent mechanical ventilation and sedative dosing in intensive care units Reinforcement learning in healthcare: A survey Supervisedactor-critic reinforcement learning for intelligent mechanical ventilation and sedative dosing in intensive care units Robust and Optimal Control Optimum settings for automatic controllers The following table describes the settings for determining policies (a) and (b) for collecting simulator training data as described in Section 4. We use an initial learning rate of 10 −1 and weight decay 10 −5 over 30 epochs. To the generalization task, we train controllers across multiple simulators corresponding to the lung settings (R = 20, C = [10, 20, 50] in our case). For each target waveform (there are six, one for each PIP in [10, 15, 20, 25, 30, 35] cmH2O) and each simulator, we train the controller round-robin (i.e., one after another sequentially) once per epoch. We zero out the gradients between each epoch.