key: cord-0510041-fw9ejrg3 authors: Marecek, Jakub title: Screening for an Infectious Disease as a Problem in Stochastic Control date: 2020-11-01 journal: nan DOI: nan sha: 4fde78c3d4fead40d0a97816832a1d80ad40f031 doc_id: 510041 cord_uid: fw9ejrg3 There has been much recent interest in screening populations for an infectious disease. Here, we present a stochastic-control model, wherein the optimum screening policy is provably difficult to find, but wherein Thompson sampling has provably optimal performance guarantees in the form of Bayesian regret. Thompson sampling seems applicable especially to diseases, for which we do not understand the dynamics well, such as to the super-spreading COVID-19. There has been much recent interest in screening populations for a an infectious disease. In the case of COVID-19, data from contact-tracing apps [1, 2, 3, 4, 3, e. g.], esp. [5] , suggest that mortality is negatively associated with the number of tests performed in the given community, esp. in low-income countries and countries with lower government-effectiveness scores [5, 6] . While the association does not imply causation, once the number of tests required by contact tracing exceeds the capacity for performing tests [3] , or once the contact tracing becomes futile by other means, the importance of statistical approaches to the allocation of tests to communities [7, 8, 9] and, hypothetically, [10] , individuals, becomes clear. Screening for an infectious disease, especially in a pandemic, has two conflicting goals: one goal is to stop the spread of the disease and one goal is to understand the spread of the disease as precisely as possible. The former goal may lead to increasing the intensity of testing in communities, where the disease has spread widely. The latter goal may lead to uniform sampling from the population, perhaps using tests of limited accuracy, as exemplified by Slovakia [11] , which has tested the entirety of its population with immunoassays over a weekend. These two goals are conflicting, but their conflict is well understood in Stochastic Control [12, 13] . Indeed, Stochastic Control could be seen as a field concerned with the balancing the trade-off between "exploration" and "exploitation" [13] . In exploration, we aim to learn the stochastic processes [14] involved. In exploitation, one wishes to utilise the current estimates of the stochastic processes involved to optimize a functional, such as the long-run sum of the persons infected. Notice that the exploitation and exploration does not necessarily map to the short-term and long-term objectives in disease control: long-term disease control requires both exploration and exploitation, and hence the balancing the trade-off. In contrast, there seems to be a mismatch between the techniques used for screening for an infectious diseases at the In Epidemiology [17] , there are very many compartmental models [17] , raging from the simple SIR model [18] , which recognises three stages,to SIDARTHE [4] , which recognises eight stages of infection. It is, however, increasingly recognised [19, 20] that such models may be of limited utility in screening for novel diseases, which are super-spreading, allow for reinfections, and may overwhelm the healthcare system, for several reasons: First, it is non-trivial to identify parameters of the compartmental models, until a substantial number of cases of the novel disease is documented in a particular intervention regime. Second, and more importantly, the compartmental models underestimate the variance of the associated stochastic processes in so-called super-spreading diseases. 1 Third, many widely studied compartmental models (SI, SIR, SEIR, . . . , SIDARTHE) do not model reinfections 2 , not only because of the prevalence of diseases with "immunizing infections," but also for technical reasons [28] . Fourth, compartmental models that allow for the study of the impact of a test assume that individuals can be effectively isolated, once tested positive, while the overloaded healthcare systems may not be able to prevent further infections by those who have tested positive. While one can and should make forecasts based on the current models [29] , one should also realise that the dynamics are uncertain [29] and that that modelling the stochastic aspects better [14] may be beneficial [30] . Based on a long history of work in Stochastic Control [12, 13] , we present several insights into monitoring the spread of diseases, for which we do not understand the dynamics well, such as the super-spreading COVID-19. Overall, our aims are three-fold: • to remove as many assumptions from the screening for an infectious disease as possible, • to study the computational complexity [31] of allocating the budget of tests [32] independent of any conjectures (e.g., P ? =NP [31] ), and • to guarantee optimality of practical algorithms for the same problem. 1 Statements such as "offspring distribution of COVID-19 is highly overdispersed" with k = 0.1 [21] suggests that "10% of cases lead to 80% of the spread" [21] , which is hard to model in the compartmental models. 2 Reinfections are also well documented [22, 23] , although their numbers [24, 25, 26, 27] and impact [27] are still unclear. Stochastic Models: Our first suggestion is to consider the stochastic aspects of the problem explicitly, starting with the fact that the tests are imperfect. 3 Consider the hypothetical situation, where we performed multiple low-accuracy tests of a single person, or perhaps a sequence of tests of increasing accuracy. We should like to consider both the outcomes of the tests for that person, as in their mean, but also some measure of variance of the outcomes for that person, in deciding whether to test further. 4 If at some point, everyone quarantined perfectly, and there were an unlimited capacity to perform the tests, the screening would be reduced to the so-called multi-armed bandit problem (MAB) [13] in Stochastic Control [12, 13] . (See the Supplementary material for a definition.) If the capacity to perform tests were limited, this would correspond to the combinatorial variant [35, 36] of the MAB, which is substantially harder [37] . If, however, the disease spreads, these models are no longer useful and one has to consider the so-called restless bandits [38, 13] . In particular, in modelling the spread of the disease as restless bandits, the stochastic process could be the positivity rate in a particular region or community, for example with the sampling frequency of a day. One could have several stochastic processes, one for each community. There are no assumptions on the evolution of the stochastic process, including no assumptions of the independently identically distributed random variables. Computational Complexity: Consider problem of whom to test given a budget of tests [32] . For example in COVID-19, the reported symptom of loss of taste and/or smell was most strongly associated with a positive test result [1, 39] , so in the short term, it may be beneficial to test symptomatic patients. It is clear, however, that this is suboptimal in the long run, where one also needs to test asymptomatic individuals [9] in communities where no (or few) tests have been positive, so far. Our second insight is that in the restless-bandit model outlined in the previous paragraph, the problem of whom to test given a budget of tests [32] , is computationally hard. In the language of computational complexity, its approximation to any non-trivial factor is complete for polynomial-space Turing machines [40] . This suggests that independent of any unproven conjectures, the problem is as hard as any computation that can be performed using a polynomial amount of space on the Turing machine in any amount of time. This is based on the well-known complexity results for the restless multi-armed 3 The probability of detecting disease conditional on the person tested being infected is less than one. This is true for chest CT and RT-PCR [33] and immunoassays. Likewise, the probability of detecting disease conditional on the person tested not being infected is larger than zero [33] . The probability of a person passing the infection, conditional on them testing positive, is still very much less than one for superspreading [19] diseases. 4 A classical policy [12] considers the so-called upper confidence bound (UCB1) based on Hoeffding's Inequality [12] . Following n tests in aggregate, out of which n i tests have been performed on individual i with mean outcome µ i , individual i will receive an "index" µ i + 2 ln n n i . Each day, individuals with the highest indices are chosen for a test, up to the capacity. An alternative policy, known as Thompson sampling [34] , selects the individual according to the probability that it is optimal, considering some prior. bandit problem [38, 41, 13] , under very modest assumptions, as we detail in the Supplementary Material. Optimality of Thompson sampling: Our third insight is that there are (asymptotically) optimal algorithms for the screening problem, despite the complexity results. Moreover, these optimal algorithms can be as simple as Thompson sampling [34, 42] , which is a natural approach, wherein one draws a random sample θ l from a prior, applies actions that maximize expected reward considering the sample θ l drawn, observes the outcome, and updates the prior using the observation of the outcome. This is repeated, possibly daily, as suggested in Algorithm 1 in the Supplementary Material. While the second and third insights may seem contradictory, especially considering that -until recently -the best guarantees [43] for the restless multi-armed bandit problem suggested anÕ( √ T ) bound on the average of the distance between the optimal action and the action chosen by the algorithm (regret) by time T for an algorithm that is intractable in general. More recent work [44, 45] allows for theÕ( √ T ) bound on the regret using Thompson sampling, the classic algorithm [34, 42] , both in the special case of binary rewards, where in the testing of individuals there are binary [44] (either an infected person tests positive with reward 1, or we do not receive any reward), and in a more general episodic case [45] . These are applicable to the two variants of the problem discussed above. This can be seen as a complement to the more traditional model-based optimal control [46, 47, 48, 49, 50, 51, 52, 53] for the introduction of the restrictions. Making it Practical: In order to make the optimal algorithms practically relevant, one needs to choose the prior wisely. The specifics depend, obviously, on the nature of the data available. CMU Delphi lab 5 , for instance, makes 10 different graphs available, outside of any data from any testand-trace application. On such a dataset, for instance, Kemenybased priors [10] or highest-degree-first [54, e.g.] priors may work well, as documented by clinical trials in other diseases [54, e.g.] . One may also consider extensions drawing on work in reinforcement learning [55] . There has been a substantial amount of proposal as to how to screen for COVID-19, under a variety of strong assumptions. We present a very natural approach to removing the assumptions. This seems particularly useful in diseases, whose dynamics are poorly understood, but may be superspreading and allow for reinfections, e.g., COVID-19. Within this model, well-known policies from stochastic control come with strong performance guarantees (Bayesian regret bounds) relative to the best possible deterministic policies in hindsight. Such an optimum deterministic policy, e.g., based on contact tracing, are, however, unknown a priori. Our guarantees are optimal up to a logarithmic factor. In Section A of the Supplementary Material, we introduce the stochastic models involved, following [40] . In Section B, we summarize the guarantees for Thompson sampling, following [44] . For three decades, one of the best studied problems in applied probability and stochastic analysis has been the restless multi-armed bandit problem, [38, 41, 13] . Formally, in the restless bandits problem, we are given n Markov chains (bandits) X i (0), i = 1, . . . , n, f = 0, 1, .., that evolve on a common finite state space S = 1, ..., M . We are also given the initial state of each chain. At each time t, bandit i(t) is chosen. For i = i(t), X(t + 1) is determined by a transition matrix P . For every i = i(t), X(t + 1) is determined by some other transition matrix Q. At each time step, we incur a cost C(t) = c(X i(t) ) + i i(t) d(X i (t)) for some rational-valued functions c and d defined on the state space S. Given states of the different bandits, policy π : S n → 1, ..., n decides which bandit should be played next; that is, i(t) = π(X 1 (f ), . . . , X n (t)). Its average expected cost is defined as and we are interested in finding a policy minimizing the average expected cost. The multi-armed bandit problem is a special case of restless bandits, in which bandits that are not played do not change their state and do not incur any cost, i.e., we have Q equal to the identity matrix and d = 0. It is well known that: Theorem 1 (Theorem 4 in [40] ). Restless bandits are PSPACE-hard. Actually, the proof of Theorem 4 in [40] shows that deciding if the optimal reward is non-zero is also PSPACE-hard, hence ruling out any algorithm with non-trivial approximation ratio. Furthermore, the result holds [40] , even if matrices P , Q correspond to one deterministic transition rule for all bandits that are not played and another deterministic transition rule applying to the bandit that is played. In the general case, [56] introduce a hierarchy of N (where N is the number of bandits) increasingly stronger linear programming relaxations, the last of which is exact and corresponds to the (exponential size) formulation of the problem as a Markov decision chain, while the other relaxations provide bounds and are efficiently computed. They also propose a priority-index heuristic scheduling policy from the solution to the first-order relaxation, where the indices are defined in terms of optimal dual variables. Similarly, [43] present strong guarantees, but for an algorithm that is not tractable. Under stricter assumptions, better guarantees are possible. Under assumptions on the rate of change, [57] present a framework for reasoning about regret of policies computable in polynomial time. Other assumptions [58, 44] also yield guarantees on the approximation ratio. Specifically, [44] consider Algorithm 1 Thompson sampling for COVID-19 screening, based on [44] 1: Input prior Q, episode length L, policy mapping µ 2: Initialize posterior Q 1 = Q, history H = ∅ 3: for episodes l = 1, · · · , m do 4: Draw a parameter θ l ∼ Q l and compute the policy π l = µ(θ l ) Set H 0 = ∅ 6: for t = 1, · · · , L do 7: Select N communities to test in A t = π l (t, H t−1 ) Evaluate tests to obtain rewards X t,At Update H t 10: end for 11: Append H L to H and update posterior distribution Q l+1 using H 12: end for guarantees in the case of the rewards being binary, which is indeed the case in COVID-19 screening. Our guarantees are relative to a broad class of benchmark policies including the optimal fixed policy, the myopic policy, or the index-based policy, all of which are: Definition 2 ( [13, 44] ). A deterministic policy π takes time index and history (t, H t−1 ) as an input and outputs a fixed action A t = π(t, H t−1 ). A deterministic policy mapping µ takes a system parameter θ as an input and outputs a deterministic policy π = µ(θ). In particular, we bound: where the value function is: A variant of the regret, where one assumes we have access to a prior distribution Q over the set of system parameters Θ: The bound is as follows: Proof. The proof is by straightforward application of Theorem 1 of [44] . It is known [44] that this bound is tight for L = 1, N = 1. Population-scale longitudinal mapping of covid-19 symptoms, behaviour and testing Flipping the script for coronavirus disease 2019 contact tracing Modelling the impact of testing, contact tracing and household quarantine on second waves of covid-19 Modelling the covid-19 epidemic and implementation of population-wide interventions in italy Covid-19 mortality is negatively associated with test number and government effectiveness An economic model of the covid-19 epidemic: The importance of testing and age-specific policies Impact of delays on effectiveness of contact tracing strategies for covid-19: a modelling study Covid-19 and the other pandemic: populations made vulnerable by systemic inequity Testing of asymptomatic individuals for fast feedbackcontrol of covid-19 pandemics Kemeny-based testing for covid-19 Slovakia to test all adults for sars-cov-2 Prediction, learning, and games Multi-armed bandit allocation indices Capturing the time-varying drivers of an epidemic using stochastic dynamical systems Special report: The simulations driving the world's response to covid-19 Impact of non-pharmaceutical interventions (npis) to reduce covid19 mortality and healthcare demand Infectious diseases of humans: dynamics and control Containing papers of a mathematical and physical character Superspreading and the effect of individual variation on disease emergence Clustering and superspreading potential of severe acute respiratory syndrome coronavirus 2 (sars-cov-2) infections in hong kong Estimating the overdispersion in covid-19 transmission using outbreak sizes outside china A case report of possible novel coronavirus 2019 reinfection Serum antibody profile of a patient with covid-19 reinfection Will we see protection or reinfection in covid-19? Clinical recurrences of covid-19 symptoms after recovery: viral relapse, reinfection or inflammatory rebound? Covid-19 reinfection: Myth or truth Covid-19: What if immunity wanes Systematic approximations to susceptible-infectioussusceptible dynamics on networks Immune life history, vaccination, and the dynamics of sars-cov-2 over the next 5 years Digital technology and covid-19 Computational complexity: a modern approach Fair allocation of scarce medical resources in the time of covid-19 Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases On the likelihood that one unknown probability exceeds another in view of the evidence of two samples Combinatorial bandits Combinatorial bandits revisited Tight lower bounds for combinatorial multi-armed bandits," ser. Proceedings of Machine Learning Restless bandits: Activity allocation in a changing world Smell and taste changes are early indicators of the covid-19 pandemic and political decision effectiveness The complexity of optimal queuing network control On an index policy for restless bandits A Tutorial on Thompson Sampling, ser. Foundations and Trends in Machine Learning Series Online regret bounds for undiscounted continuous reinforcement learning Regret bounds for thompson sampling in episodic restless bandit problems Thompson sampling in non-episodic restless bandits Modeling, state estimation, and optimal control for the us covid-19 outbreak Optimal covid-19 quarantine and testing policies Optimal control of an epidemic through social distancing On fast multi-shot epidemic interventions for post lock-down mitigation: Implications for simple covid-19 models Robust and optimal predictive control of the covid-19 outbreak Optimal lockdown for pandemic stabilization Optimal targeted lockdowns in a multi-group sir model Controlling epidemic spread: Reducing economic losses with targeted closures Clinical trial of an ai-augmented intervention for hiv prevention in youth experiencing homelessness Whittle index based q-learning for restless bandits with average reward Restless bandits, linear programming relaxations, and a primal-dual index heuristic Stochastic multiarmed-bandit problem with non-stationary rewards Approximation algorithms for restless bandit problems