key: cord-146213-924ded7t authors: Kiamari, Mehrdad; Ramachandran, Gowri; Nguyen, Quynh; Pereira, Eva; Holm, Jeanne; Krishnamachari, Bhaskar title: COVID-19 Risk Estimation using a Time-varying SIR-model date: 2020-08-11 journal: nan DOI: nan sha: doc_id: 146213 cord_uid: 924ded7t Policy-makers require data-driven tools to assess the spread of COVID-19 and inform the public of their risk of infection on an ongoing basis. We propose a rigorous hybrid model-and-data-driven approach to risk scoring based on a time-varying SIR epidemic model that ultimately yields a simplified color-coded risk level for each community. The risk score $Gamma_t$ that we propose is proportional to the probability of someone currently healthy getting infected in the next 24 hours. We show how this risk score can be estimated using another useful metric of infection spread, $R_t$, the time-varying average reproduction number which indicates the average number of individuals an infected person would infect in turn. The proposed approach also allows for quantification of uncertainty in the estimates of $R_t$ and $Gamma_t$ in the form of confidence intervals. Code and data from our effort have been open-sourced and are being applied to assess and communicate the risk of infection in the City and County of Los Angeles. The ongoing COVID-19 epidemic has forced governments and public authorities to employ stringent measures [1] , [2] , including closing business and implementing stay-at-home orders, to contain the spread. When making such decisions, policymakers require tools to understand in "real-time" how the virus is spreading in the community, as well as tools to help communicate the level of risk to citizens so that they can be encouraged to take appropriate measures and take the public health directives seriously. One metric that has been found to be useful for authorities to assess the level of containment over time is the effective reproduction number [3] . The effective reproduction number, R t , indicates on average how many currently susceptible persons can be infected by a currently infected individual. The epidemic grows if this measure is above one. It is desirable to keep this value as far below one as possible over time in order to contain and eventually, hopefully, eliminate the virus from the community. While R t is meaningful to understand the rate at which the epidemic is spreading and has been proposed previously (for example, see https://rt.live/ ), what has been missing in the public discourse is a risk metric that is more suitable for communication to a wider public. One key requirement for such a metric is that it be something that a citizen could relate to on an individual basis. Another requirement is that it needs to be easy to communicate to a wide audience. We address both these requirements in this work and make the following contributions. First, we obtain the daily effective reproduction number R t of a time-varying SIR model as well as the corresponding confidence Interval. The confidence interval reflects uncertainty in both the parameter of the underlying model and uncertainty in the data itself. Further, we present the mathematical derivation of the distribution of R t . Second, we propose a novel risk score Γ t for a community that is proportional to the probability that an individual will get infected in the next 24 hours. We show that the risk score can be calculated given estimates of four quantities: a) an estimate of I rep,new (t), the most recently reported count of new confirmed infectious cases, b) an estimate of R t as discussed above, c) an estimate of K, the ratio of true infectious cases to the number of confirmed cases, and d) an estimate of S(t), the current number of susceptible individuals in the community. To make the score more meaningful, we normalize the probability of infection by multiplying it by 10,000. Then, a risk score of x is an indication that there is, on average, a chance of x in 10,000 of an individual in the community becoming infected in the next 24 hours. Third, we propose to convert the numerical risk score, which has an intuitive meaning as indicated above, to a color-coded risk level based on suitably chosen thresholds. We propose the use of four color-levels to indicate the corresponding risk level from low to high: green, yellow, orange, and red. Fourth, we have implemented software to estimate the risk level for any community and released it as open-source. The code requires only time-series data on confirmed new cases, the population of the community, and an estimate for the ratio of true to confirmed (detected) COVID-19 positive cases. This software is being used at USC to process the daily data of communities within Los Angeles County to estimate and generate maps of risk levels by community. The block diagram in figure 1 illustrates key elements of our system design. Our data parser is able to get the raw data from online data sources, clean them up and store them in machine-friendly (csv and json) formats. Our code for infection risk calculation uses this data in conjunction with a time-varying SIR-based Bayesian mathematical model to obtain risk estimates and prediction for different communities. The results are provided in CSV format and can be used to generate a heatmap-type visualization as well. The risk scoring model we describe in this work is now being used by the City of Los Angeles, which in turn is working with the County of Los Angeles and other partners to develop a publicly accessible tool that can be used by individuals and communities to grow awareness and mitigate risk of infection. We believe that our risk estimation approach will be similarly of value to other communities around the world. II. RELATED WORK As noted above, the calculation of the risk score requires an estimate of R t . We show how this can be estimated using a time-varying SIR model, a generalization of the well-known SIR compartmental model [4] , [5] which consists of three states, namely the susceptible state, the infected state, and the recovered state. While traditionally this model is assumed to have a interaction rate / infection rate parameter that is constant, one recent work has used a time-varying SIR model to recover the time-varying effective reproduction number [6] . Going beyond that work, we also show how to derive a confidence interval for R t in this work. Further, the authors of [6] make strong assumptions on the number of susceptible individuals by approximating it as a constant factor of the entire population. This assumption may not be accurate when the number of infected individuals are high compared to the total population of a community; we therefore take a more general approach. Another recent work by Systrom [7] has presented a Bayesian prediction approach to obtain confidence intervals for R t . However, Systrom's work builds on [8] , where the definition of infection rate R t is not based on a time-varying contact rate of the SIR model. Instead, their approach estimates infection rate probabilistically based on the number of new cases alone. We are not aware of prior work that has proposed defining risk for COVID-19 or other epidemics in terms of an individual's probability of infection, which we argue is more meaningful for communicating risk to the public. III. METHODOLOGY Compartmental mathematical models for epidemic spreads including the well-known SIR model have been used since the work of Kermack and McKendrick in 1927 [4] . In the SIR model, each member of a given population is in one of three states at any time: susceptible, infectious, recovered. Any individual that is susceptible could become infected with some probability when they come into contact with an infected individual. Any individual that is infectious eventually recovers (in the context of COVID-19 when applying the SIR model, note that the category of recovered individuals will also include removed individuals due to deaths, which could be modeled as a constant fraction of all individuals in this category). In the classical SIR model, the number of susceptible individuals that become infected depends on the rate at which infected and susceptible individuals encounter each other and this rate is assumed to be constant. A well-known parameter in the classical SIR model is called R0, the effective reproductive number, which measures the average number of infections caused by infectious individuals at the beginning of the epidemic. In our work, we have extended the SIR model to a time-varying model, in which the rate of encounters and infection probability between individuals in the population is assumed to be time-varying. This better reflects the reality of our present epidemic where interventions such as stay-at-home have been put in place and relaxed and various times and compliance with recommendations such as wearing masks and maintaining physical density has also been time-varying. Based on this model, we are able to define and derive a new approach to calculating a time-varying version of the effective reproductive number, which we refer to as R t . A particularly innovative aspect of our model is that it is a Bayesian model that allows the incorporation of various sources of uncertainty in the model, including uncertainty in the actual numbers of infected individuals (due to not every infected individual having been tested, as studies [2] have shown), uncertainty in recovery times, and uncertainty in the choice of parameters for de-noising the empirical data. This allows us to generate not only an estimate of R t , but also quantify confidence in the estimate from a rigorous statistical perspective. In this section, we elaborate upon the SIR model in detail. The SIR model is one of the simplest and the most well-known epidemic model [4] , [5] where each person belongs to one of the following three states: the susceptible state, the infected state, and the recovered state. Regarding the susceptible state, individuals have not had the virus yet. However, they may get infected in case of being exposed to an infected individual. As far as the infected state is concerned, a susceptible person has the virus after being exposed to infected individuals. Finally, a person enters the recovered state in case of either the individual gets healed or dies. One important point about this model is that a recovered person will not be a susceptible one anymore. The SIR model follows the following differential equations: where S(t), I(t), and R(t) respectively represent the number of susceptible, infected, and recovered people in a population size of N at time t. Regarding the parameter σ, it is the recovery rate after being infected and is equal to 1 D I where D I represents the average infectious days. Parameter β is known as the effective contact rate, i.e. the average number of contacts an individual have with others is β. In analyzing whether any pandemic is contained, it is very crucial to obtain parameter β. We next show that how we can derive β from the aforementioned differential equations. 1) Obtaining β t and R t for the SIR Model: In the SIR model, we can express the number of susceptible individuals in terms of population size and the number of infected persons as S(t) ≈ N −I(t). By replacing S(t) with N − I(t) in the second differential equation of (1), we would have We can rewrite (2) as follows: By taking definite integral from time t 1 to t 2 and assuming β to be constant in this time interval, we would have which leads to One can easily check (5) has a unique solution for β due to the fact that term 1 β−σ and log term have monotonic behaviors. An epidemic happens in case of increase in the number of infected individuals, i.e. dI(t) dt > 0, or consequently In the early stage of an epidemic, almost everyone are susceptible except very few cases. Therefore, N − I(t) ≈ N and as a result, condition (6) would turn into β σ > 1. The variable R β σ is defined as the effective reproduction number. It is a useful metric to determine epidemic growth. In case of having R > 1, the epidemic is growing exponentially while R < 1 indicates the epidemic is contained and will decline and die out eventually. For discrete-time cases such as daily reporting on number of infected cases, the time-variant effective contact rate β t , which represents the contact rate for time slot t can be derived by solving the following equation: Therefore, the time-variant effective reproduction number would be defined as R t βt σ . Since it is difficult to write a closed form solution for β t in (7), we take a simpler approximation to β t by considering the following which is based on (2) . Then, we estimate R t as βt σ . 2) Obtaining the Confidence Interval for R t : Since there is uncertainty about parameter D I (or equivalently σ) and the number of infected cases I(t), we now provide the derivation of confidence interval for parameter R t . Regarding modeling the ambiguity in the number of the infected cases, we present the uncertainty about the actual number of infected cases as a factor of reported ones, i.e. I rep (t) 1 K I(t), and K is a constant greater than 1. The main intuition behind this factor is due to taking into account the following two phenomena, namely lack of sufficient number of tests (specially in the beginning of the pandemic) and asymptomatic cases (mild infections which might not even be noticed). To derive the confidence interval, we need to first find the marginal distribution of R t . By considering f D (d) and f K (k) as the probability distribution function (pdf) for parameters D I and K, respectively, the joint pdf of these parameters would be due to the independence of D I and K. We can derive the probability distribution function of R t by performing the following transformation on parameters D I and K and introducing auxiliary variable Z: Since the transformation of (Z, R t ) to (D I , K) is one-to-one, we have where a t , the joint pdf of Z and R t would be f Z,Rt (z, r) = |J|f D,K (d, k) with (12) By substituting the corresponding values of parameters and the Jacobin, we have: The marginal pdf of R t can be obtained by taking integral of (13) over parameter z, i.e. Remark 1: One reasonable assumption regarding the pdf of parameters D I and K is that both of them have Gaussian distributions. By considering D I ∼ N (µ D , σ 2 D ) and K ∼ N (µ K , σ 2 K ), the pdf of R t can be simplified as where φ µc,σ 2 c (.) indicates the pdf of a normal distribution with mean µ c and variance σ 2 c while By taking integral through using change of parameters, (15) can be rewritten as follows (17) where Φ µc,σ 2 c (.) represents the cumulative distribution function (cdf) of a normal distribution with mean µ c and variance σ 2 c . The confidence interval would belong to (R t − δ,R t + δ) whereR t E[R t ] = rf Rt (r)dr and δ can be derived by satisfying P(|R t −R t |≤ δ) = Rt−δ f Rt (x)dx = 1 − for some small > 0. 3) Estimating the Risk Score: We propose a novel risk score metric for a given community that is proportional to the probability of someone in that community becoming infected in the next time period (typically, 24 hours). The risk score can be derived as the average number of people in that community that are likely to get infected in the next 24 hours by the currently infectious people divided by the current number of susceptible individuals. We further normalize this probability by multiplying by 10,000, so that a score of 1 implies a 1 in 10,000 chance of getting infected, a score of 2 implies a 2 in 10,000 chance of getting infected, and so on. Mathematically, the risk score is defined as follows: where I rep,new (t) indicates the most recently reported count of new confirmed infectious cases, K refers to the ratio of true cases to reported cases, R t is the time-varying reproduction number, and N is the total population size of the community. The approximation follows from the fact that I rep,new (t) is approximately equal to I(t) D I ·K and S(t) the number of susceptible people in the community is approximately equal to N in the early stages of the epidemic. Confidence intervals for the risk score Γ t could be obtained numerically using a similar process as described for R t accounting also for uncertainty in K. Note that since K may not be known for a given community, it may be helpful to use the following normalized form of the risk score: Γt K , which is still proportional to the probability of infection for an individual. 4) Color-coded Risk Levels: To further simplify the presentation of the risk score to a wider audience, we propose to classify the risk levels into four color-coded levels: (Green, Yellow, Orange, Red). The risk level is determined by evaluating the normalized risk score ( Γ K ) with respect to three pre-specified threshold levels θ 1 , θ 2 , θ 3 , such that when Γ K < θ 1 the risk level is green, when θ 1 ≤ Γ K < θ 2 the risk level is yellow, when θ 2 ≤ Γ K < θ 3 the risk level is orange, and when Γ K ≥ θ 3 the risk level is red. The software for data collection, infection rate estimation and prediction has already been implemented and made available as open-source software (at the following repository: https://github.com/ANRGUSC/ covid19 risk estimation). The software is written in Python using standard data processing libraries such as NumPy and SciPy. We have acquired COVID-19 case data from the LA County's Department of Public Health using a Python-based data parser we wrote (open-sourced at the following link: https://github.com/ANRGUSC/ lacounty covid19 data). We have been updating this repository regularly with the latest data every day since mid-march and also making available plots of the number of cases, number of fatalities, top 6 communities with the large number of cases, infection rate for the entire LA County, and the top 9 communities with the highest infection rate at the following link: http://anrg.usc.edu/www/covid19.html. The following data sources are used for the infection rate and prediction: • The CoVID-19 case information was collected through LA County's daily press releases (Accessible through the following website: http://publichealth.lacounty.gov/media/Coronavirus/). • Recovery information provided by the World Health Organization. • The population data from LA County Census is available online (from lacounty.gov/government/ geography-statistics/cities-and-communities/). The City of Los Angeles is currently using the risk model described in this work that has been developed by researchers at USC, to help assess location-based risk for COVID-19 infection. The City is working with the County and other partners to develop a tool that is publicly accessible and can be used by individuals and communities to mitigate risk of infection. The goal is to change behaviors to reduce risk of infection and promote a greater understanding of factors that increase COVID risk. A color-coded COVID-19 threat level tool that can be used by citizens has also been unveiled by the Mayor of the City of LA, online at https://corona-virus.la/covid-19-threat-level. We present below plots from our analysis of LA County community case data using the estimation approach described in this work. Figure 2 shows plots of the estimated expected reproductive number R t and the estimated risk score for the entire LA county. These plots are based on a 14-day moving average applied on the daily number of confirmed cases. In accordance with LA county daily press releases, there is a sharp jump in both R t and risk score around the beginning of July. Note that the reason the risk score during the beginning of July is higher than the risk score during the last week of March, despite having the same R t , is due to the fact that there are significantly more confirmed cases in July compared to March. Figure 3 shows the risk score estimates over time for four representative communities within the LA County. Figure 4 shows the color-coded risk levels for communities in LA County for select dates over the past 3 months. We have proposed a new risk metric Γ t that can be used by individuals in any community to assess their probability of getting infected by COVID-19. The metric builds on the estimation of R t , the average reproductive number, which is obtained from a time-varying extension of the classical SIR model. We show how to evaluate the uncertainty in both metrics as well. In future work, we plan to generalize the approach to the SEIR model, which also models an additional incubation period. We have released code to implement an estimation of the risk score that can be used for any community worldwide as long as time-series data for confirmed new cases and the population are known. We have also proposed the use of simple color-coded risk levels to inform and guide the public, as has been adopted in the City of Los Angeles. Evaluation of the lockdowns for the sars-cov-2 epidemic in italy and spain after one month follow up The lockdowns workedbut what comes next The reproductive number of covid-19 is higher compared to sars coronavirus The Kermack-McKendrick epidemic model revisited Networks: An introduction A time-dependent sir model for covid-19 The metric we need to manage covid-19 Real time bayesian estimation of the epidemic potential of emerging infectious diseases